Skip to main content

Posts

Showing posts from February, 2016

How to Pull Twitter Data Using Apache Flume into HDFS

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. Flume lets Hadoop users make the most of valuable log data. Specifically, Flume allows users to: Stream data from multiple sources into Hadoop for analysis Collect high-volume Web logs in real time Insulate themselves from transient spikes when the rate of incoming data exceeds the rate at which data can be written to the destination Guarantee data delivery Scale horizontally to handle additional data volume Flume’s high-level architecture is focused on delivering a streamlined codebase that is easy-to-use and easy-to-extend. The project team has designed Flume