Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of log data. It
has a simple and flexible architecture based on streaming data flows. It is
robust and fault tolerant with tunable reliability mechanisms and many failover
and recovery mechanisms. It uses a simple extensible data model that allows for
online analytic application.
Flume
lets Hadoop users make the most of valuable log data. Specifically, Flume
allows users to:
- Stream data from multiple sources into Hadoop for
analysis
- Collect high-volume Web logs in real time
- Insulate themselves from transient spikes when the
rate of incoming data exceeds the rate at which data can be written to the
destination
- Guarantee data delivery
- Scale horizontally to handle additional data volume
Flume’s
high-level architecture is focused on delivering a streamlined codebase that is
easy-to-use and easy-to-extend. The project team has designed Flume with the
following components:
- Event – a singular
unit of data that is transported by Flume (typically a single log entry
- Source – the entity
through which data enters into Flume. Sources either actively poll for
data or passively wait for data to be delivered to them. A variety of
sources allow data to be collected, such as log4j logs and syslogs.
- Sink – the entity
that delivers the data to the destination. A variety of sinks allow data
to be streamed to a range of destinations. One example is the HDFS sink
that writes events to HDFS.
- Channel – the conduit
between the Source and the Sink. Sources ingest events into the channel
and the sinks drain the channel.
- Agent – any physical
Java virtual machine running Flume. It is a collection of sources, sinks
and channels.
- Client – produces and
transmits the Event to the Source operating within the Agent
A flow in Flume starts from the Client (Web
Server). The Client transmits the event to a Source operating within the Agent.
The Source receiving this event then delivers it to one or more Channels. These
Channels are drained by one or more Sinks operating within the same Agent.
Channels allow decoupling of ingestion rate from drain rate using the familiar
producer-consumer model of data exchange. When spikes in client side activity
cause data to be generated faster than what the provisioned capacity on the
destination can handle, the channel size increases. This allows sources to
continue normal operation for the duration of the spike. Flume agents can be
chained together by connecting the sink of one agent to the source of another
agent. This enables the creation of complex dataflow topologies.
Before we move onto Sentiment analysis first we need to install Apache flume
Follow my previous tutorial for Apache flume installation procedure from here
Steps to Connect with Twitter using Flume
Step 1 : Use below link and download flume-sources-1.0-SNAPSHOTS.jar
Flume jar
Step 2 : Move the flume-sources-1.0-SNAPSHOT.jar file from Downloads directory to lib directory of apache flume:
Command: sudo mv Downloads/flume-sources-1.0-SNAPSHOT.jar /usr/lib/apache-flume-1.4.0-bin/lib/
Step 1 : Use below link and download flume-sources-1.0-SNAPSHOTS.jar
Flume jar
Step 2 : Move the flume-sources-1.0-SNAPSHOT.jar file from Downloads directory to lib directory of apache flume:
Command: sudo mv Downloads/flume-sources-1.0-SNAPSHOT.jar /usr/lib/apache-flume-1.4.0-bin/lib/
Step 3: Check whether flume SNAPSHOT has moved to the
lib folder of apache flume:
Step 4: Copy flume-env.sh.template content
to flume-env.sh
Command: cd
/usr/lib/apache-flume-1.4.0-bin/
Step 5: Edit flume-env.sh as mentioned in below
snapshot.
Set JAVA_HOME and FLUME_CLASSPATH as shown in
below snapshot.
Now we have Configured flume on our machine. Lets
run flume to stream twitter data on to HDFS.
We need to create an application in twitter and use its credentials to fetch data
Step 6: Open a Browser and go to the below URL:
Step 7: Open a Browser and go to the below URL:
Step 8 : Enter your Twitter account credentials and sign
in:
Step 9 : Click on Create New App to create a new
application and enter all the details in the application:
Step 10: Check Yes, I agree and click on Create your
Twitter application:
Step 11: Your Application will be created:
Step 13 : Scroll down and Click on Create my access token:
Step 14 : Download the flume.conf file from here.
Step 15: Edit flume.conf
Replace all the below highlighted credentials in
flume.conf with the credentials (Consumer Key, Consumer Secret, Access Token,
Access Token Secret) you received after creating the application very
carefully, rest all will remain same, save the file and close it.
Step 16 : Change permissions for flume directory.
Step 17 : Start fetching the data from twitter:
Now wait for 20-30
seconds and let flume stream the data on HDFS, after that press ctrl + c to
break the command and stop the streaming. (Since you are stopping the process,
you may get few exceptions, ignore it)
Step 18 : Open the Mozilla browser in your VM, and go to
/user/flume/tweets in HDFS
Click on FlumeData file which got created:
If you can see data similar as shown in below
snapshot, then the unstructured data has been streamed from twitter on to HDFS
successfully
In the next post we will do some analysis of twitter data using Apache Hive.
sir from where i have to download the flume.conf file there are only env and conf template file
ReplyDeleteYou can get the conf file from the link flume.conf after the step 15..
ReplyDelete16/04/22 02:38:45 ERROR node.PollingPropertiesFileConfigurationProvider: Unhandled error
ReplyDeletejava.lang.NoSuchMethodError: twitter4j.conf.Configuration.isStallWarningsEnabled()Z
at twitter4j.TwitterStreamImpl.(TwitterStreamImpl.java:60)
at twitter4j.TwitterStreamFactory.(TwitterStreamFactory.java:40)
at org.apache.flume.source.twitter.TwitterSource.configure(TwitterSource.java:115)
at org.apache.flume.conf.Configurables.configure(Configurables.java:41)
at org.apache.flume.node.AbstractConfigurationProvider.loadSources(AbstractConfigurationProvider.java:331)
at org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:102)
at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:140)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/04/22 02:38:45 ERROR node.PollingPropertiesFileConfigurationProvider: Unhandled error
ReplyDeletejava.lang.NoSuchMethodError: twitter4j.conf.Configuration.isStallWarningsEnabled()Z
at twitter4j.TwitterStreamImpl.(TwitterStreamImpl.java:60)
at twitter4j.TwitterStreamFactory.(TwitterStreamFactory.java:40)
at org.apache.flume.source.twitter.TwitterSource.configure(TwitterSource.java:115)
at org.apache.flume.conf.Configurables.configure(Configurables.java:41)
at org.apache.flume.node.AbstractConfigurationProvider.loadSources(AbstractConfigurationProvider.java:331)
at org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:102)
at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:140)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
hi u got the solution for above error? if knows pls let us know
Deletehi im getting the following error. kindly help me resolve it
ReplyDeletejeni@jeni-VirtualBox:/usr/lib/flume/apache-flume-1.7.0-bin$ ./bin/flume-ng agent -n TwitterAgent -c conf -f /usr/lib/flume/apache-flume-1.4.0-bin/conf/flume.conf
Info: Sourcing environment configuration script /usr/lib/flume/apache-flume-1.7.0-bin/conf/flume-env.sh
Info: Including Hadoop libraries found via (/opt/hadoop/bin/hadoop) for HDFS access
Error: Could not find or load main class org.apache.flume.tools.GetJavaProperty
Info: Including Hive libraries found via () for Hive access
+ '$' exec /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Xms100m -Xmx2000m -Dcom.sun.management.jmxremote -cp '/usr/lib/flume/apache-flume-1.7.0-bin/conf:/usr/lib/flume/apache-flume-1.5.0.1-bin/lib/*:/usr/lib/flume/apache-flume-1.7.0-bin/lib/flume-sources-1.0-SNAPSHOT.jar:/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/hadoop/share/hadoop/mapreduce/lib/*:/opt/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/lib/*' -Djava.library.path= org.apache.flume.node.Application -n TwitterAgent -f /usr/lib/flume/apache-flume-1.4.0-bin/conf/flume.conf
./bin/flume-ng: line 214: $: command not found
+ exit 0
Hi Jenni,
DeleteBy seeing the error i can tell you that you have not configured the apache flume so it is showing the "Error: Could not find or load main class org.apache.flume.tools.GetJavaProperty"
I suggest you to remove the apache flume package and reinstall and try again.
It is very excellent blog and useful article thank you for sharing with us , keep posting learn more about Big Data Hadoop
ReplyDeleteimportant information thank you providing this important information on Big Data Hadoop Online Training
Thank you for sharing very useful blog!!!!
ReplyDeleteAzure DevOps online training
Microsoft Azure DevOps Training
Microsoft Azure DevOps Online Training
Microsoft Azure DevOps training hyderabad
Azure DevOps online training in hyderabad