Spark Cummulative Processing on single log file

Question

For log processing using spark streaming, I have used socketStream and textFileStream APIs. Through socketStream, using nc -lk on a particular port, we can read the appending log file, and through textFileStream, any new file added in a directory can be read and cummulative processing can be done.

What I am looking is for a single log file, which grows with time, how can I read the same log file into say: DStream or any Spark RDD handle and then process it cummulatively. I don't intend to use nc -lk, as it may not be a general approach. Is there any way or API in Spark, where we can listen to the log file, and any additions to it will be read and processed into RDD formats?

Shawn Guo · Accepted Answer

I think there are no native APIs in Spark to monitoring the single log file and get continuous incremental log content for now(before 1.6 version)
but it seems that netcat pattern(tail and netnet pipeline to receive continous incremental log) is prevalent both in socket stream and Kafka stream.
Using spark socket stream to connect to pipelined netcat.

tail -f xxx.log | nc -lk 9999

Or Using spark Kafka stream to connect to pipelined kafkacat.

kafkacat is a generic non-JVM producer and consumer for Apache Kafka 0.8, think of it as a netcat for Kafka.

https://github.com/edenhill/kafkacat

tail -f /var/log/syslog | kafkacat -b mybroker -t syslog -z snappy

Note:Read messages from stdin, produce to 'syslog' topic with snappy compression.

Spark Cummulative Processing on single log file

Tags:

apache-spark

spark-streaming

Naveen Kumar

1 Answers

Shawn Guo

Recent Activity

Donate For Us

Spark Cummulative Processing on single log file

Tags:

apache-spark

spark-streaming

Naveen Kumar

1 Answers

Shawn Guo

Related questions

Recent Activity

Donate For Us