Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Cummulative Processing on single log file

For log processing using spark streaming, I have used socketStream and textFileStream APIs. Through socketStream, using nc -lk on a particular port, we can read the appending log file, and through textFileStream, any new file added in a directory can be read and cummulative processing can be done.

What I am looking is for a single log file, which grows with time, how can I read the same log file into say: DStream or any Spark RDD handle and then process it cummulatively. I don't intend to use nc -lk, as it may not be a general approach. Is there any way or API in Spark, where we can listen to the log file, and any additions to it will be read and processed into RDD formats?

like image 751
Naveen Kumar Avatar asked Oct 25 '25 00:10

Naveen Kumar


1 Answers

I think there are no native APIs in Spark to monitoring the single log file and get continuous incremental log content for now(before 1.6 version)
but it seems that netcat pattern(tail and netnet pipeline to receive continous incremental log) is prevalent both in socket stream and Kafka stream.
Using spark socket stream to connect to pipelined netcat.

tail -f xxx.log | nc -lk 9999

Or Using spark Kafka stream to connect to pipelined kafkacat.

kafkacat is a generic non-JVM producer and consumer for Apache Kafka 0.8, think of it as a netcat for Kafka.

https://github.com/edenhill/kafkacat

tail -f /var/log/syslog | kafkacat -b mybroker -t syslog -z snappy  

Note:Read messages from stdin, produce to 'syslog' topic with snappy compression.

like image 88
Shawn Guo Avatar answered Oct 27 '25 16:10

Shawn Guo