I am developing a solution in Java which communicates with a set of devices through REST APIs which belongs to different vendors. So for each vendor, there are a set of processes that I have to perform inside my solution. However, these processes will be changed according to each vendor. Following are the high-level processes that need to be performed.
Retrieve an XML file from a folderProcess the XML filePerform some image processingSchedule a job and execute it on the scheduled timeStoring data on a MySQL DB and perform some REST calls to outside APIsSo for one vendor might have all of the above processes. But for another, might not have some processes (Eg: Image processing). Following things should be able to obtain from the selected solution.
I should be able to create custom workflows for new vendorsNeed to identify any failures that have been occurred within the workflow and perform retry mechanisms.Should be able to execute some functions parallelly (Eg: Image processing)ScalableOpensourceSo I was told to look into workflow managers like Nifi/Airflow/Falcon. I did some research on them but couldn't finalize the most suitable solution.
NOTE: There is NO requirement to use Hadoop or any other cluster and data flow frequency is not that high
Currently, I am thinking of using Nifi. But can anyone please give your opinion on that? What would be the best solution for my use case?
Apache NiFi is not a workflow manager in the way the Apache Airflow or Apache Oozie are. It is a data flow tool - it routes and transforms data. It is not intended to schedule jobs but rather allows you to collect data from multiple locations, define discrete steps to process that data and route that data to different destinations.
Apache Falcon is again different in that it allows you to more easily define and manage HDFS datasets. It is effectively data management within a HDFS cluster.
Based on your description, NiFi would be useful addition to your requirements. It would be able to collect your XML file, process in it in some manner, store the data in MySQL, and perform REST calls. It would also be easily configurable for new vendors, and tolerates failures well. It performs most functions in parallel and can be scaled into a clustered NiFi with multiple host machines. It was designed with performance and reliability in mind.
What I am unsure about is the ability to perform image processing. There are some processors (extract image metadata, resize image) but otherwise you would need to develop a new processor in Java - which is relatively easy. Or, if the image processing uses Python or some other scripting language, you can use one of the ExecuteScript processors.
'Scheduling jobs' using NiFi is not recommended.
Full disclosure: I am an Apache NiFi contributor.
I am using nifi with an OP's similar use case. Regarding scheduling, I like how nifi works with Kafka, I have some scripts scheduled to run with a crontab frequency, just adding the message into Kafka topics, which topic is listened by nifi, then starts the orchestration for loading, transforming, fetching, indexing, storing, etc, also, you can always handle HttpRequest so you can make kinda "webhook receivers" in order to trigger a process from an external HTTP POST, once again, for simple deployments (these ones you plug and play in a single machine) cronjob nails the task. For image processing, I have an OCR image reader with python connected with an ExecuteScript processor and one facial reckon with opencv with ExecuteCommand processor, the automatic nifi's back-pressure has solved many of the problems I ran by only running the python script and the command by itself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With