There is a module named flink-jdbc which only supports non-parallel tuple type based JDBC InputFormat.
In order to use a parallel InputFormat for JDBC, it seems one needs to customize by implementing the interface: org.apache.flink.core.io.InputSplit.
So in my case, how can I custom implement JdbcInputSplit to query data in parallel from the database?
Apache Flink does not provide a parallel JDBC InputFormat. So you need to implement one yourself. You can use the non-parallel JDBC InputFormat as a starting point.
In order to query a database in parallel, you need to split the query into several queries that cover non-overlapping (and ideally equally-sized) parts of the result set. Each of these smaller queries would be wrapped in an InputSplit and handed to a parallel instance of the input format.
Splitting the query is the challenging part as it depends on the query and the data. So you need a bit of meta information to come up with good splits. You might want to delegate this to the user of the input format and ask for a set of queries instead of a single one. You should also check that the queried database handles parallel requests better than a single query.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With