Challenge for H2, and other in-memory database gurus

Question

I have an interesting scenario that is, I believe, an excellent application of an IMDB (such as H2) and, possibly, jOOQ. However, there are some interesting challenges and questions that arise.

We’ve developed a specialized, Java-based ETL platform for insurance data conversion that is now in its fourth generation. Without going into unnecessary detail, we routinely extract data from source systems such as SQL Server, DB2, etc. that are normalized to varying degrees. Insurance data conversion has two characteristics that are highly relevant here:

We typically convert one insurance entity (i.e. policy, application, claim, etc.) at a time (unless it’s part of a package or other transactional grouping, in which case we might be converting a few entities at a time). Importantly, therefore, a given conversion transaction seldom involves even 1 Mb of data at a time. Indeed, a typical transaction involves less than 50K of data—miniscule by any modern measure.
Because source and target systems can differ so dramatically in their schemas, granularity, and even underlying semantics, the transformations can be very complex. In terms of source processing, the queries are numerous and complex, frequently joining many tables, using subqueries, etc. Given this fact, obtaining reasonable performance means saving the query results in some fashion. Until now, we’ve relied on a proprietary approach involving “insurance maps,” which are specialized Java maps. We knew this approach was ultimately insufficient, but it served our needs initially.

Now that I have some time to reflect, I’m thinking about a long term approach. If we just consider the basic characteristics above, it would be seem that an IMDB like H2 would be perfect:

Execute all the complex queries against the source database (e.g. SQL Server) up-front, creating tables, performing inserts/updates, in order to create an IMDB representation of all the data that pertains to a single conversion transaction (e.g. a single insurance policy). BTW, I could see how jOOQ could be really helpful here (and elsewhere) for simlifying and increasing the type safety of these queries.
Execute all the complex transformation queries against the IMDB. Again, jOOQ might have significant benefits.
Discard and recreate the IMDB for each insurance conversion transaction.

One of the things that I love about this approach (at least with H2) is the ability to encapsulate queries in Java-based stored procedures—much better than writing T-SQL stored procs. And would it again make things even easier/safer to use jOOQ against the IMDB instead of, for example, the native H2 stored proc API?

However, I have two concerns:

Serialization--This is actually a distributed platform (I’ve simplified my description above for discussion purposes), and we make fairly heavy use of services and message queuing to pass/queue data. This all works wonderfully when we’re working with XML data sources, which is frequently the case. How well will this work with an IMDB?

For a given insurance transaction IMDB, we must be able to a) serialize the IMDB, b) transmit and/or queue the IMDB and, finally, c) deserialize the data back into a fully functioning IMDB for conversion processing.

It appears that the best way to do this with H2, for example, is to use the SQL SCRIPT command to serialize the data, and then run the script to deserialize the data. I’m wondering about the performance characteristics of this approach. I don’t consider our platform to be particularly performance sensitive, but I do want to avoid an approach that is particularly sluggish or architecturally awkward.
Target loading—This discussion has focused on source side database processing because, frequently we generate XML on the target side (we have mature subsystems for this purpose). Sometimes, however, we need to directly address databases on the target side as well. In this case, we must be able to directly insert/update against mainstream relational databases in accordance with the converted data.

The approach I’m contemplating again uses an IMDB, but on the target side. The transformed data populates an IMDB with the same schema as the actual target database. Then, this target IMDB could be serialized and transmitted as needed. Finally, the contents of the target IMDB would be used to insert/update against the actual target database (which, of course, could have many gigabytes of data). What would be tremendous (but I’m not optimistic), is if I could use a simple SQL SCRIPT statement against the IMDB to generate a script containing INSERT/UPDATE statements that I could then simply run against the target database. I suspect it won’t be that easy.

In any event, does this general approach to target loading seem reasonable?

I apologize for the length of this post, but this is a critically important question for our team. Thank you so much, in advance, for your responses.

Nikita Ivanov · Accepted Answer

A bit off topic... One thing to remember is that H2 is non-distributed database and a rather primitive solution thus at best. Essentially, this is a what-ever-fits-in-on-heap-of-a-single-JVM database. There are better approaches unless you are talking about absolutely simplistic use case (which I don't think you are).

GridGain's In-Memory Database, for example, uses H2 for its SQL processing internally (with all its benefits) but also provides full distribution for SQL as well as host of other features. There are other distributed in-memory databases and even some sophisticated data grids that can fit your use case.

Just my 2 cents here.

Challenge for H2, and other in-memory database gurus

Tags:

java

h2

in-memory-database

jooq

Rob Oaks

1 Answers

Nikita Ivanov

Recent Activity

Donate For Us

Challenge for H2, and other in-memory database gurus

Tags:

java

h2

in-memory-database

jooq

Rob Oaks

1 Answers

Nikita Ivanov

Related questions

Recent Activity

Donate For Us