Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching through polymorphic data with Elasticsearch

I am stumped at what seems to be a fundamental problem with Elasticsearch and polymorphic data. I would like to be able to find multiple types of results (e.g. users and videos and playlists) with just one Elasticsearch query. It has to be just one query, since that way Elasticsearch can do all the scoring and I won't have to do any magic to combine multiple query results of different types.

I know that Elasticsearch uses a flat document structure, bringing me to the following problem. If I index polymorphic data, I will have to specify a 'missing' value for each unique attribute that I care about in scoring subtypes of the polymorphic data.

I've looked for examples of other dealing with this problem and couldn't find any. There doesn't seem to be anything in the documentation on this either. Am I overlooking something obvious or was Elasticsearch just not designed to do something like this?

Kind regards,

Steffan

like image 961
steffansluis Avatar asked Dec 05 '25 03:12

steffansluis


1 Answers

Thats not the issue of Elasticsearch itself, its the problem (or limitation) of underlying lucene indexes. So, any db/engine based on lucene will have the same problems (if not worse :), ES does a hell ton of job for you). Probably, ES will ease the pain in further releases, but not dramatically. And IMO, there's hardly any hi-perf search engine that can bear with true polymorphic data.

The answer depends on your data structure, thats for sure. Basically, you have two options:

  1. Put all your data in single index, and split it by types. And you already know the overhead - lucent indexes works poorly with sparse data. More similar your data is, less problem you have. Anyway, ES will do all the underlying job for "missing" values, you only have to cope with memory/disk overhead for storing sparse data.

    If your data is organised with parent-child relation (i.e. video -> playlist), you definitely need single index for such data. Which is leaving you with this approach only.

  2. Divide your data into multiple indexes. This way you have slightly higher disk overhead for lucene index + possibly higher CPU usage when aggregation data from multiple shards (so, you should tune your sharding respectively).

    You still can query ES for all your documents in single request, as ES supports multi-index queries.

So, this looks like question purely of your data structure. I'd recommend to simply fire up small cluster to measure memory/disk/cpu usage for expected data. More details on "index vs shard" – great article by Adrien.

Slightly off-topic, if ES doesn't seem to feet your needs, I suggest you to still consider merging data on application side. ES works great with multiple light request (instead of few heavier), and as your results from ES is sorted already, you need to merge sorted streams having sorted input. Not so much magic there, tbh.

like image 56
Slam Avatar answered Dec 07 '25 19:12

Slam



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!