I am designing an application which will have a heavy reliance on searching using a Lucene.NET repository. The repository will be built using data from an operational database that is constantly changing. I'm trying to figure out the best strategy to keep the Lucene repository synced up with the source database. Should I have a service running that wakes up every few minutes, queries the database for updated records, and adds/removes from the Lucene index? Should I rebuild the Lucene repository every night and tolerate some latency in the data?
What are the best practices for keeping the data in a Lucene repository fresh? How do the different strategies affect latency, performance, etc.?
Lucene is capable of performing so called near real-time search, which means that the updates to the index can be seen in query results almost instantly. So you can freely send the updates as soon as they are saved in the database -- Lucene should have no problem in handling even quite frequent updates, as for example Twitter search is built with it (of course, to maintain such big load, you would need to distribute your index).
So preferably, you would send your updates in some code that triggers after transaction is committed. It is hard to say anything more specific, without knowing what database or queuing system are you using. Some general thoughts on this matter, as well as examples of using it along with CouchDB or RabbitMQ are shown in elasticsearch river documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With