Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pymongo cursor 'touch' to avoid timeout

I need to fetch a large number (e.g. 100 million) of documents from a mongo (v3.2.10) collection (using Pymongo 3.3.0) and iterate over them. The iteration will take several days, and I often run into an exception due to a timed out cursor.

In my case I need to sleep for unpredictable amounts of time as I iterate. So for example I might need to: - fetch 10 documents - sleep for 1 seconds - fetch 1000 documents - sleep for 4 hours - fetch 1 document etc

I know I can:

  • disable timeouts entirely, but I'd like to avoid that if possible since it's nice to have the cursors cleaned up for me if my code stops functioning entirely
  • decrease my cursor's batch_size but this won't help if e.g. I need to sleep for 4 hours as in the example above

It seems like a nice solution would be a way to 'touch' the cursor to keep it alive. So for example I'd break up a long sleep into shorter intervals and touch the cursor between each interval.

I didn't see a way to do this via pymongo, but I'm wondering if anyone knows for sure whether it's possible.

like image 570
nonagon Avatar asked Jan 18 '26 15:01

nonagon


1 Answers

For sure, it is not possible, what you want is feature SERVER-6036, which is unimplemented.

For such a long-running task I recommend a query on an indexed field. E.g. if your documents all have a timestamp "ts":

documents = list(collection.find().sort('ts').limit(1000))
for doc in documents:
    # ... process doc ...

while True:
    ids = set(doc['_id'] for doc in documents)
    cursor = collection.find({'ts': {'$gte': documents[-1]['ts']}})
    documents = list(cursor.limit(1000).sort('ts'))
    if not documents:
        break  # All done.
    for doc in documents:
        # Avoid overlaps
        if doc['_id'] not in ids:
            # ... process doc ...

This code iterates the cursor completely, so it doesn't time out, then processes 1000 documents, then repeats for the next 1000.

Second idea: configure your server with a very long cursor timeout:

mongod --setParameter cursorTimeoutMillis=21600000  # 6 hrs

Third idea: you can be more certain, though not completely certain, that you'll close a non-timeout cursor by using it in a with statement:

cursor = collection.find(..., no_cursor_timeout=True)
with cursor:
    # PyMongo will try to kill cursor on server
    # if you leave this block.
    for doc in cursor:
        # do stuff....
like image 165
A. Jesse Jiryu Davis Avatar answered Jan 20 '26 06:01

A. Jesse Jiryu Davis



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!