The documentation of the shelve module makes the following claim under Restrictions:
The shelve module does not support concurrent read/write access to shelved objects. (Multiple simultaneous read accesses are safe.)
As far as I can tell, this means that as long as I don't try to have multiple processes write to a single shelf at once, I should be in the clear. Multiple processes using the same shelf as a read-only cache should be safe. Right?
Apparently not. After some struggling, I ended up with a test case that appears to demonstrate some very bad behavior when reading asynchronously from the shelf. The following script:
Shelf and populates it with "i" : 2*i for i from 1 to 10.Spawns processes to retrieve values for each key from the shelf file, and reports whether a value was retrieved or not.
import multiprocessing
import shelve
SHELF_FILE = 'test.shlf'
def store(key, obj):
    db = shelve.open(SHELF_FILE, 'w')
    db[key] = obj
    db.close()
def load(key):
    try:
        db = shelve.open(SHELF_FILE, 'r')
        n = db.get(key)
        if n is not None:
            print('Got result {} for key {}'.format(n, key))
        else:
            print('NO RESULT for key {}'.format(key))
    except Exception as e:
        print('ERROR on key {}: {}'.format(key, e))
    finally:
        db.close()
if __name__ == '__main__':
    db = shelve.open(SHELF_FILE, 'n') # Create brand-new shelf
    db.close()
    for i in range(1, 11): # populate the new shelf with keys from 1 to 10
        store(str(i), i*2)
    db = shelve.open(SHELF_FILE, 'r') # Make sure everything got in there.
    print(', '.join(key for key in db)) # Should print 1-10 in some order
    db.close()
    # read each key's value from the shelf, asynchronously
    pool = multiprocessing.Pool()
    for i in range(1, 11):
        pool.apply_async(load, [str(i)])
    pool.close()
    pool.join()
The expected output here would naturally be 2, 4, 6, 8 and so on up to 20 (in some order). Instead, arbitrary values cannot be retrieved from the shelf, and sometimes the request causes shelve to blow up altogether. The actual output looks like this: ("NO RESULT" lines indicate keys that returned None):
6, 7, 4, 5, 2, 3, 1, 10, 8, 9
ERROR on key 3: need 'c' or 'n' flag to open new db
ERROR on key 6: need 'c' or 'n' flag to open new db
Got result 14 for key 7
NO RESULT for key 10
Got result 2 for key 1
Got result 4 for key 2
NO RESULT for key 8
NO RESULT for key 4
NO RESULT for key 5
NO RESULT for key 9
My intuition, given the error messages, is that maybe external resources (maybe the .dir file?) aren't being flushed to disk appropriately (or maybe they're deleted by other processes?). Even then, I'd expect a slowdown as a process waited for the disk resource, rather than these "oh I guess it's not there" or "What are you talking about this isn't even a shelf file" results. And frankly I wouldn't expect there to be any writing to those files anyway, since the worker processes only use read-only connections...
Is there something I'm missing, or is shelve just outright unusable in multiprocessing environments?
This is Python 3.3 x64 on Windows 7, if that turns out to be relevant.
There's an alerting comment in shelve.open() documentation:
Open a persistent dictionary. The filename specified is the base filename for the underlying database. As a side-effect, an extension may be added to the filename and more than one file may be created.
Try to pass a preopened shelve (instead of a filename) to the pool threads, and see if the behaviour changes. That said, I don't have a repro with 2.7, Win7-64 (an output is all messed up of course).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With