I'm making a web crawler in Python that collects redirects/links, adds them to a database, and enters them as a new row if the link doesn't already exist. I want like to use multi-threading but having trouble because I have to check in real time if there is an entry with a given URL.
I was initially using sqlite3
but realised I can't use it simultaneously on different threads. I don't really want to use MySQL (or something similar) as it needs more disk space and runs as separate server. Is there anyway to make sqlite3
work with multiple threads?
The Python sqlite3
module has a threadsafety
level of 1, which means that although you can't share database connections between threads, multiple threads can use the module simultaneously. So, you could have each thread create its own connection to the database.
The problem with this approach is that SQLite has poor write concurrency, so having multiple threads doing a ton of INSERT
s at once will give you the dreaded “database is locked” error. You can improve things somewhat by using PRAGMA JOURNAL_MODE = 'WAL'
, but that only goes so far.
If performance is an issue and switching to a client-server database isn't an option, then what you'll probably have to do is keep an in-memory cache of your URLs, and arrange your program so that you have one thread that syncs this cache with the SQLite database.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With