Just a query into both personal experience and understanding of limitations etc. If I had, for example, a text file with 100,000 lines (entries) and a database with 100,000 identical entries, each containing one word and no doubles, which one would I be able to process faster and which would consume the least memory?
It is my understanding that I could load the entire text file into onto memory into a list at the start (only about 1MB.) This information is being used to confirm string contents. Every word (delimited by a space) in the string has to exist in the file or else it gets changed to the most similar entry in the list. In a nutshell, it's very high-level auto-correct. Sadly, however, I have to reinvent the wheel.
So anyway, my question still stands. Which is my best choice? I'm trying to use the fewest external modules possible, so I'm thinking I might stick with SQLite (it's standard, is it not? Though one more can't hurt) If newline delimited text files are both my fastest and most economical option, is there a specific way I should go about handling them? I want this script to be able to perform at least 100 match operations in a second, if that's computationally possible with a language such as Python.
If you load all 100,000 words into a Python set, determining whether a given word is in that set will be O(1) - it doesn't get any faster than that. The penalty will be a delay when launching your python app because Python has to load all data, it'll be on the order of a couple of seconds.
If you load the words into SQLite (or any other SQL database), you'd need a hash-based index to achieve the same order of performance. I'm not sure if SQLite has that index type. MySQL doesn't.
SQL databases usually don't have a function to find 'similar' words, because every user has his own definition of 'similar'. It'll be much easier implementing that in Python, but maybe the database of your choice has something that's exactly what you're looking for.
The best choice depends on other requirements you didn't mention. Do the 100,000 words change frequently? Do other people (non-programmers) appart from you need to maintain them? If so, a database might be more convenient, and you might want to trade speed for that. Also, how often do you launch your Python app? If you run it to test single words, you'll wait a couple of seconds for each word. On the other hand, if you write a daemon/server and add an interface (sockets, HTTP, whatever), you only have to load your data once and you can throw loads of words at it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With