I have a large list of chemical names (~30,000,000) and a large list of articles (~34,000) in the form of XMLs that are being stored on a server as files.
I am trying to parse every XML as a string for a mention of one or more chemical names. The final result would be a tab-separated text file where I have a file name and then the list of chemicals that appear in the file.
The current issue is that I have a for loop that iterates through all the chemicals inside a for loop that iterates through all the XMLs. Nested inside the for loops is the string in string operation in python. Is there any way to improve the performance by either using a more efficient operation than the string in string or by rearranging the for loops?
My pseudo code:
for article is articles:
chemicals_in_article = []
temp_article = article.lower()
for chemical in chemicals:
if chemical in temp_article: chemicals_in_article.append(chemical)
#Write the results into a text file
output_file.write(article.file_name)
for chemical in chemicals_in_article:
output_file.write("\t" + chemical)
output_file.write("\n")
I am not sure if 30M entries would blow your memory or not, but an approach based on trie would likely be the fastest. There are several packages that implement this in slightly different forms, for example FlashText; or trieregex. Both have examples that are exact match for your scenario.
EDIT: ...at least on plain text. Per comment above, if you want to avoid matching random bits of markup, build a trie then use XPath matches function to find text nodes where trie-derived regexp finds a match. Unfortunately, the main XML library for Python does not support matches (and indeed there are very few libraries around that support XPath 2.0), so this is not very workable.
Since all you need is detecting presence of your keywords anywhere in the text of the document, a viable workaround is to convert XML to text, then employ one of the methods above. Here is an example:
#pip install libxml2-python3 trieregex
from trieregex import TrieRegEx as TRE
from libxml2 import parseDoc
import re
# prepare
words = ['lemon', 'lemons', 'lime', 'limes', 'pomelo', 'pomelos', 'orange', 'oranges', 'citrus', 'citruses']
tre = TRE(*words)
pattern = re.compile(fr"\b{tre.regex()}\b")
# => \b(?:l(?:emons?|imes?)|citrus(?:es)?|oranges?|pomelos?)\b
# search
xml = """
<?xml version="1.0"?>
<recipe>
<substitute for="lemon">three limes</substitute>
<substitute for="orange">pomelo</substitute>
</recipe>
""".strip()
doc = parseDoc(xml)
text = doc.getContent()
matches = pattern.findall(text)
print(matches)
# => ['limes', 'pomelo']
doc.freeDoc()
Note that you only need to prepare the regex once; you can then apply it very fast on multiple documents.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With