Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I efficiently try to find a large list of words in a large list of XMLs

I have a large list of chemical names (~30,000,000) and a large list of articles (~34,000) in the form of XMLs that are being stored on a server as files.

I am trying to parse every XML as a string for a mention of one or more chemical names. The final result would be a tab-separated text file where I have a file name and then the list of chemicals that appear in the file.

The current issue is that I have a for loop that iterates through all the chemicals inside a for loop that iterates through all the XMLs. Nested inside the for loops is the string in string operation in python. Is there any way to improve the performance by either using a more efficient operation than the string in string or by rearranging the for loops?

My pseudo code:

for article is articles:
         chemicals_in_article = []
         temp_article = article.lower()
         for chemical in chemicals:
               if chemical in temp_article: chemicals_in_article.append(chemical)

         #Write the results into a text file
         output_file.write(article.file_name)
         for chemical in chemicals_in_article: 
               output_file.write("\t" + chemical)
         output_file.write("\n")

               
like image 916
Justin Avatar asked Jan 25 '26 07:01

Justin


1 Answers

I am not sure if 30M entries would blow your memory or not, but an approach based on trie would likely be the fastest. There are several packages that implement this in slightly different forms, for example FlashText; or trieregex. Both have examples that are exact match for your scenario.

EDIT: ...at least on plain text. Per comment above, if you want to avoid matching random bits of markup, build a trie then use XPath matches function to find text nodes where trie-derived regexp finds a match. Unfortunately, the main XML library for Python does not support matches (and indeed there are very few libraries around that support XPath 2.0), so this is not very workable.

Since all you need is detecting presence of your keywords anywhere in the text of the document, a viable workaround is to convert XML to text, then employ one of the methods above. Here is an example:

#pip install libxml2-python3 trieregex

from trieregex import TrieRegEx as TRE
from libxml2 import parseDoc
import re


# prepare
words = ['lemon', 'lemons', 'lime', 'limes', 'pomelo', 'pomelos', 'orange', 'oranges', 'citrus', 'citruses']
tre = TRE(*words)
pattern = re.compile(fr"\b{tre.regex()}\b")
# => \b(?:l(?:emons?|imes?)|citrus(?:es)?|oranges?|pomelos?)\b


# search
xml = """
<?xml version="1.0"?>
<recipe>
  <substitute for="lemon">three limes</substitute>
  <substitute for="orange">pomelo</substitute>
</recipe>
""".strip()
doc = parseDoc(xml)
text = doc.getContent()
matches = pattern.findall(text)
print(matches)
# => ['limes', 'pomelo']
doc.freeDoc()

Note that you only need to prepare the regex once; you can then apply it very fast on multiple documents.

like image 131
Amadan Avatar answered Jan 27 '26 20:01

Amadan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!