Basically as the question states. I am fairly new to Python and like to learn by seeing and doing.
I would like to create a script that searches through a text document (say the text copied and pasted from a news article for example) for certain words or phrases. Ideally, the list of words and phrases would be stored in a separate file.
When getting the results, it would be great to get the context of the results. So maybe it could print out the 50 characters in the text file before and after each search term that has been found. It'd be cool if it also showed what line the search term was found on.
Any pointers on how to code this, or even code examples would be much appreciated.
Despite the frequently expressed antipathy for Regular Expressions on the part of many in the Python community, they're really a precious tool for the appropriate use cases -- which definitely include identifying words and phrases (thanks to the \b "word boundary" element in regular expression patterns -- string-processing based alternatives are much more of a problem, e.g., .split() uses whitespace as the separator and thus annoyingly leave punctuation attached to words adjacent to it, etc, etc).
If RE's are OK, I would recommend something like:
import re
import sys
def main():
if len(sys.argv) != 3:
print("Usage: %s fileofstufftofind filetofinditin" % sys.argv[0])
sys.exit(1)
with open(sys.argv[1]) as f:
patterns = [r'\b%s\b' % re.escape(s.strip()) for s in f]
there = re.compile('|'.join(patterns))
with open(sys.argv[2]) as f:
for i, s in enumerate(f):
if there.search(s):
print("Line %s: %r" % (i, s))
main()
the first argument being (the path of) a text file with words or phrases to find, one per line, and the second argument (the path of) a text file in which to find them. It's easy, if desired, to make the case search-insensitive (perhaps just optionally based on a command line option switch), etc, etc.
Some explanation for readers that are not familiar with REs...:
The \b item in the patterns items ensures that there will be no accidental matches (if you're searching for "cat" or "dog", you won't see an accidental hit with "catalog" or "underdog"; and you won't miss a hit in "The cat, smiling, ran away" by some splitting thinking that the word there is "cat," including the comma;-).
The | item means or, so e.g. from a text file with contents (two lines)
cat
dog
this will form the pattern '\bcat\b|\bdog\b' which will locate either "cat" or "dog" (as stand-alone words, ignoring punctuation, but rejecting hits within longer words).
The re.escape escapes punctuation so it's matched literally, not with special meaning as it would normally have in a RE pattern.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With