I want to write a code to open multiple text files and count how many times predefined strings occurs in each file. My desired output it can be a list of the sums of occurrence of each string along the files.
My desired strings are values of a dictionary.
For instance:
mi = { "key1": "string1", "key2": "string2", and so on..." }
For the purpose to open a unique file and realized my desired count I got the code. Check below:
mi = {} #my dictionary
data = open("test.txt", "r").read()
import collections
od_mi = collections.OrderedDict(sorted(mi.items()))
count_occur = list()
for value in od_mi.values():
count = data.count(value)
count_occur.append(count)
lista_keys = []
for key in od_mi.keys():
lista_keys.append(key)
dic_final = dict(zip(lista_keys, count_occur))
od_mi_final = collections.OrderedDict(sorted(dic_final.items()))
print(od_mi_final) #A final dictionary with keys and values with the count of how many times each string occur.
My next target is do the same with multiple files. I have a group of text files that are named according a pattern, e.g. "ABC 01.2015.txt ; ABC 02.2015.txt ...".
I made 3 text files as test files, in each one of the files, each string occurs one time. Therefore, in my test run my desired output is a count of 3 for each string.
mi = {}
import collections
od_mi = collections.OrderedDict(sorted(mi.items()))
for i in range(2,5):
for value in od_mi.values():
x = "ABC" + " " + str(i) +".2015.txt"
data = open(x, "r").read()
contar = data.count(value)
count_occur.append(contar)
print(count_occur)
Output:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
I realize that my code was overwriting the counting when entered each time in the loop. Therefore, how can I fix this issue?
Make a Counter from the values in your mi dict, then use the intersection between the new Counter dict keys and each line of split words:
mi = { "key1": "string1", "key2": "string2"}
import collections
from collections import Counter
counts = Counter(dict.fromkeys(mi.values(), 0))
for fle in list_of_file_names:
with open(fle) as f:
for words in map(str.split, f):
counts.update(counts.viewkeys() & words)
print(counts)
If you are looking for exact matches and you have multiple word phrases to find, your best bet will be a regex with word boundaries:
from collections import Counter
import re
patt = re.compile("|".join([r"\b{}\b".format(v) for v in mi.values()]))
for fle in list_of_file_names:
with open(fle) as f:
for line in f:
counts.update(patt.findall(line))
print(counts)
You might find that calling the regex on f.read() presuming the file content fits into memory:
with open(fle) as f:
counts.update(patt.findall(f.read()))
The regular re module won't work for overlapping matches, if you pip install [regex][1] that will catch the overlapping matches once you set the overlapped flag:
import regex
import collections
from collections import Counter
counts = Counter(dict.fromkeys(mi.values(), 0))
patt = regex.compile("|".join([r"\b{}\b".format(v) for v in mi.values()]))
for fle in list_of_files:
with open(fle) as f:
for line in f:
counts.update(patt.findall(line, overlapped=True))
print(counts)
If we change your examples slightly you can see the difference:
In [30]: s = "O rótulo contém informações conflitantes sobre a natureza mineral e sintética."
In [31]: mi = {"RTL. 10": "conflitantes sobre", "RTL. 11": "sobre"}
In [32]: patt = re.compile("|".join([r"\b{}\b".format(v) for v in mi.values()]))
In [33]: patt.findall(s)
Out[33]: ['conflitantes sobre']
In [34]: patt = regex.compile("|".join([r"\b{}\b".format(v) for v in mi.values()]))
In [35]: patt.findall(s,overlapped=True)
Out[35]: ['conflitantes sobre', 'sobre']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With