I am new to python. I am given a folder with around 2000 text files. I am supposed to output each word and the number of times it occurs (without repetition in a file). For example, the sentence: "i am what i am" must include only one occurrence of "i" in a file.
I am able to do this for a single file, but how do I do it for multiple files?
from collections import Counter
import re
def openfile(filename):
    fh = open(filename, "r+")
    str = fh.read()
    fh.close()
    return str
def removegarbage(str):
    # Replace one or more non-word (non-alphanumeric) chars with a space
    str = re.sub(r'\W+', ' ', str)
    str = str.lower()
    return str
def getwordbins(words):
    cnt = Counter()
    for word in words:
        cnt[word] += 1
    return cnt
def main(filename, topwords):
    txt = openfile(filename)
    txt = removegarbage(txt)
    words = txt.split(' ')
    bins = getwordbins(words)
    for key, value in bins.most_common(topwords):
        print key,value
main('speech.txt', 500)
You can get a list of files by using the glob() or iglob() function in the glob module. I noted that you weren't using the Counter object efficiently. It would be much better to just call its update() method and pass it the list of words. Here's a streamlined version of your code that processes all the *.txt files found in the specified folder:
from collections import Counter
from glob import iglob
import re
import os
def remove_garbage(text):
    """Replace non-word (non-alphanumeric) chars in text with spaces,
       then convert and return a lowercase version of the result.
    """
    text = re.sub(r'\W+', ' ', text)
    text = text.lower()
    return text
topwords = 100
folderpath = 'path/to/directory'
counter = Counter()
for filepath in iglob(os.path.join(folderpath, '*.txt')):
    with open(filepath) as file:
        counter.update(remove_garbage(file.read()).split())
for word, count in counter.most_common(topwords):
    print('{}: {}'.format(count, word))
If I got your explanation right,you want to calculate for each word the number of files containing this word. Here is what you could do.
For each file obtain a set of words in this file (that is, words should be unique). Then, for each word count the number of sets it can be found in.
Here is what I suggest:
os.listdir for this purpose.Make a set of words found in this file:
with open(filepath, 'r') as f:
    txt = removegarbage(f.read())
    words = set(txt.split())
Now when you have a set of words in every file, you can finally use Counter with those sets. It's best to use its update method. Here is a little demo:
>>> a = set("hello Python world hello".split())
>>> a
{'Python', 'world', 'hello'}
>>> b = set("foobar hello world".split())
>>> b
{'foobar', 'hello', 'world'}
>>> c = Counter()
>>> c.update(a)
>>> c.update(b)
>>> c
Counter({'world': 2, 'hello': 2, 'Python': 1, 'foobar': 1})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With