I have an generator (a function that yields stuff), but when trying to pass it to gensim.Word2Vec I get the following error:
TypeError: You can't pass a generator as the sentences argument. Try an iterator.
Isn't a generator a kind of iterator? If not, how do I make an iterator from it?
Looking at the library code, it seems to simply iterate over sentences like for x in enumerate(sentences), which works just fine with my generator. What is causing the error then?
A generator is a function that produces a sequence of results instead of a single value. Each time the yield statement is executed the function generates a new value. So a generator is also an iterator. You don't have to worry about the iterator protocol.
A generator is a special kind of iterator—the elegant kind. A generator allows you to write iterators much like the Fibonacci sequence iterator example above, but in an elegant succinct syntax that avoids writing classes with __iter__() and __next__() methods.
Calling a generator function creates an iterable. Since it is an iterable so it can be used with iter() and with a for loop.
Generator in python is a subclass of Iterator. To prove this, we use the issubclass() function. Iterator in python is a subclass of Iterable.
Generator is exhausted after one loop over it. Word2vec simply needs to traverse sentences multiple times (and probably get item for a given index, which is not possible for generators which are just a kind of stacks where you can only pop), thus requiring something more solid, like a list.
In particular in their code they call two different functions, both iterate over sentences (thus if you use generator, the second one would run on an empty set)
self.build_vocab(sentences, trim_rule=trim_rule) self.train(sentences) It should work with anything implementing __iter__  which is not GeneratorType. So wrap your function in an iterable interface and make sure that you can traverse it multiple times, meaning that
sentences = your_code for s in sentences:   print s for s in sentences:   print s prints your collection twice
As previous posters are mentioned, generator acts similarly to iterator with two significant differences: generators get exhausted, and you can't index one.
I quickly looked up the documentation, on this page -- https://radimrehurek.com/gensim/models/word2vec.html
The documentation states that
gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0, seed=1, workers=1, min_alpha=0.0001, sg=1, hs=1, negative=0, cbow_mean=0, hashfxn=, iter=1, null_word=0, trim_rule=None, sorted_vocab=1) ...
Initialize the model from an iterable of sentences. Each sentence is a list of words (unicode strings) that will be used for training.
I'm venture to guess that the logic inside of the function inherently requires one or more list properties such as item indexing, there might be an explicit assert statement or if statement that raises an error.
A simple hack that can solve your problem is turning your generator into list comprehension. Your program is going to sustain CPU performance penalty and will increase its memory usage, but this should at least make the code work.
my_iterator = [x for x in generator_obj] If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With