I have recently been playing around with Markov chains, trying to generate text from a large corpus just to see what I got back (some of which was quite interesting).
A large part of building the data structure required for text generation is creating n-grams. Given a small sample text: "Today is Thursday March the sixth" an example n-gram where n = 3 would be:
Today is Thursday
is Thursday March
Thursday March the
March the sixth
# skipped lines that have < 3 words because is isn't enough for a 3-gram
Depending on the size of the text, the list of n-grams generated by my code could be quite large, in some languages there is the concept of a generator that contains a yield statement to make custom iterators, but Perl is unfortunately not one of them.
Instead, in Perl we can use closures over lexical variables to create Iterators, but I am having a little trouble understanding what I am really gaining when using them.
Here is the iterator that I created to create n-grams (assume that n is held in $self->order):
sub _ngrams {
my ($self, @words) = @_;
return sub {
while(@words) {
my @ngram = @words[0 .. $self->order]; # get $order + 1 words
shift @words; # drop the first word
return @ngram;
}
return; # nothing left to do
};
}
Do I really gain anything from this code efficiency-wise? The list of words is still held entirely in memory in @words. Is there an alternative implementation that could reduce my memory footprint?
Here is how the iterator is used to generate the dictionary:
sub seed {
my $self = shift;
my $ngram_it = $self->_ngrams(split /\s+/, $self->text);
GRAM:
while (my @gram = $ngram_it->()) {
next GRAM unless @gram == scalar grep { $_ } @gram;
my $val = pop @gram;
my $key = join ' ', @gram;
if (exists $self->lexicon->{$key}) {
push @{$self->lexicon->{$key}}, $val;
}
else {
$self->lexicon->{$key} = [$val];
}
}
}
Any input would be very helpful.
First of all, your iterator implementation has the bad tendency of returning undef items in the last few values. I'd change it to
sub _ngrams {
my ($self, @words) = @_;
my $order = $self->order;
return sub {
if (@words > $order) {
my @ngram = @words[0 .. $order]; # get $order + 1 words
shift @words; # drop the first word
return @ngram;
}
return; # nothing left to do
};
}
Next, this iterator is a nice abstraction. It is not meant to increase performance in any way, it's only useful to make your main code simpler. Here, your code would be shorter (but not simpler) if you didn't separate out the iteration, and do it all in your main code.
However, iterators can handle interesting things like lazy evaluation or infinite streams. For this to be useful, we'd have to switch over completely to streams:
# contract: an iterator returns a list of things
# or an empty list when depleted
sub _ngrams {
my ($self, $source) = @_;
my $order = $self->order;
my @ngram = (undef, map { $source->() } 1 .. $order);
return sub {
if (my ($next) = $source->()) {
(undef, @ngram) = (@ngram, $next); # or instead: shift/push
return @ngram;
}
return;
};
}
Which would be initialized like
my $text = $self->text;
my $iter = $self->_ngrams(sub {
return $1 if $text =~ /\G\s*(\S+)/gc;
return;
});
Is this useful here? No, because you immediately fetch all elements from the iterator. The simplest solution would use no fancy abstractions, and simply be this:
sub seed {
my $self = shift;
my @words = split /\s+/, $self->text;
my $order = $self->order;
while (@words > $order) {
my @gram = @words[0 .. $order]; # get the next n-gram
shift @words;
my $val = pop @gram;
push @{$self->lexicon->{join ' ', @gram}}, $val;
}
}
I'd wager it's also the most (time-)performant variant.
Note: there is no need to test for exists, as Perl hashes autovivify. (Or are you using strange extensions?)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With