Using closures as Iterators

Question

I have recently been playing around with Markov chains, trying to generate text from a large corpus just to see what I got back (some of which was quite interesting).

A large part of building the data structure required for text generation is creating n-grams. Given a small sample text: "Today is Thursday March the sixth" an example n-gram where n = 3 would be:

Today is Thursday
is Thursday March 
Thursday March the
March the sixth
# skipped lines that have < 3 words because is isn't enough for a 3-gram

Depending on the size of the text, the list of n-grams generated by my code could be quite large, in some languages there is the concept of a generator that contains a yield statement to make custom iterators, but Perl is unfortunately not one of them.

Instead, in Perl we can use closures over lexical variables to create Iterators, but I am having a little trouble understanding what I am really gaining when using them.

Here is the iterator that I created to create n-grams (assume that n is held in $self->order):

sub _ngrams {
   my ($self, @words) = @_; 

   return sub {
      while(@words) {
         my @ngram = @words[0 .. $self->order]; # get $order + 1 words
         shift @words;                          # drop the first word

         return @ngram;
      }

      return; # nothing left to do
  };
}

Do I really gain anything from this code efficiency-wise? The list of words is still held entirely in memory in @words. Is there an alternative implementation that could reduce my memory footprint?

Here is how the iterator is used to generate the dictionary:

sub seed { 
   my $self = shift; 

   my $ngram_it = $self->_ngrams(split /\s+/, $self->text); 
GRAM:
   while (my @gram = $ngram_it->()) {
      next GRAM unless @gram == scalar grep { $_ } @gram;

      my $val = pop @gram; 
      my $key = join ' ', @gram; 

      if (exists $self->lexicon->{$key}) {
         push @{$self->lexicon->{$key}}, $val;
      }
      else {
         $self->lexicon->{$key} = [$val];
      }
   }
}

Any input would be very helpful.

amon · Accepted Answer

First of all, your iterator implementation has the bad tendency of returning undef items in the last few values. I'd change it to

sub _ngrams {
   my ($self, @words) = @_; 
   my $order = $self->order;

   return sub {
      if (@words > $order) {
         my @ngram = @words[0 .. $order]; # get $order + 1 words
         shift @words;                          # drop the first word

         return @ngram;
      }

      return; # nothing left to do
  };
}

Next, this iterator is a nice abstraction. It is not meant to increase performance in any way, it's only useful to make your main code simpler. Here, your code would be shorter (but not simpler) if you didn't separate out the iteration, and do it all in your main code.

However, iterators can handle interesting things like lazy evaluation or infinite streams. For this to be useful, we'd have to switch over completely to streams:

# contract: an iterator returns a list of things
# or an empty list when depleted

sub _ngrams {
   my ($self, $source) = @_; 
   my $order = $self->order;

   my @ngram = (undef, map { $source->() } 1 .. $order);

   return sub {
      if (my ($next) = $source->()) {
          (undef, @ngram) = (@ngram, $next);  # or instead: shift/push
          return @ngram;
      }
      return;
  };
}

Which would be initialized like

my $text = $self->text;
my $iter = $self->_ngrams(sub {
    return $1 if $text =~ /\G\s*(\S+)/gc;
    return;
});

Is this useful here? No, because you immediately fetch all elements from the iterator. The simplest solution would use no fancy abstractions, and simply be this:

sub seed { 
   my $self = shift; 

   my @words = split /\s+/, $self->text;
   my $order = $self->order;
   while (@words > $order) {
      my @gram = @words[0 .. $order];  # get the next n-gram
      shift @words;

      my $val = pop @gram; 
      push @{$self->lexicon->{join ' ', @gram}}, $val;
   }
}

I'd wager it's also the most (time-)performant variant.

Note: there is no need to test for exists, as Perl hashes autovivify. (Or are you using strange extensions?)

Using closures as Iterators

Tags:

iterator

closures

perl

Hunter McMillen

1 Answers

amon

Recent Activity

Donate For Us

Using closures as Iterators

Tags:

iterator

closures

perl

Hunter McMillen

1 Answers

amon

Related questions

Recent Activity

Donate For Us