Home > Code > Iterators and iterables clarified.

Iterators and iterables clarified.

So what exactly is a python iterator, and how is that different from an iterable?

An iterable is an object that implements a method __iter__(), which, when called, returns an iterator. The __iter__() method can be called by the iter() function, and is also called behind the scenes in a for loop.

An iterator is a specific kind of iterable. It is an iterable which also implements a next() method, which returns a value each time it is called, until it runs out of values, at which point it raises a StopIteration exception. (And raises a StopIteration every time thereafter). Iterators also as a general rule return self when __iter__() is called.

Why does this matter?

The important thing to note here is that iterables do not have to perform their own iteration. To take a very simple case, list objects are iterables but not iterators. That is, they define an __iter__() method, but not a next() method. Rather than returning the list itself, __iter__() returns a listiterator object.


>>> names = ['Tom', 'Dick', 'Muhammad']
>>> iter(names)

This means that the list doesn’t have to store state information about the iteration process. A list doesn’t need to know that you just looked at the second object, so next time you iterate over it, you should see 'Muhammad'. In fact, it can’t know this, because a list might be in use in two iterations at once.

Looping over an iterable twice

What if you do the following?


>>> def get_pairs(seq):
... for x in seq:
... for y in seq:
... if x is not y:
... print(x, y)

The list would have to store its index in the x loop, but then it would have to store a different index in the y loop. It’s far easier to create a separate iterator for each, that just has to handle its own loop. Just to make it clear, if verbose, the following code does approximately the same thing:


>>> seq = ['Tom', 'Dick', 'Muhammad']
>>> iterator_x = iter(seq)
>>> iterator_y = iter(seq) # First pass
>>> x = iterator_x.next() # x == 'Tom'
>>> y = iterator_y.next() # y == 'Tom'
>>> if x is not y: # x is y: skip
... print(x, y)
>>> y = iterator_y.next() # y == 'Dick'
>>> if x is not y: # x is not y
... print(x, y) # prints "Tom Dick"
>>> y = iterator_y.next() # y == 'Muhammad'
>>> if x is not y: # x is not y
... print(x, y) # prints "Tom Muhammad"
>>> y = iterator_y.next()
StopIteration
>>> x = iterator_x.next() # x == 'Dick'
>>> iterator_y = iter(seq) # Restart the inner loop
>>> y = iterator_y.next() # y == 'Tom'
>>> if x is not y: # x is not y
... print(x, y) # prints "Dick Tom"
>>> y = iterator_y.next() # y == 'Dick'
>>> if x is not y: # x is y: skip
... print(x, y)
>>> y = iterator_y.next() # y == 'Muhammad'
>>> if x is not y: # x is not y
... print(x, y) # prints "Dick Muhammad"
>>> y = iterator_y.next()
StopIteration
>>> x = iterator_x.next() # x == 'Muhammad'
>>> iterator_y = iter(seq) # Restart the inner loop
>>> y = iterator_y.next() # y == 'Tom'
>>> if x is not y: # x is not y
... print(x, y) # prints "Muhammad Tom"
>>> y = iterator_y.next() # y == 'Dick'
>>> if x is not y: # x is not y
... print(x, y) # prints "Dick Muhammad"
>>> y = iterator_y.next() # y == 'Muhammad'
>>> if x is not y: # x is y: skip
... print(x, y)
>>> y = iterator_y.next()
StopIteration
>>> x = iterator_x.next()
StopIteration

Note that each loop has its own iterator. This is not the case with, for instance, file objects. File objects implement the iterator protocol directly, which is to say, they have both an __iter__() method and a next() method. As a result, you can only use a file object in one loop at a time.

Looping over an iterator twice

If we try to run the same get_pairs() function with file objects, it doesn’t give us the expected results.

File 'names.txt':


Tom
Dick
Muhammad

Result of get_pairs(open('names.txt')):


Tom
 Dick
 
Tom
 Muhammad

So two things have happened here:

  1. We're getting newlines from each line of our file.
  2. The only value being used in the

x loop is 'Tom'.

The reason for (1) is pretty obvious and not particularly interesting, so lets look a little closer at (2). When you enter the outer loop, python calls f.__iter__(), which, for an iterator like a file object, returns a copy of the file object itself. It then calls f.next() on the iterator, and stores the value of that ('Tom\n') in x.

It then enters the inner loop, and calls f.__iter__(), returning another copy of the same file object. Now we have two loops working on copies of the same object. So when the inner loop calls f.next(), it immediately returns 'Dick\n', because *it is the same iterator* which just returned 'Tom\n' in the outer loop. The next time through, it returns 'Muhammad\n', and finally, it raises a StopIteration because it has exhausted the file, and returns control to the outer loop.

We’d like the outer loop to loop over 'Dick\n' and 'Muhammad\n' now, but remember that it is operating on the same object as the inner loop was, so when it calls f.next(), it just raises another StopIteration instead, and exits the outer loop.

Restarting an iterable

Another benefit of creating an iterable which does not perform its own iteration is that it can be restarted. Once an iterable raises a StopIteration exception, it is required to raise a StopIteration every time next() is called from then on.

An iterable, on the other hand, doesn’t implement next() itself, so it never raises a StopIteration. If you try to start a new loop with your iterable, you get a new, fresh iterator, which isn’t stuck in StopIteration mode.

For instance, with an iterable:


>>> names = ['Tom', 'Dick', 'Muhammad']
>>> for x in names:
... print x
Tom
Dick
Muhammad
>>> for x in names: # Looping over a new iterator.
... print x
Tom
Dick
Muhammad

And with an iterator


>>> f = open('names.txt'):
>>> for x in f:
... print x.strip()
Tom
Dick
Muhammad
>>> for x in f:
... print x.strip() # Immediately raises StopIteration
>>>

An iterable can be looped through over and over again, and an iterator, only once.
How can I take advantage of this?

These very different usage patterns are simply implemented by returning different things from the __iter__() method. If you create an object which returns itself, you need to also implement next(), and have it raise StopIteration once it is exhausted. If you return a different object from __iter__(), you can use it over and over again.

Note that you cannot just write an iterable and expect it to work on its own. It has to return a legitimate iterator, which has both __iter__() and next() methods. Thus writing an iterable is a little more work, but the added flexibility is often worth it.

See my follow-up post: Tricks with Iterators. In it I discuss various ways you can take advantage of iterators and iterables.

Advertisements
Categories: Code Tags:
  1. Sun Yi
    2010/02/26 at 5:15 pm

    Awesome. You should think about posting your article to one of the major Python forums. I think a lot of people would definitely appreciate it.

    Thank you for posting it.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: