Archive

Archive for the ‘Code’ Category

Don’t waste your iterators!

2011/08/28 1 comment

Hey all. I kind of put everything in this blog. I hope much of it will be useful to somebody, but most people will probably only care about some of what I write here. Today, I’m writing about python programming. Some days, I use this space as my workout log. If you are using a feed reader, and you only want to see certain kinds of content, you can actually subscribe to individual categories within my blog, by clicking on the category name on the right –> and then using that URL in your feed reader.

If you’re only here for the fitness stuff, feel free to move on.

There’s a pattern I see fairly often in code where someone uses a function that returns a sequence of some sort, filters it, and then wants to use the first result that matches the filter. It looks something like this:

    return [x for x in foo if len(x) > 4][0]

This works, and looks “pythonic” (it uses list comprehensions after all!), but it’s actually a fairly slow and wasteful way to get the results we want.

The actual example I saw which prompted me to post this was from a fun post by Jeff Elmore which explained creating a wu-name generator in six lines of python.

import urllib
from lxml.html import fromstring
def get_wu_name(first_name, last_name):
    """
    >>> get_wu_name("Jeff", "Elmore")
    'Ultra-Chronic Monstah'
    """

    w = urllib.urlopen("http://www.recordstore.com/wuname/wuname.pl",
                       urllib.urlencode({'fname':first_name, 'sname': last_name}))
    doc = fromstring(w.read())
    return [d for d in doc.iter('span') if d.get('class') == 'newname'][0].text

What this will do is find every span in the document, and check to see if its class is ‘newname’. In order to do this, it has to scan the entire document, which may contain a significant amount of unwanted material.

We don’t need a comprehensive list of matching spans. We just need one, and then we can take it and move on. With a list comprehension, we can’t even ask for the first one until the whole document has been processed.

We’re actually better off going through this the old-fashioned way, by using an if nested in a for-loop, and returning the result.

   for d in doc.iter('span'):
       if d.get('class') == 'newname':
           return d

But we like our brevity, and python provides us with generator expressions, which look like list comprehensions, but don’t do any actual work when they gets created. With them, we can ask for the first object before we even begin scanning the document, so python knows to stop processing as soon as it finds the right one. It also doesn’t hold its results in memory; it passes them back as they are retrieved, one at a time. We both save the memory it would take to build up a list of spans *and* get to stop searching the moment we find a span matching our conditional.

The bad news is that we can’t just write:

   return (d for d in doc.iter('span') if d.get('class') == 'newname')[0].text

If we do, we get an exception:

TypeError: 'generator' object is not subscriptable

We’re trying to index into a generator, but the iterator protocol doesn’t support indexing to a particular member. All you can do is start at the beginning and work through it one at a time.

The good news is that since we want the first one anyway, all we have to do is start iterating through the generator, and stop after grabbing the first item.

   generator = (d for d in doc.iter('span') if d.get('class') == 'newname')
   for each in generator:
       # The function gets returned on the first pass through
       # the loop, forestalling any further processing
       return each

But now we’ve lost the terseness of our list comprehension again. All we’ve done is move the if clause out of the for loop and into the generator. Not much of an improvement.

If you understand how generators do their job, though, you can actually maintain the terseness of the original list comprehension version, while enjoying the improved performance of the generator version. Each time you loop over an iterator, the next value is retrieved by calling the .next() method on the generator. So rather than relying on a forloop to go through our iterator for us, we can step through it manually using this method.

   return (d for d in doc.iter('span') if d.get('class') == 'newname').next().text

In python 3 the method is called .__next__() instead of .next(), so this method isn’t quite compatible across python versions. For python 3, we could use:

   return (d for d in doc.iter('span') if d.get('class') == 'newname').__next__().text

If cross-python compatibility is important to you, or if you would rather not muck around with dunder methods, there is a builtin next() function which goes back at least to python 2.6, and probably further, which takes an iterator and calls the appropriate .next() or .__next__() method on the iterator, returning the result. So now we can write:

   return next(d for d in doc.iter('span') if d.get('class') == 'newname').text

We’ve gotten the results we wanted quickly and efficiently, with no appreciable loss of code clarity.

Now, in this particular case, it’s not a huge deal one way or another. Our bottleneck is going to be pulling the document down from the internet, and the page is fairly short, so we’re not going to be wasting too much time scanning over it. But there are times when this trick can save you quite a bit of time. Imagine scanning through a logfile that hasn’t been placed under logrotation, and has hundreds of megabytes of data in it. Imagine if the processing we were doing on each item in the list comprehension took ten minutes. Or imagine if we were calculating wu names for a million users, where every hundredth of a second getting a wu-name translated to nearly three more hours of running time. In any of these cases, knowing when to use iterators, and how to use them effectively can make a big difference.

Categories: Code

Gwibber is…

2010/11/24 3 comments

Just had a minor annoyance this morning, trying to post a status with gwibber this morning. I was dual-posting to Twitter and Facebook. On Twitter, the post read, simply enough:

DC Tweed Ride 2010. I need to step up my game: http://readysetdc.com/2010/11/video-dc-tweed-ride-2010/

On Facebook, however, Gwibber felt the need to add the word “is” to the beginning of my post. “Cliff Dyer is DC Tweed Ride 2010” makes no sense whatsoever.

Read more…

Categories: Code

Submitting links via POST using jquery

Often, when designing interactive websites, you want to have elements that look like links, but submit POST requests, because they modify some data on your website. Maybe it’s a delete link. Maybe it’s a “Like” button. Today, I just stumbled across a dead easy way to do this using JQuery.
Read more…

Categories: Code Tags: , , ,

Psycopg2 has a web site? Sweet!

It just came to my attention that psycopg2, the python driver for PostgreSQL database has a website again! For several months (years?) the site was unavailable, except for an plaintext rant about the author’s ire toward Trac. Now it has several pages of nice-looking sphinx documentation. I haven’t delved into it yet, so I don’t know how good it is, but it’s nice to see a professional looking website providing the public presence for the driver that lets me get at my data. It may be technically irrelevant, but it gives me a little more confidence that the software isn’t as hackish as the site used to be.

If you haven’t seen it yet, I recommend hopping on over to their site, http://initd.org/. It’s well worth a look.

Welcome back psycopg2!

Iterators and iterables clarified.

2010/02/25 1 comment

So what exactly is a python iterator, and how is that different from an iterable?

An iterable is an object that implements a method __iter__(), which, when called, returns an iterator. The __iter__() method can be called by the iter() function, and is also called behind the scenes in a for loop.

An iterator is a specific kind of iterable. It is an iterable which also implements a next() method, which returns a value each time it is called, until it runs out of values, at which point it raises a StopIteration exception. (And raises a StopIteration every time thereafter). Iterators also as a general rule return self when __iter__() is called.

Read more…

Categories: Code Tags:

Getting django-south working with gis models.

2009/12/14 3 comments

Just a quick note for posterity…

Django’s south data migration utility has a weird bug that turned up when I was trying to work with gis fields.

I’ve been banging my head all day Friday and today against trying to get south working on my geodjango project. My original problem was (I think) that south 0.5 (which is what Karmic installs by default) doesn’t set up m2m columns when you do a ./manage.py startmigration app --initial. (this seems to have been fixed three weeks ago in Changeset 548:486840db6350, though I’m not sure that’s the right revision) Once I realized that, and upgraded to the latest version (via pip), I had a new problem: South couldn’t figure out what do do with my Point column:

jcdyer@aalcdl07:/cgi/django/brp/projects/brp$ ./manage.py startmigration content_base --initial
Creating migrations directory at '/cgi/django/brp/apps/cdla_apps/content_base/migrations'...
Creating __init__.py in '/cgi/django/brp/apps/cdla_apps/content_base/migrations'...
+ Added model 'content_base.Item'
+ Added model 'content_base.ItemType'
+ Added M2M 'content_base.Item.subject_set'
+ Added M2M 'content_base.Item.date_published_set'
( Nodefing field: coordinates
( Parsing field: coordinates
WARNING: Cannot get definition for 'coordinates' on 'parkway.location'. Please edit the migration manually to define it, or add the south_field_triple method to it.
Created 0001_initial.py.

It went ahead and created the migration for me, but left some boilerplate for me to clean up in the frozen ORM definition. Where the frozen orm said:

'coordinates': '< < PUT FIELD DEFINITION HERE > >',

I just had to put in the definition of a geometry field. Working from a nearby example that looked like this:

'milepost': ('django.db.models.fields.FloatField', [], {'null': 'True', 'blank': 'True'}),

I changed my code to:

'coordinates': ('django.contrib.gis.db.models.PointField', [], {'null': 'True', 'blank': 'True'}),

I haven’t yet figured out what the south_field_triple method is.

Edit: Bafflingly, there’s a PointField on another table in the same app which has no problem defining itself.

Categories: Code