Python’s hash lookup is insanely good
As near as I can tell, looking up a string inside a set is effectively free.
I made a 9000-element set, where each element is foo_0, or foo_1, foo_2, … , on up to foo_8999. Then I measured the time cost of testing whether an element belongs to that set:
$ python -m timeit -s 's = set(["foo_%s" % i for i in range(9000) ])' '"foo_4500" in s'
1000000 loops, best of 3: 0.447 usec per loop
Next I measured a few scans across a list of the same size:
$ python -m timeit -s 's = ["foo_%s" % i for i in range(9000) ]' '"foo_0" in s'
1000000 loops, best of 3: 0.447 usec per loop
$ python -m timeit -s 's = ["foo_%s" % i for i in range(9000) ]' '"foo_1" in s'
1000000 loops, best of 3: 0.659 usec per loop
$ python -m timeit -s 's = ["foo_%s" % i for i in range(9000) ]' '"foo_900" in s'
10000 loops, best of 3: 130 usec per loop
$ python -m timeit -s 's = ["foo_%s" % i for i in range(9000) ]' '"foo_4500" in s'
1000 loops, best of 3: 631 usec per loop
It takes more time to do to two string comparisons than it does to hash the string and look it up in the set.
Pete said,
March 17, 2008 @ 4:40 pm
This doesn’t look like you are using a set, just doing an ‘in’ operator on a list of strings. Your worst case (foo_4500) gets infinitely faster when switching to this.
python -m timeit -s “s = ['foo_%s' % i for i in range(9000) ]” “‘foo_4500′ in s”
10000 loops, best of 3: 157 usec per loop
python -m timeit -s “s = set(’foo_%s’ % i for i in range(9000))” “‘foo_4500′ in s”
10000000 loops, best of 3: 0.135 usec per loop
Dan L said,
March 17, 2008 @ 4:40 pm
Sets store has values (unlike dicts, which don’t) so testing for membership is basically free.
That is my understanding, anyway.
Pete said,
March 17, 2008 @ 4:42 pm
Ah, nevermind, I see you were comparing the non-set to the set version at the top of the post.
Anyways, yes, Python rocks!
matt said,
March 17, 2008 @ 5:37 pm
Pete: pygame was what got me interested in learning python. It was 2002, and I was looking for a way to use libSDL without having to deal with C.
aj said,
March 17, 2008 @ 5:40 pm
Since all hash implementations usually have constant O(1) access time, there is nothing specific to Python here. Searching for key within a list always requires O(n) access time….also nothing special about Python…
matt said,
March 17, 2008 @ 5:45 pm
aj: sure, hash lookups cost O(1), but as far as I understand, O(1) really just means O(c) where c is some constant that doesn’t vary with the size of the input.
So, if the constant time is really big, then it is possible that O(1) > O(n) when n is small enough.
Dan L: can you explain more about what you mean? How does the lookup inside a set work?
Dan L said,
March 18, 2008 @ 11:24 am
matt:
Sorry I had a typo in my original comment. It was supposed to read, “Sets store hash values…”
So since the hash values of each object in the set is already computed (at instantiation) operations like set membership and set difference (and the rest of the set operations) become really cheap because all the interpreter needs to do is compare hash values — unlike a dictionary where it needs to compute (and re-compute if necessary) the hash of each object for each comparison.
Does that make sense? I know my explanation is a little lacking.
matt said,
March 18, 2008 @ 11:27 am
Dan L: OK, I get it. Instead of having to hash both the input key and every element in the set at the moment of a lookup, we only have to hash the input.
I’m still impressed that hashing is so cheap compared to string comparison.