As near as I can tell, looking up a string inside a set is effectively free.
I made a 9000-element set, where each element is foo_0, or foo_1, foo_2, … , on up to foo_8999. Then I measured the time cost of testing whether an element belongs to that set:
$ python -m timeit -s 's = set(["foo_%s" % i for i in range(9000) ])' '"foo_4500" in s'
1000000 loops, best of 3: 0.447 usec per loop
Next I measured a few scans across a list of the same size:
$ python -m timeit -s 's = ["foo_%s" % i for i in range(9000) ]' '"foo_0" in s'
1000000 loops, best of 3: 0.447 usec per loop
$ python -m timeit -s 's = ["foo_%s" % i for i in range(9000) ]' '"foo_1" in s'
1000000 loops, best of 3: 0.659 usec per loop
$ python -m timeit -s 's = ["foo_%s" % i for i in range(9000) ]' '"foo_900" in s'
10000 loops, best of 3: 130 usec per loop
$ python -m timeit -s 's = ["foo_%s" % i for i in range(9000) ]' '"foo_4500" in s'
1000 loops, best of 3: 631 usec per loop
It takes more time to do to two string comparisons than it does to hash the string and look it up in the set.
This doesn’t look like you are using a set, just doing an ‘in’ operator on a list of strings. Your worst case (foo_4500) gets infinitely faster when switching to this.
python -m timeit -s “s = [‘foo_%s’ % i for i in range(9000) ]” “‘foo_4500’ in s”
10000 loops, best of 3: 157 usec per loop
python -m timeit -s “s = set(‘foo_%s’ % i for i in range(9000))” “‘foo_4500’ in s”
10000000 loops, best of 3: 0.135 usec per loop
Sets store has values (unlike dicts, which don’t) so testing for membership is basically free.
That is my understanding, anyway.
Ah, nevermind, I see you were comparing the non-set to the set version at the top of the post.
Anyways, yes, Python rocks!
Pete: pygame was what got me interested in learning python. It was 2002, and I was looking for a way to use libSDL without having to deal with C.
Since all hash implementations usually have constant O(1) access time, there is nothing specific to Python here. Searching for key within a list always requires O(n) access time….also nothing special about Python…
aj: sure, hash lookups cost O(1), but as far as I understand, O(1) really just means O(c) where c is some constant that doesn’t vary with the size of the input.
So, if the constant time is really big, then it is possible that O(1) > O(n) when n is small enough.
Dan L: can you explain more about what you mean? How does the lookup inside a set work?
matt:
Sorry I had a typo in my original comment. It was supposed to read, “Sets store hash values…”
So since the hash values of each object in the set is already computed (at instantiation) operations like set membership and set difference (and the rest of the set operations) become really cheap because all the interpreter needs to do is compare hash values — unlike a dictionary where it needs to compute (and re-compute if necessary) the hash of each object for each comparison.
Does that make sense? I know my explanation is a little lacking.
Dan L: OK, I get it. Instead of having to hash both the input key and every element in the set at the moment of a lookup, we only have to hash the input.
I’m still impressed that hashing is so cheap compared to string comparison.
Not to take away from Python in anyway here but I'd expect the reason you see such amazing performance here has to do with the magic of your CPU cache… none the less it is still impressive !
I wish I knew enough about hardware to understand that stuff 🙂
Matt
And hash tables and programming in general lol
Too bad I didn't master phyton, I only use HTML and CSS..
Thanks for sharing the post…
This posting is marvelous and what a fantastic research that you have done. It has helped me a lot. thank you very much.
This is the best post on this topic i have ever read.
Thanks
Jikso laiye
______________________________________________
Isn't hash lookup supposed to be an O(1) operation? So why is this so surprising?
http://en.wikipedia.org/wiki/Hash_table
O(1) is a constant amount of time. In the toy hash tables I built in
school, the hashing code took a significant amount of time, even if it
was a fixed amount of time, regardless of the number of elements in my
hash table.
What surprised me with python is how small that time really is. I
wasn't able to separate the cost of hashing an object from a simple
string comparison.
The lookup is often O(1) but the function that hashes a string so it can be looked up is usually not O(1) and will change depending on how long the string is.
Also, collisions are possible in a hash table and in those cases it will not be O(1) to look those items up.
The lookup is often O(1) but the function that hashes a string so it can be looked up is usually not O(1) and will change depending on how long the string is.
Also, collisions are possible in a hash table and in those cases it will not be O(1) to look those items up.
Very nice, I'm gonna try some of these, i don't know if they will work with me as i with to, but i'm gonna give this a try.
…
Not to take away from Python, at least here, but I expect the reason you see such a fantastic performance here has to do with the magic of your CPU cache … However, it is always impressive!
Â
Viagra
Generic Viagra
Kamagra