Take the 2-minute tour ×
Programmers Stack Exchange is a question and answer site for professional programmers interested in conceptual questions about software development. It's 100% free, no registration required.

Why does Python use hash table to implement dict, but not Red-Black Tree?

What is the key? Performance?

share|improve this question

closed as too broad by Martijn Pieters, MichaelT, gnat, World Engineer Apr 8 at 0:10

There are either too many possible answers, or good answers would be too long for this format. Please add details to narrow the answer set or to isolate an issue that can be answered in a few paragraphs.If this question can be reworded to fit the rules in the help center, please edit the question.

2  
Sharing your research helps everyone. Tell us what you've tried and why it didn’t meet your needs. This demonstrates that you’ve taken the time to try to help yourself, it saves us from reiterating obvious answers, and most of all it helps you get a more specific and relevant answer. Also see How to Ask –  gnat Apr 4 at 10:36
    
@gnat Thanks for your suggestion. I'm a new fish. –  longdeqidao Apr 4 at 10:39
2  
add comment

2 Answers 2

up vote 4 down vote accepted

This is a general, non-Python-specific answer.

Algorithmic complexity comparison

       | Hash Table  |   Red-Black Tree    |
       |=============|=====================|
Space  | O(n) : O(n) | O(n)     : O(n)     |
Insert | O(1) : O(n) | O(log n) : O(log n) |
Fetch  | O(1) : O(n) | O(log n) : O(log n) |
Delete | O(1) : O(n) | O(log n) : O(log n) |
       | avg  :worst | average  : worst    |

The problem with hash tables is that hashes can collide. There are various mechanisms to resolve collisions, e.g. open addressing or separate chaining. The absolute worst case is that all keys have the same hash code, in which case a hash table will degrade into a linked list.

In all other cases, a hash table is a great data structure that's easy to implement and delivers good performance. A downside is that implementations that can quickly grow the table and redistribute their entries will likely waste nearly as much memory as is actually being used.

RB-Trees are self-balancing and do not change their algorithmic complexity in a worst case. However, they are more difficult to implement. Their average complexities are also worse than that of a hash table.

Restrictions on keys

All keys in a hash table must be hashable and comparable for equality among each other. This is especially easy for strings or integers, but is also fairly straightforward to extend to user-defined types. In some languages like Java these properties are guaranteed by definition.

Keys in an RB-Tree must have a total order: each key must be comparable with any other key, and the two keys must either compare smaller, greater, or equal. This sorting equality must be equivalent to actual equality. This is straightforward for integers and other numbers, also fairly easy for strings (the order need only be consistent and not externally observable, so the order does not need to consider locales), but difficult for other types that have no inherent order. It is absolutely impossible to have keys of different types unless some comparison between them is possible.

One might think that a cheap solution for RB-Tree keys would be to first test for equality, then compare identity (i.e. compare pointers). However, this ordering would not be transitive: If a == b and b > c, then it must follow that a > c as well, which is not guaranteed here. So instead, we might use the hash code of keys as the lookup keys. Here, the ordering works correctly, but we might end up with multiple distinct keys with the same hash code, which will be assigned to the same node in the RB tree. To solve these hash collisions we can use separate chaining just like with hash tables, but this also inherits the worst case behaviour for hash tables – the worst of both worlds.

Other Aspects

  • I expect to a hash table to have better memory locality than a tree, because a hash table is essentially just an array.

  • Entries in both data structures have a fairly high overhead:

    • hash table: key, value, and next entry pointer in the case of separate chaining. Also storing the hash code can speed up resizing.
    • RB-tree: key, value, colour, left child pointer, right child pointer
  • Insertions and deletions in an RB-tree involve tree rotations. These aren't really costly, but do involve an overhead that. In a hash, insertion and deletion are no more expensive than a simple access (although resizing a hash table upon insertion is an O(n) endeavour).

  • Hash tables are inherently mutable, whereas an RB-tree could also be implemented in an immutable fashion. However, this is rarely useful.

share|improve this answer
    
Can we have a hash table with little RB-trees for colliding hashes? –  aragaer Apr 4 at 13:04
    
@aragaer not generally, but it would be possible in some specific cases. However, collisions are usually handled by linked lists – much easier to implement, much less overhead, and usually much more performant because we typically only have very few collisions. If we expect many collisions, we could change the hash function, or use a simpler B-tree. Self-balancing trees like RB-trees are awesome, but there are many cases where they simply don't add value. –  amon Apr 4 at 14:06
add comment

There are a whole range of reasons that might be true, but the key ones are likely to be:

  • Hash tables are easier to implement than trees. Neither is entirely trivial, but hash tables are a bit easier, and the impact on the domain of legal keys is less stringent as you just need a hashing function and an equality function; trees require a total order function, and that's a lot harder to write.
  • Hash tables (may) have better performance at small sizes. This matters a lot because a significant fraction of work only theoretically deals with large datasets; in practice, much actually works with only tens or hundreds of keys, not millions. Small scale performance matters a lot, and you can't use asymptotic analysis to figure out what is best there; you have to actually implement and measure.

Easier to write/maintain, and a performance winner in typical use cases? Sign me up, please!

share|improve this answer
add comment

Not the answer you're looking for? Browse other questions tagged or ask your own question.