Why does Python use hash table to implement dict, but not Red-Black Tree? [closed]

Question

Why does Python use hash table to implement dict, but not Red-Black Tree?

What is the key? Performance?

Sharing your research helps everyone. Tell us what you've tried and why it didn’t meet your needs. This demonstrates that you’ve taken the time to try to help yourself, it saves us from reiterating obvious answers, and most of all it helps you get a more specific and relevant answer. Also see How to Ask — gnat, Apr 4 at 10:36
Related: Why is python's dict implemented as hash table whereas std::map is tree-based?. — Martijn Pieters, Apr 4 at 11:38

amon · Accepted Answer · 2014-04-04 12:04:03Z

This is a general, non-Python-specific answer.

Algorithmic complexity comparison

       | Hash Table  |   Red-Black Tree    |
       |=============|=====================|
Space  | O(n) : O(n) | O(n)     : O(n)     |
Insert | O(1) : O(n) | O(log n) : O(log n) |
Fetch  | O(1) : O(n) | O(log n) : O(log n) |
Delete | O(1) : O(n) | O(log n) : O(log n) |
       | avg  :worst | average  : worst    |

The problem with hash tables is that hashes can collide. There are various mechanisms to resolve collisions, e.g. open addressing or separate chaining. The absolute worst case is that all keys have the same hash code, in which case a hash table will degrade into a linked list.

In all other cases, a hash table is a great data structure that's easy to implement and delivers good performance. A downside is that implementations that can quickly grow the table and redistribute their entries will likely waste nearly as much memory as is actually being used.

RB-Trees are self-balancing and do not change their algorithmic complexity in a worst case. However, they are more difficult to implement. Their average complexities are also worse than that of a hash table.

Restrictions on keys

All keys in a hash table must be hashable and comparable for equality among each other. This is especially easy for strings or integers, but is also fairly straightforward to extend to user-defined types. In some languages like Java these properties are guaranteed by definition.

Keys in an RB-Tree must have a total order: each key must be comparable with any other key, and the two keys must either compare smaller, greater, or equal. This sorting equality must be equivalent to actual equality. This is straightforward for integers and other numbers, also fairly easy for strings (the order need only be consistent and not externally observable, so the order does not need to consider locales), but difficult for other types that have no inherent order. It is absolutely impossible to have keys of different types unless some comparison between them is possible.

One might think that a cheap solution for RB-Tree keys would be to first test for equality, then compare identity (i.e. compare pointers). However, this ordering would not be transitive: If a == b and b > c, then it must follow that a > c as well, which is not guaranteed here. So instead, we might use the hash code of keys as the lookup keys. Here, the ordering works correctly, but we might end up with multiple distinct keys with the same hash code, which will be assigned to the same node in the RB tree. To solve these hash collisions we can use separate chaining just like with hash tables, but this also inherits the worst case behaviour for hash tables – the worst of both worlds.

Other Aspects

I expect to a hash table to have better memory locality than a tree, because a hash table is essentially just an array.
Entries in both data structures have a fairly high overhead:
- hash table: key, value, and next entry pointer in the case of separate chaining. Also storing the hash code can speed up resizing.
- RB-tree: key, value, colour, left child pointer, right child pointer
Insertions and deletions in an RB-tree involve tree rotations. These aren't really costly, but do involve an overhead that. In a hash, insertion and deletion are no more expensive than a simple access (although resizing a hash table upon insertion is an O(n) endeavour).
Hash tables are inherently mutable, whereas an RB-tree could also be implemented in an immutable fashion. However, this is rarely useful.

Can we have a hash table with little RB-trees for colliding hashes? — aragaer, Apr 4 at 13:04
@aragaer not generally, but it would be possible in some specific cases. However, collisions are usually handled by linked lists – much easier to implement, much less overhead, and usually much more performant because we typically only have very few collisions. If we expect many collisions, we could change the hash function, or use a simpler B-tree. Self-balancing trees like RB-trees are awesome, but there are many cases where they simply don't add value. — amon, Apr 4 at 14:06

Donal Fellows · Answer 2 · 2014-04-04 12:53:06Z

There are a whole range of reasons that might be true, but the key ones are likely to be:

Hash tables are easier to implement than trees. Neither is entirely trivial, but hash tables are a bit easier, and the impact on the domain of legal keys is less stringent as you just need a hashing function and an equality function; trees require a total order function, and that's a lot harder to write.
Hash tables (may) have better performance at small sizes. This matters a lot because a significant fraction of work only theoretically deals with large datasets; in practice, much actually works with only tens or hundreds of keys, not millions. Small scale performance matters a lot, and you can't use asymptotic analysis to figure out what is best there; you have to actually implement and measure.

Easier to write/maintain, and a performance winner in typical use cases? Sign me up, please!

asked	3 months ago
viewed	355 times
active	3 months ago

current community

your communities

more stack exchange communities

Why does Python use hash table to implement dict, but not Red-Black Tree? [closed]

closed as too broad by Martijn Pieters, MichaelT, gnat, World Engineer♦ Apr 8 at 0:10

2 Answers 2

Algorithmic complexity comparison

Restrictions on keys

Other Aspects

Not the answer you're looking for? Browse other questions tagged python data-structures or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Why does Python use hash table to implement dict, but not Red-Black Tree? [closed]

closed as too broad by Martijn Pieters, MichaelT, gnat, World Engineer♦ Apr 8 at 0:10

2 Answers 2

Algorithmic complexity comparison

Restrictions on keys

Other Aspects

Not the answer you're looking for? Browse other questions tagged python data-structures or ask your own question.

Related

Hot Network Questions