Huffman Encoding
The Huffman Encoding algorithm is an extremely simple algorithm for compressing data. In this example, we compress text data to binary data of variable lengths. This algorithm is called "greedy" because it compresses the most common characters to shorter strings than less common characters.
(Example in Python 2.7.11)
from heapq import heapify, heappush, heappop
def huffman(char_freqs):
''' The Huffman Encoding algorithm, taking a dictionary of
characters and their frequencies.
'''
min_queue = [[val, [key, '']] for key, val in char_freqs.iteritems()]
heapify(min_queue)
while len(min_queue) > 1:
low = heappop(min_queue)
high = heappop(min_queue)
for pair in low[1:]:
pair[1] = '0' + pair[1]
for pair in high[1:]:
pair[1] = '1' + pair[1]
heappush(min_queue, [low[0] + high[0]] + low[1:] + high[1:])
return sorted(heappop(min_queue)[1:], key=lambda v: (len(v[-1]), v))
char_freqs = {'A': 0.25, 'T: 0.5, 'C': 0.2, 'G': 0.05}
encoded = huffman(char_freqs)
print(encoded)
# prints: [['T', '1'], ['A', '00'], ['C', '011'], ['G', '010']]
Now, imagine we had a long string representing a genome. That string would only have four characters, repeated many times (A, C, T, and G). We could compress this long string for storage by replacing each letter with it's small binary representation, saving us a lot of space.
Sign up or log in
Save edit as a guest
Join Stack Overflow
We recognize you from another Stack Exchange Network site!
Join and Save Draft