My project involves validating and normalising email addresses in this format
[userpart]@[domainpart].[tld]
After syntactic validation of the address, [tld] is checked to exist, otherwise the validation fails.
After validation, [userpart] and [domainpart] are normalised into a 32-bit numeric ID called [partid] (one for each part), partid and the part string are permanently stored on disk when new strings are discovered. 4 bytes seems the most sensible for each as 3 bytes is not enough (there are around 200 million registered domain names just now).
To process lists of new emails quickly, I've been using Judy arrays (JudySL to be exact, http://judy.sourceforge.net/doc/index.html) , where the part string is mapped to the part id.
I've used Judy Arrays as I know how to use them and they generally perform well, although they are not thread safe which is a slight drawback (I use mutexes to get round that). There's no collisions that I'd get with a hash table... which reduces complexity somewhat.
However they seem to be taking up a much larger proportion of memory than I'd like, so my question is what would you suggest as a better storage method?
Example case:
- 64-bit system (Debian based)
- 150,000,000 "parts", averaging around 8 bytes each, 12 bytes total when including the partid
- ~1.67GB for that data in itself
- ~5.1GB used by Judy
- Perhaps worth noting that the emails are evaluated after being converted to punycode format.
- Input would come in lists varying between 10K and 1M rows
- Inserts on-demand, when a new [part] is found, ++partid is assigned as its partid and partid,part are written to disk
- No deletes required, only inserts and lookups.
So I need around 3x more memory than disk space in order to convert [~8-byte-string] to an int.
Is there any better(*) data structures that I should be considering?
(* by better, I'd hope for reasonably fast, with a smaller memory footprint and preferably something pre-written like a library...)
Update
The TL;DR of my original question... converting a variable length string to a 32 bit INT.
Strictly speaking the userpart is case sensitive but from what I understand the standard implementation is to ignore case. In that case (no pun), there are 56 valid characters in either punycoded domain part, or 6 bits worth.
I'm thinking I can sacrifice 1-bit to indicate direct conversion of string->int without a lookup table, which will allow conversion back too. The bit would be switched on when this can occur and would prevent collisions with lookups, leaving 31 bits for data, and 2^31 IDs for all other domain parts that don't fit.
I was messing around with fixed length 4-5-6 bit length codes based on the least frequent character, but huffman encoding would seem to (obviously?) win out in the long run and avoids a 2nd pass of the input string. Here's some data based on the char distribution across those 150m parts.
Huffman manages to have yahoo, gmail and even hotmail squeezed into 31 bits which is excellent. 6.1% of all parts can be mapped this way.
With the same huffman encoding, a further 70.5% can be mapped to 63 bits which'd allow a fixed length int key->value list. This should save on memory (I'll try report back)
That leaves less than 25% of all parts to be mapped still.
This new layout should help a lot for saving on memory, thanks for the suggestions so far.