Take the 2-minute tour ×

Programmers Stack Exchange is a question and answer site for professional programmers interested in conceptual questions about software development. It's 100% free, no registration required.

Most suitable data structure for in-memory string -> int conversion for variable length strings

up vote 1 down vote favorite

My project involves validating and normalising email addresses in this format

[userpart]@[domainpart].[tld]

After syntactic validation of the address, [tld] is checked to exist, otherwise the validation fails.

After validation, [userpart] and [domainpart] are normalised into a 32-bit numeric ID called [partid] (one for each part), partid and the part string are permanently stored on disk when new strings are discovered. 4 bytes seems the most sensible for each as 3 bytes is not enough (there are around 200 million registered domain names just now).

To process lists of new emails quickly, I've been using Judy arrays (JudySL to be exact, http://judy.sourceforge.net/doc/index.html) , where the part string is mapped to the part id.

I've used Judy Arrays as I know how to use them and they generally perform well, although they are not thread safe which is a slight drawback (I use mutexes to get round that). There's no collisions that I'd get with a hash table... which reduces complexity somewhat.

However they seem to be taking up a much larger proportion of memory than I'd like, so my question is what would you suggest as a better storage method?

Example case:

64-bit system (Debian based)
150,000,000 "parts", averaging around 8 bytes each, 12 bytes total when including the partid
~1.67GB for that data in itself
~5.1GB used by Judy
Perhaps worth noting that the emails are evaluated after being converted to punycode format.
Input would come in lists varying between 10K and 1M rows
Inserts on-demand, when a new [part] is found, ++partid is assigned as its partid and partid,part are written to disk
No deletes required, only inserts and lookups.

So I need around 3x more memory than disk space in order to convert [~8-byte-string] to an int.

Is there any better(*) data structures that I should be considering?

(* by better, I'd hope for reasonably fast, with a smaller memory footprint and preferably something pre-written like a library...)

Update

The TL;DR of my original question... converting a variable length string to a 32 bit INT.

Strictly speaking the userpart is case sensitive but from what I understand the standard implementation is to ignore case. In that case (no pun), there are 56 valid characters in either punycoded domain part, or 6 bits worth.

I'm thinking I can sacrifice 1-bit to indicate direct conversion of string->int without a lookup table, which will allow conversion back too. The bit would be switched on when this can occur and would prevent collisions with lookups, leaving 31 bits for data, and 2^31 IDs for all other domain parts that don't fit.

I was messing around with fixed length 4-5-6 bit length codes based on the least frequent character, but huffman encoding would seem to (obviously?) win out in the long run and avoids a 2nd pass of the input string. Here's some data based on the char distribution across those 150m parts.

http://pastebin.com/DTrUWSna

Huffman manages to have yahoo, gmail and even hotmail squeezed into 31 bits which is excellent. 6.1% of all parts can be mapped this way.

With the same huffman encoding, a further 70.5% can be mapped to 63 bits which'd allow a fixed length int key->value list. This should save on memory (I'll try report back)

That leaves less than 25% of all parts to be mapped still.

This new layout should help a lot for saving on memory, thanks for the suggestions so far.

edited Dec 4 '13 at 10:10

asked Dec 4 '13 at 3:07

innvo
62

This question documents the character sets allowed in email addresses. stackoverflow.com/questions/2049502/… This page describes how to squeeze a finite-symbol encoding into integers en.wikipedia.org/wiki/Numeral_system – rwong Dec 4 '13 at 4:21

Thank you for replying @rwong. I'm good for the character set validation, I assume from the 2nd link you specifically refer to the "Generalized variable-length integers" section. I'd originally realised that each character could naively be squeezed into 6 bits but that article is enlightening for further squeezing. Very interesting, thanks. – innvo Dec 4 '13 at 4:58

Don't 40% of all parts equal to 'gmail', 'yahoo' and 'hotmail'? Seems you could almost half your memory requirements by special casing a few most common parts. – iluxa Dec 4 '13 at 6:47

@iluxa indeed a lot of addresses are concentrated within those domains. You are right in that the scenario of storing the full address, it'd make sense just to keep the large providers in separate lists and just lookup on the user@ part. however in this case, I'm simply looking at the [part]@[part] conversion from an email address input to part ids. the beginning of the part list is [1\thotmail\n2\tgmail\n3\tyahoo\n....] – innvo Dec 4 '13 at 7:20

What if you used Huffington's code to shrink the data, and rebuilt the index every so often? – iluxa Dec 4 '13 at 7:33

show 1 more comment

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Name

Email required, but not shown

Post as a guest

Name

Email required, but not shown

discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged data-structures or ask your own question.

question feed

asked	3 months ago
viewed	146 times

current community

your communities

more stack exchange communities

Most suitable data structure for in-memory string -> int conversion for variable length strings

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Browse other questions tagged data-structures or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Most suitable data structure for in-memory string -> int conversion for variable length strings

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged data-structures or ask your own question.

Related

Hot Network Questions