Use a hash table. The hash table would include a reasonably unique key generated from the string and pointers to the strings that match that hash key.
A simple, fast hash with a small hash table (256 entries), but lots of key collisions, would be the xor of all bytes in the string. A slower, more complicated hash with a much larger hash table (as many entries as strings), but where key collisions are unlikely, would be AES encryption.
I realize you're using C#, but here's a little perl script to help you investigate what hash table size you'd like to use. This version of the keyify() function just sums the string into a 16 bit integer.
# keyspace.pl
sub keyify {
use constant HASH_KEY_SIZE_IN_BITS => 16;
return unpack( '%' . HASH_KEY_SIZE_IN_BITS . 'A*', $_ );
}
$/ = "\r\n"; # Windows EOL
$offset = 0;
while(<>) {
$newoffset = $offset + length($_);
$key = keyify($_);
if (defined $myhash{$key}) {
# key collision, add to the list of offsets
push @{ $myhash {$key} }, $offset;
} else {
# new key, create the list of offsets
$myhash { $key } = [$offset];
}
$offset = $newoffset;
}
printf "%d keys generated\n", scalar (keys %myhash);
$max = 0;
foreach (keys%myhash) {
$collisions = scalar @{ $myhash{$_} };
$max = $collisions if ( $collisions > $max );
}
print "maximum # of string offsets in a hash = $max\n";
exit;
# dump hash table
foreach (keys%myhash) {
print "key = $_:";
foreach my $offset ( @{ $myhash{$_} } ) {
print " $offset";
}
print "\n";
}
Use it like so:
perl keyspace.pl <strings.dat
The same thing in PowerShell, with a much simpler hashing function. You'll have to put in some effort if you want this to be useful.
# keyspace.ps1
# Don't use "gc -Encoding Byte -Raw" because it reads the ENTIRE file into memory.
function keyify {
return $args[0].Substring(0,1);
}
$myHash = @{};
$offset = 0;
$file = New-Object System.IO.StreamReader($args[0]);
while ($line = $file.ReadLine()) {
$newoffset = $offset + $line.Length + 2; # adjust by 2 for Windows EOL (CRLF)
$key = keyify($line);
if ($myHash.ContainsKey($key)) {
# key collision, add to the list of offsets
$myHash.Set_Item($key, $myHash.Get_Item($key)+$offset);
} else {
# new key, create the list of offsets
$myHash.Add($key, @($offset));
}
$offset = $newoffset;
}
$file.close()
echo "$($myHash.Count) keys generated";
$max = 0;
foreach ($i in $myHash.KEYS.GetEnumerator()) {
$collisionList = $myHash.Get_Item($i);
if ($collisionList.Count -gt $max) { $max = $collisionList.Count; }
}
echo "maximum # of string offsets in a hash = $max";
# echo $myHash;
Use it like so:
.\keyspace.ps1 strings.dat