Take the 2-minute tour ×
Programmers Stack Exchange is a question and answer site for professional programmers interested in conceptual questions about software development. It's 100% free.

In an app, a user can initiate the generation a six-digit alphanumeric code (actually done via Web API). The code needs to be checked against a remote database to ensure it is unique. Other than for diagnostics or logging, the DB table is only used by the Web API and an Azure web job that cleans out expired codes.

In my tests, collisions occurred on average only one in ~24K times with a high of 85K and a low of 170. It is the latter that concerns me. Although the table is emptied of expired codes daily, at some point there will be a collision. Although unlikely, it could happen with very few records in the table.

It is worth noting that because of exterior requirement, the code is limited to six characters from AEFGHJKLPQRSTUWXYZ123456789.

A simplistic solution would be to keep running the generation/check until a unique code is created. In all likelihood, this would only mean 1-2 iterations. But there's no guarantee. And certainly it can't run unfettered.

The next step would be to put a limit - say 5 tries - through a simple loop. This may be enough of a solution. Despite concern that the user would be frustrated if it ultimately fails, if it does there's probably enough wrong with the overall system to ruin the user experience anyway.

I have good exponential backoff code so although this is not a network issue (also a concern), it may help to incorporate it into the retry scheme. In fact, that might be a nice catch-all solution. Given 5 increasing temporal delays in retrying, it would allow both enough chances for the logical code to exceed the chances of a collision and mitigate introducing more stress on network resources.

Any suggestions/observations would be appreciated.

FYI, this is the code used to generate codes:

    /// <summary>
/// Generates a randomized alphanumeric code
/// </summary>
/// <param name="strLength">e.g. 6 chars</param>
/// <param name="charSet">e.g. AEFGHJKLPQRSTUWXYZ123456789</param>
/// <returns>string of the specified length</returns>
public static string GenerateCode(int strLength, string charSet)
{
    var chars = charSet.ToCharArray();
    var data = new byte[1];
    var crypto = new RNGCryptoServiceProvider();
    crypto.GetNonZeroBytes(data);
    data = new byte[strLength];
    crypto.GetNonZeroBytes(data);
    var result = new StringBuilder(strLength);
    foreach (var b in data)
    {
        result.Append(chars[b % (chars.Length)]);
    }
    return result.ToString();
}
share|improve this question
    
If it is at all possible I'd recommend adding a unique code generator method to the aformentioned Web API, instead of having to deal with collisions client side. –  darri Dec 22 '14 at 17:01
    
Sorry, clarification: The code generator is invoked by the Web API controller. –  Stonetip Dec 22 '14 at 17:12
    
@Stonetip You should update your question with your clarification. As it reads now it sounds like the code is generated client side and then checked with the web api. Also, why is it only 6 digits if you are concerned with collisions? –  GrandmasterB Dec 22 '14 at 17:34
    
Done. Six char limit is a usability thing. They're sometimes relayed verbally to other users (e.g. via conference call). –  Stonetip Dec 22 '14 at 17:43
    
Is the remote database the code is being checked against part of the same application that generates the code? –  GrandmasterB Dec 22 '14 at 17:58

2 Answers 2

up vote 3 down vote accepted

Lets think about your different six-digit codes. You can generate (26 (for alpha) + 10 (for numeric))^6 = 2176782336 different entries. Lets now step into the birthday paradoxon if you have sqrt(2176782336) = 46656 (36^3) elements already in your list, you will have a 50 percent chance to find a duplicate for each new try. Since you are filling the list on the server, it will be harder for the client to find a non-used-code.

If you will not reach 46656 elements within a day, you have very good chances, to use up to 4 tries. But if you come closer to the 46K-barrier, you will have to increase the retry count. If you want lesser probable collisons use more digits or more characters 25 for lower case (omit l) and 24 (Omit O and I) for upper case and 10 for numbers this will increase your 50 Percent barrier to 205379 (59^3).

on the server side you will experience the so called Coupon collectors problem Analogoue to the problem, that the clients are try to give you (the server) coupons you do not already have...

hth.

share|improve this answer
    
Thanks for the math info and the reason for possibly needing to increase the retry count if the table is approaching the sqrt. My actual sqrt is only 19683 (27 char set). If I'm correct, 5 tries will mean ~ 1 in 3200 chance (.5^5) of the user failing to get a code IF the table already has that many values. Seems adequate, given the project requirements. –  Stonetip Dec 22 '14 at 17:38
    
5 retries means a 50 percent chance to fail, if you fail, another 50 percent chance on 50 percent the cases to fail again (Bernoulli experiment) (you only need to succeed once) - this means your success in five tries near the 50 percent boundary is round about 50% + 25% + 12,5% + 6,25% + 3,25% = 96,75 %. If you need more certaincy you will need to extend the number of tries - regardless of the algorithm you use. –  thepacker Dec 22 '14 at 17:50
    
I am bad enough at math to be dangerous. –  Stonetip Dec 22 '14 at 17:56
    
Do i need to elaborate more on the 96 percent thing? If so i will add it to my answer. - my fault 3,25 should be 3,125% and the 96,75 wasnt that correct, i should have used a calculator. –  thepacker Dec 22 '14 at 18:04
1  
No, I just meant that my assumption of .5^5 was bad. I see that going to 7 retries gets me to just over 99% certainty of success and 10 to 99.9% –  Stonetip Dec 22 '14 at 18:07

You can use a table with an auto-incrementing id as the basis for your code. If need be, start the increments at 100,000 for 6 digits. Insert a row, get the id, and return it. It will be guaranteed to be unique.

Depending on your expected usage levels, you might consider tacking on a few random letters to the id. So that code '1234' becomes '1234AB'. This makes it more difficult to stumble upon other people's codes.

You can also implement the above as a lookup-table. Fill a table with suitable codes, and return the corresponding one based on the auto-incremented id.

share|improve this answer
    
I'm using the. .NET RNGCrypographyProvider (msdn.microsoft.com/en-us/library/…) to generate codes. It's fast and makes unpredictable strings. BTW, I tried a lookup table. It was slow and huge (2.5Gb compressed). –  Stonetip Dec 22 '14 at 19:32
    
Why would using the auto-increment id itself not be suitable? How many codes do you need per day? –  GrandmasterB Dec 22 '14 at 19:44
    
Unless I'm missing something, that's going to be more easily guessable than my method (now shown in my question). Here's a sample of generated values: 5TWUYE, X21K5S, QGKJGT, ESS121, JSL8RY –  Stonetip Dec 22 '14 at 20:26

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.