How Unique Should a Unique Code Be? | Mon 15 Nov 2010 | Blog

How Unique Should a Unique Code Be?Mon 15 Nov 2010

A system I'm working on needs to generate and store unique codes that will be distributed to its users. Any user can then enter one of these codes to obtain a discount at the basket.

Because the codes are there for anybody's taking they need to be random enough that users who shouldn't have one shouldn't be able to go guessing them.

The codes will be handed out ad-hoc to users who will take the code and enter it at the basket. Nothing ties the code to any given user though.

The code I've come up with to generate the codes (read "pasted and modified off Google") is like this:

private readonly Random _rng = new Random();
private const string _chars = "ABCDEFGHJKLMNPQRSTWXY3456789";

public static string Generate(int size)
{
 char[] buffer = new char[size];

 for (int i = 0; i < size; i++)
 {
  buffer[i] = _chars[_rng.Next(_chars.Length)];
 }
 return new string(buffer);
}

It generates codes like this:

TWEXB8GE, WHJ55459, AEJA6XP5, D4J3RXJK, NRMALE8H, WMJGQAKY, YAQQKKX7

Notice that I've left out confusingly-similar characters combinations such as I and 1, Z and 2. Going further I could probably miss out S and 5 as well as 8 and B?

The issue I have is one of compromise. How do I balance uniqueness and "security" with ease of entry for the end user?

A string that's 8 characters long might not look too hard to guess, but is in fact fairly unique.

If the set of characters used has 28 members and the string is 8 characters long then the chance of guessing a code is one in 28^8 or 28 to the power or 8 or 28*28*28*28*28*28*28*28 or 377,801,998,336.

Assuming my maths is right?

Here's how some other combinations stack up:

Possible Characters	Code Length	Permutations
28	12	232,218,265,089,212,416
28	8	377,801,998,336
28	6	481,890,304
28	4	614,656
10	10	10,000,000,000
10	3	1,000

Maybe a code 8 characters long is too much. Probably 6 would suffice. What would you opt for?

Comments

- Jerry Carter
- Mon 15 Nov 2010 10:18 AM
A couple of thoughts - expiration (life span) and ease of use.
Given ease of use as a factor (how hard for users to type in if they can't puzzle out copy and paste) and setting an expiration (n days on m population of codes) you could well manage with 4 characters... which is readily available in the form of a Note document ID for those writing Domino based applications. I've used this for a url clipping implementation where the short urls are not long lived and it gives me plenty of usage. I might go to 6 just to extend the life span of the urls out a bit or even 8 would do if I wanted them to live on for almost-ever.

Reply
1. - Jake Howlett
  - Mon 15 Nov 2010 10:26 AM
  As I understand it there's a chance the codes will sometimes be printed on paper and literally handed out to users (or even non-users as an incentive to become one). So no copy/paste there and hence I wanted to keep as short and readable as possible.
  They need to last indefinitely. There's an option for them to have an expiry date but also an option not to set one.
  
  Reply
- Mark Barton
- Mon 15 Nov 2010 10:47 AM
If the codes are going to be given out then generate a complex one and produce a QR code - http://code.google.com/apis/chart/docs/gallery/qr_codes.html.
Then just implement a barcode reader (already done for Android / iPhone) for webcams ;-)

Reply
- Dragon Cotterill
- Mon 15 Nov 2010 11:15 AM
Always suspicious of auto-generated codes. Have to make sure that they don't match real words. Otherwise odd combinations involving cfku (and other variants) could crop up.

Reply
1. - Jake Howlett
  - Mon 15 Nov 2010 11:22 AM
  Hadn't thought of that. Although, according to the table above, the chance of any given four-letter-word cropping up is 1:614,656. That's fairly unlikely, no?
  
  Reply
  
  Show the rest of this thread
  1. - Jerry Carter
    - Mon 15 Nov 2010 01:38 PM
    An important fact about probabilities - there is always "a" chance, sort of like having a deck of 52 cards. While the odds you draw a 2 are one in 13 and that it's the 2 of Spades is 1 in 52, the chance that it is the next card you draw is equal to the chance that you draw any other card... all draws have a 1:52 probability and hence equal chance for any combination.
    Maybe black listing specific options would be a good idea or setting up some regular expression filtering would be better.
    
    Reply
- Flemming Riis
- Mon 15 Nov 2010 11:17 AM
it depends on the value of what you are protecting.
if its bra size for victorias secret and there isnt a name or any info to id the person keep it as short as possible
if it hides a homeaddress kill it and opt for user/password+token
guids are nice but not very user friendly to type back in.
if you buy something from RIM you get a /reg.do ID=820874xxx&PD=29962xxx
so thats 18 numbers only if thats easy to retype and represent a high value
so it depends of the value of the data

Reply
1. - Flemming Riis
  - Mon 15 Nov 2010 11:18 AM
  i need to learn how to read you mentioned the data allready , is there a policy for code reuse ? so if there is duplicated the next one wont get a discount
  
  Reply
  
  Show the rest of this thread
  1. - Jake Howlett
    - Mon 15 Nov 2010 11:20 AM
    Code's are never re-used and there's code to take care of duplication never occurring.
    
    Reply
- Hynek Kobelka
- Mon 15 Nov 2010 11:30 AM
Obviously the length of the code also depends on the amount of valid codes that you will give out, because a "guesser" need to hit any one of them.
Simply mathematically i think that if you want to keep it short then the best way is to expand your character map as much as possible. Right now you use only 28 allowed chars. But with the whole alphabeth in upper and lowercase and numbers you can get 60.
Of course people will then have problems with similar symbols but maybe this could be resolved with the choice of a proper font.
And one more thing is that if you decide to have a longer code then think about dividing it into smaller groups for better readibility: NRMA-LE8H , NR-1234-8H, NR MA LE 8H,...
But these are just ideas :-)

Reply
1. - Jake Howlett
  - Mon 15 Nov 2010 11:58 AM
  Good ideas though Hynek. Thanks!
  
  Reply
- Richard Schwartz
- Mon 15 Nov 2010 11:45 AM
My calculator gets 377,801,998,336 for 28**8.
How many of these are you going to give out? If you give out a million or them, then (even with my higher number) on average it will take only 377,802 guesses to crack one. That's not very many guesses if it is done with some computer assistance.
And there's something possibly more important than that. You probably don't want to re-use these codes, but 28*8 is only on the order of 2**26 values, so if you just generate ~8000 random codes (2**13, actually) there will be a 50% chance that you have re-used at least one. (Lookup 'birthday paradox' on wikipedia for the details.)
I would go with a longer code, and I would not make it random. I would create codes by applying a hash to a set of unique strings. This has the advantage, too, of allowing you to have customer-specific codes that can't be shared (because the hash input strings contain customer names or account numbers), or having codes that are specific to particular partner web site (by having the partner name or number in the hash input strings), or codes that are sharable and generic, all with the same format and generation mechanism.

Reply
1. - Richard Schwartz
  - Mon 15 Nov 2010 11:53 AM
  Oops! It's not 2**26. I took the natural log on my calculator instead of the log2. It's more like 2**39. That means you can generate close to 1,000,000 codes before there's a 50% probability that you generate a dupe.
  
  Reply
2. - Jake Howlett
  - Mon 15 Nov 2010 12:01 PM
  "My calculator gets 377,801,998,336" Mine too :-)
  I can't imagine there ever being more than about 10,000 of these codes in existence (and that's a high-end guess).
  You're losing me with all this log2 stuff. It's been a long time since I did any advanced maths. Working out it was 28^8 took me long enough...
  
  Reply
  
  Show the rest of this thread
  1. - Richard Schwartz
    - Mon 15 Nov 2010 01:47 PM
    The log2 stuff is really just asking: How many bits are there in the number when written in binary? That's the key factor when you're trying to figure the probability of getting the same value in a set twice.
    The birthday problem is this: Every time you walk into a bar with at least X people in it, you bet the bartender that there are at least 2 people in the bar with the same birthday. How big does X have to be so that you will win more often than you lose? The answer is just 23, which surprises most people because it's a lot lower than you might suspect. But if you never make this bet when there are fewer than 23 people in the bar, and you always make this bet when there are more than 23, then you will make money. Unless, of course, the bartender knows everybody's birthday and doesn't take the bet when he would lose! ;-)
    The bits come into it because there's an easy way to get an approximate answer just by taking the number of bits that you need to represent the number of choices, divide that number of bits by 2, and raise 2 to that power. The larger the numbers you're dealing with, the better this is as an approximation. (It's actually not very good an approximation for a number as low as 365. You get 19 this way, which will cause you to lose money!)
    The log2 comes in because that's how you define the number of bits when your number of choices isn't a power of 2. E.g., for 365, the Log2 is 8.51, so that's how many "bits" you are dealing with.
    Anyhow, if there really won't ever be more than 10,000 of these codes, you're probably okay.
    
    Reply
- Greg
- Mon 15 Nov 2010 04:54 PM
One other thing you could consider if you want to prevent people guessing your codes is to add a checksum character into the code. At its simplest this can just be the character that might represent the sum of all the other characters. I'm sure you could work out the details of how that would work pretty quickly.

Reply
1. - Jake Howlett
  - Tue 16 Nov 2010 09:52 AM
  "I'm sure you could work out the details of how that would work pretty quickly."
  Hmm, your faith in me may be misplaced. Never did get checksums.
  
  Reply
- Curtis Kuhn
- Mon 15 Nov 2010 05:05 PM
One thing you might want to keep in mind from a usability standpoint is to keep the letters in lowercase. That way users can more easily distinguish between letters and numbers. They won't be left wondering if something is a 0 (number zero) or an O (letter O). Of course you then might run into confusion with lowercase l and the number 1. Maybe a good idea to eliminate 0s, Os, ls and 1s altogether. It decreases your pool of available codes but would probably lead to less frustration and a higher success rate.

Reply
- Jeroen Jacobs
- Mon 15 Nov 2010 05:11 PM
Windows API has a CoCreateGuid(); function, which can be called from LotusScript too...
It creates 128bit integers, but you can perform a base32 conversion on it, so you will get an alphanumeric text-string.
Oh yeah, make sure your generated codes do not contain any profanity :-))

Reply
- Liam McLaughlin
- Tue 16 Nov 2010 09:45 AM
The opposite of security is usually usability - and in this case if the user is typing in the code then it has to be short-ish. IMHO less than 9 and as Hynek suggested grouped for readibility.
Case sensitivity to be avoided for good usability and likewise any similar letters/unumbers
I'm also interested in the google search you'll have to do to try to find the list of unsuitable words to parse out...could be some interesting results. Let us know how that one goes

Reply
1. - Jake Howlett
  - Tue 16 Nov 2010 09:50 AM
  I think just removing most of the of vowels will remove any risk of profanities popping up.
  My new list of chars is:
  ACDEFGHJKLMNPQRTWXY34679
  If you can spell a naughty word with they you're a smarter fecker than me ;-)
  
  Reply
  
  Hide the rest of this thread
  1. - Sorry!
    - Tue 16 Nov 2010 02:36 PM
    FART?
    
    Reply
  2. - Michelle O'Rorke
    - Tue 16 Nov 2010 05:26 PM
    Maybe not strictly swearing, but I can still make
    .. eat me
    And what about words in other languages?
    You may need to add to the algorithm so that there is no more than two consecutive letters before a number is added.
    
    Reply
    1. - Jake Howlett
      - Thu 18 Nov 2010 04:34 AM
      Would you believe it I just found an generated code with PEAR in the middle of it. So, the chance of a four letter word cropping up can't be dismissed. The word "pear" re-arranged could spell an word likely to cause offensive too!
      Right, I'm dropping "A" now too.
      
      Reply
  3. - Roger Melly
    - Wed 17 Nov 2010 09:13 AM
    w@nk, pr@t, wedgy, pearl necklace :o)
    
    Reply
- Andrew Magerman
- Wed 17 Nov 2010 01:52 AM
Hi Jake, if this is Domino, why not use @Unique (without a parameter). it gives strings that are like this:
AMAG-8BAB9A

Reply
1. - Jake Howlett
  - Wed 17 Nov 2010 02:29 AM
  It's not Domino, but, if it were, I'm not sure @unique would cut it.
  The first 4 chars are fixed and so there's "only" 308,915,776 possibilities, which I guess is enough in reality, but aren't the produced sequentially?
  My code would produce, say, 100 codes at once. Assuming they are in fact guaranteed unique I'm guessing there's a chance that a user who received code AMAG-8BAB9A could then take a stab at AMAG-8BAB9B and AMAG-8BAB9C etc.
  
  Reply
  
  Show the rest of this thread
  1. - Andrew Magerman
    - Fri 19 Nov 2010 02:41 AM
    AFAIK it's a time-stamp. One thought - if the users identify themselves and you know their identity before you send the numbers to them, I would make the number a hash of their names, plus another salt. That would make the code work only for that particular user. In Domino, @Password would be a good first choice
    
    Reply
- Rob Shaver
- Wed 1 Dec 2010 05:01 PM
How about using a selection of 1000 short words. You put two words together with a digit between them. Words are easer to type because they are recognizable. This would give you about 10 million combinations I think.

Reply