The quest for the ID
The ID. The IDentification Number (or code) is used everywhere and most of the time it is crucial for software applications and for business; “most of the time” also crucial for personal identification.
I would start with noting that indeed the email address is a major example of ID which is simply understood by most of the people. It is unique across the Internet world.
We might need to exclude a couple of friends of mine who are sharing a “family” address. The email address is not personal in this case.
What most of the people might not know is that there is an effort worldwide to “unify” and make unique most of the IDs. For products an example of a worldwide effort is given here: Global Trade Item Number. It is a good example because you will need to pay to be included within the database. The fact that you have to pay shows that there is a non zero effort to keep an ID unique and working worldwide.
The best ID I ever used is indeed the “sequential” ID: 1,2,3..,n, because it actually makes its own use clean and neat. As you might have guessed this post is about all but sequential ID, but, please, do not forget you have it, use it every time: it is still the best ID ever!
The ID is used within databases to identify a single unique entry. SQL uses historically a sequential number. Few programmers split the DB record ID from the software record ID even if the DB can provide unique ID. They feel like they do not have control on the ID from the software perspective. This little trust concept, did work indeed, as new IDs can include global IDs and/or simply failure on the DB. But do we really need bright programmers for that.
Let’s see some properties of the IDs or DB indexes which might be required by the applications.
- Security:
- The ID should not be guessed
- The ID should not be guessed having a previous ID
- The ID should not be guessed by a number crunching program
- Uniqueness:
- The ID should be unique within the application
- The ID should be unique worldwide
- The ID should be unique… see discussion later
- Readability:
- The ID should be readable by a human being
- The ID should be readable by a human being and eventually easy to memorize
- Explanatory:
- The ID should contain information which can give indications to a human being
- The ID should contain some classification information which can be partially processed
- Fast and speed dedicated:
- The ID contains some sorting values which speed up searches and listing
These are some example requests to which I came across in my “ID life”. There might be more.
Before to go further you might want to read these more general documents:
- UUID ( wikipedia: Universally unique identifier) / GUID (wikipedia: Globally Unique Identifier)
- Wikipedia: Identification (I did not read it all, but gives you an idea of the problem)
- Mongo ObjectId (a nice example of an ID)
Examples
I describe here few examples which I hope summarise the problem of finding a good ID.
Invoices
A typical invoice would have a simple sequential number like “32451”. A today’s example conversation might go like this:
VEND: Hello, I am from Company A and I call about the invoice 32451 we sent to you?
CUST: Yes, no problem, but I cannot find any order from my side from Company A with ID 32451, are you sure you called the right person?
VEND: Yes, was an order made on the 12/12/2012 for 123.45 $
CUST: Let me look… yes, I have such an order but it is from X-Bay with ID: sxAs%rtmnlh-32451
VEND: Sure, sorry, What are the first 5 chars?
CUST: sx lowercase, A capital, s lowecase and the symbol percent.
VEND: …. sure, you payed it twice!
Here you see few problems in action. In first place an invoice is typically an administration document which has to be referenced by its own ID. The actual final customer indeed registers the order and the due payment under the main web site name and not under company A. Both indeed agree on a date and an amount. Then they exchange an ID by voice which might generate errors.
Invoices solution
Our solution runs this way:
IN-2016-08-04-<EmittingCompanyID>-<SequentialNumber>
This is human readable, and unique for the given target. You might need to be sure that it fits for the given country law.
This solution works on apache ofbiz. Thanks to Jacques Le Roux. As easy as it might look like it was a long discussion with me, Jacques and Pramod prasanth.
Mongo concurrency on probability
You could consider the mongoID as an example of uniqueness. But it is not the case. While it has an high level of randomness it is not yet 100% guaranteed to generate unique numbers for the “worldwide” part.
Technically we are left with the last “3-byte counter, starting with a random value”. Why? Because in a production system I might have multiple mongodb clusters running and, in particular, for backup or fault tolerance: I might “duplicate” these clusters. Mongodb IDs work if mongo is the unique parallel application.
Let make things a bit more complicated: 1 million requests on 1 million unique items concurrently.
UUID, GUID
For this please read the wikipedia documentation. There is enough information about uniqueness.
These are working basically on random number generators. There is many version.
Few links are given above but please try yourself to google it.
The ID
The idea of this post is to give directions. And of course my personal opinion.
For example the invoice ID is unique. But it is public. I would not use public IDs as DB IDs. Sorry for the strong contraction. The point is indeed easy: do not use public IDs for your database. This is indeed the simplest drawback for sequential IDs. You need to have a very good programmer that can hide the backend process.
Changing subject. The best ID I know about is the mongoID. It has many sub-features. The only problem is the machine ID which might not be always available.
The generator
You might not have realised in the lines above but the main problem is: who generates the IDs?
You can have a totally random number, but you will need to trace it for any purpose.
You can have a paid service which warranty unique IDs.
For a general purpose IDs I suggest the mongoID with little changes.
The machineID might not be alway available and generating a random number for it might make little sense. Consider also the case where the processID might be a random number.
The best idea I came up is a generator key distributor.
In practice the keys generator, gives a unique string to each generator, which is used to generate IDs, in a mongoDB style with the machineID replaced by the string.
This way we can be sure that a single generator makes unique IDs. A second (or thousands from the design), generator will not share the same key.
The exercise is to make a good keys distributor. It will have the list of allowed IDs generators.
I will publish shortly a github demo in coffeescript.
Solved problems
The first is concurrency. We assume that a generator has the possibility to spawn processes with a given processID and control a sequential number within the process. Within this picture there is no problem about concurrency.
Uniqueness: Every ID is indeed unique. As much as the machineID generator works with unique numbers.