Meaningful primary keys beat sequences and UUIDs

Summary

Instead of serial primary keys or UUIDs for primary keys, when possible, you should make a PK that somehow describes the underlying data. This is not as hard as you might think!

In detail

Everything you read about databases will say that every table really ought to have a primary key, in other words, some value that is unique in every row. The clumsiest, crudest, simplest way to do that is to use something like a sequence (or a row number) for that primary key.

Here’s a table where I track schedules at golf courses:

create table loops -- a "loop" is what people call a game of golf
(
loop_number serial primary key,
club_short_name text not null references clubs (short_name),
tee_time date not null default today(),
golfer integer not null references golfers (golfer_id),
number_holes integer not null check (number_holes in (9, 18)),
unique (club_name, golfer, tee_time, number_holes) -- prevent duplicates!
);

Every time somebody inserts a row in that loops table, the loop_number column will automatically get a unique number. Now if we’re making some kind of web app, it is easy to make a link to a particular game (aka loop) by just making a URL with the loop_number in there, like this:

https://example.com/loop?loop_number=99

or this:

https://example.com/loop/99

Depending on how fancy you like your URLs.

This style was super-popular in the Ruby on Rails heyday.

Then maybe a few years after that, when distributed databases started catching on, it wasn’t as easy to get the next sequential value, because you couldn’t check all the nodes quickly. So people started using stuff like UUIDs like this:

row_id uuid not null default uuid_generate_v4() primary key

instead of

row_number serial primary key

And then URLs started looking more like this:

https://example.com/loop/5f7664e6-15c2-4d08-858e-3306f7a8ca07

Sidenote

I’ve heard a lot of people argue that replacing sequential primary keys with UUIDs is somehow more secure because it is very easy for some malicious person to change

https://example.com/loop/99

to something like

https://example.com/loop/100

and possibly spy on information not meant for them. Doing the same trick with UUIDs is not so easy; in other words, if you hand out a URL like

https://example.com/loop/5f7664e6-15c2-4d08-858e-3306f7a8ca07

to some customer, they probably won’t find any valid rows by just slightly incrementing that UUID (and that’s if they can figure out HOW to increment it).

I personally don’t think that switching from serial PK’s to UUID’s is always enough to block this attack, but it is absolutely a great first step! In practice, it seems pretty hard to guess another valid UUID, but it certainly is not impossible, and the RFC for UUIDs has this little nugget of advice:

Do not assume that UUIDs are hard to guess; they should not be used
as security capabilities (identifiers whose mere possession grants
access), for example. A predictable random number source will
exacerbate the situation.

Incidentally, this URL tweaking is a big source of data breaches! I’ve seen it in the wild numerous times. I found this post that describes this kind of attack in more detail.

Unfortunately, the web popular frameworks don’t offer much help for this issue. They all make it easy to check that a user is authenticated (they are who they say they are), and maybe they offer some kind of role-based permissions, but you’re pretty much on your own when building a multi-tenant system with data privacy.

In other words, if you want to block user X from spying on data that should only be seen by user Y, you need to check that yourself, in all the different places in your code where you pull back data.

All that said, my favorite reason to go with UUIDs is that you don’t reveal that you only have like three clients on your system when you’re out doing demos, and that’s pretty dang important when you’re fundraising!

Back to the main point

Based on the table above, any loop is a unique combination of a club, a golfer, a tee time, and a number of holes of golf. For example, one row might track that Thurston Howell III is playing 18 holes at snooty-snooterton country club on April 1st, 2018.

It would be great if we could have a primary key like this:

snt-2018-04-01-th3-18

The snt part identifies the club, the 2018-04-01 part identifies the tee time, the th3 identifies the golfer, and the 18 part identifies the number of holes.

This makes for vastly easier to understand data! And it isn’t hard to do this. Just add a trigger on your table that fires before insert and update that sets your primary key column:

create or replace function set_loop_pk ()
returns trigger
as
$$
begin

NEW.loop_pk = NEW.club_short_name
|| '-'
|| to_char(NEW.tee_time, 'yyyy-mm-dd')
|| '-'
|| NEW.golfer_initials
|| '-'
|| number_holes,
;

return NEW;

end;
$$ language plpgsql;

create trigger set_loop_pk
before insert or update
on loops
for each row
execute procedure set_loop_pk();

Of course I replaced columns like golfer with golfer_initials, but hopefully that didn’t trip you up.

Also, the code above assumes you’re using the postgresql database, but you can translate it into whatever other database environment you want.

If you’re some kind of crazy person, I suppose you could even build that PK in your ORM layer.

Why is this better? This is better because anyone that sees a meaningful PK can infer a lot about the inner data. This is one of those things that will save you tons of time in a crisis because you don’t need to write lots of joins to understand who the heck is user ‘ac29a573-35f2-4200-b10a-384999426ee6’ or which club has club_id 876.

Your end-users will be more confident in the system as well. Labels printed with a meaningful PK are self-evident. URLs hint about the contents.

Sure, there are times when you need obfuscation, but it easy to have a meaningful PK and then scramble it somehow, with a real cryptographic solution, rather than leaning on a UUID.

Or really, the best approach in my opinion is to keep your meaningful PK, but also tag on another parameter that combines that meaningful PK with a secret value and then hashes it. People call this approach an HMAC.

Last point: if your data is only unique because you’re using a sequence or because you’re using a UUID, well, you’re not really “doing databases right”. Using a meaningful PK means you’ve figured out enough about what you’re storing to know what makes a row unique.

  • Nice idea, but there are a few problems.

    1) What happens when the same golfer plays two rounds of golf at the same club in the same day? Might not be that common of an occurrence, but if it could happen we either need to support it or prevent it.

    2) What if we have two golfers with the same initials? We have to query the table until we find a variation that works. To make matters worse initialst distributed evenly over the alphabet, so the most common names will take the most time to enter since they have to hunt for an available variation.

    3) What happens when a golfer changes thier name? Maybe it was entered wrong the first time, or they changed their name. Whatever the case people and clubs change their names. Do we update all the keys? What if we have archived some data, do we need to update the keys in all the data warehouses too? This brings us back to point 2 again if we have an initial collision.

    At some point, the amount of data required to uniquely and clearly identify a row, and not result in collisions, will exceed that required for a UUID and will approach the number of bits in the other columns.

    We had a system we were designing that had multiple disconnected databases that would eventually sync up. There was no way to have a centralized sequece generator. We finally came up with a system involving timestamps, unique identifiers, and local sequences that was satisfactory for uniquely identifying rows. It was 2 bytes short of a UUID, so we went with a UUID.

    Oh, and that is not the usage of an HMAC.