Meaningful primary keys beat sequences and UUIDs

Summary

Instead of serial primary keys or UUIDs for primary keys, when possible, you should make a PK that somehow describes the underlying data. This is not as hard as you might think!

In detail

Everything you read about databases will say that every table really ought to have a primary key, in other words, some value that is unique in every row. The clumsiest, crudest, simplest way to do that is to use something like a sequence (or a row number) for that primary key.

Here’s a table where I track schedules at golf courses:

create table loops -- a "loop" is what people call a game of golf
(
loop_number serial primary key,
club_short_name text not null references clubs (short_name),
tee_time date not null default today(),
golfer integer not null references golfers (golfer_id),
number_holes integer not null check (number_holes in (9, 18)),
unique (club_name, golfer, tee_time, number_holes) -- prevent duplicates!
);

Every time somebody inserts a row in that loops table, the loop_number column will automatically get a unique number. Now if we’re making some kind of web app, it is easy to make a link to a particular game (aka loop) by just making a URL with the loop_number in there, like this:

https://example.com/loop?loop_number=99

or this:

https://example.com/loop/99

Depending on how fancy you like your URLs.

This style was super-popular in the Ruby on Rails heyday.

Then maybe a few years after that, when distributed databases started catching on, it wasn’t as easy to get the next sequential value, because you couldn’t check all the nodes quickly. So people started using stuff like UUIDs like this:

row_id uuid not null default uuid_generate_v4() primary key

instead of

row_number serial primary key

And then URLs started looking more like this:

https://example.com/loop/5f7664e6-15c2-4d08-858e-3306f7a8ca07

Sidenote

I’ve heard a lot of people argue that replacing sequential primary keys with UUIDs is somehow more secure because it is very easy for some malicious person to change

https://example.com/loop/99

to something like

https://example.com/loop/100

and possibly spy on information not meant for them. Doing the same trick with UUIDs is not so easy; in other words, if you hand out a URL like

https://example.com/loop/5f7664e6-15c2-4d08-858e-3306f7a8ca07

to some customer, they probably won’t find any valid rows by just slightly incrementing that UUID (and that’s if they can figure out HOW to increment it).

I personally don’t think that switching from serial PK’s to UUID’s is always enough to block this attack, but it is absolutely a great first step! In practice, it seems pretty hard to guess another valid UUID, but it certainly is not impossible, and the RFC for UUIDs has this little nugget of advice:

Do not assume that UUIDs are hard to guess; they should not be used
as security capabilities (identifiers whose mere possession grants
access), for example. A predictable random number source will
exacerbate the situation.

Incidentally, this URL tweaking is a big source of data breaches! I’ve seen it in the wild numerous times. I found this post that describes this kind of attack in more detail.

Unfortunately, the web popular frameworks don’t offer much help for this issue. They all make it easy to check that a user is authenticated (they are who they say they are), and maybe they offer some kind of role-based permissions, but you’re pretty much on your own when building a multi-tenant system with data privacy.

In other words, if you want to block user X from spying on data that should only be seen by user Y, you need to check that yourself, in all the different places in your code where you pull back data.

All that said, my favorite reason to go with UUIDs is that you don’t reveal that you only have like three clients on your system when you’re out doing demos, and that’s pretty dang important when you’re fundraising!

Back to the main point

Based on the table above, any loop is a unique combination of a club, a golfer, a tee time, and a number of holes of golf. For example, one row might track that Thurston Howell III is playing 18 holes at snooty-snooterton country club on April 1st, 2018.

It would be great if we could have a primary key like this:

snt-2018-04-01-th3-18

The snt part identifies the club, the 2018-04-01 part identifies the tee time, the th3 identifies the golfer, and the 18 part identifies the number of holes.

This makes for vastly easier to understand data! And it isn’t hard to do this. Just add a trigger on your table that fires before insert and update that sets your primary key column:

create or replace function set_loop_pk ()
returns trigger
as
$$
begin

NEW.loop_pk = NEW.club_short_name
|| '-'
|| to_char(NEW.tee_time, 'yyyy-mm-dd')
|| '-'
|| NEW.golfer_initials
|| '-'
|| number_holes,
;

return NEW;

end;
$$ language plpgsql;

create trigger set_loop_pk
before insert or update
on loops
for each row
execute procedure set_loop_pk();

Of course I replaced columns like golfer with golfer_initials, but hopefully that didn’t trip you up.

Also, the code above assumes you’re using the postgresql database, but you can translate it into whatever other database environment you want.

If you’re some kind of crazy person, I suppose you could even build that PK in your ORM layer.

Why is this better? This is better because anyone that sees a meaningful PK can infer a lot about the inner data. This is one of those things that will save you tons of time in a crisis because you don’t need to write lots of joins to understand who the heck is user ‘ac29a573-35f2-4200-b10a-384999426ee6’ or which club has club_id 876.

Your end-users will be more confident in the system as well. Labels printed with a meaningful PK are self-evident. URLs hint about the contents.

Sure, there are times when you need obfuscation, but it easy to have a meaningful PK and then scramble it somehow, with a real cryptographic solution, rather than leaning on a UUID.

Or really, the best approach in my opinion is to keep your meaningful PK, but also tag on another parameter that combines that meaningful PK with a secret value and then hashes it. People call this approach an HMAC.

Last point: if your data is only unique because you’re using a sequence or because you’re using a UUID, well, you’re not really “doing databases right”. Using a meaningful PK means you’ve figured out enough about what you’re storing to know what makes a row unique.

I see a path to 1984-style thought crimes from where we are now

A little while ago I got admonished online.

I posted a remark that I was shocked by the sex assault allegations against Charlie Rose.

I wrote, “who’s next — Lurch from Addams Family?”

A friend wrote back I was trivializing sex assault. Then she went on to say something about how I was part of the problem. I thought about it for a while and I just didn’t agree with it.

So I decided not to spend time debating the issue and I deleted the comment.

Then an hour later my friend posted a screenshot showing how I deleted the comment and said I was gaslighting her.

And these were not short posts she was making. These were full-on rants. From somebody I’ve known for 20 years.

Again I decided there was no way we were going to have a good conversation about this, so I deleted the post and blocked this person. I’m not going to host somebody insulting me on my story.

After that, I heard through someone else that she had posted a rant about my behavior calling me a bully.

It seems like at this point in the game, just expressing a regressive point of view is apparently the same as committing acts of violence.

If you see me in the labor camps, please say hi!