Grab all the rows when parameter is null, or just grab the rows that match

If I pass in NULL for xyz, I want to get all the rows.

If I set xyz to a value, I only want the rows in the table where the xyz column matches the value I pass in.

And I don’t want to do build up a string in my app code.

Here is how:


select *
from my_table
where
case
when %(xyz)s is null then true
when %(xyz)s = xyz then true
else false
end
;

And here’s a slight tweak if you want to pass in an array of allowed values:


select *
from my_table
where
case
when %(xyz)s is null then true
when xyz = any(%(xyz)s) then true
else false
end
;

I hope this helps! If you know a better way, let me know!

Meaningful primary keys beat sequences and UUIDs

Summary

Instead of serial primary keys or UUIDs for primary keys, when possible, you should make a PK that somehow describes the underlying data. This is not as hard as you might think!

In detail

Everything you read about databases will say that every table really ought to have a primary key, in other words, some value that is unique in every row. The clumsiest, crudest, simplest way to do that is to use something like a sequence (or a row number) for that primary key.

Here’s a table where I track schedules at golf courses:

create table loops -- a "loop" is what people call a game of golf
(
loop_number serial primary key,
club_short_name text not null references clubs (short_name),
tee_time date not null default today(),
golfer integer not null references golfers (golfer_id),
number_holes integer not null check (number_holes in (9, 18)),
unique (club_name, golfer, tee_time, number_holes) -- prevent duplicates!
);

Every time somebody inserts a row in that loops table, the loop_number column will automatically get a unique number. Now if we’re making some kind of web app, it is easy to make a link to a particular game (aka loop) by just making a URL with the loop_number in there, like this:

https://example.com/loop?loop_number=99

or this:

https://example.com/loop/99

Depending on how fancy you like your URLs.

This style was super-popular in the Ruby on Rails heyday.

Then maybe a few years after that, when distributed databases started catching on, it wasn’t as easy to get the next sequential value, because you couldn’t check all the nodes quickly. So people started using stuff like UUIDs like this:

row_id uuid not null default uuid_generate_v4() primary key

instead of

row_number serial primary key

And then URLs started looking more like this:

https://example.com/loop/5f7664e6-15c2-4d08-858e-3306f7a8ca07

Sidenote

I’ve heard a lot of people argue that replacing sequential primary keys with UUIDs is somehow more secure because it is very easy for some malicious person to change

https://example.com/loop/99

to something like

https://example.com/loop/100

and possibly spy on information not meant for them. Doing the same trick with UUIDs is not so easy; in other words, if you hand out a URL like

https://example.com/loop/5f7664e6-15c2-4d08-858e-3306f7a8ca07

to some customer, they probably won’t find any valid rows by just slightly incrementing that UUID (and that’s if they can figure out HOW to increment it).

I personally don’t think that switching from serial PK’s to UUID’s is always enough to block this attack, but it is absolutely a great first step! In practice, it seems pretty hard to guess another valid UUID, but it certainly is not impossible, and the RFC for UUIDs has this little nugget of advice:

Do not assume that UUIDs are hard to guess; they should not be used
as security capabilities (identifiers whose mere possession grants
access), for example. A predictable random number source will
exacerbate the situation.

Incidentally, this URL tweaking is a big source of data breaches! I’ve seen it in the wild numerous times. I found this post that describes this kind of attack in more detail.

Unfortunately, the web popular frameworks don’t offer much help for this issue. They all make it easy to check that a user is authenticated (they are who they say they are), and maybe they offer some kind of role-based permissions, but you’re pretty much on your own when building a multi-tenant system with data privacy.

In other words, if you want to block user X from spying on data that should only be seen by user Y, you need to check that yourself, in all the different places in your code where you pull back data.

All that said, my favorite reason to go with UUIDs is that you don’t reveal that you only have like three clients on your system when you’re out doing demos, and that’s pretty dang important when you’re fundraising!

Back to the main point

Based on the table above, any loop is a unique combination of a club, a golfer, a tee time, and a number of holes of golf. For example, one row might track that Thurston Howell III is playing 18 holes at snooty-snooterton country club on April 1st, 2018.

It would be great if we could have a primary key like this:

snt-2018-04-01-th3-18

The snt part identifies the club, the 2018-04-01 part identifies the tee time, the th3 identifies the golfer, and the 18 part identifies the number of holes.

This makes for vastly easier to understand data! And it isn’t hard to do this. Just add a trigger on your table that fires before insert and update that sets your primary key column:

create or replace function set_loop_pk ()
returns trigger
as
$$
begin

NEW.loop_pk = NEW.club_short_name
|| '-'
|| to_char(NEW.tee_time, 'yyyy-mm-dd')
|| '-'
|| NEW.golfer_initials
|| '-'
|| number_holes,
;

return NEW;

end;
$$ language plpgsql;

create trigger set_loop_pk
before insert or update
on loops
for each row
execute procedure set_loop_pk();

Of course I replaced columns like golfer with golfer_initials, but hopefully that didn’t trip you up.

Also, the code above assumes you’re using the postgresql database, but you can translate it into whatever other database environment you want.

If you’re some kind of crazy person, I suppose you could even build that PK in your ORM layer.

Why is this better? This is better because anyone that sees a meaningful PK can infer a lot about the inner data. This is one of those things that will save you tons of time in a crisis because you don’t need to write lots of joins to understand who the heck is user ‘ac29a573-35f2-4200-b10a-384999426ee6’ or which club has club_id 876.

Your end-users will be more confident in the system as well. Labels printed with a meaningful PK are self-evident. URLs hint about the contents.

Sure, there are times when you need obfuscation, but it easy to have a meaningful PK and then scramble it somehow, with a real cryptographic solution, rather than leaning on a UUID.

Or really, the best approach in my opinion is to keep your meaningful PK, but also tag on another parameter that combines that meaningful PK with a secret value and then hashes it. People call this approach an HMAC.

Last point: if your data is only unique because you’re using a sequence or because you’re using a UUID, well, you’re not really “doing databases right”. Using a meaningful PK means you’ve figured out enough about what you’re storing to know what makes a row unique.

Postgresql: convert a string to a date or NULL

We’re working with some user-submitted text that we need to convert into dates. Most of the data looks correct, but some of it looks glitchy:


expiration_date
8/16
5/17
damaged
6/16

See that line with “damaged” in there? That will cause to_date to throw an error:

select to_date('damaged', 'YY/MM');
ERROR: invalid value "da" for "YY"
DETAIL: Value must be an integer.

So I wrote this function:

create or replace function dt_or_null (s text, fmt text)

returns date
as
$$
begin

return to_date(s, fmt);

exception
when others then return null;

end;

$$ language plpgsql;

And this is how it works:

select 'damaged', dt_or_null('damaged', 'YY/MM');
+----------+------------+
| ?column? | dt_or_null |
+----------+------------+
| damaged | |
+----------+------------+
(1 row)

Customize your psql terminal

Here’s the contents of my ~/.psqlrc file:

-- Report time used for each query.
\timing

-- Set a cute border around query results.
\pset border 2

-- Set pager to less, rather than more.
\setenv PAGER /usr/bin/less

-- Monkey up the prompt.

-- This one shows [email protected]:database
-- \set PROMPT1 '%[email protected]%m:%/%# '

-- Same thing, but with pretty colors, across several lines.
\set PROMPT1 '\n%[%033[0;36;29m%]%[email protected]%m:%/%[%033[0m%]\n%# '

Now I get a psql prompt that looks like this:

[email protected]:dbname
# select * from current_timestamp;
+-------------------------------+
| now |
+-------------------------------+
| 2012-10-23 12:22:38.263557-04 |
+-------------------------------+
(1 row)

Time: 1.907 ms

The postgreSQL crosstab function is really powerful

I’m just beginning to wrap my brain around it. I’ve got some data that looks like this:

select rr as row_name, dt, c from mm limit 3;
+----------+----------+------+
| row_name | dt | c |
+----------+----------+------+
| xxx | 2009 mar | 552 |
| xxx | 2009 nov | 2179 |
| xxx | 2009 jun | 1101 |
+----------+----------+------+
(3 rows)

And now I’m going to transpose it like this:

select * from crosstab('select rr as row_name, dt, c from mm limit 3')
as foo(row_name text, dt_1 bigint, dt_2 bigint);
+----------+------+------+
| row_name | dt_1 | dt_2 |
+----------+------+------+
| xxx | 552 | 2179 |
+----------+------+------+
(1 row)

Neat, right?

A few simple PostgreSQL tricks

I’ve got a bunch of these things collecting dust in a text file in my home directory. Maybe they’ll help somebody else out.

Transpose display in psql with \x


matt=# select now(), 1 + 1 as sum;
now | sum
-------------------------------+-----
2008-08-20 13:00:44.178076-04 | 2
(1 row)

matt=# \x
Expanded display is on.
matt=# select now(), 1 + 1 as sum;
-[ RECORD 1 ]----------------------
now | 2008-08-20 13:00:46.930375-04
sum | 2
matt=# \x
Expanded display is off.
matt=# select now(), 1 + 1 as sum;
now | sum
-------------------------------+-----
2008-08-20 13:01:19.725394-04 | 2
(1 row)

See how long every query takes


matt=# \timing
Timing is on.
matt=# select now();
now
-----------------------------
2008-12-18 12:31:50.60008-05
(1 row)

Time: 76.322 ms

By the way, you can put these into your $HOME/.psqlrc to always turn on timing at the beginning of a session::

$ cat $HOME/.psqlrc

-- Report time used for each query.
\timing

Define a function in python

First you have to install plpython. Then connect to the database as somebody with sufficient privileges and then type:

CREATE LANGUAGE 'plpythonu';

to allow functions defined in plpython.

Here’s a toy example:

matt=# create or replace function snoz (INT)
returns INT AS
$$
x = args[0]
return x * x * x
$$ language 'plpythonu';
CREATE FUNCTION

And here it is in use:

matt=# select 3, snoz(3);
?column? | snoz
----------+------
3 | 27
(1 row)

Use a trigger to set the modified date column

First define a function to set the modifieddate column:

create or replace function ts_modifieddate()
returns TRIGGER as
'
BEGIN
NEW.modifieddate = now();
RETURN NEW;
END;
' language 'plpgsql';

Now set up the trigger to call that function::

create trigger set_modifieddate before update
on knid for each row
execute procedure ts_modifieddate();

Use a trigger to set one column based on other columns

I got a table like this:

create table snoz (
a bool default false,
b bool default false,
c bool default false
);

I want a trigger to set c to true when a and/or b is true.

First, I define a function that does what I want:

create or replace function set_c()
returns trigger as $$
BEGIN
if NEW.a = true or NEW.b = true then NEW.c = true; END IF;
RETURN NEW;
END;
$$ language 'plpgsql';

Now I wire up that trigger to execute when I want:

create trigger update_c
before insert or update
on snoz
for each row
execute procedure set_c();

Here it is all in action:

matt=# insert into snoz (a, b) values (false, false), (false, false);
INSERT 0 2
matt=# select * from snoz;
a | b | c
---+---+---
f | f | f
f | f | f
(2 rows)

matt=# insert into snoz (a, b) values (false, true), (true, false);
INSERT 0 2
matt=# select * from snoz;
a | b | c
---+---+---
f | f | f
f | f | f
f | t | t
t | f | t
(4 rows)

See, it works!

Some research on generic/EAV tables

Yesterday I confirmed a hunch I’ve had about database schema design. Here’s the background: I’m working on a feature where I track employees and their preferred locations, shifts, and stations.

For example, I’ll track that Alice likes the morning shift at the west-side location, and she likes to work the front register station most of all, but her second choice is the drive-though.

Meanwhile, Bob likes the west-side and north-side locations, is indifferent about the shift, and likes the dishwasher station. Note the one-to-many relationship between Bob and his preferred locations and his lack of shift preferences.

I came up with two ways to make my tables:

FIRST METHOD

create table preferred_location (
employee_id int references employee (id),
location_id int references location (id));

create table preferred_shift (
employee_id int references employee (id),
shift int references shift (id));

create table preferred_station (
employee_id int references employee (id),
station_id int references station (id));

Hopefully, this is obvious. I store that Alice likes the west-side location in the preferred_location table like this:

(Alice's ID, west-side location ID)

Then I store the fact that she likes the morning shift in the preferred shift table like this:

(Alice's ID, morning shift ID)

Every time I want to add some new type of preference, e.g., hats, I need to make a table to hold all the legal hats and then make a table linking employees to their hat preference.

SECOND METHOD

This way keeps all the preferences in a single table.

create table preferences (
employee_id int references employee (id),
preference_type text,
preference_value text));

Here’s how I would store that Bob likes to be a dishwasher:

(Bob's ID, 'station', 'dishwasher')

Here’s what I like about this method two: I don’t need to tweak the database schema whatsoever when I dream up new preferences. In fact, I can let system users create new preference types at run-time, and the system just works. In this scenario, adding each employee’s hat preference does not require updating my schema.

On the downside, I wouldn’t have any FK constraints. Somebody could store a preference like they want to work a nonexistent shift and I wouldn’t know until I get an angry customer calling me. I’d have to do a lot of application-level data validation, which I hate.

Finally, there’s just something about method two that seems … wrong, even though I’ve seen variations of this theme in production environments at previous jobs (cough, ALLCODES, cough, PINDATA, cough).

So, with this dilemma, I wrote a post to the PostgreSQL users mailing list and got a fantastic reply. Here’s some excerpts:

Your “method 2″ is something called an Entity-Attribute-Value table design[1].

That said, by going the EAV/”Method-2” route, you’re gaining flexibility, but at the cost of increased complication, and ultimately repurposing a relational database to do something that isn’t very database-like, that’s really more like a spreadsheet. (So why not just use a spreadsheet?) You have little room for recording additional information, like ordering preferences, or indicating that (say) a station preference depends on a location preference, or that a shift time depends on day of the week, etc — so you’re probably not getting as much flexibility as you think. Sure, you could add an “Extra_Data” column, so you have rows:

Marie-Location-West-1,
Marie-Location-East-2,
Marie-Shift-Evening-Tuesday,
Marie-Station-Register-West,
Marie-Shift-Morning-Sunday,

etc. But you can see the data integrity nightmare already, when you somehow manage to record “Marie-Shift-Register-1”. Not to mention that you’ll have to do format conversions for that “Extra_Data” field, and incorporate logic somewhere else in your program that deciphers whatever’s in the generic data field to come up with ordering preferences for locations, station preferences by shift times, or whatever else you want to store.

[1] http://en.wikipedia.org/wiki/Entity-Attribute-Value_model

At this point, I was pretty sure I would go with method 1, but not absolutely certain. Then I read that linked article, which really just said more of the same.

Then I read this Ask Tom post and that erased the last bit of lingering doubt I had. Method 2 is incompatible with performance. Method 2 turns your database into a glorified flat file. Here’s some of my favorite excerpts from the Ask Tom post:

Frequently I see applications built on a generic data model for “maximum flexibility” or applications built in ways that prohibit performance. Many times – these are one in the same thing! For example, it is well known you can represent any object in a database using just four tables:

Create table objects ( oid int primary key, name varchar2(255) );

Create table attributes
( attrId int primary key, attrName varchar2(255),
datatype varchar2(25) );

Create table object_Attributes
( oid int, attrId int, value varchar2(4000),
primary key(oid,attrId) );

Create table Links ( oid1 int, oid2 int,
primary key (oid1, oid2) );

Looks great, right? I mean, the developers don’t have to create tables anymore, we can add columns at the drop of a hat (just requires an insert into the ATTRIBUTES table). The developers can do whatever they want and the DBA can’t stop them. This is ultimate “flexibility”. I’ve seen people try to build entire systems on this model.

But, how does it perform? Miserably, terribly, horribly. A simple “select first_name, last_name from person” query is transformed into a 3-table join with aggregates and all.

There’s a comment on that story about some java developers that insisted on this approach and then had to redesign the whole thing post-launch. I also like the “chief big table” remark.

Anyhow, it’s nice to know that (this time) my instincts were correct.

A few half-formed thoughts on SQLObject

I love SQLObject, but this is a rant about the tiny frustrations I face with it.

First, this is a minor point. I don’t really care about database independence that much. Postgres has a lot of wonderful features: I never have to worry about choosing the table engine that will enforce foreign key constraints, I like creating indexes with function inside:

create unique index nodup_parent on category (org_id, parent_cat, lower(name));

and I really like how easy it is to write stored procedures. Anyway, since I know I’m going to use postgresql, I don’t want to be restricted to only the features that exist or can be emulated in every platform. I know all about sqlmeta and createSQL and use it plenty. But I don’t like how when I set a default value, sometimes it is set in the database table, and other times, it isn’t.

Anyway, in practice, the most dangerous part of using SQLObject is that it hypnotizes you into forgetting about the queries behind everything. Imagine you have employees, departments, and a join table between them. You can set this up in SQLObject like this:

class Employee(SQLobject):
name = UnicodeCol(alternateID=True)
departments = RelatedJoin('Department')

class Department(SQLObject):
name = UnicodeCol(alternateID=True)
employees = RelatedJoin('Employee')

You want to draw a grid that indicates whether each user is a member in every group, so you might dash off some code like this:

for emp in Employee.select():
for d in Department.select():
if d in emp.departments:
print "yes!"
else:
print "no!"

In an ideal scenario, you can do this with three simple queries:

  • You need a list of employees
  • You need a list of departments
  • You need the list of employee-department of associations.

People that talk about how you can use outer joins to cram all that into one query will be dropped into a bottomless pit. Besides, I profiled it, and three separate queries is often much cheaper.

Anyway, back to the point. SQLObject will only run a single query to get the employees and a separate single query to get all the departments. So that’s good.

However, the place where all hell breaks loose is that if clause in the middle. If we have three employees and four departments, this statement

if d in emp.departments:

executes a dozen times. That’s unavoidable. The problem is that each time it executes, SQLObject runs a query like:

select department_id from department_employee where employee_id = (whatever);

Every time you say “is this particular department in this employee’s list of departments?” SQLObject grabs the full list of departments for that employee. So, if you ask about 10 different departments, you will run the exact same query ten times. Sure, the database is likely to cache the results of the query for you, but it is still very wasteful.

With just a few employees and a few departments, that’s not so bad. Eventually, though, as the number of employees and departments grow, the cost of that code grows at N2, which is just geek slang for sucky.

So, in conclusion, this may sound like a rant, but it really isnt. SQLObject is great. But it isn’t magic. It’s a great scaffolding system. But now I find that I’m rewriting a fair portion of code in order to reduce the database costs.

Aside: when I started paying attention to the queries generated by SQLObject, I found it really useful to edit postgresql.conf and enable log_min_duration_statement. Then every query and its cost will be logged for you. This is really useful stuff. It’s helped me to relax about doing a lot of things that I used to think were really bad.

How to use compound foreign keys with Postgresql

I am working on a multiple-choice quiz program. We have four tables related to questions and answers:

  • a table of questions
  • a table of answers
  • a table that links each question to multiple answers
  • a table that records a user’s actual answer to a question.

I needed a way to track answers to multiple-choice questions, and to limit possible answers to each question. For example, I want to make sure that if I ask the question “what is your favorite color?” then the answer is a color, and not something silly like “yes.” This is tricky, because “yes” is a valid answer for some questions, but not for others.

This table holds each question.


create table questions (
q_id serial primary key,
q_text text
);

Here are some questions. Since I used the serial data type, I’m not going to assign the q_id column. Postgresql will assign the q_id automatically.


insert into questions (q_text) values ('What is your favorite color?'); /* q_id 1 */
insert into questions (q_text) values ('Is it raining outside?'); /* q_id 2 */

After inserting, this is what the database looks like now:

select * from questions;
q_id | q_text
------+------------------------------
1 | What is your favorite color?
2 | Is it raining outside?
(2 rows)

This table lists every possible answer, each with a unique id.

create table all_answers (
a_id serial primary key,
a_text text
);

These are some answers.

insert into all_answers (a_text) values ('red'); /* a_id 1 */
insert into all_answers (a_text) values ('yes'); /* a_id 2 */
insert into all_answers (a_text) values ('green'); /* a_id 3 */
insert into all_answers (a_text) values ('no'); /* a_id 4 */

Here’s what all_answers looks like after adding data:

select * from all_answers;
a_id | a_text
------+--------
1 | red
2 | yes
3 | green
4 | no
(4 rows)

This table links each question to meaningful possible answers. I am using question ID (q_id) and answer ID (a_id) numbers in order to save space.

create table possible_answers (
q_id integer references questions (q_id),
a_id integer references all_answers (a_id),
primary key (q_id, a_id)
);

Now, I’ll link certain questions with certain answers. This insert links ‘What is your favorite color?’ with ‘red’.

insert into possible_answers (q_id, a_id) values (1, 1);

This statement links ‘Is it raining outside?’ with ‘no’.

insert into possible_answers (q_id, a_id) values (1, 3);

And, just to continue the trend…

insert into possible_answers (q_id, a_id) values (2, 2);
insert into possible_answers (q_id, a_id) values (2, 4);

Now let’s see what we’ve got:

select * from possible_answers;
q_id | a_id
------+------
1 | 1
1 | 3
2 | 2
2 | 4
(4 rows)

Finally, this is the table that will record the actual answer given for a question and make sure that the answer is appropriate for the question.

create table real_answers (
q_id integer references questions (q_id), /* simple FK */
a_id integer references all_answers (a_id), /* simple FK */
foreign key (q_id, a_id) references possible_answers (q_id, a_id) /* compound FK */
);

Now watch what happens when I try to insert an answer that doesn’t match the question into the database. I’m going to try to answer the question “What is your favorite color?” with the answer “yes”.

Hopefully the database will prevent me.

select q_text from questions where q_id = 1;
q_text
------------------------------
What is your favorite color?
(1 row)


select a_text from all_answers where a_id = 2;
a_text
--------
yes
(1 row)

insert into real_answers (q_id, a_id) values (1, 2);
ERROR: $3 referential integrity violation - key referenced from real_answers
not found in possible_answers
Hurray! We are prevented from entering the data! Now, let’s try storing an acceptable answer.

insert into real_answers (q_id, a_id) values (1,1);
INSERT 17196 1

That means it worked!