Category Archives: PostgreSQL

Customize your psql terminal

Here’s the contents of my ~/.psqlrc file:



-- Report time used for each query.
timing

-- Set a cute border around query results.
pset border 2

-- Set pager to less, rather than more.
setenv PAGER /usr/bin/less

-- Monkey up the prompt.

-- This one shows user@host:database
-- set PROMPT1 '%n@%m:%/%# '

-- Same thing, but with pretty colors, across several lines.
set PROMPT1 'n%[%033[0;36;29m%]%n@%m:%/%[%033[0m%]n%# '

Now I get a psql prompt that looks like this:


user@localhost:dbname
# select * from current_timestamp;
+-------------------------------+
|              now              |
+-------------------------------+
| 2012-10-23 12:22:38.263557-04 |
+-------------------------------+
(1 row)

Time: 1.907 ms

The postgreSQL crosstab function is really powerful

I’m just beginning to wrap my brain around it. I’ve got some data that looks like this:


select rr as row_name, dt, c from mm limit 3;
+----------+----------+------+
| row_name |    dt    |  c   |
+----------+----------+------+
| xxx      | 2009 mar |  552 |
| xxx      | 2009 nov | 2179 |
| xxx      | 2009 jun | 1101 |
+----------+----------+------+
(3 rows)

And now I’m going to transpose it like this:

select * from crosstab('select rr as row_name, dt, c from mm limit 3')
as foo(row_name text, dt_1 bigint, dt_2 bigint);
+----------+------+------+
| row_name | dt_1 | dt_2 |
+----------+------+------+
| xxx      |  552 | 2179 |
+----------+------+------+
(1 row)

Neat, right?

A few simple PostgreSQL tricks

I’ve got a bunch of these things collecting dust in a text file in my home directory. Maybe they’ll help somebody else out.

Transpose display in psql with \x


    matt=# select now(), 1 + 1 as sum;
                  now              | sum 
    -------------------------------+-----
     2008-08-20 13:00:44.178076-04 |   2
    (1 row)

    matt=# x
    Expanded display is on.
    matt=# select now(), 1 + 1 as sum;
    -[ RECORD 1 ]----------------------
    now | 2008-08-20 13:00:46.930375-04
    sum | 2
    matt=# x
    Expanded display is off.
    matt=# select now(), 1 + 1 as sum;
                  now              | sum 
    -------------------------------+-----
     2008-08-20 13:01:19.725394-04 |   2
    (1 row)

See how long every query takes


matt=# timing
Timing is on.
matt=# select now();
             now
-----------------------------
 2008-12-18 12:31:50.60008-05
(1 row)

Time: 76.322 ms

By the way, you can put these into your $HOME/.psqlrc to always turn on timing at the beginning of a session::

    $ cat $HOME/.psqlrc

    -- Report time used for each query.
    timing

Define a function in python

First you have to install plpython. Then connect to the database as somebody with sufficient privileges and then type:


    CREATE LANGUAGE 'plpythonu'; 

to allow functions defined in plpython.

Here’s a toy example:


    matt=# create or replace function snoz (INT)
    returns INT AS
    $$
    x = args[0]
    return x * x * x
    $$ language 'plpythonu';
    CREATE FUNCTION

And here it is in use:


    matt=# select 3, snoz(3);
     ?column? | snoz 
    ----------+------
            3 |   27
    (1 row)

Use a trigger to set the modified date column

First define a function to set the modifieddate column:


    create or replace function ts_modifieddate()
    returns TRIGGER as
    '
    BEGIN
       NEW.modifieddate = now();
        RETURN NEW;
    END;
    ' language 'plpgsql';

Now set up the trigger to call that function::

    create trigger set_modifieddate before update
    on knid for each row
    execute procedure ts_modifieddate();

Use a trigger to set one column based on other columns

I got a table like this:


    create table snoz (
        a bool default false, 
        b bool default false, 
        c bool default false
    );

I want a trigger to set c to true when a and/or b is true.

First, I define a function that does what I want:


    create or replace function set_c()
    returns trigger as $$
    BEGIN
    if NEW.a = true or NEW.b = true then NEW.c = true; END IF;
    RETURN NEW;
    END;
    $$ language 'plpgsql';

Now I wire up that trigger to execute when I want:

    create trigger update_c 
    before insert or update 
    on snoz 
    for each row
    execute procedure set_c();

Here it is all in action:

    matt=# insert into snoz (a, b) values (false, false), (false, false);
    INSERT 0 2
    matt=# select * from snoz;
     a | b | c
    ---+---+---
     f | f | f
     f | f | f
    (2 rows)

    matt=# insert into snoz (a, b) values (false, true), (true, false);
    INSERT 0 2
    matt=# select * from snoz;
     a | b | c
    ---+---+---
     f | f | f
     f | f | f
     f | t | t
     t | f | t
    (4 rows)

See, it works!

Some research on generic/EAV tables

Yesterday I confirmed a hunch I’ve had about database schema design. Here’s the background: I’m working on a feature where I track employees and their preferred locations, shifts, and stations.

For example, I’ll track that Alice likes the morning shift at the west-side location, and she likes to work the front register station most of all, but her second choice is the drive-though.

Meanwhile, Bob likes the west-side and north-side locations, is indifferent about the shift, and likes the dishwasher station. Note the one-to-many relationship between Bob and his preferred locations and his lack of shift preferences.

I came up with two ways to make my tables:

FIRST METHOD

create table preferred_location (
   employee_id int references employee (id),
   location_id int references location (id));

create table preferred_shift (
   employee_id int references employee (id),
   shift int references shift (id));

create table preferred_station (
   employee_id int references employee (id),
   station_id int references station (id));

Hopefully, this is obvious. I store that Alice likes the west-side location in the preferred_location table like this:


    (Alice's ID, west-side location ID)

Then I store the fact that she likes the morning shift in the preferred shift table like this:


    (Alice's ID, morning shift ID)

Every time I want to add some new type of preference, e.g., hats, I need to make a table to hold all the legal hats and then make a table linking employees to their hat preference.

SECOND METHOD

This way keeps all the preferences in a single table.

create table preferences (
   employee_id int references employee (id),
   preference_type text,
   preference_value text));

Here’s how I would store that Bob likes to be a dishwasher:


    (Bob's ID, 'station', 'dishwasher')

Here’s what I like about this method two: I don’t need to tweak the database schema whatsoever when I dream up new preferences. In fact, I can let system users create new preference types at run-time, and the system just works. In this scenario, adding each employee’s hat preference does not require updating my schema.

On the downside, I wouldn’t have any FK constraints. Somebody could store a preference like they want to work a nonexistent shift and I wouldn’t know until I get an angry customer calling me. I’d have to do a lot of application-level data validation, which I hate.

Finally, there’s just something about method two that seems … wrong, even though I’ve seen variations of this theme in production environments at previous jobs (cough, ALLCODES, cough, PINDATA, cough).

So, with this dilemma, I wrote a post to the PostgreSQL users mailing list and got a fantastic reply. Here’s some excerpts:

Your “method 2″ is something called an Entity-Attribute-Value table design[1].

That said, by going the EAV/”Method-2″ route, you’re gaining flexibility, but at the cost of increased complication, and ultimately repurposing a relational database to do something that isn’t very database-like, that’s really more like a spreadsheet. (So why not just use a spreadsheet?) You have little room for recording additional information, like ordering preferences, or indicating that (say) a station preference depends on a location preference, or that a shift time depends on day of the week, etc — so you’re probably not getting as much flexibility as you think. Sure, you could add an “Extra_Data” column, so you have rows:


 Marie-Location-West-1,
 Marie-Location-East-2,
 Marie-Shift-Evening-Tuesday,
 Marie-Station-Register-West,
 Marie-Shift-Morning-Sunday,

etc. But you can see the data integrity nightmare already, when you somehow manage to record “Marie-Shift-Register-1″. Not to mention that you’ll have to do format conversions for that “Extra_Data” field, and incorporate logic somewhere else in your program that deciphers whatever’s in the generic data field to come up with ordering preferences for locations, station preferences by shift times, or whatever else you want to store.

[1] http://en.wikipedia.org/wiki/Entity-Attribute-Value_model

At this point, I was pretty sure I would go with method 1, but not absolutely certain. Then I read that linked article, which really just said more of the same.

Then I read this Ask Tom post and that erased the last bit of lingering doubt I had. Method 2 is incompatible with performance. Method 2 turns your database into a glorified flat file. Here’s some of my favorite excerpts from the Ask Tom post:

Frequently I see applications built on a generic data model for “maximum flexibility” or applications built in ways that prohibit performance. Many times – these are one in the same thing! For example, it is well known you can represent any object in a database using just four tables:


Create table objects ( oid int primary key, name varchar2(255) );

Create table attributes 
( attrId int primary key, attrName varchar2(255), 
datatype varchar2(25) );

Create table object_Attributes 
( oid int, attrId int, value varchar2(4000), 
primary key(oid,attrId) );

Create table Links ( oid1 int, oid2 int, 
primary key (oid1, oid2) );  

Looks great, right? I mean, the developers don’t have to create tables anymore, we can add columns at the drop of a hat (just requires an insert into the ATTRIBUTES table). The developers can do whatever they want and the DBA can’t stop them. This is ultimate “flexibility”. I’ve seen people try to build entire systems on this model.

But, how does it perform? Miserably, terribly, horribly. A simple “select first_name, last_name from person” query is transformed into a 3-table join with aggregates and all.

There’s a comment on that story about some java developers that insisted on this approach and then had to redesign the whole thing post-launch. I also like the “chief big table” remark.

Anyhow, it’s nice to know that (this time) my instincts were correct.

A few half-formed thoughts on SQLObject

I love SQLObject, but this is a rant about the tiny frustrations I face with it.

First, this is a minor point. I don’t really care about database independence that much. Postgres has a lot of wonderful features: I never have to worry about choosing the table engine that will enforce foreign key constraints, I like creating indexes with function inside:


create unique index nodup_parent on category (org_id, parent_cat, lower(name));

and I really like how easy it is to write stored procedures. Anyway, since I know I’m going to use postgresql, I don’t want to be restricted to only the features that exist or can be emulated in every platform. I know all about sqlmeta and createSQL and use it plenty. But I don’t like how when I set a default value, sometimes it is set in the database table, and other times, it isn’t.

Anyway, in practice, the most dangerous part of using SQLObject is that it hypnotizes you into forgetting about the queries behind everything. Imagine you have employees, departments, and a join table between them. You can set this up in SQLObject like this:

class Employee(SQLobject):
    name = UnicodeCol(alternateID=True)
    departments = RelatedJoin('Department')

class Department(SQLObject):
    name = UnicodeCol(alternateID=True)
    employees = RelatedJoin('Employee')

You want to draw a grid that indicates whether each user is a member in every group, so you might dash off some code like this:

for emp in Employee.select():
    for d in Department.select():
        if d in emp.departments: 
            print "yes!"
        else: 
            print "no!"

In an ideal scenario, you can do this with three simple queries:

  • You need a list of employees
  • You need a list of departments
  • You need the list of employee-department of associations.

People that talk about how you can use outer joins to cram all that into one query will be dropped into a bottomless pit. Besides, I profiled it, and three separate queries is often much cheaper.

Anyway, back to the point. SQLObject will only run a single query to get the employees and a separate single query to get all the departments. So that’s good.

However, the place where all hell breaks loose is that if clause in the middle. If we have three employees and four departments, this statement


if d in emp.departments:

executes a dozen times. That’s unavoidable. The problem is that each time it executes, SQLObject runs a query like:

select department_id from department_employee where employee_id = (whatever);

Every time you say “is this particular department in this employee’s list of departments?” SQLObject grabs the full list of departments for that employee. So, if you ask about 10 different departments, you will run the exact same query ten times. Sure, the database is likely to cache the results of the query for you, but it is still very wasteful.

With just a few employees and a few departments, that’s not so bad. Eventually, though, as the number of employees and departments grow, the cost of that code grows at N2, which is just geek slang for sucky.

So, in conclusion, this may sound like a rant, but it really isnt. SQLObject is great. But it isn’t magic. It’s a great scaffolding system. But now I find that I’m rewriting a fair portion of code in order to reduce the database costs.

Aside: when I started paying attention to the queries generated by SQLObject, I found it really useful to edit postgresql.conf and enable log_min_duration_statement. Then every query and its cost will be logged for you. This is really useful stuff. It’s helped me to relax about doing a lot of things that I used to think were really bad.

How to use compound foreign keys with Postgresql

I am working on a multiple-choice quiz program. We have four tables related to questions and answers:

  • a table of questions
  • a table of answers
  • a table that links each question to multiple answers
  • a table that records a user’s actual answer to a question.

I needed a way to track answers to multiple-choice questions, and to limit possible answers to each question. For example, I want to make sure that if I ask the question “what is your favorite color?” then the answer is a color, and not something silly like “yes.” This is tricky, because “yes” is a valid answer for some questions, but not for others.

This table holds each question.


create table questions (
q_id serial primary key,
q_text text
);

Here are some questions. Since I used the serial data type, I’m not going to assign the q_id column. Postgresql will assign the q_id automatically.


insert into questions (q_text) values ('What is your favorite color?'); /* q_id 1 */
insert into questions (q_text) values ('Is it raining outside?');       /* q_id 2 */

After inserting, this is what the database looks like now:


select * from questions;
q_id |            q_text
------+------------------------------
1 | What is your favorite color?
2 | Is it raining outside?
(2 rows)

This table lists every possible answer, each with a unique id.

create table all_answers (
a_id serial primary key,
a_text text
);

These are some answers.

insert into all_answers (a_text) values ('red');      /* a_id 1 */
insert into all_answers (a_text) values ('yes');      /* a_id 2 */
insert into all_answers (a_text) values ('green');    /* a_id 3 */
insert into all_answers (a_text) values ('no');       /* a_id 4 */

Here’s what all_answers looks like after adding data:

select * from all_answers;
a_id | a_text
------+--------
1 | red
2 | yes
3 | green
4 | no
(4 rows)

This table links each question to meaningful possible answers. I am using question ID (q_id) and answer ID (a_id) numbers in order to save space.

create table possible_answers (
q_id integer references questions (q_id),
a_id integer references all_answers (a_id),
primary key (q_id, a_id)
);

Now, I’ll link certain questions with certain answers. This insert links ‘What is your favorite color?’ with ‘red’.


insert into possible_answers (q_id, a_id) values (1, 1);

This statement links ‘Is it raining outside?’ with ‘no’.

insert into possible_answers (q_id, a_id) values (1, 3);

And, just to continue the trend…

insert into possible_answers (q_id, a_id) values (2, 2);
insert into possible_answers (q_id, a_id) values (2, 4);

Now let’s see what we’ve got:

select * from possible_answers;
q_id | a_id
------+------
1 |    1
1 |    3
2 |    2
2 |    4
(4 rows)

Finally, this is the table that will record the actual answer given for a question and make sure that the answer is appropriate for the question.

create table real_answers (
q_id integer references questions (q_id),                         /* simple FK    */
a_id integer references all_answers (a_id),                       /* simple FK    */
foreign key (q_id, a_id) references possible_answers (q_id, a_id) /* compound FK  */
);

Now watch what happens when I try to insert an answer that doesn’t match the question into the database. I’m going to try to answer the question “What is your favorite color?” with the answer “yes”.

Hopefully the database will prevent me.


select q_text from questions where q_id = 1;
q_text
------------------------------
What is your favorite color?
(1 row)


select a_text from all_answers where a_id = 2;
a_text
--------
yes
(1 row)

insert into real_answers (q_id, a_id) values (1, 2);
ERROR:  $3 referential integrity violation - key referenced from real_answers
not found in possible_answers
Hurray! We are prevented from entering the data! Now, let’s try storing an acceptable answer.

insert into real_answers (q_id, a_id) values (1,1);
INSERT 17196 1

That means it worked!