Archive for SQL


A few half-formed thoughts on SQLObject

I love SQLObject, but this is a rant about the tiny frustrations I face with it.

First, this is a minor point. I don’t really care about database independence that much. Postgres has a lot of wonderful features: I never have to worry about choosing the table engine that will enforce foreign key constraints, I like creating indexes with function inside:

create unique index nodup_parent on category (org_id, parent_cat, lower(name));

and I really like how easy it is to write stored procedures. Anyway, since I know I’m going to use postgresql, I don’t want to be restricted to only the features that exist or can be emulated in every platform. I know all about sqlmeta and createSQL and use it plenty. But I don’t like how when I set a default value, sometimes it is set in the database table, and other times, it isn’t.

Anyway, in practice, the most dangerous part of using SQLObject is that it hypnotizes you into forgetting about the queries behind everything. Imagine you have employees, departments, and a join table between them. You can set this up in SQLObject like this:

class Employee(SQLobject):
    name = UnicodeCol(alternateID=True)
    departments = RelatedJoin('Department')
 
class Department(SQLObject):
    name = UnicodeCol(alternateID=True)
    employees = RelatedJoin('Employee')

You want to draw a grid that indicates whether each user is a member in every group, so you might dash off some code like this:

for emp in Employee.select():
    for d in Department.select():
        if d in emp.departments:
            print "yes!"
        else:
            print "no!"

In an ideal scenario, you can do this with three simple queries:

  • You need a list of employees
  • You need a list of departments
  • You need the list of employee-department of associations.

People that talk about how you can use outer joins to cram all that into one query will be dropped into a bottomless pit. Besides, I profiled it, and three separate queries is often much cheaper.

Anyway, back to the point. SQLObject will only run a single query to get the employees and a separate single query to get all the departments. So that’s good.

However, the place where all hell breaks loose is that if clause in the middle. If we have three employees and four departments, this statement

if d in emp.departments:

executes a dozen times. That’s unavoidable. The problem is that each time it executes, SQLObject runs a query like:

select department_id from department_employee where employee_id = (whatever);

Every time you say “is this particular department in this employee’s list of departments?” SQLObject grabs the full list of departments for that employee. So, if you ask about 10 different departments, you will run the exact same query ten times. Sure, the database is likely to cache the results of the query for you, but it is still very wasteful.

With just a few employees and a few departments, that’s not so bad. Eventually, though, as the number of employees and departments grow, the cost of that code grows at N2, which is just geek slang for sucky.

So, in conclusion, this may sound like a rant, but it really isnt. SQLObject is great. But it isn’t magic. It’s a great scaffolding system. But now I find that I’m rewriting a fair portion of code in order to reduce the database costs.

Aside: when I started paying attention to the queries generated by SQLObject, I found it really useful to edit postgresql.conf and enable log_min_duration_statement. Then every query and its cost will be logged for you. This is really useful stuff. It’s helped me to relax about doing a lot of things that I used to think were really bad.

Comments (4)

A few different ways to store data with varying attributes

Got an email from a friend:

I want to create a database that acts just like the index card note cards one used to make for doing research papers in HS and Univ. I’ve got most of it down, but I am having trouble figuring out how to normalize the page numbers and the source data.
Let’s say one has three kinds of sources - books, magazines, and websites. Well, a book will have:

author(s)
title
place of publication
publishedr
copywrite date

a magazine:
author(s) - but only maybe - what does one do about The Economist?
title of article
title of magazine
date of publication

a website:
author(s) - again only maybe
title of website
URL

Here’s what I said in reply:

So, I think I get your question. Books, magazines, and websites are all different examples of sources that you might cite. They have some attributes in common and some attributes that are unique.

Going with the high school term paper example, let’s pretend that you wrote a paper and your bibliography looks like this:

  • (book) Tom Sawyer by Mark Twain. Published by Hustler, 1833, in NY.
  • (book) Huckleberry Finn by Mark Twain. Published by Hustler, 1834, in NY.
  • (magazine) “Indonesia sucks”, The Economist. No author listed. February 2001 issue. page 67.
  • (magazine) “No, the Economist Sucks”, Jakarta Post. Joe Brown is the author. Article appeared in the March 14, 2007 issue, on page 6D.

  • (website): “Indonesia” on WIkipedia, http://en.wikipedia.org/wiki/Indonesia. Lots and lots of authors. I used text found on the site as of June 1st, 2007.
  • (website) “blog #96 post”, http://anncoulter.com, Ann Coulter is the author, article posted on July 4th, 2007. I used text found on the site as of this date.

I can see at least three ways to set this up:

1. You can make a single table called sources that includes the union of all these different types. So, you would have a column called “publisher” and another column called URL. The book sources would have a blank URL field, and the website sources would have a blank publisher field. You could have a column called source type which would have values of “book”, “website”, “URL”, or anything else that fits.

PROs: simple!

CONs: It is tricky to bake in good data validation into your database. You can’t easily add rules to enforce that you get all the required data for each row. Also, every time you discover a new source type, you may need to modify the table and add even more columns.

2. You create a separate table for each source type. So, you have a books table, a magazines table, and then a websites table.

PROs: Now, you can easily make sure that all your books data has all the required data.

CONs: Accumulating all the results for one of your papers means you have to do a query against each table separately and then use some UNION keyword to add them together. Also, when you need to a new source type, you’ll need to add a new table to your schema.

3. Make a bunch of tables:

sources (source_id)

fields (field_id, field_name)

source_fields(source_id, field_id, field_value)

So, this entry:

(book) Tom Sawyer by Mark Twain. Published by Hustler, 1833, in NY.

Would get a single row in the sources table.

And the fields table would have these values:

(field_id, field_name)
1, source type
2, title
3, author
4, publisher
5, publish date
6, publish location

Then finally, we’d put the actual data in the source fields table:

(source_id, field_id, field_value)
1, 1, “book”
1, 2, “Tom Sawyer”
1, 3, “Mark Twain”
1, 4, “Hustler”

… you get the idea, hopefully.

Then, when you want to store a magazine, the first thing you do is add any new field types you need to the fields table, and then add your data

PROs: you can make up new attributes for your data any time you want, and never have to change your database. For example, if you need to start storing TV shows, you can just add the new types to the fields table and you’re good.

CONs: The field_value field needs to accept any kind of data. So, probably, you’ll want to make it a column type like a TEXT column that can hold arbitrarily large objects, and then before you store anything in your database you need to convert it to a text object. So, you’re not going to be able to index this data well and you’re not going to be able to require that the data matches some formats.

So, figuring which of these approaches is the correct one depends on the specifics of the scenario.

How well can you predict today all the future types of data? If you have perfect clairvoyance, or if you don’t mind monkeying with the database, approach #3 is pointless. I recommend approach #3 in a scenario when you have lots of users, and you don’t want each of them monkeying with the schema.

How worried are you about bad data getting entered in? You can always use triggers and stored procedures or some outer application code to add validation on any of these, but it won’t be easy. Using approach #2 will make validation the easiest.

How fast do queries need to be? If we want to know all the books written by Mark Twain, approach #2 will likely give the fastest response, followed by #1 and then #3.

By the way, while we’re on the subject of bibliographies, all my class notes on normal forms are here and I wrote up a blog on compound foreign key constraints a while back here.

Matt

PS: I’m using this email as my next blog entry. And here it is :)

How to use compound foreign keys with Postgresql

I am working on a multiple-choice quiz program. We have four tables related to questions and answers:

  • a table of questions
  • a table of answers
  • a table that links each question to multiple answers
  • a table that records a user’s actual answer to a question.

I needed a way to track answers to multiple-choice questions, and to limit possible answers to each question. For example, I want to make sure that if I ask the question “what is your favorite color?” then the answer is a color, and not something silly like “yes.” This is tricky, because “yes” is a valid answer for some questions, but not for others.

This table holds each question.


create table questions (
q_id serial primary key,
q_text text
);

Here are some questions. Since I used the serial data type, I’m not going to assign the q_id column. Postgresql will assign the q_id automatically.


insert into questions (q_text) values ('What is your favorite color?'); /* q_id 1 */
insert into questions (q_text) values ('Is it raining outside?');       /* q_id 2 */

After inserting, this is what the database looks like now:

select * from questions;
q_id |            q_text
------+------------------------------
1 | What is your favorite color?
2 | Is it raining outside?
(2 rows)

This table lists every possible answer, each with a unique id.

create table all_answers (
a_id serial primary key,
a_text text
);

These are some answers.

insert into all_answers (a_text) values ('red');      /* a_id 1 */
insert into all_answers (a_text) values ('yes');      /* a_id 2 */
insert into all_answers (a_text) values ('green');    /* a_id 3 */
insert into all_answers (a_text) values ('no');       /* a_id 4 */

Here’s what all_answers looks like after adding data:

select * from all_answers;
a_id | a_text
------+------–
1 | red
2 | yes
3 | green
4 | no
(4 rows)

This table links each question to meaningful possible answers. I am using question ID (q_id) and answer ID (a_id) numbers in order to save space.

create table possible_answers (
q_id integer references questions (q_id),
a_id integer references all_answers (a_id),
primary key (q_id, a_id)
);

Now, I’ll link certain questions with certain answers. This insert links ‘What is your favorite color?’ with ‘red’.

insert into possible_answers (q_id, a_id) values (1, 1);

This statement links ‘Is it raining outside?’ with ‘no’.

insert into possible_answers (q_id, a_id) values (1, 3);

And, just to continue the trend…

insert into possible_answers (q_id, a_id) values (2, 2);
insert into possible_answers (q_id, a_id) values (2, 4);

Now let’s see what we’ve got:

select * from possible_answers;
q_id | a_id
------+------
1 |    1
1 |    3
2 |    2
2 |    4
(4 rows)

Finally, this is the table that will record the actual answer given for a question and make sure that the answer is appropriate for the question.

create table real_answers (
q_id integer references questions (q_id),                         /* simple FK    */
a_id integer references all_answers (a_id),                       /* simple FK    */
foreign key (q_id, a_id) references possible_answers (q_id, a_id) /* compound FK  */
);

Now watch what happens when I try to insert an answer that doesn’t match the question into the database. I’m going to try to answer the question “What is your favorite color?” with the answer “yes”.

Hopefully the database will prevent me.

select q_text from questions where q_id = 1;
q_text
------------------------------
What is your favorite color?
(1 row)


select a_text from all_answers where a_id = 2;
a_text
------–
yes
(1 row)

insert into real_answers (q_id, a_id) values (1, 2);
ERROR:  $3 referential integrity violation - key referenced from real_answers
not found in possible_answers
Hurray! We are prevented from entering the data! Now, let’s try storing an acceptable answer.

insert into real_answers (q_id, a_id) values (1,1);
INSERT 17196 1

That means it worked!