Got an email from a friend:
I want to create a database that acts just like the index card note cards one used to make for doing research papers in HS and Univ. I’ve got most of it down, but I am having trouble figuring out how to normalize the page numbers and the source data.
Let’s say one has three kinds of sources – books, magazines, and websites. Well, a book will have:
place of publication
author(s) – but only maybe – what does one do about The Economist?
title of article
title of magazine
date of publication
author(s) – again only maybe
title of website
Here’s what I said in reply:
So, I think I get your question. Books, magazines, and websites are all different examples of sources that you might cite. They have some attributes in common and some attributes that are unique.
Going with the high school term paper example, let’s pretend that you wrote a paper and your bibliography looks like this:
- (book) Tom Sawyer by Mark Twain. Published by Hustler, 1833, in NY.
- (book) Huckleberry Finn by Mark Twain. Published by Hustler, 1834, in NY.
- (magazine) “Indonesia sucks”, The Economist. No author listed. February 2001 issue. page 67.
- (website): “Indonesia” on WIkipedia, http://en.wikipedia.org/wiki/Indonesia. Lots and lots of authors. I used text found on the site as of June 1st, 2007.
- (website) “blog #96 post”, http://anncoulter.com, Ann Coulter is the author, article posted on July 4th, 2007. I used text found on the site as of this date.
(magazine) “No, the Economist Sucks”, Jakarta Post. Joe Brown is the author. Article appeared in the March 14, 2007 issue, on page 6D.
I can see at least three ways to set this up:
1. You can make a single table called sources that includes the union of all these different types. So, you would have a column called “publisher” and another column called URL. The book sources would have a blank URL field, and the website sources would have a blank publisher field. You could have a column called source type which would have values of “book”, “website”, “URL”, or anything else that fits.
CONs: It is tricky to bake in good data validation into your database. You can’t easily add rules to enforce that you get all the required data for each row. Also, every time you discover a new source type, you may need to modify the table and add even more columns.
2. You create a separate table for each source type. So, you have a books table, a magazines table, and then a websites table.
PROs: Now, you can easily make sure that all your books data has all the required data.
CONs: Accumulating all the results for one of your papers means you have to do a query against each table separately and then use some UNION keyword to add them together. Also, when you need to a new source type, you’ll need to add a new table to your schema.
3. Make a bunch of tables:
fields (field_id, field_name)
source_fields(source_id, field_id, field_value)
So, this entry:
(book) Tom Sawyer by Mark Twain. Published by Hustler, 1833, in NY.
Would get a single row in the sources table.
And the fields table would have these values:
1, source type
5, publish date
6, publish location
Then finally, we’d put the actual data in the source fields table:
(source_id, field_id, field_value)
1, 1, “book”
1, 2, “Tom Sawyer”
1, 3, “Mark Twain”
1, 4, “Hustler”
… you get the idea, hopefully.
Then, when you want to store a magazine, the first thing you do is add any new field types you need to the fields table, and then add your data
PROs: you can make up new attributes for your data any time you want, and never have to change your database. For example, if you need to start storing TV shows, you can just add the new types to the fields table and you’re good.
CONs: The field_value field needs to accept any kind of data. So, probably, you’ll want to make it a column type like a TEXT column that can hold arbitrarily large objects, and then before you store anything in your database you need to convert it to a text object. So, you’re not going to be able to index this data well and you’re not going to be able to require that the data matches some formats.
So, figuring which of these approaches is the correct one depends on the specifics of the scenario.
How well can you predict today all the future types of data? If you have perfect clairvoyance, or if you don’t mind monkeying with the database, approach #3 is pointless. I recommend approach #3 in a scenario when you have lots of users, and you don’t want each of them monkeying with the schema.
How worried are you about bad data getting entered in? You can always use triggers and stored procedures or some outer application code to add validation on any of these, but it won’t be easy. Using approach #2 will make validation the easiest.
How fast do queries need to be? If we want to know all the books written by Mark Twain, approach #2 will likely give the fastest response, followed by #1 and then #3.
PS: I’m using this email as my next blog entry. And here it is 🙂