Some research on generic/EAV tables
Yesterday I confirmed a hunch I’ve had about database schema design. Here’s the background: I’m working on a feature where I track employees and their preferred locations, shifts, and stations.
For example, I’ll track that Alice likes the morning shift at the west-side location, and she likes to work the front register station most of all, but her second choice is the drive-though.
Meanwhile, Bob likes the west-side and north-side locations, is indifferent about the shift, and likes the dishwasher station. Note the one-to-many relationship between Bob and his preferred locations and his lack of shift preferences.
I came up with two ways to make my tables:
FIRST METHOD
create table preferred_location (
employee_id int references employee (id),
location_id int references location (id));
create table preferred_shift (
employee_id int references employee (id),
shift int references shift (id));
create table preferred_station (
employee_id int references employee (id),
station_id int references station (id));
Hopefully, this is obvious. I store that Alice likes the west-side location in the preferred_location table like this:
(Alice's ID, west-side location ID)
Then I store the fact that she likes the morning shift in the preferred shift table like this:
(Alice's ID, morning shift ID)
Every time I want to add some new type of preference, e.g., hats, I need to make a table to hold all the legal hats and then make a table linking employees to their hat preference.
SECOND METHOD
This way keeps all the preferences in a single table.
create table preferences (
employee_id int references employee (id),
preference_type text,
preference_value text));
Here’s how I would store that Bob likes to be a dishwasher:
(Bob's ID, 'station', 'dishwasher')
Here’s what I like about this method two: I don’t need to tweak the database schema whatsoever when I dream up new preferences. In fact, I can let system users create new preference types at run-time, and the system just works. In this scenario, adding each employee’s hat preference does not require updating my schema.
On the downside, I wouldn’t have any FK constraints. Somebody could store a preference like they want to work a nonexistent shift and I wouldn’t know until I get an angry customer calling me. I’d have to do a lot of application-level data validation, which I hate.
Finally, there’s just something about method two that seems … wrong, even though I’ve seen variations of this theme in production environments at previous jobs (cough, ALLCODES, cough, PINDATA, cough).
So, with this dilemma, I wrote a post to the PostgreSQL users mailing list and got a fantastic reply. Here’s some excerpts:
Your “method 2″ is something called an Entity-Attribute-Value table design[1].
…
That said, by going the EAV/”Method-2″ route, you’re gaining flexibility, but at the cost of increased complication, and ultimately repurposing a relational database to do something that isn’t very database-like, that’s really more like a spreadsheet. (So why not just use a spreadsheet?) You have little room for recording additional information, like ordering preferences, or indicating that (say) a station preference depends on a location preference, or that a shift time depends on day of the week, etc — so you’re probably not getting as much flexibility as you think. Sure, you could add an “Extra_Data” column, so you have rows:
Marie-Location-West-1,
Marie-Location-East-2,
Marie-Shift-Evening-Tuesday,
Marie-Station-Register-West,
Marie-Shift-Morning-Sunday,
etc. But you can see the data integrity nightmare already, when you somehow manage to record “Marie-Shift-Register-1″. Not to mention that you’ll have to do format conversions for that “Extra_Data” field, and incorporate logic somewhere else in your program that deciphers whatever’s in the generic data field to come up with ordering preferences for locations, station preferences by shift times, or whatever else you want to store.…
[1] http://en.wikipedia.org/wiki/Entity-Attribute-Value_model
At this point, I was pretty sure I would go with method 1, but not absolutely certain. Then I read that linked article, which really just said more of the same.
Then I read this Ask Tom post and that erased the last bit of lingering doubt I had. Method 2 is incompatible with performance. Method 2 turns your database into a glorified flat file. Here’s some of my favorite excerpts from the Ask Tom post:
Frequently I see applications built on a generic data model for “maximum flexibility” or applications built in ways that prohibit performance. Many times - these are one in the same thing! For example, it is well known you can represent any object in a database using just four tables:
Create table objects ( oid int primary key, name varchar2(255) );
Create table attributes
( attrId int primary key, attrName varchar2(255),
datatype varchar2(25) );
Create table object_Attributes
( oid int, attrId int, value varchar2(4000),
primary key(oid,attrId) );
Create table Links ( oid1 int, oid2 int,
primary key (oid1, oid2) );
Looks great, right? I mean, the developers don’t have to create tables anymore, we can add columns at the drop of a hat (just requires an insert into the ATTRIBUTES table). The developers can do whatever they want and the DBA can’t stop them. This is ultimate “flexibility”. I’ve seen people try to build entire systems on this model.
…
But, how does it perform? Miserably, terribly, horribly. A simple “select first_name, last_name from person” query is transformed into a 3-table join with aggregates and all.
There’s a comment on that story about some java developers that insisted on this approach and then had to redesign the whole thing post-launch. I also like the “chief big table” remark.
Anyhow, it’s nice to know that (this time) my instincts were correct.