Some research on generic/EAV tables

Yesterday I confirmed a hunch I’ve had about database schema design. Here’s the background: I’m working on a feature where I track employees and their preferred locations, shifts, and stations.

For example, I’ll track that Alice likes the morning shift at the west-side location, and she likes to work the front register station most of all, but her second choice is the drive-though.

Meanwhile, Bob likes the west-side and north-side locations, is indifferent about the shift, and likes the dishwasher station. Note the one-to-many relationship between Bob and his preferred locations and his lack of shift preferences.

I came up with two ways to make my tables:

FIRST METHOD

create table preferred_location (
employee_id int references employee (id),
location_id int references location (id));

create table preferred_shift (
employee_id int references employee (id),
shift int references shift (id));

create table preferred_station (
employee_id int references employee (id),
station_id int references station (id));

Hopefully, this is obvious. I store that Alice likes the west-side location in the preferred_location table like this:

(Alice's ID, west-side location ID)

Then I store the fact that she likes the morning shift in the preferred shift table like this:

(Alice's ID, morning shift ID)

Every time I want to add some new type of preference, e.g., hats, I need to make a table to hold all the legal hats and then make a table linking employees to their hat preference.

SECOND METHOD

This way keeps all the preferences in a single table.

create table preferences (
employee_id int references employee (id),
preference_type text,
preference_value text));

Here’s how I would store that Bob likes to be a dishwasher:

(Bob's ID, 'station', 'dishwasher')

Here’s what I like about this method two: I don’t need to tweak the database schema whatsoever when I dream up new preferences. In fact, I can let system users create new preference types at run-time, and the system just works. In this scenario, adding each employee’s hat preference does not require updating my schema.

On the downside, I wouldn’t have any FK constraints. Somebody could store a preference like they want to work a nonexistent shift and I wouldn’t know until I get an angry customer calling me. I’d have to do a lot of application-level data validation, which I hate.

Finally, there’s just something about method two that seems … wrong, even though I’ve seen variations of this theme in production environments at previous jobs (cough, ALLCODES, cough, PINDATA, cough).

So, with this dilemma, I wrote a post to the PostgreSQL users mailing list and got a fantastic reply. Here’s some excerpts:

Your “method 2″ is something called an Entity-Attribute-Value table design[1].

That said, by going the EAV/”Method-2” route, you’re gaining flexibility, but at the cost of increased complication, and ultimately repurposing a relational database to do something that isn’t very database-like, that’s really more like a spreadsheet. (So why not just use a spreadsheet?) You have little room for recording additional information, like ordering preferences, or indicating that (say) a station preference depends on a location preference, or that a shift time depends on day of the week, etc — so you’re probably not getting as much flexibility as you think. Sure, you could add an “Extra_Data” column, so you have rows:

Marie-Location-West-1,
Marie-Location-East-2,
Marie-Shift-Evening-Tuesday,
Marie-Station-Register-West,
Marie-Shift-Morning-Sunday,

etc. But you can see the data integrity nightmare already, when you somehow manage to record “Marie-Shift-Register-1”. Not to mention that you’ll have to do format conversions for that “Extra_Data” field, and incorporate logic somewhere else in your program that deciphers whatever’s in the generic data field to come up with ordering preferences for locations, station preferences by shift times, or whatever else you want to store.

[1] http://en.wikipedia.org/wiki/Entity-Attribute-Value_model

At this point, I was pretty sure I would go with method 1, but not absolutely certain. Then I read that linked article, which really just said more of the same.

Then I read this Ask Tom post and that erased the last bit of lingering doubt I had. Method 2 is incompatible with performance. Method 2 turns your database into a glorified flat file. Here’s some of my favorite excerpts from the Ask Tom post:

Frequently I see applications built on a generic data model for “maximum flexibility” or applications built in ways that prohibit performance. Many times – these are one in the same thing! For example, it is well known you can represent any object in a database using just four tables:

Create table objects ( oid int primary key, name varchar2(255) );

Create table attributes
( attrId int primary key, attrName varchar2(255),
datatype varchar2(25) );

Create table object_Attributes
( oid int, attrId int, value varchar2(4000),
primary key(oid,attrId) );

Create table Links ( oid1 int, oid2 int,
primary key (oid1, oid2) );

Looks great, right? I mean, the developers don’t have to create tables anymore, we can add columns at the drop of a hat (just requires an insert into the ATTRIBUTES table). The developers can do whatever they want and the DBA can’t stop them. This is ultimate “flexibility”. I’ve seen people try to build entire systems on this model.

But, how does it perform? Miserably, terribly, horribly. A simple “select first_name, last_name from person” query is transformed into a 3-table join with aggregates and all.

There’s a comment on that story about some java developers that insisted on this approach and then had to redesign the whole thing post-launch. I also like the “chief big table” remark.

Anyhow, it’s nice to know that (this time) my instincts were correct.

Published by

matt

My name is Matt Wilson. I make websites for Kiwee to pay the bills. I live in Cleveland Heights, Ohio, with my wife and son.

  • Mac

    Matt,

    Its good we have diverse databases to work with. The fact that RDBMS gives you the needed performance (except Join query) and no matter what Tom says, Just ask a developer what its like redesigning requirement change at later stage of developments.

    The EAV model is definitely much more flexible and gives the maximum freedom to the developers. And removes the join query as there are no two tables to join.

    I think you need to check schemaless semantic database which is packed with Brainwave Platform a complete development and deployment suite mostly built in python.

    -Mac

  • Doug R

    I don't know, Matt, there's something to be said for Entity-Attribute-Value, if it's used judiciously. Adding a new column to a table every time you want some sort of new behavior in your application is just not a sustainable way to program.

    What I like to do is have first-class properties (i.e. database columns) on my model objects if a) I know every single object of this type will need this to be valued, or b) I am concerned about data integrity, or c) I plan to query this information frequently. Then for other things that may not be defined for every instance of that type, and data integrity doesn't matter, use attributes. For example, in a standard CMS application, your Page object might have a column for “page_url” since every page needs to have a URL. But I might use an attribute record for something like “main image caption”, since not every page necessarily has an image, and if the caption is missing or just some random text I don't care. Now when you add a second image to the page, and a second caption, you don't need to add a new column, you just add a new record.

    Tom's example seems completely ridiculous, because everybody has first name and last name and birthday…What he advocates is adding a “<couple of flex fields for oddities>” So now you've got data in these extra fields, except you have a fixed number of them, and there's no context around what data is actually in those fields, which means you still can't query it easily, and you still can't enforce data integrity. I prefer attributes to this mess.

    To address your problem of people putting in preferences for non-existent shifts, this sounds like a user-interface problem, not a database design problem. If you are going to let them type in any old thing, then of course you can't have foreign keys, and will have to do application level validation.

  • http://blog.tplus1.com Matt Wilson

    Hi Doug, thanks for the comment. After I wrote this up I saw this pretty good EAV implementation here:

    http://permalink.gmane.org/gmane.comp.db.postgr

    It's flexible and it lets me use constraints to make sure nothing silly happens. I bet it is probably slower performance-wise than the really inflexible approach of keeping every preference in a separate table, and I don't see how I could store information about the nature of the preferences, like which preference outranks another.

    I've had bad luck with relying on the UI to protect the integrity of the data. A lot of times, I end up with numerous different interfaces; a command-line deal, some web app stuff, and so forth.

    I know it works for others, but I've been bitten more than once by not locking down my database against silly users.

    Anyhow, I've been trying to learn couchdb in my very scarce free time. It's a completely different take on databases — every object has its own set of attributes, but there's no guarantee that any two objects share anything. Seems like the antithesis of the relational model. So it probably makes a much better fit for some things.

    Completely unrelated — you a programmer now too? How much does the fact that you learned stuff like SAS influence how you program now? Looking back on the Fed, I can't believe they would just scoop up complete novice programmers, and then just expect them to navigate and improve that code, and in general, it worked OK.

  • GB

    True.
    But this is also the downfall of databases.
    How have you solved your initial problem? You still can not ask your original schema to tell you all about a person's preferences without knowing all the tables that exist beforehand. Which means everytime you come up with a new preference table, you have to add it to queries requiring to look at all preferences.
    Databases do not seem to be good at managing information of this kind.

  • http://felspar.com Mike A

    Seems to me that the problem is that SQL was never really 'relational'.

    The massive warts on the face of SQL is the NULLs. These things get everywhere and distort application logic into special casing them. Hack on top of ugly hack. We are forced to live with it as an accepted religion. Ever nullable column in a table represents a missing table, on average perhaps a hundred lines of hacky conditionals, weeks of programmer time already wasted, and months on maintenance issues ahead.

    See, what happened after SQL was that common programmers discovered the normal forms, but of course, these being wrapped in all that academic, mathematical nonsense, the masses never really got them, and so they settled on ideas like '3NF is good enough' and 'sometimes you have to denormalise'.

    The fact is, that even 5NF is not good enough for any serious application. The only decent policy, and one that data warehouses are starting to realise, is 6NF throughout (not DKNF). This gives you all of the flexibility benefits of EAV (adding an attribute means just adding a new table – none of your existing code needs to change), all of the benefits of 5NF and lower (real data integrity), and as added bonus, you never have to worry about NULLs (there are none), and you can historize your data (turning your database into revision system for free). Migration is also much easier, as you can d.

    As for JOIN performance, think about this – we know EAV sucks for performance as described above. Relational joins can also suck – but not if they are in 6NF. That's because, disregarding the number of rows, the time taken to join two tables, not to mention the sorting, filtering etc. is proportional to the number of attributes in the tables, which is a massive waste of time if you have 30 attributes in each table but only need 2 or 3 of them for the query, which is often the case.

    Because of this, developers and DBAs become disillusioned with their carefully 5NF databases, and hence they are vulnerable to EAV.

    6NF, on the other hand, does linear time joins, no matter how many attributes there are in the schema (as you only join what you need). That really matters when you have massive amounts of data, and is why such snake oil as EAV databases like BigTable are doomed. The advent of SSDs and cheap RAM will only make this more pertinent.

    If only people had actually listened to Codd.

    Check out anchor modelling for an example (although the naming scheme is a waste of time).

  • http://blog.tplus1.com Matt Wilson

    Hi Mike, You wrote a lot of interesting stuff. I've never looked at 6NF but I'm going to now. Am reading a book “SQL for Smarties” and the author spends a lot of time referring back to Codd's original work.

    Thanks for the post! I'll maybe have some intelligent things to say after I do my homework.

  • http://felspar.com Mike A

    Sorry, I meant to say: 'Migration is also much easier, as you can drop and add tables, without having to change code.'

  • http://felspar.com Mike A

    Oops, also, 6NF joins are constant time (not linear) for constant number of rows, bad mistake. (for more on this, see page 29 on http://www.anchormodeling.com/tiedostot/SU_KTH_…)

    I'm glad you enjoyed the post. I got interested in the EAV/relational debate after going to a Google Devfest where they explained AppEngine and BigTable. Now, AppEngine is cool, but I remember distinctly a point in the presentation where the explained BigTable and someone at the back asked a question “How do we do joins?”. The guy from the Google team replied “Ah, well, ok, you need to rewrite your app so there are no joins – just denormalise all your data”.

    There was an audible 5 second stony silence as people exchanged anxious glances, including from the Google team.

    From talking to people over the next few weeks, I realised that, at that moment, I'd guess about 30% of people in the room realised that the Emperor had no clothes (one guy actually walked out), the rest somehow managed to convince themselves that It's Google So It Must Be Good™.

    And I just think that's not right.

  • http://blog.tplus1.com Matt Wilson

    “just denormalise all your data”

    Yeah, that seems to be the heart of big table, couch DB, and simple DB
    are all about. Saying “just denormalize” makes it seem like
    converting a database from SQL to something like a gigantic
    spreadsheet is no big deal. But for a serious project, in order to
    make sure that my stuff still works, I'd need to implement all sorts
    of constraints and checks in my application layer. By the time I
    finish, I will probably have written a crappy subset of SQL.

    On the other hand, I think there are a lot of projects that can use a
    simple denormalized database very well. There's room in the ecosystem
    for all sorts of different critters, but I don't think one can easily
    be exchanged for another.

    I'll take a look at that PDF.

  • http://www.evokedesign.com/ Miami web designer

    Great post.

  • http://www.uk-cheapest.co.uk Domain Names

    You have little room for recording additional information, like ordering preferences, or indicating that (say) a station preference depends on a location preference, or that a shift time depends on day of the week, etc — so you’re probably not getting as much flexibility as you think. Sure, you could add an “Extra_Data” column, so you have rows:

  • http://meow-mix-coupons.co.cc meow mix coupons

    Very nice, I'm gonna try some of these, i don't know if they will work with me as i with to, but i'm gonna give this a try.