Database Naming: A response

[Note to non-technical types: skip this one. It contains no entertainment value whatsoever.]

I read an interesting article linked from DZone today, proposing a naming approach for database tables and fields.
What was a little startling to me, though, is that it was clearly a carefully-considered approach, but I disagreed with almost every single conclusion he drew. I’ll review the main points and outline my approach along the way.

Identifying the Problem

From the article:

Perhaps the most common practice in database naming is to use all one case (upper or lower) with words separated by underscores, and heavily abbreviated. Fixed case names have become the standard because database systems are traditional case-insensitive in table and field names (or required fixed case). […] Separating words with underscores is a natural result of fixed case names and also is commonplace in academia. The heavy use of abbreviations is a holdover from the days of dBase and other systems where there was a stringent limit on table and field name length.

We’re agreed at the starting point — personally, some of my most frustrating development experiences have been working with inherited AS/400 databases; I’m not sure what the table & column name character limits were — 10? — but particularly when every abbreviations seemed to have been made up on the fly (“customer” could be cst, cust, cus, cstm, cu, etc.) it’s simply hell. They also tended to try to prefix column names with some shorter abbreviation of the table name as well (leaving even fewer precious characters for the actual field name…). The databases were nearly impossible to use.

I also want to point out that almost any naming standard is better than none, or a random mixture of standards depending on who was adding the particular table — if I were working with one of the databases designed using his standard, I’d stick to it (though I’d probably gripe about it!).

On to his scheme. His first solution is to use all mixed case names:

Solution #1: Mixed case (vs. Underscores)

  • I use mixed case names.
  • I do not use underscores. Mixed-case makes underscores unnecessary and once you are used to omitting them, they look ugly and it is easier not to type them.

The first problem here is that databases are more or less still case-insensitive, not just “traditionally”. Try googling for database table names mixed case and see how many times you see words like bug/issue/defect/HELP! associated with how many different databases… people looking for workarounds because Oracle, DB2, MySQL, PostgreSQL, etc. are chewing up their mixed-case table names. I know I’ve had to recreate schemas before because of table name case issues.

Second, are underscores actually harder to read? Example: a column holding the date of the last order:

Plain English: last order date
Underscores:   last_order_date
Mixed case:    lastOrderDate

If the mixed case were significantly faster to read & easier on the eyes, it might be worth it to push against the database case-insensitive history & ongoing issues (I imagine you can get around the problems by quoting all table name references, for example). I’m not seeing the benefit, though.

Solution #2: Table type prefix to table name

  • I prefix table names to indicate one of three different types of tables.

I’m a bit more understanding on this one — I also used to be a hardcore fan of Hungarian notation in code, though I found as my code got better I fell back to a much laxer standard, only keeping prefixes to flag incoming method arguments and constants. Prefixes help by grouping tables by type automatically when they’re in alpha order and providing a bit more info than the base name; the downside is added ugliness and distraction when you don’t need the added info.

His prefixes: tbl for regular data tables, link for many-to-many relationship tables, and pltbl for “picklist” tables — lists of valid values for columns elsewhere.

First off… “pltbl”? That makes me think more of dinner table scenes from Calvin & Hobbes than anything else. But beyond that, I don’t use these, and probably won’t start; I get more value out of just constructing table names carefully so that order-related tables all start with “order”, etc. — instance-specific prefixes. But if I were, I’d drop the “tbl” prefix entirely (since that’s the majority of the tables; why clutter up the name for no added info?) and change “pltbl” to “lookup” or some meaningful, single word.

Solution #3: Consistent singular (vs. plural) table names

He sticks with singular (e.g., customer vs. customers). I agree on this one — then I can have a table customer with a PK customer_id, for example. It just works out cleanly.

Solution #4: Table-name prefix on column names

  • I prefix field names with a three letter code to indicate which table owns the field.

Disagree completely. This is what table aliases are for in joins… and you can leave them out when you’re selecting from a single table, and aliases won’t play hell with mapping a result set to object fields. Also, when (for example) order IDs are referenced from multiple tables, I want them to use the same column name, because they’re the same data! It should be clear that you can join activity_log.order_id to order.order_id… whereas if those columns are named actlOrderId and ordOrderId that’s not obvious anymore.

Reason #1 from the article:

When you join tables together that each contain fields of the identical name, you need to clutter your SQL statements with a bunch of AS clauses to make the resulting field names unique.

This confused me at first — take the “order_id” example above — why would I select from both order_id columns? They’ll have the same data. Then I noticed that he’s selecting * from all of the joined tables in his example. That’s why…. I pretty much never select * when I’m joining multiple tables, partly as a self-check; if I have to change the columns in a table (renaming, removing, etc.) I don’t want queries working normally, only to find later that an essential piece of data is no longer making it into my objects; if it’s wrong, the sooner it breaks and makes some noise, the better.

Reason #2:

It makes it more difficult to interpret certain error messages. For example, suppose you are using traditional naming and you get the error “Field cust_id not found.” If you have multiple tables with a cust_id field, you may not immediately know which table generated the error.

This is an error-reporting shortfall; if the exception handling captures enough data (at a minimum, the faulty SQL statement), this is never an issue.

Solution #5: Never use spaces in field names

This one’s hard to argue with; it makes for horrendous SQL. But my feeling is that you can get almost the same readability using underscores anyway (see above).

Solution #6: Avoid duplicating column names in different tables

From the article:

For example, your customer table may have a notes field for notes about the customer and your order table may have a notes field for notes about the order. Using my prefix system, you may be inclined to name them this way:

  • cusNotes
  • ordNotes

You can see here how using prefixes to indicate the owner table is already convenient to distinguish order notes from customer notes on a join operation.

However, I recommend this method:

  • cusCustomerNotes
  • ordOrderNotes

This is where the downsides of this approach start to be really clear: his “solutions” have given him column names such as cusCustomerNotes — that’s 11 characters you have to parse before you can find the 5 character actual name. And how much useful data is actually in those 11 characters? Almost none. It’s basically 70% noise and 30% data.

The problem he’s addressing is valid — I’ve had special cases where overlap like this is obviously going to present a problem, so I’ll choose column names carefully to avoid it. But the special cases aren’t common. Look at the example — how often are you going to be selecting both a customer’s notes and one of his order’s notes into a single object? It’s much more likely that you’d have a Customer object, then separate list of Order objects. Hence, no overlap & no conflict, and 11 characters just muddling things up without adding anything.

A related issue that he doesn’t cover (but you should) is naming of common field types — I’m not talking about Hungarian notation here, but common datatypes that you’re using everywhere and which tend to include the type in the name somehow, like dates, timestamps, PK values, and true/false values. Take the boolean example: if you don’t set a standard, you’ll have some developers using ‘T’/’F’ in char(1) fields, others using ‘Y’/’N’ or ‘T’/null or empty string in the same fields, others using 1/0 values in number(1) fields… with column names including things like is_active, active_flag, active_tf, active_bool, and so on. I wouldn’t insist on what approach to take exactly, but do think it over, pick an approach, and stick with it.

Solution #7: Use GUID strings as primary keys, not integers

He advocates using a 36-character long, randomized string instead of the standard auto-incremented numeric value for primary keys. I actually worked with a database that took this approach recently, and since we were migrating it anyway, we switched the approach in the process to use a numeric PK — it was just too frustrating to work with!

He lists performance as the first downside — though I don’t know that he actually benchmarked this; personally I don’t think it would make a big difference. Then he does mention the main problem — these IDs are simply hard to work with:

While integers are messy in that they don’t tell you much about the object, during the development process, it is much easier to tell a colleague that there is a problem with customer number 3452 than to tell him that there is a problem with customer CUS:005bf43a-01ca-454a-996c-2c73621ceeb6. In my development, GUIDs have proved somewhat inconvenient when trying to figure out why the application cannot find customer CUS:005bf43a-01ca-454a-996c-2c73621ceeb6 or trying to find the corresponding foreign key value in a database. It can be done, but looking at columns and rows or GUIDs for any length of time can give you a serious headache!

Exactly — even an integer up in the millions is far easier to copy & paste, when you’re writing ad-hoc SQL, debugging, etc.. So, that’s the cost, but he thinks it’s generally outweighed by the benefits:

Benefit #1: Merging databases; i.e., you don’t have to worry about overlap between different data sets.
Well, this situation doesn’t come up often, of course, but if it does, if you’re following my approach of giving the same data the same column name irrespective of table (e.g., order_id everywhere, not cusOrderId and ordOrderId) then it’s pretty trivial to migrate — just add a fixed offset to those incoming columns, update your sequences when you’re done, and you’re all set.

Benefit #2 : Knowing the PK before adding the record.
Maybe we’re just dealing with different databases, but I definitely do not need to follow all of the steps he outlines to search for a row I’ve just written to retrieve the auto-incremented ID. AFAIK modern major databases do not have this problem. For example, in Oracle you can either retrieve the next ID from a sequence on your own (before writing the record) or you can build it into the insert statement like this:

insert into order (order_id, customer_id) values (order_id_seq.nextVal, 12);

…then retrieve the just-added ID from the sequence like this:

select order_id_seq.currVal from dual;

No searching, selecting, etc.; it doesn’t even have to hit the disk to give you this info.

Benefit #3: Web database key values in the URLs

If you ever write a database application for the web, one thing you probably want to avoid is to use consecutive integers as your primary key values. This is because most database-driven web sites at some point display a key field in the address bar of the web browser. Any enterprising hacker can simply increment or decrement the key value and potentially see data that they shouldn’t. By using GUIDs as primary keys, it is much more difficult for the hacker to penetrate the inner workings of the site by this method.

This is another case of trying to hack a solution to a problem using SQL when the problem is elsewhere. GUIDs are absolutely not an excuse for leaving your webapp wide-open. Substituting IDs in URLs doesn’t even take an “enterprising hacker”… anyone can try it, and eventually someone will get it. If you don’t implement any security to prevent people from accessing data they’re not supposed to see, you’ve got a much bigger problem that this can solve.

Benefit #4: You’ll never run out of GUIDs
The maximum numeric column value in most databases is more than large enough for anything I’ve ever run into… it’s 38 digits in Oracle — whatever actual number that translates to, it’s big.

Summary

Well, that’s the gist of it. The important thing (I think we’d agree) is that you put some thought into the usability of the table and column names you choose, in the situations where you’ll be using them most — and make some kind of standard that you can follow consistently for the life of the database. What you do will depend partly on your tools, your development language(s), your database, your coding habits, and so on — just make sure you are making your choices based on your real everyday usage, not based on the “cleverness” of an idea or because of imagined issues that are actually minimal risks.

18 Comments

  1. Reply
    Jeremy Weiskotten 25th January, 2008

    Great post! I’ve been using Ruby on Rails for about a year, which defines the database conventions for you (eliminating most of this kind of debate), but the Rails conventions pretty much mirror what you’ve outlined here. One exception is that table names are plural (the Customer class maps to the customers table), and the surrogate PK is always “id”.

    I have done enough work with databases built with ridiculous conventions to appreciate the simpler, cleaner ideas you’ve described.

  2. Reply
    BlazS 28th January, 2008

    I think that tables should always be plural. Customers is table and its logical that in table Customers we will find one or more customers. Each customer has own row and so because of that tables should be plural. Customers > customer

  3. Reply
    rob 28th January, 2008

    @BlazS: It’s one of those things that I don’t have a strong opinion on — personally I use singular because it means the table name and object name match exactly, and the PK is [tablename]_id… and I’ve used that approach for years, so it sounds right to me. I’d rather refer to the customer table than the customers table — it’s better grammatically. Also, automating the plural -> singular translation is normally simple, but not always — unless you pluralize “reply” as “replys”, “person” as “persons”, etc..

    On the other hand, another argument for plural (beyond what you mentioned): some databases have administrative tables with names that you perhaps shouldn’t duplicate in your own schema, like “user” and “privilege”. THOSE are singular (in Oracle at least, which is what I’m thinking of), so if you have a table for your users and your own role/privilege setup in the database, using plural tablenames there’s no possible confusion.

    The more important point with this one is consistency, since there’s no significant cost or risk either way — if I’m working on a database designed by someone else with plural names, I definitely follow that convention with any new tables.

  4. Reply
    rob 28th January, 2008

    Ha! I was digging around online to see what other arguments there were for singular vs. plural table names, and ran across a brilliant suggestion: just choose table names that are nouns with the same form for plural and singular! Examples:
    aircraft
    sheep
    deer
    cannon
    fish

    It, ah, perhaps limits your options a bit, but it certainly resolves the issue!

  5. Reply
    JGM 28th January, 2008

    Just my personal preference, but I avoid all underscores in names on MySQL as this is a wildcard character. I prefer not to have to escape table or field names to prevent pattern matching.

  6. Reply
    rob 28th January, 2008

    @JGM: I don’t follow — you can write a query with a wildcard in the table name? I’ve only ever heard of wildcards in something like “select * from table where name like ‘Jo_’; …where that would return values like Jon/Job/etc..

    But that’s only values, not table or column names.

    Can you give an example of where that could be a problem? I done a few projects using MySQL, but I’ve never heard of this issue before.

  7. Reply
    karthik 28th January, 2008

    I think the performance benefit of knowing the primary key is in a specialized use – most frequently observed when using a tool such as Hibernate. If you are attempting to write a 1000 rows into a table in a batch and your PK is an Oracle sequence, Hibernate will execute a 1000 select … dual before it writes the rows. With a GUID, Hibernate can generate the key on the client side.

    I see your point about the readability of GUIDs. However, I think there is benefit to using GUIDs for certain PKs. I’ve successfully used GUID as the pk and a sequence as a surrogate key (populated with a trigger) and had the surrogate key as the foreign key elsewhere.

    For table names, it makes sense to ALWAYS use a prefix – sometimes you have tables from another app migrated into yours – it helps to easily distinguish between table sets then. I always use a TLA – 2 letters identifying the application and one for sub module. So for a bulletin board app, table prefixes may be BBS for system tables, BBU for user tables, BBM for message tables, etc.

  8. Reply
    Steve 28th January, 2008

    You absolutely do not have to escape table or column names that contain underscores in MySQL, I’m not sure what JGM is talking about.

    I agree with all your points; just I tend to use “id” rather than “tablename_id”, and as I do a fair bit of Rails work I tend to use pluralized forms nowadays – Rails has a reasonably comprehensive singular plural form function to do all the hard work. Makes you wonder why they bothered, but oh well.

    http://svn.rubyonrails.org/rails/trunk/activesupport/lib/active_support/inflections.rb

  9. Reply
    ME 28th January, 2008

    On the case sensitivity: This has been the case in programming languages for donkeys years (C -> C++ -> Java etc. ) and programmers handle mixed case names fine.

    Yes the DB may be case sensitive, so you have to write SQL in the correct case by following the naming convention.

    People not following a convention is not an argument against the convention.

  10. Reply
    rob 29th January, 2008

    @karthik: I’m not sure about the options for Hibernate bulk inserts, though I suspect you’d be surprised at how fast pulling 1000 id’s from a sequence would be over an open connection (much faster than reading from a table on the disk, for example…). I haven’t hit the situation, though — I’ve never used Hibernate for bulk data moves; it’s always seemed much more appropriate for application-style work with generally small numbers of objects at a time.

    WRT your second point — what were the reasons for using the GUID PK in the first place? It sounds like you had two separate primary keys in that situation (one as the “real” PK, one to operate as a readable identifier and foreign key), which would be confusing, so I imagine there were reasons for needing them both.

    The prefix argument I still don’t buy. Are there good reasons for sharing a single database schema for multiple applications? You can always leave them in the same database instance for reasons of simple backups, perhaps shared data, etc.; just put them in different schemas, then (you can still hit both schemas from one DB connection, but they won’t be stepping on each others toes)… that also makes it easier to move them to separate servers later as they grow, which seems more likely than that you’d be adding more applications to a single schema. And If I want to import tables from another database, well, I’d tweak the incoming data however I needed to make it consistent with the “home” schema. There’s some minor extra work involved in renaming the tables, columns, etc. to match the existing standards, but it’s a rare enough situation that I wouldn’t worry about it.

    @ME: Yes, programming languages like you mentioned are case-sensitive; but SQL by default and historically is not… you can select * from CUSTOMER or SELECT * fRom custoMER and it’ll work either way.

    I believe there’s support nowadays for using case-sensitive table names in most databases, but it seems to be problematic (partly because of the history of case-insensitivity affects tool assumptions).

    So you can do it, but it’s not the “norm” and I am arguing that it isn’t worth the trouble.

  11. Reply
    Liam 30th January, 2008

    I actually fall under a different philosophy for singular/plural table names, but it still often leads me to calling the tables singular:

    “The name of a table should describe one row of the table.”

    So, a table of customers would be called ‘customer’, a crosswalk table between customers and companies would be ‘customer_company’, and one where each row was a matrix of companies would be “companies”. In practice, I’ve only ever found singular table rows, however.

  12. Reply
    Mike 30th January, 2008

    I prefer to have my table names plural. The reason is because a table is actually a set of like items. So even though the individual item name would be Customer, when you select from a table, your actually returning Customers. If you look at what your actually doing logically, you wouldn’t want to say Customer->Customer[0]->id, Customers->Customer[0]->id just makes more sense.

    As for table prefixes, leave them out. They limit your ability to turn tables into views and vice-versa if your requirement later change. Then for linking tables I usually use something like CustomersAddressesMap.

  13. Reply
    Tim 30th January, 2008

    Let’s see how many of my naming rules (http://tottinge.blogsome.com/meaningfulnames/) dzone wants us to break ….

    I think only four. That’s better than the average legacy code base, but could be better, I think.

    I prefer non-space names, underscore-enriched, no warts or prefixes, no concern for plural v. singular in table names (be consistent only), totally agree with Rob on #6, and I don’t particularly like guids for IDs.

    Scary to agree so much.

  14. Reply
    Tyler 30th January, 2008

    Fucking eh! Good read.

  15. Reply
    mikeh 30th January, 2008

    Another option not discussed is to use a “Counter” table (VERY old school, but effective for many scenarios). Grab the next value for a PK int field via a stored proc, which increments the counter value when reading. One of the advantages of this is being able to reserve a range of consecutive values in advance. By using a char pk instead of an int you can even “namespace” your counter table values so different registered databases will not create overlapping ids (ie: “AAA000000001”, “AAA000000001”)

  16. Reply
    rob 30th January, 2008

    @Tim: “Scary to agree so much”
    I agree. No! I disagree! Huh. “This statement is false.” Hey, that’s a nice-looking naming guide.

    @mikeh: I’ve done that before, actually! I had a table for key values, with columns something like pk_name, value, chunk_size, max_value, then a singleton SequenceManager that would claim PKs from the table in “chunk_size” chunks, dole them out as needed, send email warnings if we were approaching the max_value, and reset the sequence back to 1 when it hit the max.

    In retrospect, though, I wouldn’t recommend it. The design was simple; implementing the code was trickier, and we did have some subtle bugs show up — that singleton object that doles out the IDs needed to be threadsafe but couldn’t lock out *all* PK requests when it needed to claim a new chunk for just one stack (welcome to concurrency tricks). Now cluster the app servers, so each server has its *own* SequenceManager, and they have to play nicely even when refreshing at the same time….

    The biggest problem is reimplementing all of the features of sequences that are nowadays very likely built into the database already. That project was working with DB2400; I don’t remember what it had (and possibly didn’t know at the time…). But at least in Oracle, this is *all* stuff the sequence can do for you (even pre-claiming chunks of N at a time) so getting a new PK is really going to be extremely fast… very unlikely to be near the top of performance hotspots, and thus not a candidate for a bunch of complex custom code.

  17. Reply
    Robert Williams 21st February, 2009

    Does anyone know where I can find more information about this?

  18. Reply
    Barrett 21st November, 2012

    Thanks for this article. I’m very glad that I happened to read it. I’m designing my first database (for personal use only so the stakes aren’t so high) and I’ve been trying to educate myself about conventions used, best practices, etc. I read the article that you are debunking and, being a novice, found his arguments enticing if a bit inconvenient (ex: cusCustomerNotes). I’m so glad that I happened upon your counter points. While his schema may be useful when merging databases, the more simple approach is definitely going to serve me better for my first project which only I will ever use.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.