Database Naming: A response
[Note to non-technical types: skip this one. It contains no entertainment value whatsoever.]
I read an interesting article linked from DZone today, proposing a naming approach for database tables and fields.
What was a little startling to me, though, is that it was clearly a carefully-considered approach, but I disagreed with almost every single conclusion he drew. I’ll review the main points and outline my approach along the way.
Identifying the Problem
From the article:
Perhaps the most common practice in database naming is to use all one case (upper or lower) with words separated by underscores, and heavily abbreviated. Fixed case names have become the standard because database systems are traditional case-insensitive in table and field names (or required fixed case). […] Separating words with underscores is a natural result of fixed case names and also is commonplace in academia. The heavy use of abbreviations is a holdover from the days of dBase and other systems where there was a stringent limit on table and field name length.
We’re agreed at the starting point — personally, some of my most frustrating development experiences have been working with inherited AS/400 databases; I’m not sure what the table & column name character limits were — 10? — but particularly when every abbreviations seemed to have been made up on the fly (“customer” could be cst, cust, cus, cstm, cu, etc.) it’s simply hell. They also tended to try to prefix column names with some shorter abbreviation of the table name as well (leaving even fewer precious characters for the actual field name…). The databases were nearly impossible to use.
I also want to point out that almost any naming standard is better than none, or a random mixture of standards depending on who was adding the particular table — if I were working with one of the databases designed using his standard, I’d stick to it (though I’d probably gripe about it!).
On to his scheme. His first solution is to use all mixed case names:
Solution #1: Mixed case (vs. Underscores)
- I use mixed case names.
- I do not use underscores. Mixed-case makes underscores unnecessary and once you are used to omitting them, they look ugly and it is easier not to type them.
The first problem here is that databases are more or less still case-insensitive, not just “traditionally”. Try googling for database table names mixed case and see how many times you see words like bug/issue/defect/HELP! associated with how many different databases… people looking for workarounds because Oracle, DB2, MySQL, PostgreSQL, etc. are chewing up their mixed-case table names. I know I’ve had to recreate schemas before because of table name case issues.
Second, are underscores actually harder to read? Example: a column holding the date of the last order:
Plain English: last order date Underscores: last_order_date Mixed case: lastOrderDate
If the mixed case were significantly faster to read & easier on the eyes, it might be worth it to push against the database case-insensitive history & ongoing issues (I imagine you can get around the problems by quoting all table name references, for example). I’m not seeing the benefit, though.
Solution #2: Table type prefix to table name
- I prefix table names to indicate one of three different types of tables.
I’m a bit more understanding on this one — I also used to be a hardcore fan of Hungarian notation in code, though I found as my code got better I fell back to a much laxer standard, only keeping prefixes to flag incoming method arguments and constants. Prefixes help by grouping tables by type automatically when they’re in alpha order and providing a bit more info than the base name; the downside is added ugliness and distraction when you don’t need the added info.
His prefixes: tbl for regular data tables, link for many-to-many relationship tables, and pltbl for “picklist” tables — lists of valid values for columns elsewhere.
First off… “pltbl”? That makes me think more of dinner table scenes from Calvin & Hobbes than anything else. But beyond that, I don’t use these, and probably won’t start; I get more value out of just constructing table names carefully so that order-related tables all start with “order”, etc. — instance-specific prefixes. But if I were, I’d drop the “tbl” prefix entirely (since that’s the majority of the tables; why clutter up the name for no added info?) and change “pltbl” to “lookup” or some meaningful, single word.
Solution #3: Consistent singular (vs. plural) table names
He sticks with singular (e.g., customer vs. customers). I agree on this one — then I can have a table customer with a PK customer_id, for example. It just works out cleanly.
Solution #4: Table-name prefix on column names
- I prefix field names with a three letter code to indicate which table owns the field.
Disagree completely. This is what table aliases are for in joins… and you can leave them out when you’re selecting from a single table, and aliases won’t play hell with mapping a result set to object fields. Also, when (for example) order IDs are referenced from multiple tables, I want them to use the same column name, because they’re the same data! It should be clear that you can join activity_log.order_id to order.order_id… whereas if those columns are named actlOrderId and ordOrderId that’s not obvious anymore.
Reason #1 from the article:
When you join tables together that each contain fields of the identical name, you need to clutter your SQL statements with a bunch of AS clauses to make the resulting field names unique.
This confused me at first — take the “order_id” example above — why would I select from both order_id columns? They’ll have the same data. Then I noticed that he’s selecting * from all of the joined tables in his example. That’s why…. I pretty much never select * when I’m joining multiple tables, partly as a self-check; if I have to change the columns in a table (renaming, removing, etc.) I don’t want queries working normally, only to find later that an essential piece of data is no longer making it into my objects; if it’s wrong, the sooner it breaks and makes some noise, the better.
It makes it more difficult to interpret certain error messages. For example, suppose you are using traditional naming and you get the error “Field cust_id not found.” If you have multiple tables with a cust_id field, you may not immediately know which table generated the error.
This is an error-reporting shortfall; if the exception handling captures enough data (at a minimum, the faulty SQL statement), this is never an issue.
Solution #5: Never use spaces in field names
This one’s hard to argue with; it makes for horrendous SQL. But my feeling is that you can get almost the same readability using underscores anyway (see above).
Solution #6: Avoid duplicating column names in different tables
From the article:
For example, your customer table may have a notes field for notes about the customer and your order table may have a notes field for notes about the order. Using my prefix system, you may be inclined to name them this way:
You can see here how using prefixes to indicate the owner table is already convenient to distinguish order notes from customer notes on a join operation.
However, I recommend this method:
This is where the downsides of this approach start to be really clear: his “solutions” have given him column names such as cusCustomerNotes — that’s 11 characters you have to parse before you can find the 5 character actual name. And how much useful data is actually in those 11 characters? Almost none. It’s basically 70% noise and 30% data.
The problem he’s addressing is valid — I’ve had special cases where overlap like this is obviously going to present a problem, so I’ll choose column names carefully to avoid it. But the special cases aren’t common. Look at the example — how often are you going to be selecting both a customer’s notes and one of his order’s notes into a single object? It’s much more likely that you’d have a Customer object, then separate list of Order objects. Hence, no overlap & no conflict, and 11 characters just muddling things up without adding anything.
A related issue that he doesn’t cover (but you should) is naming of common field types — I’m not talking about Hungarian notation here, but common datatypes that you’re using everywhere and which tend to include the type in the name somehow, like dates, timestamps, PK values, and true/false values. Take the boolean example: if you don’t set a standard, you’ll have some developers using ‘T’/’F’ in char(1) fields, others using ‘Y’/’N’ or ‘T’/null or empty string in the same fields, others using 1/0 values in number(1) fields… with column names including things like is_active, active_flag, active_tf, active_bool, and so on. I wouldn’t insist on what approach to take exactly, but do think it over, pick an approach, and stick with it.
Solution #7: Use GUID strings as primary keys, not integers
He advocates using a 36-character long, randomized string instead of the standard auto-incremented numeric value for primary keys. I actually worked with a database that took this approach recently, and since we were migrating it anyway, we switched the approach in the process to use a numeric PK — it was just too frustrating to work with!
He lists performance as the first downside — though I don’t know that he actually benchmarked this; personally I don’t think it would make a big difference. Then he does mention the main problem — these IDs are simply hard to work with:
While integers are messy in that they don’t tell you much about the object, during the development process, it is much easier to tell a colleague that there is a problem with customer number 3452 than to tell him that there is a problem with customer CUS:005bf43a-01ca-454a-996c-2c73621ceeb6. In my development, GUIDs have proved somewhat inconvenient when trying to figure out why the application cannot find customer CUS:005bf43a-01ca-454a-996c-2c73621ceeb6 or trying to find the corresponding foreign key value in a database. It can be done, but looking at columns and rows or GUIDs for any length of time can give you a serious headache!
Exactly — even an integer up in the millions is far easier to copy & paste, when you’re writing ad-hoc SQL, debugging, etc.. So, that’s the cost, but he thinks it’s generally outweighed by the benefits:
Benefit #1: Merging databases; i.e., you don’t have to worry about overlap between different data sets.
Well, this situation doesn’t come up often, of course, but if it does, if you’re following my approach of giving the same data the same column name irrespective of table (e.g., order_id everywhere, not cusOrderId and ordOrderId) then it’s pretty trivial to migrate — just add a fixed offset to those incoming columns, update your sequences when you’re done, and you’re all set.
Benefit #2 : Knowing the PK before adding the record.
Maybe we’re just dealing with different databases, but I definitely do not need to follow all of the steps he outlines to search for a row I’ve just written to retrieve the auto-incremented ID. AFAIK modern major databases do not have this problem. For example, in Oracle you can either retrieve the next ID from a sequence on your own (before writing the record) or you can build it into the insert statement like this:
insert into order (order_id, customer_id) values (order_id_seq.nextVal, 12);
…then retrieve the just-added ID from the sequence like this:
select order_id_seq.currVal from dual;
No searching, selecting, etc.; it doesn’t even have to hit the disk to give you this info.
Benefit #3: Web database key values in the URLs
If you ever write a database application for the web, one thing you probably want to avoid is to use consecutive integers as your primary key values. This is because most database-driven web sites at some point display a key field in the address bar of the web browser. Any enterprising hacker can simply increment or decrement the key value and potentially see data that they shouldn’t. By using GUIDs as primary keys, it is much more difficult for the hacker to penetrate the inner workings of the site by this method.
This is another case of trying to hack a solution to a problem using SQL when the problem is elsewhere. GUIDs are absolutely not an excuse for leaving your webapp wide-open. Substituting IDs in URLs doesn’t even take an “enterprising hacker”… anyone can try it, and eventually someone will get it. If you don’t implement any security to prevent people from accessing data they’re not supposed to see, you’ve got a much bigger problem that this can solve.
Benefit #4: You’ll never run out of GUIDs
The maximum numeric column value in most databases is more than large enough for anything I’ve ever run into… it’s 38 digits in Oracle — whatever actual number that translates to, it’s big.
Well, that’s the gist of it. The important thing (I think we’d agree) is that you put some thought into the usability of the table and column names you choose, in the situations where you’ll be using them most — and make some kind of standard that you can follow consistently for the life of the database. What you do will depend partly on your tools, your development language(s), your database, your coding habits, and so on — just make sure you are making your choices based on your real everyday usage, not based on the “cleverness” of an idea or because of imagined issues that are actually minimal risks.