Tips for Choosing a Primary Key

We had a great lunch the other day with a long time customer, talking about technologies past, present and future, as well as what the customer is currently doing. The main LOB application needs a serious upgrade, and it looks like a total rewrite to take the desktop app to the web. This was a good time to rearchitect the database, and our customer was stuck with some technical decisions on what to use for primary keys. Having been down this road several times before, I shared some of my experiences, and here's a summary of my thoughts. I'm not dogmatic, these suggestions aren't set in stone, and people will disagree with me here. This post comes from years of experience in data warehouses and BI apps, and represents my current thinking. Historically I did not apply the logic I discuss here in every situation, and if I could take what I know now and go back in time, I'd probably key a few tables differently. Most not, but a few, yes. A primary key is important for more than just query optimization--this is the key which will be used in relationships between tables. Primary keys are the data primarily responsible for the integrity of your data, and the choice should be taken very seriously. In my work, the data relations were more important than having a good index for a query--we usually ended up creating other indices for the application and reporting queries. There is a great deal of information on choosing the right primary key, and in the end the choice needs to suit your circumstances. You'll likely have to take into account regulatory information, as well as limitations in storage capacity, historical data integrity, application design, etc. Ideally a primary key should be a single column, since that will simplify the relations between tables and in the join. If you have multiple columns, consider using an identity column instead. I don't usually recommend homebrewing your own key generation logic because it's more than likely you'll find your "that'll never happen" scenario. Even if you add some lookup ability and incremement a counter in case of a duplicate, how does that improve your application beyond just a standard identity field? One of the complaints I've heard about normalizing databases is "when I'm looking at the data, I know what the column means". My answer is to stop looking at the raw data directly, use a view or your application. If the data can change, be blank or be duplicated, don't use it The prototypical example here is customer name. Customer names, or portions thereof, make lousy primary keys. Names can change, as can the customer's location. Email and phone numbers also make poor primary keys, because these can also change (or in the case of phone numbers, reassigned to a completely different person). In any of these conditions, I'd recommend an identity column or a GUID (another decision, see below). If there is a central registry, use it The prototypical example here is a countries table. There is an ISO list of recognized countries and their two and three letter abbreviations at Rather than use a numeric CountryId, use the Alpha-2 or Alpha-3 abbreviations. There is nothing wrong with having a CountryId identitiy column and using that, but it's not the most graceful solution, and using the ISO standards of USA, CAN, GBR as the key can save you a join in many queries. Everybody likes to save a join. Another good key is a stocked part number. UPCs are issued through a central registry, and specifically identify a particular product and size. Manufacturing part numbers for inventory items are good, but for custom jobs the part number may not exist at all. What to do here is a decision based on the needs of your domain. Anyone with a memory of the last few decades might be flagging a violation of my rule about changing data. Aside from a few years in the mid-1990s, the list of recognized countries changes very slowly. What to do in the case of another situation like the USSR or Czechoslovakia dissolving depends on your enterprise's needs, but you may end up adding new countries to the table, and updating customer records. Never use personal information Even though Social Security numbers are issued by a central registry, they should not be used as a primary key. This is a piece of very personal identifying information, and some industries have regulations specifically prohibiting this practice. Even if you're not in one of those industries, it's still a best practice to not use this type of information. Integer or GUID? I hear all sorts of arguments for one datatype or the other, all of them are well reasoned. There are multiple choices, and multiple reasons, because no data type is a one-size-fits-all solution. Not to be repetitive, but your choice should be made looking at the requirements of your application and enterprise, not just the opinions online. One of the more common arguments is the storage space required. For most of us, the difference will be minimal and more than overcome by normalizing our data better. GUIDs are composed of hexadecimal numbers, and are stored that way. They take up less actual disk space than it looks they need when you view them in a query result. Integers are sequential, so sorting an integer key gives you the order of creation with some reliability. GUIDs are not sequential, so you'll definitely need to add a timestamp (which you should anyway). It is true that in very large tables, SQL Server can sort integers faster than GUIDs, which may mean you include both in your table. If it's likely you'll combine fact tables (perhaps there are multiple field personnal working from local databases), it's easier if you use a GUID, since you can just copy the data directly into the target tables (after any schema changes are implemented). If you use an integer or other value, it's still no big deal, but a little extra work is required. We did that several times as our database evolved--we simply added an "OldId" column to the target database, mapped in the old ID, then updated the FK column as we migrated the fact tables. Whatever you do, guard those IDs! Whatever you do, be sure to guard your IDs carefully. Don't expose them in a querystring, or assume no one looks at your HTML code. There have been a number of security breaches this year along because app developers--even at major companies--handled their information in insecure ways. Summary The primary key is essential for data integrity, and the choice should be taken seriously. Values used in primary keys should be unique and non-null, should not change, and ideally be one data column. Industry standardized lists are good sources of primary key values; an individual's name or code derived from a name is not a good choice. Lean on your database to provide auto-incrementing columns rather than writing one yourself.


GrapeCity Developer Tools
comments powered by Disqus