SQL Database Design » 13. Design Heuristics » 13.1 Natural vs. Artificial Keys

13.1 Natural vs. Artificial Keys

Table of Contents

While primary keys are also a conceptual issue, the implications detailed in this section apply to the relational model, but not to the conceptual level. Designers, who directly start with the logical model will have to go through primary key selections.

Primary keys are a concept that exists for two reasons: to identify single entities and to avoid duplicates. While the former is an automatic, the latter is not. It can not be guaranteed just because a primary key was specified. Duplicates cause problems of their own, including addressability, duplicate removal, countability, orderability, data correctness, and interpretability of results. Avoiding duplicates is the domain of neverending discussions about natural and artificial keys. There are still religious wars going on between natural and artificial key protagonists, but as long as they keep fighting I say both of can be used in databases at the same time.

While the primary key topic is important, it has to be noted that candidate keys are actually the more important topic. Most people, like myself, find themselves defining foreign keys to other tables' primary keys, but they can also specified to reference alternative keys (no news here).

Artificial keys are often denoted as surrogate keys. Both are regularly used meaning the same. However, "artificial" is the exact semantic opposite of "natural", so I prefer using the term artificial keys over surrogate keys. I will also use the term "artificial keys", because "surrogate" is not really a word the average person, especially non-native English speakers would enter into a search engine.Artificial keys are rather an umbrella term for all kinds of artificial keys, one of which are surrogate keys.

Overall, there are three sub types of artificial keys, resulting in a total of four different primary key types:

  • Natural keys
  • Artificial keys:
    • Constructed keys
    • Identity keys
    • Surrogate keys

I will go into each key type in the following sections.

13.1.1 Natural Keys

Celko defines natural keys as following: A natural key is a subset of attributes which occur in the data model and act as a unique identifier. They are seen by the user. You can go to the external reality and verify them. You would also like to have some validation rule, such as look-ups and check digits in a UPC code.

Pascal has a more originary view on the topic: In a sense, no identifier in the real world is natural. Indeed, we assign names, social security numbers, policy numbers, employee numbers, car licenses, and so on precisely because there aren't any inherent ones. However, once assigned and used in the real worlds, these identifiers become natural and the columns representing them in the database are natural keys. For database purposes, then, a key is deemed natural if the attribute it represents was used for identification prior to (and, thus, independently of) the database. Natural keys have the advantage of familiarity - they are meaningful to users.

But, is anybody familiar with Universal Product Codes (UPCs)? Celko clearly adds two other characteristics to natural keys: they can be seen in and verified by an "external reality". (In this case, the external reality is the institution that created the UPCs.) Pascal also characterizes natural keys, in that they precede the database.

Both positions are compatible. I think the most important aspect of natural keys is the familiarity, which is a number of attributes known before the actual database was developed. Everyone designing a database for a bookstore will immediately realize, that ISBN numbers are predestined for becoming the primary key. ISBN numbers are natural, because they are an attribute of books and you are (somewhat) familiar with them. Of course, being represented by a single column does not automatically mean it cannot be natural.

At first, it is not quite obvious when a key is natural and when it is already artificial. There are some additional characteristics involved with artificial keys, discussed next.

13.1.2 Artificial Keys

Celko defines an artificial key as an extra attribute added to the table which is seen by the user. Note, he means an extra attribute added after all attributes of the entities have been modeled. Again, to distinguish them from natural keys, artificial keys are not familiar.

As mentioned before, there are three artificial key sub types:

  • Constructed keys:

    According to Celko, an artificial key, what I will call a constructed key, is characterized as following: It does not exist in the external reality, but can be verified for syntax or check digits inside itself. It's up to the DBA to maintain a trusted source for them. For example, a user can assign the open codes in the UPC scheme to products. In the grocery business, this might be where you find bread baked in the store, rather than bread bought from a national bakery chain. The check digits still work the same way to validate the barcodes, but you have to verify them inside your own enterprise.

    This means, for example, any official UPCs are natural and those which are created for internal use are artificial. At first, this does not seem to make much sense, but the point is, that the scheme and/or range of possible keys is developed by an internal instance solely for the purpose of the database. While these codes do have a business use, they are not meant to be used outside the enterprise. Although the keys might obey to the same rules as their natural counterparts, they are classified as a distinct artificial key type.

  • Identity keys:

    Pascal writes: When ... individual entities are of interest, the visual/arbitrary on the fly identifiers can be represented in the database by columns with, say, sequential integer values. Such columns are surrogate keys: They do not identify entities in a meaningful manner, like natural keys, they just distinguish between them in arbitrary fashion.

    Although Pascal denotes identity keys as surrogate keys, Celko indicates, that Codd originally defined surrogate keys differently: ... Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them ...

    Another source states: Surrogate keys should not be confused with [identity] keys. It is very difficult to find natural keys for primary entities such as Customer, Employee and Vendor. Therefore, we create artificial keys to make the entity unique, such as Customer ID, Employee ID and Vendor ID. These IDs are not surrogate keys because they have a business usage. "What is your customer ID?" "Please enter you Employee ID followed by the # sign." "Where is the payment for Vendor ID 12345?"

  • Surrogate Keys:

    The word "surrogate" means substitute or replacement. Surrogate keys are the "most" artificial keys of all. According to Wikipedia, surrogate keys are system-wide, system-generated, unique identifiers. They are never composed of other attributes (from other tables), and because they have no semantic meaning, there is no way check their validity. Surrogate keys are never exposed to the user and cannot (should not) be manipulated.

    Another source writes the following: Surrogate keys are often touted to be a means to database (DB) performance. A surrogate key is a single immutable, non-intelligent identifier used to substitute for the natural key, especially (but not only) a compound natural key. ... Surrogate keys are totally artificial and hidden, and are for system use.

13.1.3 Scheme of Key Characteristics

I used the information I gathered to develop a scheme of key characteristics:

Characteristics of Natural and Artificial (Sub) Keys
Key TypeSynonymsCharacteristics
AttributesCreationVerificationBusiness Use / User Visibility
Natural keyIntelligent keyMutable existing attributesHuman-developed, preceding the database (meaningful, familiar)Verifiable in realityBusiness use (visible to the user)
Constructed keyArtificial key (Celko), assigned code (Johansen)Immutable extra attributeHuman-developed for the purpose of the database (meaningful)Verifiable in itselfBusiness use (visible to the user)
Identity keyExposed physical locator (Celko), synthetic key (Johansen)Immutable extra attributeSystem-generated while inserted into the database (meaningless)Not verifiableBusiness use (visible to the user)
Surrogate keyHidden physical locatorImmutable extra attributeSystem-generated while inserted into the database (meaningless)Not verifiableNo business use (invisible to the user)

I have not used surrogate keys in practice yet, so there is nothing I could say about them. Thus, the discussions ahead omit surrogate keys.

13.1.4 Using Natural Keys

Natural keys prevent duplicate rows, because the attributes are part of the primary key. Any duplicate set of data will be rejected from being inserted by plain primary key logic (unique and not null). This dramatically increases the overall data quality compared to artificial keys. One of the simplest choices for natural keys are standards. Very often, other people have already solved a plethora of detail questions around keying.

With natural keys, you can create pretty intuitive and non-redundant data models. As a matter of fact, tables that use natural keys can access information residing in other tables without a join. This can sometimes be handy, but you will usually need some joins to access the full set of attributes. You must perform a join for all attributes, that are not part of the primary key (the non-primary key attributes, e.g. all optional attributes). Thus, you cannot always save on joins with natural keys. Joins with natural keys are comparably costly, especially on lower aggregation levels.

One of natural keys' disadvantages is the repetition of columns: because each referencing table's primary key has the same attributes as the referenced table, natural keys have quite some overhead storing the same data in those tables which reference others - especially in well-normalized designs. With natural keys, each table adds at least one attribute that is part of the primary key, which in turn are propagated to other referencing tables again. This results in an increasing number of columns, transitively passed on to each aggregation level. Natural keys effectively repeat columns to avoid the repetition of rows (duplicates). Decide for yourself.

Although it is true, that many natural (multi-column) primary key constellations will result in slower accesses than artificial (single-column) keys, the argument to generally avoid natural keys because of performance problems is not very elaborate. Natural keys are not slow because they are natural, but rather, because natural keys tend to involve many columns at some point, especially on lower aggregation levels. Artificial keys can, but do not generally have advantages over natural keys. After all, there are database optimization techniques that can cache joins. Performance concerns should be addressed after design.

You should not let performance concerns dictate your designs prematurely. I really do not see why the resulting data model should be affected by ease of use or the laziness of the developer to think about natural keys. Most of the time, you will be pre-optimizing portions of your databases, that are not time-critical anyway. The performance argument against natural keys diminishes even further with today's computers becoming faster and faster. As long as you do not maintain a high-traffic database server, the performance argument should not be overemphasized.

13.1.5 Using Identity Keys

You might run into situations, where using natural keys even becomes absolutely impossible. What if you cannot gather all the information needed to compose the primary key? I once modeled persons of which I at max knew the first and last names, but not any birth dates, street addresses, nor ZIP codes. How would anyone create a natural key when critical information is missing?

In other situations, you need to model inherently artificial entities for which it is impossible to determine a natural key at all. How do you want to find a natural key for something that is purely artificial? How do you naturally identify a specific forum post? Most computer and Internet stuff seems to be affected by this. Maybe this is the reason why so many people are using identity keys exclusively.

Inherently, natural keys must be mutable, because the primary keys contain regular attributes. According to this source key mutability has disadvantages (just replace surrogate keys with artificial or identity keys in the two quotes following):

Using Natural Keys as foreign key references causes a number of problems, notably performance for long and composite keys and join complexity for composite keys. It is difficult for applications to efficiently use a composite key of which most pure natural keys usually are. Also, what happens when the natural key changes, it will cause a very big transaction updating all those foreign key references, and the application will have very significant problems and possibly update the wrong data because you are referencing the natural key which may now have changed!

...

Surrogate keys are an answer to this performance and concurrency problem. They never change, never need updating and are consistent within the database and application, the application can very easily use them too. A surrogate key is using just an integer number or some people use guid's in a distributed environment. Auto-populating of the surrogate key is simple, you can use the IDENTITY property which incurrs no locking penalty unlike MAX( x ) + 1. Guid's present their own problem, they aren't easily readable so application debuging is difficult, also there generation is not sequential so you will get bad fragmentation on any index you use on them, and lets face it the surrogates will need indexing!

Identity keys are the common way most developers in the industry define their primary keys today. In fact, many people swear by using identity keys exclusively. They seem to believe, that using natural keys automatically results in bad performance. That is not true. For every declared primary key, whether single- or multi-column, an index will be created by the underlying DBMS - usually automatically. It is in fact the reason why indices were "invented": to speed up data access.

However, identity keys generate problems of their own. Often, duplicate data rows with differing IDs are inserted into the database, which effectively destroys one purpose of primary keys. Identity keys have the reputation to cause duplicate rows. While this is true in an unconstrained set of columns, there is an easy way to avoid this problem: you could avoid duplicates with artificial keys by declaring each column involved as not-null and also put a unique constraint upon the whole set of attributes that would otherwise comprise the natural key. This (astonishingly) little-known "trick" probably makes unique constraints one of the most neglected features of SQL.

However, identity keys have specific disadvantages. First, they add a meaningless, dissociated column to the rest of the columns. Identity keys expose the physical order in which a row was added to a table. The value is not an attribute of anything except the machine's past internal state. If you were hiding auto-increment attributes from the user, they would become Codd/Celko surrogate keys.

13.1.6 Using Constructed Keys

To find a compromise between natural and identity keys, you might want to use constructed keys every now and then. Constructed keys could be touted as semi-artificial keys. They combine the advantages of identity keys with the natural keys' meaningfulness. Constructed keys are basically single-column natural keys, or maybe naturalized keys. In a way, they are like SQL dates and times. Both are not truly atomic, because they actually consist of several pieces of information. If you use dates and times, you probably will not find any arguments against using constructed keys.

One of the biggest advantages of constructed keys is that they prevent multi-column natural keys, whose attributes would propagate to referencing tables. You frequently find such keys in practice, which are often backed by a standard. Examples are Unicode locale identifiers, ISBN numbers, and EANs. Constructed keys are also useful in tree hierarchies (like Unicode locales), where a single-column constructed key replaces the multi-column natural key. The foreign key to the parent would produce the double number of already many natural key columns. Constructed keys circumvent the propagation of many columns.

13.1.7 Using Surrogate Keys

TODO

13.1.8 Summary

I have to admit, that I used to belong to the group of "auto-ID developers", too. In the beginnings, I did not really know what I was doing. I used identity keys throughout my designs because everyone seemed to be doing this, until I realized, that having just another meaningless "ID" column in my otherwise well-normalized designs was redundant. Ever since I introduced natural keys and tried to reuse keys, my database designs became less cluttered with unnecessary ID columns.

Natural keys can be nice. You definitely have to put some brain into finding the right ones, but it is usually worth the effort. You will get a much better understanding of the data you are working with and it also makes others realize you know what you are doing.

I think, the best overall approach is to mix key types. Of course, mixing natural and artificial keys has the consequence of interleaving real-world attributes with truly ficticious ones. The difficulty is to determine when to use the one and when to use another. Even for more experienced database designers, this is probably the most brain-consuming issue.

My rule of thumb is: try to find natural keys first. If they consist of only two (or maybe three) columns, they are worthy candidates. Any keys involving more columns usually get rather complicated, especially when they are often referenced from other tables. On the lowest aggregation levels, I allow up to three- or four-column natural keys, because they are not referenced anyway. For other tables, you might want to look for alternative constructed keys. Often, standards already exist. If not, construct some yourself. If that does not make much sense, introduce identity keys, which replace the natural (foreign) key columns. Do this for tables which are often referenced to avoid the propagation of too many natural key columns. Note, that this is just a rule of thumb to get you started.

Some sources claim, logical models must be free of artificial keys. From a puristic point of view this could be demanded, as every (real-world) entity should be identifiable by its natural identifier. However, some entities do not have a natural key or it is impossible to find one, because the subject itself is inherently artificial. Often, it is simply too expensive to gather all the information needed to determine a natural key. There is one question arising though: if logical models are required to be complete, and you do not model artificial keys, how do you ever fulfill the completeness requirement? As a matter of fact, you cannot create natural keys from missing information.

After all, finding appropriate keys is a practical issue:

The relational model requires each table to have a key, but key selection is a pragmatic, not theoretical issue. The chapter on keys in PRACTICAL ISSUES IN DATABASE MANAGEMENT discusses the four pragmatic criteria for selecting keys and explains that often they are in conflict with one another and the optimal tradeoff between them varies by database. In the case of composite keys, familiarity can sometimes be traded off for simplicity and, therefore system-generated surrogate keys may be appropriate.

References for Natural vs. Artificial Keys:

Last updated: 2010-08-04