SQL Database Design » 13. Design Heuristics » 13.3 Predefined Values

13.3 Predefined Values

Table of Contents

Predefined values are classic global constants as well as static sets/lists of predefined values. Deciding for a global constants solution involves no real decision processes. Section 12.3 Constants Tables already offers a pretty good solution.

However, the situation is different for static sets/lists of predefined values. You have several implementation options to choose from, which have been introduced in previous sections. This section uses techniques from the sections 8.1 Domains, 8.3 Enumerated Types, 10.4 Check Constraints, and 12. Auxiliary Tables.

You can choose from two basic variants:

  • Use a column constraint
  • Use a separate table

The first group of variants is to constrain a column. This includes using portable IN check constraints and non-portable ENUMs (PostgreSQL, MySQL, etc.). Both are similar, although not the same. Your second option is to use a separate table, of which there are two basic sub variants to choose from: lookup and sequence tables.

13.3.1 Constraint-Based Solutions

Note, that even though I discuss IN check constraints and ENUMs in one breath, the latter have a specific downside as explained in section 10.4.1 Foreign Key Type Propagation. However, if you use the workaround from section 10.4.2 MySQL ENUM Propagation Workaround, you can treat MySQL ENUMs just like check constraints.

The DBMSs I have examined allow the following constraint-based variants:

  • IN check constraint (ordered, non-reusable, portable)
  • IN check constraint used in a domain (ordered, reusable, portable)
  • PostgreSQL ENUM (unordered, reusable, non-portable)
  • MySQL ENUM (unordered, non-reusable, non-portable)

The first two implement unordered set logic for nominally-scaled attributes. The latter two add order to represent lists/arrays for ordinally-scaled attributes.

PostgreSQL allows all variants, that is IN check constraints (with or without domain) and (reusable) ENUMs. On MySQL, you have no other alternative but to constrain string types by using non-reusable ENUMs. Oracle and SQL Server only support non-reusable IN check constraints.

The discussion on whether or not to use constraint-based variants largely stems from the MySQL community. There has been great disagreement on that topic. As a matter of fact, you can perfectly live without them, but the question cannot be answered that easily. Because MySQL only has ENUMs and no check constraints, decisions only have to be made between ENUM and separate tables. The principles for ENUM however also apply to standard SQL IN check constraints.

The grand advantage of using IN check constraints and ENUMs is that they save you an extra table in the design, plus respective foreign keys and joins. Furthermore, constraint-based types offer high transparency on which column values are valid. They also serve documentary purposes, e.g. in visual database diagrams.

Constraint-based solutions have a few implications, which are not obvious. An inherent problem with them is, that the values a column can take are inside the type declaration. This practice is often associated with the words "maintenance nightmare", but that is not true in general. However, the maintenance nightmare only applies, when repeating IN checks or ENUMs. Look for a table-based solution if reusability is an issue.

Many developers also use constraint-based implementations on values that often change. One of the premises that predefined values make, is that the values are fixed in number and are known up front (ideally). If your data does not comply to this, you should define a separate table instead.

One last disadvantage of constraint-based solutions is, that there is no way of inheriting values. You might need predefined values for male and female person genders, but you cannot reuse these values for male, female, and mixed team genders. In a table design, inheritance could be done with a sub table.

13.3.2 Table-Based Solutions

Some design constellations make separate table inevitable: either the number of values is very large, changes often, or the final values are not known (which correlates with the previous property). Also, whenever your values actually involve more than one attribute, there is no alternative but to use a separate table. To add order, you can store an integer along with the actual values and make it a sequence table.

All DBMSs will allow the following table-based solutions:

  • Lookup table (unordered, reusable, portable)
  • Sequence table (ordered, reusable, portable)

Be aware, that using an extra table can become very unwieldy at some point, as the complexity of your database design quickly increases. Designs consistently using extra tables become rather bloated for data that is rather simple. Furthermore, separate tables force you into additional foreign key constraints and joins.

13.3.3 Decision Matrix

The previous two section presented a total of six different implementation options. To decide for an appropriate implementation, you have to weigh certain factors against each other. The selection depends on the following criteria:

  • DBMS
  • Scale
  • Reusability
  • Portability
  • Extensibility (performance of update)
  • Performance (selects)? TODO

Your DBMS is the first criterion. Some techniques might not apply, so some of them are not at your disposal. The scale of the values is determined by the values' nature (as just described). The reusability criterion depends on the concrete design. You have to know if you have to reference the predefined values from more than one table or not.

Portability is always an issue, but it is rather average in priority. Extensibility is a low-priority and rather a performance factor when changing the element set/list. The number of elements is the current state and is not a criterion by itself. It is rather related to the extensibility criterion (change frequency). There are probably more factors, but I found these to be the relevant ones for making a decision.

Here is the decision matrix for predefined values, based on the two major criteria scale and reusability:

Decision Matrix for Predefined Values
Unordered Ordered
Non-reusable
  • IN check constraint (portable)
  • MySQL ENUM (non-portable)
Reusable
  • Domain with IN check constraint* (portable)
  • Lookup table (portable)
  • PostgreSQL ENUM (non-portable)
  • Sequence table (portable)

*) Domains are only implemented by very few DBMSs today. Please refer to section 8.1 Domains for more information.

You should first check the scale of your static values. In most cases you do not have ordinally-scaled attributes at hand and do not need any order. Unordered set logic usually suffices. In a second step, check if you need to reference the predefined values from more than one table. If so, aim for a reusable solution. Note, that portability is always an issue, so you should always prefer portable solutions over non-portable ones.

If you want change the elements every now and then, a lookup table solution requires DML INSERT or DELETE statements, while a check constraint or ENUMs requires DDL ALTER TABLE statements, which are usually a lot more expensive. Each time you have to make changes to the list of possible enum values, the DBMS has to perform data integrity checks for each row.

13.3.4 Value Naming

A general question arising for predefined values is about the values themselves: shall you use codes, abbreviations or full names? It's hard to say. The rule here probably is: be as concise as possible without sacrificing ease of comprehension. If the full words aren't too long, use them, abbreviate them otherwise. Just don't use abbrevations that no one understands. If there are generally accepted codes in the real world's environment, use them.

The important thing is: there will always be people who look at the database and do not immediately understand every aspect. While overly wordy names are usually unnecessary, cryptic ones usually just hide information. Note, that using the long names, e.g. final names as appearing on a graphical user interface, is advised against. The only reason why you would use these is because you do not want to translate codes to displayed text, such as in a dropdown box. View issues should not affect database design.

But what about language? As with string booleans, your enum values must be written in some human-readable language. This is why abbreviations and generally accepted codes are better than lengthy names, too. Usually, codes that stem from international standards are defined in a rather language neutral way. Besides, you save time in not having to think about naming conventions.

Last updated: 2010-08-27