SQL Database Design » 14. Normalization » 14.2 Misconception of Entity Types

14.2 Misconception of Entity Types

There is general problem with collective and structured (object-relational) types and the relational model concerning normal forms (NFs). Tables that have repeating attributes or use collective or structured types, are non-normalized. After some digging, everyone seems to agree with collective types, but the database community seems to be undecided upon structured types, which - as I will explain - is not quite correct.

A frequent task is how to model one-to-many relationships with a concrete cardinality of two (on the many side of course). Common examples are customers and suppliers (both companies) or home and away teams. The often used implementation approach is to use two separate columns, each one being a foreign key to the same table:

CREATE TABLE Games
(
  ...
  home_team_id INTEGER NOT NULL REFERENCES Teams (team_id),
  away_team_id INTEGER NOT NULL REFERENCES Teams (team_id),
  ...
)

Such implementations circumvent the need to ensure a one-to-many is exactly one-to-two and the column names clearly express the role of the reference. At first, there is not much to complain about this approach of two distinct columns. However, it has two problems: first, using such double foreign key attributes in joins requires you to have two join conditions in your SELECTs, which need to be passed on for every additional joined table (because you need two self-joins). Second, one-to-two should be implemented as a regular one-to-many relationship with another entity type (which also avoids the transitive self-join condition problem):

CREATE TABLE Games
(
  id INTEGER PRIMARY KEY,
  ...
)

CREATE TABLE Scores
(
  game_id     INTEGER  NOT NULL REFERENCES Games (id),
  is_home     BOOLEAN  NOT NULL,    -- make sure, only two scores per game can exist
  final_score SMALLINT,
  PRIMARY KEY (game_id, is_home),
  ...
)

The above is a perfect example of mixing artificial and natural columns into a mixed-type primary key. Using a separate table plus a one-to-many relationship to implement the repeating attributes, you would have to add one attribute that can only take exactly two values (booleans, checked integers, or enums) and add it to the primary key. The extra column also contains the semantics (role) via its name. Pascal explains the problem of repeating attributes:

Tables with multivalued columns represent mixtures of entity types (more precisely, their rows represent propositions about entities of multiple types) ... . Some deficiencies of such designs are obvious:

  • Possibility that the preset number of times the group repeats (here, five) will not be sufficient (if an employee has six children)

  • Waste of storage space for employees who have fewer than five or no children

This is really obvious: for each excess child, you cannot store the data and for every missing child, you are storing too much data. The maximum number of repeating attributes will usually be capped at some appropriate maximum, which, if complete data is needed, can be comparably high, and so will the overhead be. Furthermore, you get a pretty cluttered table with respective results sets when displayed. There are even more issues with collective types:

But if a DBMS has explicit support of multivalued columns, additional, less obvious complications will arise ..:

  • More complex addressing scheme. The combination 'table name + column name + key value' will no longer be unique for every value and, thus, would not be sufficient to address logically each and every database value.

  • Increase in the number of necessary data operations. For single-valued columns, at least four operations must be supported by the data language: insert, retrieve, update, and delete. But with multivalued columns, there must be two versions of these operations, one for single-valued and another for multivalued columns, doubling the number of operations.

  • Increased operation complexity. Operations on multivalued columns must rely on the order of values and order-based operations are notoriously complex and prone to errors. (For the reader familiar with arrays, the addressing of values within the column and, thus, data operations, must rely on subscripts.)

  • More complex queries. Some queries are difficult to express, for example, "Who are the employees who have two children, Steve and Roberta?"

  • Lack of formal design guidelines. "if the system supports repeating groups, we now have multiple ways of representing the same information (at the very least we can do it with and without repeating groups; there are probably several variants with repeating groups). Which design will we follow? How do we design? Is there any scientific basis (like the principles of further normalization) for making such decisions?" ..

The most important aspect is, that multi-valued attributes actually represent distinct entity types. They indicate you are doing something wrong and should be using a separate table instead. As Pascal puts it: Multivalued columns produce complications and provide no benefits. In fact, "There is nothing that can be represented logically with repeating groups that cannot also be represented without them". ...

Matthiessen and Unterstein state (translated from German): Repeated attributes have practical problems. Imagine a bus line management database, which handles a maximum of 25 hardcoded stops per line: a query in the form "Which bus lines stop at Central Station?" had to be formulated like "For which bus line does stop_01 = 'Central Station' or stop_02 = 'Central Station' or ... stop_25 = 'Central Station'?". All 25 stops had to be coded manually. Queries like "Are there any changeovers from line five to line twelve?" would even result in a query with 25 * 25 = 625 (!) of these terms: "Line_05.stop_01 = Line_12.stop_01 or Line_05.stop_01 = Line_12.stop_02 ... or Line_05.stop_25 = Line_12.stop_24 or Line_05.stop_25 = Line_12.stop_25". Repeating groups get you the past the design stage, but not any further. Avoid these kind of designs.

Normalization, or rather the normal forms (NF), help you avoid the described problems. Normal forms go up to the fifth level, but I will only discuss the first three normal forms. The first normal form can be described as the "non-repeating attributes normal form", while the second and third normal forms deal with dependencies between columns of a single table and are a littler harder to understand. The normal forms form a hierarchy: to reach the next normal form, the previous normal forms' requirements must be met.

Last updated: 2010-08-04