15.1 Trees

Table of Contents

While foreign keys define tree-like structures between differing entity types all over the database, first and foremost aggregations, compositions, and inheritance relationships, this section solely deals with recursive tree structures. The difference between "common" relationships and recursive tree structures is, that, at some point, one entity will reference another from the same or a super entity type. Examples of classic tree structures are family trees, organizational diagrams, car-part compositions, or league hierarchies.

The trees in this section are made up of entities that have relationships with entities of the same entity type/set or one of their super entity types/sets. They are recursive "type-wise". Tree structures where children have more than one direct parent (networks) or where entities can become direct or indirect parents of themselves (cyclic), that is, a child or sub child of one entity is also its direct or indirect parent, are outside the scope of this section.

To make a long story short, a tree in this article's sense is a finite, directed, acyclic graph. That means, a node cannot have several parents (no network), it does not have an infinite number of elements (finite), you cannot promote any element to be the root (directed), and the data structure does not contain cycles (acyclic).

The problem with tree structures is not really how to model them, but rather, how it affects the implementation of the queries. For the simplest and most straightforward case, you simply define a table that keeps a reference to itself or to a super table. The real problem you will encounter occurrs at a later point of time: how do you query a tree structure for the information you need?

The query results depend on how you implement your tree data structure(s), which again depends on what you need to query from such a tree. To make any use of tree structures, there are certain limitations to differing approaches. There are some things that cannot be done in pre-SQL:1999 DBMSs. Only some of the newer, SQL:1999-compatible DBMSs make some of the things possible, that could not be achieved before.

There are several basic methods to model tree structures:

  • Adjacency lists: one table with references to itself or a super table
  • Nested sets: one table with simulated references to itself or a super table

There are some common, recurring questions regularly involved with tree structures:

  1. Which node is the parent?
  2. Which nodes are the children?
  3. At which level is the current node?
  4. How do I query the whole/sub tree?
  5. How do I find a path to a node from the root?
  6. How do I find all root, branch, and/or leaf nodes?

Depending on what you need and what your DBMS offers, there are implementation alternatives that work well, work not so well, or do not work at all.

15.1.1 Adjacency Lists

The adjacency list model is the simplest and most straightforward way of implementing tree structures, but it is also the most restrictive one. You declare a hierarchy of sports leagues like:

CREATE TABLE Leagues
(
  id        INTEGER     PRIMARY KEY,
  name      VARCHAR(20) NOT NULL,
  ...
  parent_id INTEGER REFERENCES Leagues (id),
  ...
)

Classic relational algebra offers no support for recursive queries. For example, it is not possible per se to query for all ancestors of a person (finding a path to a node), when these are all saved in a single relation with one attribute being a reference to itself (the parent). The ancestors could only be determined by a series of queries, for which you have to know the maximum number of hierarchy levels up front, because you need to hardcode that number of self joins. The latter is the reason why the adjacency list model is pretty bad on displaying the whole tree, finding the path to a node, and determining the level of a node.

Pascal summarizes the adjacency model as following: Simple hierarchies are frequently represented by just one table. But such tables have at least one row - representing the root node - that is different from the others, an indication that not all entities represented by the table's rows are of only one type. Any design that "bundles" multiple entity types into one table produces complications and the need for special integrity constraints, but in the case of tree structures, such lumping also causes "numerous database traversal problems".

According to some sources, SQL:1999 introduced an extended relational algebra, which permits an operation to perform recursive queries. To quote Matthiessen (translated):

Non-recursive queries allow us to express everything that is possible with first-order predicate logic, but not anything that can be calculated. When considering a database of persons, which, if known, have a foreign key to their father and mother, we can express relations like "a is grand father of b", "a is grand grand father of b", but also "a is uncle of b". We can also express "a is grand grand grand father of b" or "a is grand grand grand grand father of b" (for every level the join must be expanded with another relation). We cannot express more fuzzy issues such as "a is an ancestor of b", although this can be calculated efficiently. Because the number of levels (of fathers) is not limited, the number of relations in a join is not, too, but there are no unlimited queries that exist. By introducing recursion, we could express just that. ...

PostgreSQL up to 8.3, MySQL, and SQL Server up to version 2000 do not support recursive queries. PostgreSQL 8.4+ and SQL Server 2005+ support them via the WITH keyword. Oracle uses its proprietary CONNECT BY syntax. Of those DBMS I have not examined, DB2 seems to support Oracle-like syntax and Sybase's Transact-SQL also supports the WITH syntax. It will definitely take a while before recursive queries becomes widely adopted, especially with smaller DBMSs.

Because of the problems described (as long as the adjacency model does not suffice), you must use a different type of tree implementation if recursive queries are not supported by your DBMSs. There are several approaches, like using separate tables for nodes and links, but there is another widely-used approach.

15.1.2 Nested Sets

With the nested set model, you simply store two integral boundary numbers per entity, usually called "left" and "right". These numbers mark the range of values which sub entities must fall inbetween. References to parents are denoted just by these numbers. By nature, you have ensure that the numbers are correct, otherwise the structure will fall apart. This means, when inserting or deleting rows from the table, you have to adjust the numbering of all parent rows.

The left and right bounds effectively replace the foreign key structure and puts the same value constraints upon all nodes: there are no nodes (rows) that are missing values, like the root node having a null parent with adjacency lists. Thus, no special treatment is required. The relationships between nodes can be queried from the table, here the left and right bounds. For each leaf, the lower bound is exactly one less than the upper bound. The number of total descendants is (upper - (lower + 1)) / 2.

The advantage of the nested set model is, that, whereas you have slightly more complicated queries for the "very easy stuff", you can more easily perform queries for more complex or impossible tasks in contrast to the classic adjancency list model. Remember, querying for the whole tree or a path actually cannot be done there without hardcoding the maximum number of hierarchy levels. The nested set model gives you a 100 percent flexibility on this issue, because the only thing you do is compare numbers (ranges).

15.1.3 Summary

The mixture of DBMS and implementation approach results in more or less complicated queries, some of which cannot be performed at all. The consequences are summarized in the following table, so you can decide for yourself which approach to follow.

Common Queries and Tree Implementation Strategies
QuestionAdjacency ListsNested Sets
Without Recursive QueriesWith Recursive Queries
Which node is the parent? ++
Which nodes are the children of anothjer node? ++
How do I find the root node? ++
How do I find all branch nodes? ++
How do I find all leaf nodes? ++
How do I query the full tree? - (fixed maximum)?+
How do I find the path to a node? - (fixed maximum)?+
How do I find the level at which a node is in the tree? ? (fixed maximum)?o
How do I query a sub tree? ? (fixed maximum)?o
How do I find the level at which a node is in the sub tree?? (fixed maximum)?o
...

The advantage of the nested set model is, that it does not change the table design, unlike other (uninvestigated) approaches. This enables you to use adjacency list and the nested set models at the same time. Design-wise, the classic approach has the consequence of adding a nullable self-reference to the table, while the nested set model adds two non-null integral numbers. Because this document is solely about design, I will leave further explanation to those people who have more experience than me querying data from trees.

As a final note for MySQL users, you usually have a problem: while newer versions of other DBMSs support SQL:1999 recursive queries, MySQL does not. Perspectively, using the extended SQL:1999 syntax looks like the future-safe way to go. However, MySQL appears to be far away from that, so you either have to fall back to the nested set model, which might require you to reimplement your queries when MySQL supports recursive queries or you have to use another DBMS.

References for Trees:

Last updated: 2010-08-04