recursive - storing tree data in sql




What is the most efficient/elegant way to parse a flat table into a tree? (10)

As of Oracle 9i, you can use CONNECT BY.

SELECT LPAD(' ', (LEVEL - 1) * 4) || "Name" AS "Name"
FROM (SELECT * FROM TMP_NODE ORDER BY "Order")
CONNECT BY PRIOR "Id" = "ParentId"
START WITH "Id" IN (SELECT "Id" FROM TMP_NODE WHERE "ParentId" = 0)

As of SQL Server 2005, you can use a recursive common table expression (CTE).

WITH [NodeList] (
  [Id]
  , [ParentId]
  , [Level]
  , [Order]
) AS (
  SELECT [Node].[Id]
    , [Node].[ParentId]
    , 0 AS [Level]
    , CONVERT([varchar](MAX), [Node].[Order]) AS [Order]
  FROM [Node]
  WHERE [Node].[ParentId] = 0
  UNION ALL
  SELECT [Node].[Id]
    , [Node].[ParentId]
    , [NodeList].[Level] + 1 AS [Level]
    , [NodeList].[Order] + '|'
      + CONVERT([varchar](MAX), [Node].[Order]) AS [Order]
  FROM [Node]
    INNER JOIN [NodeList] ON [NodeList].[Id] = [Node].[ParentId]
) SELECT REPLICATE(' ', [NodeList].[Level] * 4) + [Node].[Name] AS [Name]
FROM [Node]
  INNER JOIN [NodeList] ON [NodeList].[Id] = [Node].[Id]
ORDER BY [NodeList].[Order]

Both will output the following results.

Name
'Node 1'
'    Node 1.1'
'        Node 1.1.1'
'    Node 1.2'
'Node 2'
'    Node 2.1'

Assume you have a flat table that stores an ordered tree hierarchy:

Id   Name         ParentId   Order
 1   'Node 1'            0      10
 2   'Node 1.1'          1      10
 3   'Node 2'            0      20
 4   'Node 1.1.1'        2      10
 5   'Node 2.1'          3      10
 6   'Node 1.2'          1      20

Here's a diagram, where we have [id] Name. Root node 0 is fictional.

                       [0] ROOT
                          /    \ 
              [1] Node 1          [3] Node 2
              /       \                   \
    [2] Node 1.1     [6] Node 1.2      [5] Node 2.1
          /          
 [4] Node 1.1.1

What minimalistic approach would you use to output that to HTML (or text, for that matter) as a correctly ordered, correctly indented tree?

Assume further you only have basic data structures (arrays and hashmaps), no fancy objects with parent/children references, no ORM, no framework, just your two hands. The table is represented as a result set, which can be accessed randomly.

Pseudo code or plain English is okay, this is purely a conceptional question.

Bonus question: Is there a fundamentally better way to store a tree structure like this in a RDBMS?


EDITS AND ADDITIONS

To answer one commenter's (Mark Bessey's) question: A root node is not necessary, because it is never going to be displayed anyway. ParentId = 0 is the convention to express "these are top level". The Order column defines how nodes with the same parent are going to be sorted.

The "result set" I spoke of can be pictured as an array of hashmaps (to stay in that terminology). For my example was meant to be already there. Some answers go the extra mile and construct it first, but thats okay.

The tree can be arbitrarily deep. Each node can have N children. I did not exactly have a "millions of entries" tree in mind, though.

Don't mistake my choice of node naming ('Node 1.1.1') for something to rely on. The nodes could equally well be called 'Frank' or 'Bob', no naming structure is implied, this was merely to make it readable.

I have posted my own solution so you guys can pull it to pieces.


Assuming that you know that the root elements are zero, here's the pseudocode to output to text:

function PrintLevel (int curr, int level)
    //print the indents
    for (i=1; i<=level; i++)
        print a tab
    print curr \n;
    for each child in the table with a parent of curr
        PrintLevel (child, level+1)


for each elementID where the parentid is zero
    PrintLevel(elementID, 0)

If elements are in tree order, as shown in your example, you can use something like the following Python example:

delimiter = '.'
stack = []
for item in items:
  while stack and not item.startswith(stack[-1]+delimiter):
    print "</div>"
    stack.pop()
  print "<div>"
  print item
  stack.append(item)

What this does is maintain a stack representing the current position in the tree. For each element in the table, it pops stack elements (closing the matching divs) until it finds the parent of the current item. Then it outputs the start of that node and pushes it to the stack.

If you want to output the tree using indenting rather than nested elements, you can simply skip the print statements to print the divs, and print a number of spaces equal to some multiple of the size of the stack before each item. For example, in Python:

print "  " * len(stack)

You could also easily use this method to construct a set of nested lists or dictionaries.

Edit: I see from your clarification that the names were not intended to be node paths. That suggests an alternate approach:

idx = {}
idx[0] = []
for node in results:
  child_list = []
  idx[node.Id] = child_list
  idx[node.ParentId].append((node, child_list))

This constructs a tree of arrays of tuples(!). idx[0] represents the root(s) of the tree. Each element in an array is a 2-tuple consisting of the node itself and a list of all its children. Once constructed, you can hold on to idx[0] and discard idx, unless you want to access nodes by their ID.


If nested hash maps or arrays can be created, then I can simply go down the table from the beginning and add each item to the nested array. I must trace each line to the root node in order to know which level in the nested array to insert into. I can employ memoization so that I do not need to look up the same parent over and over again.

Edit: I would read the entire table into an array first, so it will not query the DB repeatedly. Of course this won't be practical if your table is very large.

After the structure is built, I must do a depth first traverse through it and print out the HTML.

There's no better fundamental way to store this information using one table (I could be wrong though ;), and would love to see a better solution ). However, if you create a scheme to employ dynamically created db tables, then you opened up a whole new world at the sacrifice of simplicity, and the risk of SQL hell ;).


It's a quite old question, but as it's got many views I think it's worth to present an alternative, and in my opinion very elegant, solution.

In order to read a tree structure you can use recursive Common Table Expressions (CTEs). It gives a possibility to fetch whole tree structure at once, have the information about the level of the node, its parent node and order within children of the parent node.

Let me show you how this would work in PostgreSQL 9.1.

  1. Create a structure

    CREATE TABLE tree (
        id int  NOT NULL,
        name varchar(32)  NOT NULL,
        parent_id int  NULL,
        node_order int  NOT NULL,
        CONSTRAINT tree_pk PRIMARY KEY (id),
        CONSTRAINT tree_tree_fk FOREIGN KEY (parent_id) 
          REFERENCES tree (id) NOT DEFERRABLE
    );
    
    
    insert into tree values
      (0, 'ROOT', NULL, 0),
      (1, 'Node 1', 0, 10),
      (2, 'Node 1.1', 1, 10),
      (3, 'Node 2', 0, 20),
      (4, 'Node 1.1.1', 2, 10),
      (5, 'Node 2.1', 3, 10),
      (6, 'Node 1.2', 1, 20);
    
  2. Write a query

    WITH RECURSIVE 
    tree_search (id, name, level, parent_id, node_order) AS (
      SELECT 
        id, 
        name,
        0,
        parent_id, 
        1 
      FROM tree
      WHERE parent_id is NULL
    
      UNION ALL 
      SELECT 
        t.id, 
        t.name,
        ts.level + 1, 
        ts.id, 
        t.node_order 
      FROM tree t, tree_search ts 
      WHERE t.parent_id = ts.id 
    ) 
    SELECT * FROM tree_search 
    WHERE level > 0 
    ORDER BY level, parent_id, node_order;
    

    Here are the results:

     id |    name    | level | parent_id | node_order 
    ----+------------+-------+-----------+------------
      1 | Node 1     |     1 |         0 |         10
      3 | Node 2     |     1 |         0 |         20
      2 | Node 1.1   |     2 |         1 |         10
      6 | Node 1.2   |     2 |         1 |         20
      5 | Node 2.1   |     2 |         3 |         10
      4 | Node 1.1.1 |     3 |         2 |         10
    (6 rows)
    

    The tree nodes are ordered by a level of depth. In the final output we would present them in the subsequent lines.

    For each level, they are ordered by parent_id and node_order within the parent. This tells us how to present them in the output - link node to the parent in this order.

    Having such a structure it wouldn't be difficult to make a really nice presentation in HTML.

    Recursive CTEs are available in PostgreSQL, IBM DB2, MS SQL Server and Oracle.

    If you'd like to read more on recursive SQL queries, you can either check the documentation of your favourite DBMS or read my two articles covering this topic:


Now that MySQL 8.0 is nearing release, all popular SQL databases will support recursive queries in standard syntax.

WITH RECURSIVE MyTree AS (
    SELECT * FROM MyTable WHERE ParentId IS NULL
    UNION ALL
    SELECT m.* FROM MyTABLE AS m JOIN MyTree AS t ON m.ParentId = t.Id
)
SELECT * FROM MyTree;

I tested recursive queries in MySQL 8.0 in my presentation Recursive Query Throwdown in 2017.

Below is my original answer from 2008:


There are several ways to store tree-structured data in a relational database. What you show in your example uses two methods:

  • Adjacency List (the "parent" column) and
  • Path Enumeration (the dotted-numbers in your name column).

Another solution is called Nested Sets, and it can be stored in the same table too. Read "Trees and Hierarchies in SQL for Smarties" by Joe Celko for a lot more information on these designs.

I usually prefer a design called Closure Table (aka "Adjacency Relation") for storing tree-structured data. It requires another table, but then querying trees is pretty easy.

I cover Closure Table in my presentation Models for Hierarchical Data with SQL and PHP and in my book SQL Antipatterns: Avoiding the Pitfalls of Database Programming.

CREATE TABLE ClosureTable (
  ancestor_id   INT NOT NULL REFERENCES FlatTable(id),
  descendant_id INT NOT NULL REFERENCES FlatTable(id),
  PRIMARY KEY (ancestor_id, descendant_id)
);

Store all paths in the Closure Table, where there is a direct ancestry from one node to another. Include a row for each node to reference itself. For example, using the data set you showed in your question:

INSERT INTO ClosureTable (ancestor_id, descendant_id) VALUES
  (1,1), (1,2), (1,4), (1,6),
  (2,2), (2,4),
  (3,3), (3,5),
  (4,4),
  (5,5),
  (6,6);

Now you can get a tree starting at node 1 like this:

SELECT f.* 
FROM FlatTable f 
  JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1;

The output (in MySQL client) looks like the following:

+----+
| id |
+----+
|  1 | 
|  2 | 
|  4 | 
|  6 | 
+----+

In other words, nodes 3 and 5 are excluded, because they're part of a separate hierarchy, not descending from node 1.


Re: comment from e-satis about immediate children (or immediate parent). You can add a "path_length" column to the ClosureTable to make it easier to query specifically for an immediate child or parent (or any other distance).

INSERT INTO ClosureTable (ancestor_id, descendant_id, path_length) VALUES
  (1,1,0), (1,2,1), (1,4,2), (1,6,1),
  (2,2,0), (2,4,1),
  (3,3,0), (3,5,1),
  (4,4,0),
  (5,5,0),
  (6,6,0);

Then you can add a term in your search for querying the immediate children of a given node. These are descendants whose path_length is 1.

SELECT f.* 
FROM FlatTable f 
  JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1
  AND path_length = 1;

+----+
| id |
+----+
|  2 | 
|  6 | 
+----+

Re comment from @ashraf: "How about sorting the whole tree [by name]?"

Here's an example query to return all nodes that are descendants of node 1, join them to the FlatTable that contains other node attributes such as name, and sort by the name.

SELECT f.name
FROM FlatTable f 
JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1
ORDER BY f.name;

Re comment from @Nate:

SELECT f.name, GROUP_CONCAT(b.ancestor_id order by b.path_length desc) AS breadcrumbs
FROM FlatTable f 
JOIN ClosureTable a ON (f.id = a.descendant_id) 
JOIN ClosureTable b ON (b.descendant_id = a.descendant_id) 
WHERE a.ancestor_id = 1 
GROUP BY a.descendant_id 
ORDER BY f.name

+------------+-------------+
| name       | breadcrumbs |
+------------+-------------+
| Node 1     | 1           |
| Node 1.1   | 1,2         |
| Node 1.1.1 | 1,2,4       |
| Node 1.2   | 1,6         |
+------------+-------------+

A user suggested an edit today. SO moderators approved the edit, but I am reversing it.

The edit suggested that the ORDER BY in the last query above should be ORDER BY b.path_length, f.name, presumably to make sure the ordering matches the hierarchy. But this doesn't work, because it would order "Node 1.1.1" after "Node 1.2".

If you want the ordering to match the hierarchy in a sensible way, that is possible, but not simply by ordering by the path length. For example, see my answer to MySQL Closure Table hierarchical database - How to pull information out in the correct order.


Think about using nosql tools like neo4j for hierarchial structures. e.g a networked application like linkedin uses couchbase (another nosql solution)

But use nosql only for data-mart level queries and not to store / maintain transactions


This was written quickly, and is neither pretty nor efficient (plus it autoboxes alot, converting between int and Integer is annoying!), but it works.

It probably breaks the rules since I'm creating my own objects but hey I'm doing this as a diversion from real work :)

This also assumes that the resultSet/table is completely read into some sort of structure before you start building Nodes, which wouldn't be the best solution if you have hundreds of thousands of rows.

public class Node {

    private Node parent = null;

    private List<Node> children;

    private String name;

    private int id = -1;

    public Node(Node parent, int id, String name) {
        this.parent = parent;
        this.children = new ArrayList<Node>();
        this.name = name;
        this.id = id;
    }

    public int getId() {
        return this.id;
    }

    public String getName() {
        return this.name;
    }

    public void addChild(Node child) {
        children.add(child);
    }

    public List<Node> getChildren() {
        return children;
    }

    public boolean isRoot() {
        return (this.parent == null);
    }

    @Override
    public String toString() {
        return "id=" + id + ", name=" + name + ", parent=" + parent;
    }
}

public class NodeBuilder {

    public static Node build(List<Map<String, String>> input) {

        // maps id of a node to it's Node object
        Map<Integer, Node> nodeMap = new HashMap<Integer, Node>();

        // maps id of a node to the id of it's parent
        Map<Integer, Integer> childParentMap = new HashMap<Integer, Integer>();

        // create special 'root' Node with id=0
        Node root = new Node(null, 0, "root");
        nodeMap.put(root.getId(), root);

        // iterate thru the input
        for (Map<String, String> map : input) {

            // expect each Map to have keys for "id", "name", "parent" ... a
            // real implementation would read from a SQL object or resultset
            int id = Integer.parseInt(map.get("id"));
            String name = map.get("name");
            int parent = Integer.parseInt(map.get("parent"));

            Node node = new Node(null, id, name);
            nodeMap.put(id, node);

            childParentMap.put(id, parent);
        }

        // now that each Node is created, setup the child-parent relationships
        for (Map.Entry<Integer, Integer> entry : childParentMap.entrySet()) {
            int nodeId = entry.getKey();
            int parentId = entry.getValue();

            Node child = nodeMap.get(nodeId);
            Node parent = nodeMap.get(parentId);
            parent.addChild(child);
        }

        return root;
    }
}

public class NodePrinter {

    static void printRootNode(Node root) {
        printNodes(root, 0);
    }

    static void printNodes(Node node, int indentLevel) {

        printNode(node, indentLevel);
        // recurse
        for (Node child : node.getChildren()) {
            printNodes(child, indentLevel + 1);
        }
    }

    static void printNode(Node node, int indentLevel) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < indentLevel; i++) {
            sb.append("\t");
        }
        sb.append(node);

        System.out.println(sb.toString());
    }

    public static void main(String[] args) {

        // setup dummy data
        List<Map<String, String>> resultSet = new ArrayList<Map<String, String>>();
        resultSet.add(newMap("1", "Node 1", "0"));
        resultSet.add(newMap("2", "Node 1.1", "1"));
        resultSet.add(newMap("3", "Node 2", "0"));
        resultSet.add(newMap("4", "Node 1.1.1", "2"));
        resultSet.add(newMap("5", "Node 2.1", "3"));
        resultSet.add(newMap("6", "Node 1.2", "1"));

        Node root = NodeBuilder.build(resultSet);
        printRootNode(root);

    }

    //convenience method for creating our dummy data
    private static Map<String, String> newMap(String id, String name, String parentId) {
        Map<String, String> row = new HashMap<String, String>();
        row.put("id", id);
        row.put("name", name);
        row.put("parent", parentId);
        return row;
    }
}

Well given the choice, I'd be using objects. I'd create an object for each record where each object has a children collection and store them all in an assoc array (/hashtable) where the Id is the key. And blitz through the collection once, adding the children to the relevant children fields. Simple.

But because you're being no fun by restricting use of some good OOP, I'd probably iterate based on:

function PrintLine(int pID, int level)
    foreach record where ParentID == pID
        print level*tabs + record-data
        PrintLine(record.ID, level + 1)

PrintLine(0, 0)

Edit: this is similar to a couple of other entries, but I think it's slightly cleaner. One thing I'll add: this is extremely SQL-intensive. It's nasty. If you have the choice, go the OOP route.


You can emulate any other data structure with a hashmap, so that's not a terrible limitation. Scanning from the top to the bottom, you create a hashmap for each row of the database, with an entry for each column. Add each of these hashmaps to a "master" hashmap, keyed on the id. If any node has a "parent" that you haven't seen yet, create an placeholder entry for it in the master hashmap, and fill it in when you see the actual node.

To print it out, do a simple depth-first pass through the data, keeping track of indent level along the way. You can make this easier by keeping a "children" entry for each row, and populating it as you scan the data.

As for whether there's a "better" way to store a tree in a database, that depends on how you're going to use the data. I've seen systems that had a known maximum depth that used a different table for each level in the hierarchy. That makes a lot of sense if the levels in the tree aren't quite equivalent after all (top level categories being different than the leaves).





hierarchical-data