Reprinted with Permission by Quest Software June 2006


Hierarchical Queries
by Howard J. Rogers, www.dizwell.com

I heard that it is possible to write something called a ‘hierarchical query’. Can you tell me what one of those is, and how and why I might do it with Oracle?

Sure. A hierarchical query allows you to trace or ‘walk’ the relationship between rows in the same table -which, as an explanation, is pretty accurate but probably doesn’t enlighten you very much! So let me cite a specific example in the hope that the specifics will explain things a bit better!

Consider Scott’s EMP table:


SQL> select empno, ename, mgr from scott.emp;

     EMPNO ENAME             MGR
---------- ---------- ----------
      7369 SMITH            7902
      7499 ALLEN            7698
      7521 WARD             7698
      7566 JONES            7839
      7654 MARTIN           7698
      7698 BLAKE            7839
      7782 CLARK            7839
      7788 SCOTT            7566
      7839 KING
      7844 TURNER           7698
      7876 ADAMS            7788
      7900 JAMES            7698
      7902 FORD             7566
      7934 MILLER           7782

Each employee record tells us who that employee’s manager is. Using this information, I can deduce that SMITH reports to FORD, that FORD in his turn reports to JONES, and finally that JONES reports to KING -who has no manager to report to, and must therefore presumably be the ultimate boss of everyone. In a similar fashion, I can see that ALLEN reports to BLAKE, and BLAKE reports directly to KING. And finally, I see that SCOTT reports to JONES, and JONES (as we’ve already seen) reports to KING. And so on. In other words, rows in this EMP table are related to each other because the MGR column for one row usually points us to an EMPNO for an another employee. A hierarchical query will ‘discover’ these relationships for you.

OK... so how would you go about querying the EMP table to display that sort of ‘reports to’ relationship you’ve just described?

Relatively easily, in fact. The magic SQL clause needed to do it is CONNECT BY PRIOR, which happens also to need an additional START WITH clause if it is to return vaguely sensible results. So, for example:

SQL> select ename, level from emp
  2  start with ename='KING'
  3  connect by prior empno=mgr;

ENAME          LEVEL
---------- ----------
KING               1
JONES              2
SCOTT              3
ADAMS              4
FORD               3
SMITH              4
BLAKE              2
ALLEN              3
WARD               3
MARTIN             3
TURNER             3
JAMES              3
CLARK              2
MILLER             3

That is in fact a hierarchical tree of EMP, though it doesn’t look much like a tree right now! To help bring out the ‘treeness’ of the data, I’ve selected here a pseudo-column called LEVEL which is automatically assigned to each row when you do these sorts of queries. You’ll note, for example, that KING is Level 1: that means he’s the ‘root’ node of the hierarchical tree, which is confirmation of what we already knew -he’s the ultimate boss. Jointly at level 2, we have JONES, BLAKE and CLARK. All three of these people report directly to KING. Beneath them, we have the Level 3 and 4 employees. Hence, ADAMS (level 4) reports to SCOTT (level 3) who reports to JONES, who reports to KING -and all of this should be exactly what I worked out from the visual inspection of the table earlier.

It’s not exactly obvious, though, is it?! Can’t you make it display the results any better?

What do you expect? Fireworks and 3-D graphics??! Oh, OK: I can make things look a bit more meaningful if I start indenting the different levels of the hierarchy -and I can do that easily enough if I use the LEVEL pseudo column to tell Oracle how much to indent records by: as you become more and more junior, your level gets bigger and bigger... so junior staff should end up being more indented than senior ones. You can see this at work in this example:

SQL> select lpad(ename,LEVEL*2,' ') from emp
  2  start with ename='KING'
  3  connect by prior empno=mgr;

LPAD(ENAME,LEVEL*2,'')
--------------------------------------------

KI
JONE
 SCOTT
   ADAMS
  FORD
   SMITH
BLAK
 ALLEN
  WARD
MARTIN
TURNER
 JAMES
CLAR
MILLER

...and you can see this is getting really quite close to something useful. The seniority relationship between, say, ADAMS, SCOTT, JONES and KING is now very visible -if only the employee names were not being truncated! But, fortunately, that too can be fixed with just a little tweaking of the LPAD bit of the query:

SQL> select lpad(ename,length(ename)+LEVEL*2,' ') from emp
  2  start with ename='KING'
  3  connect by prior empno=mgr;

LPAD(ENAME,LENGTH(ENAME)+LEVEL*2,'')
----------------------------------------------------------

  KING
   JONES
     SCOTT
       ADAMS
     FORD
       SMITH
   BLAKE
     ALLEN
     WARD
     MARTIN
     TURNER
     JAMES
   CLARK
     MILLER

And that, I think, is just about as good as you’ll get.

That’s quite neat. But I notice I have to know who to ‘START WITH’ before the query runs. Suppose I don’t know that?

Then you can miss that bit of the clause out. If you do, Oracle will begin its search at the base of the tree (in my case, with all the junior staff in turn) and work upwards. If I’m going to work upwards through the tree, though, I also need to reverse the order of the CONNECT BY PRIOR clause. Instead of the ‘EMPNO leads me to a MGR’ version I’ve been using, I will have to change that to read ‘MGR leads me to an EMPNO’. For example:

SQL> select lpad(ename,length(ename)+LEVEL*2,' ') from emp
  2  connect by prior mgr=empno;

LPAD(ENAME,LENGTH(ENAME)+LEVEL*2,'')
-----------------------------------------------------------

  SMITH
   FORD
     JONES
       KING
  ALLEN
   BLAKE
     KING
  WARD
   BLAKE
     KING
  JONES
   KING
  MARTIN
   BLAKE
     KING
  BLAKE
   KING
  CLARK
   KING
  SCOTT
   JONES
     KING
  KING
  TURNER
   BLAKE
     KING
  ADAMS
   SCOTT
     JONES
       KING
  JAMES

...and on for lots more rows (39 of them, in fact). Clearly, that’s not quite what we want: the query has used each row in the EMP table in turn to generate a separate ‘mini-tree’, and presented the entire collection of mini-trees as the query’s answer. In reality, of course, we want just one, global tree -so this approach must be wrong... but you might notice that every single one of the mini-trees ends with an entry for KING -which is a pretty good clue that KING must be at the top of the global tree. When you haven’t a clue what to START WITH therefore, query without the clause first to give you an idea of what the appropriate row to start with really is.

Of course, for this particular data set, I could have used this sort of alternative approach:

SQL> select lpad(ename,length(ename)+LEVEL*2,' ') from emp
  2  start with ename=
  3      (select ename from EMP where MGR is null)
  4       connect by prior empno=mgr;

LPAD(ENAME,LENGTH(ENAME)+LEVEL*2,'')
----------------------------------------------------------

  KING
   JONES
     SCOTT
       ADAMS
     FORD
       SMITH
   BLAKE
     ALLEN
     WARD
     MARTIN
     TURNER
     JAMES
   CLARK
     MILLER

The real point here, of course, is that it is perfectly legitimate to use a subquery to generate the value with which the main hierarchical query should start. If you can divine a similar sort of business rule for the specific real data you might be working with, then you can construct a similar sort of subquery to express it -and so, again, no: you don’t strictly need to ‘know’ where to START WITH ahead of time when doing these sorts of queries.

OK. One thing I don’t quite understand: why does the syntax use CONNECT BY PRIOR X=Y? And what difference does it make if you specify X=Y or Y=X?

Excellent questions! As with all matters syntactical, I think it helps sometimes just to accept that the syntax is as it is, and that there is no obvious, rational explanation why it got that way! In this case, the words CONNECT BY PRIOR EMPNO=MGR, in the context of START WITH KING, can be taken to mean ‘read the row for KING and find his EMPNO. Now find other records in EMP which have that EMPNO as their MGR value. For each record that you find, use that record’s EMPNO to search for other EMP records which use it as their MGR value.’ And so on.

In short, a row’s EMPNO field becomes the search criterion to check all other row’s MGR columns. And hence the query has to be written as EMPNO=MGR. If you did it the other way around (that is, MGR=EMPNO), again starting with KING, you’d be saying ‘extract the starting row’s MGR value, then search EMP for other rows which have that as their EMPNO’. And if you remember anything at all about Scott’s EMP table, you’ll realize that’s a problem, because KING is the President of the company, and has a null MGR field... meaning that the query can find no other rows related to KING at all and thus grinds to a halt after just the first row (try it if you don’t believe me!).

So yes, the ordering of the X=Y bit of the CONNECT BY PRIOR clause is important. If you are starting at the top of the tree and walking downwards, then the first row must give you a data value that leads to the children -and in my case, that data value can only be EMPNO. Hence the first row’s EMPNO will give us something to search the MGR column in all the other rows of the table for -and hence, EMPNO=MGR gives us a top-down hierarchical tree walk.

If you would like to start at the base of the tree, however, and walk upwards, then the first row’s MGR value will lead us to another row with that as its EMPNO -and hence MGR=EMPNO would yield a bottom-up tree walk -though, of course, you’d then have to change the START WITH part of the query too.

As for why the syntax reads CONNECT BY PRIOR... well, as I said, I just learn to stop asking why syntax reads as it does! I dislike the keyword ‘connect’, because it implies, to me, somehow making multiple logons to a database! And the ‘prior’ bit is confusing, too: it is supposed to mean ‘the previous (or prior) row you found gave you the value to search for other rows by’. But it’s pushing things, I think, to make coherent sense from that! I’d advise just getting used to the peculiar use of words, and get the real gist of the matter: the ordering of the columns you give the clause determines the direction of travel through the tree, up or down.

Can I lop bits of the tree off? Suppose I only want to see the KING-BLAKE-ALLEN bit of the tree?

That’s quite easy to do, actually -though it is possible to fall into a trap when trying to do it. What I mean is that you might think of doing something like this:

SQL> select lpad(ename,length(ename)+LEVEL*2,' ') from emp
  2  where ename<>'BLAKE'
  3  start with ename='KING'
  4  connect by prior empno=mgr;

LPAD(ENAME,LENGTH(ENAME)+LEVEL*2,'')
----------------------------------------------------------
  KING
   JONES
     SCOTT
       ADAMS
     FORD
       SMITH
     ALLEN
     WARD
     MARTIN
     TURNER
     JAMES
   CLARK
     MILLER
 

The use of a WHERE clause here certainly removes BLAKE from the tree... but, you might note, it also leaves BLAKE’s subordinates in the tree, and thus distorts the picture. Here, it looks as if ALLEN and WARD, for example, report directly to JONES: they appear, after all, to be at the same level in the tree as SCOTT and FORD.

That is probably not what you were after: you were probably more interested in lopping off the entire BLAKE branch of the hierarchy -and, fortunately, you can do that too:

SQL> select lpad(ename,length(ename)+LEVEL*2,' ') from emp
  2  start with ename='KING'
  3  connect by prior empno=mgr
  4  and ename <> 'BLAKE';

LPAD(ENAME,LENGTH(ENAME)+LEVEL*2,'')
----------------------------------------------------------

  KING
   JONES
     SCOTT
       ADAMS
     FORD
       SMITH
   CLARK
     MILLER

I’m using very much the same sort of logic as before (ENAME<>’BLAKE”), but I’ve moved it out of a WHERE clause and into the CONNECT BY PRIOR clause. The effect of such a slight change is, as you can see, quite dramatic: the entire BLAKE branch is now missing, and the report only lists 8 employees (out of 14). But at least this time the tree structure is not being mis-displayed!

Nice... but is there actually much use for this sort of thing in the real world?

Well, it depends! Bear in mind that what we are doing is exploring a set of relationships that are defined within a single table. That is indeed rather unusual in a relational database which -almost by definition- defines relationships between tables. So whilst one could perhaps imagine a hierarchy of components which ‘roll up’ into ‘parts’ and which then roll up into ‘products’, I suspect that most “proper” relational databases would have separate COMPONENTS, PARTS, and PRODUCTS tables related to each other with foreign key constraints.

The official Oracle documentation cites examples such as genealogy (family trees), livestock (pedigrees and breeding programs), evolutionary research (species evolve into other species), as well as the classic one I’ve discussed here -management/employee hierarchies. Those are fairly specialized areas of database use, don’t you think!? The key thing about all of them, though, is that they have a self-referencing quality: a horse will be the offspring of two animals that are themselves horses; species can only evolve from (“out of”) other species. Employees can only be managed by other employees. Likewise, in my example, products are merely manufactured goods which are themselves made up of other, component, manufactured goods.

So, no: I don’t imagine that this query technique is going to be in daily use for everybody. But spare a thought for a Data Warehouse or an OLAP database. They are quite commonly repositories for enormous quantities of data, and normalizing all that data in the way you'd expect to do in an OLTP-style database is quite commonly out of the question. For one thing, joining it all back together again when you do queries becomes an outrageously expensive thing for the database engine to have to do. It's therefore quite usual for Warehouse-style databases to store de-normalized data, with related yet logically distinct information stored in the same physical table (like creating a table as a select from a join of the ever-familiar EMP and DEPT tables, for example). Once you accept the existence of denormalized tables, you must inevitably come to the conclusion that there will be plenty of tables in the real world that contain their relationships within themselves, and not as a set of relations to other, non-self, tables. At which point, hierarchical queries can really come into their own and be extremely useful.

In short: these things do have a use in the real world, but it's fairly specialized and not the sort of thing you are likely to do very much of in the sort of OLTP environments with which most DBAs are familiar. Writing hierarchical queries nevertheless remains an extremely important technique to know about.


Click here to learn more about Howard Rogers.