Reprinted with Permission by Quest Software April  2002

 

Data Modeling: Reality Requires Super and Sub Types
Bert Scalzo, PhD

“All fixed set patterns are incapable of adaptability or pliability. The truth is outside of all fixed patterns.” – Bruce Lee

Welcome to this third paper in an ongoing series regarding common data modeling mistakes and their impact. The key premise throughout is that poor database design:

Each paper in the series focuses upon a narrow range of the more common and costly data modeling mistakes, exposing their impacts to project success and espousing sound data modeling practices to overcome them.

This issue we’ll examine the often overlooked and under utilized technique of entity super and sub typing, also known as generalization hierarchies or inheritance – with the goal of answering two key questions:

1.      Truly Logical = Break It Down

When logical data modeling (i.e. truly modeling from the business perspective), super and sub typing is arguably one of the most useful tools. It is a true “divide and conquer” technique that permits us to subdivide our complex business entities into smaller and more manageable component pieces. These subdivided entities are much easier to work with and are usually more readily understood than the larger whole from whence they came. Plus, the additional new relationships involving these subdivided entities clearly expose important business rules that may have been implied, but not expressly modeled. Moreover, hidden business rules such as these are quite often the ones that programmers would have had to implement via application code – when and if they remembered.

By definition, a super type is an entity that characterizes all attributes and relationships that are common across any of its contained sub types – much like a “super set” in mathematical terms. Whereas a sub type is an entity that actualizes only those attributes and relationships that are specific to the person, place or thing that it represents – much like a “sub set”. Hence, super and sub typing could be viewed as a hierarchical object based (but not object oriented) technique. OK, enough of the math already.

Let’s examine a simple portion of a Human Resources (HR) data model – the nature of our employees. Sounds simple, right? Look at Figure 1.

Figure 1

We know that we have employees, departments and unions. But which employees belong to unions? Look at all the optional attributes. What rules if any are there for when each one must or may have a value? Of course any business analyst would see right through this and correctly subdivide the model as shown in Figure 2.

Figure 2

What have we accomplished? First, we’ve clearly shown that we have two mutually exclusive types of employees (i.e. you either are exempt or non-exempt, but not both). We also clearly see now that four of our previously optional attributes are mandatory based upon the employee type. Plus we see that only non-exempt employees belong to unions. But did you also notice that our entire non-exempt work force is unionized? Look again at the cardinality, it now shows that each non-exempt employee is a union member.

That’s quite a bit of information now more clearly exposed and accurately defined.

2.      Physical Logical = Roll It Up

Forgive me as I get up on my soapbox for a moment and preach a little ER religion.

I’ve run into numerous data modelers who really aren’t business analysts, but rather are either DBA’s or programmers tasked with doing a data model (hopefully before the darn application is built). Here we have people doing a more physical attempt at a logical data model – and often the results are more about the desired database design or application programming logic than the true underlying business requirements.

Be that as it may, super and sub types are useful here as well – it’s just that the approach is now one of combining instead of dividing entities. Let’s return to our HR example.

Figure 3 shows an example of what a more physically minded data modeler might produce as their HR data model.

Figure 3

Look at all the extra work that was required to do! I had to create new attributes for the employee ID and SSN as you cannot reuse an attribute that’s been included in a unique identifier. In fact I skipped creating the third unique identifier for each employee type, as I got really tired of doing such silly busy work. Same thing happens all the time in the real world as well – data modeling shortcuts resulting in inaccurate results.

Anytime you find that you have the same attributes and relationships for two different entities as in Figure 3, you probably have a good candidate to roll-up into a super and sub type instead. Do you see that Figure 3 is really the same as Figure 2? The only difference is probably that some DBA or programmer desired a two table physical implementation, so they drew it as such. As we’ll see in the next section, this was not necessary at all.

3.      Physical Implementation Options

Let’s resume using Figure 2 as our correct HR data model. So how do we translate such a purely logical data model into a successful database design?

First and foremost and prior to generating a physical data model (PDM) from our logical data model (CDM), we must define the physical generation properties for our super and sub types (i.e. the inheritance object properties) as show in figure 4.

Figure 4

We have three basic options:

4.      One Table for Everything

If you want one table for everything, then you need to check generate parent and not check generate children on the physical generation properties in Figure 4. You also must add a column to the physical generation properties – which will serve as the entity-specifying attribute, also known as the determinant. Figure 5 shows such a logical to physical translation (note the new column for employee type that was added). It’s advisable to use a domain for this so that you can define the legitimate column values (which for our case are N and E).

Figure 5

The resulting DDL for the one table for everything is

create table EMPLOYEE  (
   EMP_ID               NUMBER(12)                       not null,
   DEPT_ID              NUMBER(8)                        not null,
   UNION_ID             NUMBER(8)                        not null,
   EMP_SSN              CHAR(9)                          not null,
   EMP_FIRST_NAME       VARCHAR2(20)                     not null,
   EMP_LAST_NAME        VARCHAR2(30)                     not null,
   EMP_BIRTH_DATE       DATE                             not null,
   EMP_GENDER           CHAR(1)                          not null,
   EMP_HIRE_DATE        DATE                             not null,
   EMP_STREET           VARCHAR2(80),
   EMP_CITY             VARCHAR2(40),
   EMP_STATE            CHAR(2),
   EMP_ZIP              CHAR(5),
   EMP_TYPE             CHAR(1)                          not null
         constraint CKC_EMP_TYPE_EMPLOYEE check (EMP_TYPE in ('E','N')),
   EMP_HOURLY_RATE      NUMBER(5,2),
   EMP_OVERTIME_RATE    NUMBER(5,2),
   EMP_MONTHLY_SALARY   NUMBER(8,2),
   EMP_VACATION_WEEKS   SMALLINT,
   constraint PK_EMPLOYEE primary key (EMP_ID),
   constraint AK_EMP_UID2_EMPLOYEE unique (EMP_SSN),
   constraint AK_EMP_UID3_EMPLOYEE unique (EMP_FIRST_NAME,
         EMP_LAST_NAME, EMP_BIRTH_DATE, EMP_GENDER),
   constraint FK_EMPLOYEE_RELATIONS_DEPARTME foreign key (DEPT_ID)
         references DEPARTMENT (DEPT_ID),
   constraint FK_EMPLOYEE_RELATIONS_UNION foreign key (UNION_ID)
         references "UNION" (UNION_ID)
)

Guess What? We followed all the proper steps, yet we have an incomplete DDL script! It’s missing a very key constraint (no pun intended). Remember all that stuff the super typing made explicit from Figure 1 to Figure 2? Well we still want that to be true, so we need to manually add the following table constraint to our table definition.

CONSTRAINT    emp_type_null
CHECK ( (emp_type = 'E' AND
         emp_monthly_salary IS NOT NULL AND
         emp_vacation_weeks IS NOT NULL AND
         emp_hourly_rate    IS NULL AND
         emp_overtime_rate  IS NULL AND
         union_id       IS NULL)
      OR
        (emp_type = 'N' AND
         emp_monthly_salary IS NULL AND
         emp_vacation_weeks IS NULL AND
         emp_hourly_rate    IS NOT NULL AND
         emp_overtime_rate  IS NOT NULL AND
         union_id       IS NOT NULL)
        ) )

5.      One Table Per Sub Type

If you want one table per sub type, then you need to not check generate parent and check generate children with inherit all attributes on the physical generation properties in Figure 4. Figure 6 shows such a logical to physical translation.

Figure 6

Doesn’t Figure 6 remind you of Figure 3 (the more physically minded modelers attempt to data model our HR system)? See – we can model the business rules correctly and still produce whatever the desired physical implementation might be!

The resulting DDL for the table per sub type is

create table EXEMPT_EMPLOYEE  (
  EMP_ID               NUMBER(12)                       not null,
   DEPT_ID              NUMBER(8)                        not null,
   EMP_SSN              CHAR(9)                          not null,
   EMP_FIRST_NAME       VARCHAR2(20)                     not null,
   EMP_LAST_NAME        VARCHAR2(30)                     not null,
   EMP_BIRTH_DATE       DATE                             not null,
   EMP_GENDER           CHAR(1)                          not null,
   EMP_HIRE_DATE        DATE                             not null,
   EMP_STREET           VARCHAR2(80),
   EMP_CITY             VARCHAR2(40),
   EMP_STATE            CHAR(2),
   EMP_ZIP              CHAR(5),
   EMP_MONTHLY_SALARY   NUMBER(8,2)                      not null,
   EMP_VACATION_WEEKS   SMALLINT                         not null,
   constraint PK_EXEMPT_EMPLOYEE primary key (EMP_ID),
   constraint FK_EXEMPT_E_RELATIONS_DEPARTME foreign key (DEPT_ID)
         references DEPARTMENT (DEPT_ID)
)

create table NON_EXEMPT_EMPLOYEE  (
   EMP_ID               NUMBER(12)                       not null,
   UNION_ID             NUMBER(8)                        not null,
   DEPT_ID              NUMBER(8)                        not null,
   EMP_SSN              CHAR(9)                          not null,
   EMP_FIRST_NAME       VARCHAR2(20)                     not null,
   EMP_LAST_NAME        VARCHAR2(30)                     not null,
   EMP_BIRTH_DATE       DATE                             not null,
   EMP_GENDER           CHAR(1)                          not null,
   EMP_HIRE_DATE        DATE                             not null,
   EMP_STREET           VARCHAR2(80),
   EMP_CITY             VARCHAR2(40),
   EMP_STATE            CHAR(2),
   EMP_ZIP              CHAR(5),
   EMP_HOURLY_RATE      NUMBER(5,2)                      not null,
   EMP_OVERTIME_RATE    NUMBER(5,2)                      not null,
   constraint PK_NON_EXEMPT_EMPLOYEE primary key (EMP_ID),
   constraint FK_NON_EXEM_RELATIONS_UNION foreign key (UNION_ID)
         references "UNION" (UNION_ID),
   constraint FK_NON_EXEM_RELATIONS_DEPARTME foreign key (DEPT_ID)
         references DEPARTMENT (DEPT_ID)
)

Guess What? We followed all the proper steps, yet we have an incomplete DDL script! It too is missing a very key constraint (no pun intended). We don’t want to permit an employee to be exempt and non-exempt – right? So we need to maintain our primary key and unique key integrity across multiple tables! This cannot be done with a traditional key constraint as they are limited in scope to a single table. So we need to manually add the following triggers to our table definitions.

CREATE TRIGGER exempt_employee_check
BEFORE INSERT OR UPDATE OF emp_id, emp_ssn
ON exempt_employee
FOR EACH ROW
DECLARE
  dummy INTEGER := 0;
BEGIN
  IF ( INSERTING OR
      (UPDATING AND
       :new.emp_id <> :old.emp_id OR
       :new.emp_ssn <> :old.emp_ssn)) THEN
    SELECT COUNT(*)
    INTO dummy
    FROM non_exempt_employee
    WHERE emp_id =  :new.emp_id OR
          emp_ssn = :new.emp_ssn;
    IF (dummy <> 0) THEN
      RAISE DUP_VAL_ON_INDEX;
    END IF;
  END IF;
END;

CREATE TRIGGER non_exempt_employee_check
BEFORE INSERT OR UPDATE OF emp_id, emp_ssn
ON non_exempt_employee
FOR EACH ROW
DECLARE
  dummy INTEGER := 0;
BEGIN
  IF ( INSERTING OR
      (UPDATING AND
       :new.emp_id <> :old.emp_id OR
       :new.emp_ssn <> :old.emp_ssn)) THEN
    SELECT COUNT(*)
    INTO dummy
    FROM exempt_employee
    WHERE emp_id =  :new.emp_id OR
          emp_ssn = :new.emp_ssn;
    IF (dummy <> 0) THEN
      RAISE DUP_VAL_ON_INDEX;
    END IF;
  END IF;
END;

6.      Table for the parent plus one table per sub type

If you want one table per sub type, then you need to check generate parent and also check generate children with inherit only primary attributes selected on the physical generation properties in Figure 4. Figure 7 shows such a logical to physical translation.

Figure 7

Note that Figure 7 nearly matches our touted logical model in Figure 2. At first this may not seem like such a big deal. But it’s much easier to maintain a logical model and add new business functions when there is such a direct correlation between the logical model and the actual database. Adding a new attribute or relationship directly matches adding a column or foreign key constraint. Your business analysts, DBA’s and programmers will have a more common frame of reference and suffer less communication breakdown with this approach. Since data models are really all about communication in the first place, I suggest that this is an overwhelming advantage – worth any extra work real or perceived.

The resulting DDL for table for the parent plus one table per sub type is

create table EMPLOYEE  (
   EMP_ID               NUMBER(12)                       not null,
   DEPT_ID              NUMBER(8)                        not null,
   EMP_SSN              CHAR(9)                          not null,
   EMP_FIRST_NAME       VARCHAR2(20)                     not null,
   EMP_LAST_NAME        VARCHAR2(30)                     not null,
   EMP_BIRTH_DATE       DATE                             not null,
   EMP_GENDER           CHAR(1)                          not null,
   EMP_HIRE_DATE        DATE                             not null,
   EMP_STREET           VARCHAR2(80),
   EMP_CITY             VARCHAR2(40),
   EMP_STATE            CHAR(2),
   EMP_ZIP              CHAR(5),
   constraint PK_EMPLOYEE primary key (EMP_ID),
   constraint AK_EMP_UID2_EMPLOYEE unique (EMP_SSN),
   constraint AK_EMP_UID3_EMPLOYEE unique (EMP_FIRST_NAME, EMP_LAST_NAME, EMP_BIRTH_DATE, EMP_GENDER),
   constraint FK_EMPLOYEE_RELATIONS_DEPARTME foreign key (DEPT_ID)
         references DEPARTMENT (DEPT_ID)
)

create table EXEMPT_EMPLOYEE  (
   EMP_ID               NUMBER(12)                       not null,
   EMP_MONTHLY_SALARY   NUMBER(8,2)                      not null,
   EMP_VACATION_WEEKS   SMALLINT                         not null,
   constraint PK_EXEMPT_EMPLOYEE primary key (EMP_ID),
   constraint FK_EXEMPT_E_INHERITAN_EMPLOYEE foreign key (EMP_ID)
         references EMPLOYEE (EMP_ID)
)

create table NON_EXEMPT_EMPLOYEE  (
   EMP_ID               NUMBER(12)                       not null,
   UNION_ID             NUMBER(8)                        not null,
   EMP_HOURLY_RATE      NUMBER(5,2)                      not null,
   EMP_OVERTIME_RATE    NUMBER(5,2)                      not null,
   constraint PK_NON_EXEMPT_EMPLOYEE primary key (EMP_ID),
   constraint FK_NON_EXEM_RELATIONS_UNION foreign key (UNION_ID)
         references "UNION" (UNION_ID),
   constraint FK_NON_EXEM_INHERITAN_EMPLOYEE foreign key (EMP_ID)
         references EMPLOYEE (EMP_ID)

Guess What? We followed all the proper steps, yet we have an incomplete DDL script! It too is missing a very key constraint (no pun intended). We don’t want to permit an employee to be exempt and non-exempt – right? So we need to maintain our primary key across across the sub type tables! This cannot be done with a traditional key constraint as they are limited in scope to a single table. But do note that this is much better than the table per sub type approach as we only need to do this for the primary key and not all the unique keys. So we need to manually add the following triggers to our table definitions.

CREATE TRIGGER exempt_employee_check
BEFORE INSERT OR UPDATE OF emp_id
ON exempt_employee
FOR EACH ROW
DECLARE
  dummy INTEGER := 0;
BEGIN
  IF ( INSERTING OR
      (UPDATING AND :new.emp_id <> :old.emp_id)) THEN
    SELECT COUNT(*)
    INTO dummy
    FROM nonexempt_employee
    WHERE emp_id =  :new.emp_id
    IF (dummy <> 0) THEN
      RAISE DUP_VAL_ON_INDEX;
    END IF;
  END IF;
END;
 

CREATE TRIGGER non_exempt_employee_check
BEFORE INSERT OR UPDATE OF emp_id
ON non_exempt_employee
FOR EACH ROW
DECLARE
  dummy INTEGER := 0;
BEGIN
  IF ( INSERTING OR
      (UPDATING AND :new.emp_id <> :old.emp_id)) THEN
    SELECT COUNT(*)
    INTO dummy
    FROM exempt_employee
    WHERE emp_id =  :new.emp_id
    IF (dummy <> 0) THEN
      RAISE DUP_VAL_ON_INDEX;
    END IF;
  END IF;
END;

7.      Some Final Thoughts

One of the more interesting rationales people often give as justification for their selection among these three design alternatives is that of normalization. People often say that they picked method X in order to obtain Y normal form. More often than not it’s just a crutch for them to reach comfort with their decision. But a decision is always better than not, so this is not such a negative thing. We just need to make sure they’re basically correct.

Without bringing up those awful relational algebra or relational calculus definitions for the joys of normalization, let’s nonetheless try to make some kind of determination as to where these alternatives lie. We’ll use the easy to remember phrase “The key, the whole key, and nothing but the key – so help me Codd”.

The one table for everything approach fails third normal form. The attributes inherited into the parent from the children rely on both the key and the determinant. For example, EMP_MONTHLY_SALARY relies on the EMP_ID for uniqueness and the EMP_TYPE for its existence. So it’s second normal form at best as it fails “nothing but the key”.

The one table per sub type approach arguably fails first normal form (although I expect quite a few people to disagree here). I’ll argue that we have repeating groups. While they are not within the same entity per se, we are in fact repeating the entities’ overlapping attributes based upon their entity role (i.e. exempt versus non-exempt). Therefore, it fails (in my humble opinion) first normal form for repeating groups and/or “the key” (since the attributes also depend on the entity nature or role).

Only the third approach seems to pass third normal form. Until next issue …