Reprinted with Permission by Quest Software June 2005


Parlez-Vous BI?

Robert Catterall

Everything you need to know to talk intelligently about business intelligence.

Does this scenario sound familiar? You're a DB2 DBA with years of experience managing operational database systems. You're in a meeting in which some business intelligence (BI) experts - your colleagues or consultants - are discussing various aspects of a new DB2-based decision-support system that's being developed. You hear these people talking about OLAP, MOLAP, ROLAP, marts, ODSs, stars, and snowflakes, and you think to yourself, "What in the world are these people talking about?" You're faced with two unpleasant options: Smile and nod and hope that no one asks for your opinion; or publicly admit your bewilderment.

Believe me, I've been there, and I have a suggestion for you: Before you go into that meeting, read this column. I'll do my best to demystify these and other BI concepts. If I can learn this stuff, so can you.

What, Where, and When?

Consider the nature of the databases that might support an organization's BI needs. For a number of years, CheckFree Corp. has had a mainframe DB2 system on which we've run a query-and-reporting workload. Until quite recently, this system was commonly referred to as our data warehouse, an appellation that drove our BI people crazy. "No," they would say, "it's not a data warehouse, it's an ODS. We don't have a data warehouse."

An ODS is an operational data store; ours was basically a copy of many of the most important tables in the production DB2 for z/OS database that is the foundation for our electronic bill presentment and payment (EBPP) application. ODSs are often used, as ours is, to offload query and reporting processing from an operational system. An ODS may contain more historical data than its production counterpart (at CheckFree, it's several years versus several months). The frequency with which data in an ODS is updated will vary from organization to organization. We use a bulk data extract and movement process that runs in batch mode and updates our ODS every two hours. At some sites, data in the ODS is updated immediately after an update to the production source data.

Our ODS had served us well (and still does), but about a year ago we launched our first true enterprise data warehouse (EDW) using DB2 UDB with the data partitioning feature (DPF) on an IBM pSeries server running the AIX operating system. (With DPF, we partition the EDW data across several logical nodes on one large SMP server.)

Key differences between our EDW and our ODS include:

The CheckFree electronic commerce division's EDW has enhanced the company's ability to understand how it can best serve the needs of key constituents in the EBPP market: billers, financial institutions, and consumers. Still, we know that we can do more to provide our business analysts and executives with tools and information to take our BI capabilities to the next level. To that end, we're about to roll out our first data mart, using DB2 UDB on a Linux server.

Data marts are often fed from data warehouses (as ours will be). They tend to be relatively specialized, in terms of subject area, and they usually contain more in the way of aggregated data versus the atomic-level data typically found in an EDW. Data marts provide excellent platforms for a type of BI processing known as online analytic processing (OLAP).

O-What?

OLAP basically involves looking at data using a cube model. A "cube" is BI-speak for a multidimensional representation of data. Suppose, for example, that someone working for a retailer wants to analyze sales data along three dimensions: time, geography, and product line. At what time of the year does certain apparel sell best in the Midwest? These three dimensions can be thought of as the height, width, and depth of a cube of data. By drilling through the cube, you can examine data "slices" of interest — slices being data values that exist, logically, at the intersection of certain dimensional values (for example, May 2003, Northeast, electronics).

A physical cube, of course, has three dimensions. Although my simple data cube example has only three, a data cube can have more than three dimensions. The term "cube" is used to describe the multidimensional data representation because we can easily visualize a three-dimensional object. Even when you add a fourth dimension, such as purchaser demographics, you still call the resulting data set a cube.

As you might expect, cubes can be pretty complex, and building OLAP-supporting cubes can take some time. Sometimes, cube-building time can be reduced when the underlying data is arranged in what's called a star schema. A star schema in a relational database consists of a so-called fact table and associated dimensional tables. The fact table is typically quite large (sometimes very large, with a billion or more rows); dimensional tables are usually quite a bit smaller. If you draw a large-box fact table surrounded by small-box dimensional tables, you can see why these arrangements are referred to as star schemas.

To use another retail example (retailers have long been BI trailblazers), think of a fact table holding basic information about sales. Dimensional tables in the star schema might hold data about time periods, store locations, product lines, and purchasers (the company might provide "loyalty cards" that allow consumers to buy at a discount and allow the retailer to know something about the person making the purchase). Rows in the fact table are linked to rows in the dimensional tables using surrogate key values generated for that purpose. When a cube-building request comes along in the form of an SQL statement, the DB2 query optimizer can choose from among several different star join techniques in making access path decisions. Often, the optimizer will choose to join dimensional tables to each other before going to the fact table, even though the dimensional tables don't have any columns in common. Although this technique would likely be less than optimal for an operational application, it can be a sensible choice for a star schema if it allows the DBMS to filter out a large percentage of the fact table rows when accessing that table.

Although dimensional tables are typically small, in some cases they can be rather large. A given dimensional table can be normalized into several smaller tables; when that happens, you get what's called a "snowflake."

When cubes are built dynamically from data in the database (perhaps from tables arranged in a star schema), that's called relational OLAP (ROLAP). Sometimes, a tool will extract data from the relational database and use it to build cubes that are stored in a proprietary, nonrelational data structure, which is called multidimensional OLAP (MOLAP). ROLAP and MOLAP can be used together. You might begin an analytic exercise using a MOLAP cube because it's already built (and would be expensive to build on the fly). As you drill down through the MOLAP cube, getting to levels characterized by progressively less-aggregated data, you might eventually hit a limit beyond which further drill down is possible only through the use of a ROLAP cube built using the more granular data in the relational database.

Until fairly recently, DB2 didn't have much built-in awareness of OLAP needs.

May I Suggest an MQT?

DB2 UDB provides a query performance-boosting capability called materialized query tables (MQTs). An MQT is table holding a precomputed query result set. The result set could be generated by a SELECT with a GROUP BY clause applied to one or more base tables in a database. If a query targeting the base table of an MQT could be processed more quickly using the MQT, the DB2 optimizer can automatically redirect the query to the MQT; the resulting improvement in the query's elapsed time can be dramatic.

Because building cubes generally involves data aggregation, MQTs can provide a major OLAP performance boost; however, it can be a challenge to figure out which MQTs should be defined in order to get the best OLAP performance. MQTs take up disk space and consume cycles when being built or updated, so you don't want to have too many of them. On the other hand, you could miss out on getting better OLAP performance if you define too few MQTs (or the wrong ones). Because DB2 itself had no idea as to the nature of the cubes to be built from data in the database, it couldn't help with the development of an OLAP-supporting MQT strategy.

That situation changed with the introduction of an IBM product called DB2 Cube Views. Cube Views provides a means of supplying metadata about cubes that will be constructed using data in the database. This cube metadata is stored in tables that are extensions of the DB2 catalog. In addition to making DB2 cube-aware, DB2 Cube Views has an advisor that can process the cube metadata and make recommendations as to which MQTs should be defined to optimize the building of the specified cubes. So, MQTs provide a great OLAP performance boost in the form of MQTs; now, the DB2 Cube Views advisor provides a way to make the best use of MQTs to speed up an OLAP workload.

Making BI Work

I don't want to suggest that this one column contains all the information you'll ever want to have on BI. But I'm hoping that, with an understanding of BI basics, you'll be in a better position to help develop systems that will enable your organization to turn data into information, and information into insight.

CATTERALL'S GLOSSARY OF BASIC BI TERMS

BI: Business intelligence; the use by an organization of its data assets to improve decision-making and, therefore, organizational performance.

ODS: Operational data store; a database that essentially mirrors a production data source and is used to offload query and reporting work from the production system.

EDW: Enterprise data warehouse; a database that's the foundation for a comprehensive BI infrastructure, typically characterized by a data-centric (as opposed to application-centric) database design.

Data mart: A database, often sourced from an EDW, that serves the advanced data analysis needs of a particular group of BI users within an organization.

Cube: A structure in which data is arranged along various dimensions (for example, time, geography, and product line), and is used to support OLAP functions.

OLAP: Online analytic processing; the sophisticated manipulation of data for analytic purposes, often using data organized into cube structures.

ROLAP: Relational OLAP; characterized by the dynamic building of cubes using data sourced from a relational database.

MOLAP: Multidimensional OLAP; involving the analysis of data in prebuilt cubes that are stored in a proprietary, nonrelational data structure.

Star schema: A database design, often used to improve OLAP performance, that features a central fact table linked to associated dimensional tables by way of surrogate keys.

Snowflake: Similar to a star schema, but with larger dimensional tables that have been denormalized into sets of smaller tables.


RESOURCES
DB2 Business Intelligence
ibm.com/db2bi
 


Robert Catterall is a database technology strategist at Atlanta-based CheckFree Corp. You can reach him at rcatterall@checkfree.com.