Reprinted with Permission by Quest Software Mar.  2002

 

DB2 Index Design: Analyze How Data is Used
Topic Extracted from Knowledge Xpert for DB2

To understand the cost vs. benefits of using indexes, you must analyze how the data is used to determine which columns should be indexed. Identify columns that are frequently searched or joined, and estimate the percentage of rows processed by frequently executed SQL statements. Consider creating the clustering index on the column most frequently searched or joined in sequence. Consider non-clustering indexes on searched or joined columns in which less than 5 to 10 percent of the rows are to be selected, updated, deleted, and joined.

Refine this analysis with an estimate of the percentage of the rows that are inserted, updated, and deleted over a given period of time. Minimize the number of indexes when inserting, updating, and deleting more than about 10 percent of the rows on a weekly basis. If heavy update activity is concentrated on a weekly, monthly, quarterly, or annual basis, consider the techniques for avoiding index maintenance described in the section on batch processing.

Composite indexes are useful when columns are frequently referenced together. They reduce the number of indexes that must be maintained and increase the chance of index-only retrieval.

These are only general guidelines. The physical design of the tables and indexes requires an analysis of how the data is to be processed. A process analysis matrix sometimes called a CRUD (Create, Retrieve, Update, and Delete) matrix is useful in summarizing how the data will be processed. Below is an example of such a matrix.

*Note P (Equal predicate, >P range), Sn (Select column, number of rows), In (Insert), Dn (Delete), Jm (Join, m=x for one of many joins) in each cell

**Frequency of execution: D (Daily), W (Weekly), M (Monthly), A (Annual)

Tables and columns are listed as rows of the matrix. Beside each table is noted the table cardinality (the number of rows in the table) and beside each column is noted the column cardinality (the number of distinct values in the column). Each column of the matrix has the name of a transaction or batch program.

For each cell in the matrix, note Sn (Select column), In (Insert), and Dn (Delete) in each cell where n represents the number of rows processed. Joins are noted with Jm where m represents one of many joins. If a predicate is specified on the column to narrow the search, note a P for an equal predicate, >P for a greater-than predicate, and so on. The last row in the matrix is for the frequency of execution; D (Daily), W (Weekly), M (Monthly), and A (Annual) are convenient notations. The form of the matrix is not critical. The important thing is to perform a detailed analysis of how the data will be processed and summarize the processing requirements to provide a basis for design of the indexes. The process matrix will also prove useful in tuning the design of the tables.

To make process analysis more manageable, identify the top 10 programs or top 10 percent of the programs for which performance is most critical. The indexes identified will likely be beneficial to most of the programs. You can do more tuning of the index design when the transaction and batch programs are tested.