Reprinted with Permission by Quest Software Dec. 2004


Taking High Availability Higher
Robert Catterall

Getting more up time from DB2 UDB for z/OS or OS/390.

IBM zSeries servers, otherwise known as mainframe computers, have a reputation as highly available systems — a reputation earned through four decades of rock-solid reliability.

DB2 for z/OS is an important part of the zSeries availability story. In this column, I'll describe some of the things that the company I work for, CheckFree Corp., has done to make a highly available system even more so.

The First Big Step: Data Sharing

In the spring of 2000, a couple of months after I started working at CheckFree, we implemented DB2 data sharing in the production environment. The term "data sharing" refers to a technology that allows multiple DB2 for z/OS subsystems on multiple servers — a mainframe cluster called a parallel sysplex — to share read/write access to a single DB2 database.

DB2 data sharing was introduced by IBM in the mid 1990s with DB2 for OS/390 v.4 and, for quite some time, was thought of primarily as a capacity-related play. (It was important in that regard, because IBM had recently shifted to CMOS microprocessors for its mainframes, and it would be a while before those chips exceeded the speed of the old bipolar engines.) However, processing capacity wasn't CheckFree's chief motivation for deciding on data sharing. Instead, the company was after availability; it wanted an enterprise data-serving platform that could keep running in the unlikely event of a mainframe, operating system, or even (gasp!) DB2 subsystem failure.

The DB2 data sharing group met that goal. It helped CheckFree weather some system component failures that otherwise could have resulted in significant outage situations. Since then, we've taken a number of steps, which I'll share with you, to build on this high-availability foundation.

More DB2 members. We increased from one DB2 for z/OS subsystem on each of the three zSeries servers in the production sysplex to two per server. This move improved availability by spreading the DB2 workload across more subsystems.

What's the connection between more members of the data sharing group and better availability? The answer is twofold. First, more subsystems means fewer retained locks in the event of a subsystem failure. Retained locks are update, also called X-type, data sharing locks held by a failed subsystem at the time of failure. Resources (typically data pages) locked in that manner will be unavailable until the failing subsystem is restarted. Second, more DB2 members means fewer in-flight units of work per subsystem, which results in faster recovery of a failed subsystem and faster freeing of retained locks. Note that the CPU overhead cost of going from three- to six-way data sharing is very small.

Updated DB2 maintenance. Generally speaking, DB2 restart following an abnormal termination is faster in a data sharing environment than for a standalone subsystem. (Updated pages in the buffer pool of a data sharing member tend to be externalized — to the group buffer pool in the coupling facility — at commit time.) We wanted that restart time to be faster still. Working with us and reviewing diagnostic information from some of our DB2 restart events, the IBM developers came up with a code enhancement. We applied this enhancement (APAR PQ66444) to our systems and saw DB2 restart times reduced by 75 percent.

Optimized application locking. One of our high-use transactions was fairly long-running and included a DB2 data update operation near the beginning of the unit of work. The transaction caused some long-duration X-type page locks to be held on a key table in the database, leading to more retained locks in the event of an abnormal DB2 termination. Our programmers were able to put that up-front DB2 update operation in a separate, very quick transaction, helping to significantly reduce the impact of retained locks on the continued operation of the data sharing system following a component failure.

These and other changes enabled us to more fully leverage our DB2 data sharing group as a defense against unplanned outages. More recently, we've focused on reducing our need for planned system outages.

Maintenance Without the Window

We have a regularly scheduled maintenance window — a time each week during which we can shut down our primary application system. During this window, we can upgrade or repair server and disk hardware, apply operating system and subsystem software fixes, make database changes, implement new application code, and perform other upkeep operations.

Although we like this window, we're getting ready to kiss it goodbye in its current form. In fact, we've already started the weaning process by significantly reducing the duration of each maintenance window, and we're preparing for further reductions in scheduled downtime. To that end, we're planning to exploit the capability of our DB2 data sharing group to support "online" system maintenance procedures.

Key to success with online maintenance is the use of what we call a "rolling maintenance" technique, which involves:

In this way, we can execute all manner of hardware and software changes on the back-end system with no interruption in application workload processing. The approach works because all the members of the DB2 data sharing group have access to all the data in the database. There's no need for a transaction or job to run on any one node of the system.

As useful as data sharing is, it's not the only DB2 technology we've used to reduce our need for scheduled downtime.

Getting Organized on the Fly

For many of our largest and most frequently accessed mainframe DB2 tables, a continuously ascending clustering key wouldn't be a good choice. The clustering keys that we typically use for these tables are such that new rows tend to be inserted into the middle of a table, as opposed to being appended to the end. When there's no hole (empty space) that would allow a new row to be inserted in or near the page where it should go (as indicated by the clustering index), DB2 puts the row where space is available. The resulting disorganization of the tablespace can negatively impact application performance; that's why periodic tablespace reorganization (REORG) operations are necessary. Reorganizing a tablespace used to mean taking it offline. Then online REORG arrived.

Introduced with DB2 for OS/390 v.5, online REORG works by first creating "shadow" copies of the data sets associated with the objects to be reorganized (tablespaces, or partitions of tablespaces, and indexes). Next, the contents of the "original" objects are copied to the shadow data sets. If the REORG is running in "share level change" mode (that is, with programs allowed to update the objects as they're being reorganized), the originals might change as this copy step is being carried out. The utility retrieves any such changes from the DB2 log and applies them to the shadow data sets. The acquisition and application of logged updates to the original data sets continues in an iterative fashion until the currency of data in the shadow data sets is very close to that of the data in the original data sets. At that point, updates are held up briefly while a final original object/shadow object synchronizing operation is performed. Then the shadow data sets are switched with the originals, and the job is done.

We use the DB2 online REORG utility for virtually all our tablespace and index reorganizations, even those performed during a maintenance window. Why use online REORG when online access isn't required during the reorganization process? There are two reasons. First, online REORG uses about the same amount of CPU time as an offline REORG. Second, if you have to terminate an online REORG before it completes, no recovery process is required because the utility hasn't changed the original tablespace and indexes.

In shifting from offline to online REORG jobs, we had to deal with a couple of issues. One had to do with the SWITCH phase of the utility, during which the shadow copies of the target objects are switched for the originals. For this to occur, the REORG job has to acquire what's called a drain lock on the original data sets. The actual type of lock is a drain all, which can't be acquired until all other processes release the read and write claims they hold on the data sets. Claims are released at commit time, and early on we had some online REORG jobs time out during the SWITCH phase because a long-running job wouldn't release its read claims on the original data sets. To eliminate this problem, we had to convey to our developers the importance of having even read-only programs issue frequent commits. The developers got the message, and the necessary commit logic was added to read-only application processes.

The second challenge presented by online REORGs was related to another phase of the utility, called BUILD2. This phase occurs only when online REORG is operating on a subset of the partitions of a partitioned tablespace on which at least one nonpartitioning index (NPI) is defined. During BUILD2, the index entries in the NPIs are updated with the correct RID values — a process made necessary by the fact that rows in the reorganized partitions have changed position as a result of the online REORG operation. While the BUILD2 phase of online REORG is running, programs can't access the affected logical partitions of the NPIs, and this situation can have the effect of blocking access to the data in the partitions being reorganized. The elapsed time of the BUILD2 phase of online REORG isn't typically very lengthy, but it's often not brief enough for us (our DB2 timeout value is only 20 seconds). BUILD2 can cause application processes on our system to time out — and that's not good.

To avoid BUILD2-related timeouts, we at first tried to do without NPIs on some of our partitioned tablespaces. Later, we hit on a better idea: let the BUILD2 phase take place during the maintenance window. (I mentioned that our window is getting smaller, but it still exists and we only need a little bit of time for the BUILD2 phase.) How can we possibly get an online REORG job to complete, in essence, on cue? It's easy — we just specify MAXRO DEFER in the utility control statement. This specification tells DB2 to indefinitely delay the SWITCH phase of online REORG, which precedes BUILD2. DB2 continues to apply updates to the shadow data sets to keep them close to the original, currency-wise (something that consumes a relatively small amount of CPU time, especially if you're reorganizing one or two partitions in a 200-partition tablespace). Once we're into the maintenance window, we use the ALTER UTILITY command to change the MAXRO value to a very small number (a few seconds). Very soon after, the SWITCH phase occurs, BUILD2 follows, and — voila — a REORG completes with 99-plus percent of the work done outside the maintenance window.

We're looking forward to trying out the data-partitioned secondary indexes (DPSIs) introduced with DB2 for z/OS v.8. DPSIs should eliminate BUILD2 processing because they will partition NPIs according to the partitioning of data in the associated tablespace. Online REORG of a partition will result in the creation of shadow data sets for the corresponding DPSI partition as well as for the tablespace partition.

What's Storage Got To Do With It?

Another step we took to enhance the availability of our DB2 for z/OS environment was to implement 64-bit hardware addressing on our mainframe servers. You don't see the connection to DB2 availability? It's kind of an indirect link. 64-bit hardware addressing refers to a recent zSeries enhancement that allows a z/OS system to use more than 2GBs of byte-addressable storage (more commonly known as central storage). Although this technology upgrade didn't change the size limit for a virtual storage address space (that remains at 2GBs for DB2 for z/OS and OS/390 v.7), it did make it possible to have more address spaces active on a system without encountering a real storage paging problem. (In essence, it allowed for a horizontal expansion of virtual storage resources.) Support for more address spaces enables the exploitation of the buffer-pools-in-data-spaces feature of DB2 for z/OS (a data space is a data-only z/OS address space).

Moving buffer pools from the database services address space (often referred to as DBM1) to data spaces improved our DB2 availability situation in two ways.

First, it took a lot of pressure off the DBM1 address space. Previously, we'd run close to the limit of DBM1 virtual storage utilization, almost reaching the 2GB line. Although doing so helped application performance (largely through the I/O reduction possible with large buffer pools), it required us to monitor DB2 virtual storage allocation very carefully, as we had very little margin for error. Moving the DB2 buffer pools to data spaces freed up hundreds of megabytes of virtual storage in the DBM1 address space. That, in turn, improved our availability picture by reducing the risk of maxing out a critical DB2 resource.

Second, we used some (but definitely not all) of the freed-up DBM1 space to significantly increase the size of our DB2 EDM pool (EDM stands for Environmental Descriptor Manager). The larger EDM pool allowed us to bind many more DB2 packages with the RELEASE(DEALLOCATE) option. The combination of RELEASE(DEALLOCATE) and CICS-DB2 protected threads (a combination that drives up EDM pool space utilization) improves the CPU efficiency of a DB2 workload, especially in a data sharing system. (Data sharing lock contention can be substantially reduced this way.) Less CPU overhead leads to better throughput, and the quicker you can get units of work through the system, the fewer retained locks you're likely to have in the event of an abnormal DB2 termination. You'll recall that fewer retained locks means faster DB2 restart and a reduction in the impact of a system component failure on the processing of the application workload.

We're Not Done Yet

DB2 database availability is a very big deal for CheckFree, and we're always striving to raise that bar another notch. I used to be much more into performance tuning than maximizing uptime, but I've changed my tune over the years (older and wiser, perhaps). Judy Ruby-Brown, Ms. DB2 Recovery/Restart and a former colleague of mine at the IBM Dallas Systems Center, put it this way: "Performance may be more exciting than availability, but a DB2 system that's down isn't performing well at all."


Robert Catterall is a database technology strategist at Atlanta-based CheckFree Corp. You can reach him at rcatterall@checkfree.com.