Reprinted with Permission by Quest Software Sep.  2003


Microstate Response-time Performance Profiling, Part 1
Danisment Gazi Unal

Getting optimum performance for mission-critical Oracle systems is an extremely complex task. For senior Oracle DBAs, measuring internal services waits within an Oracle database is a critical aspect of advanced response-time profiling. This article is extremely complex because the internals of Oracle response-time are, by their nature, very complex. However, this article provides an excellent overview of the complex world of Oracle response-time, and a good starting point for those Oracle professionals who wish to become intimate with Oracle internal response-time mechanisms. This is not a trivial article; it may require several readings to fully understand the internal response-time mechanisms and the commands that are used to gather the response-time information, but it is worthwhile if your goal is to fully understand the complex interactions between Oracle and the operating system.

Goal

Solving performance-related problems requires an understanding of the measurement techniques. The accuracy of measurements is an important factor in all types of research. However, there are no 100-percent accurate measurements in Nature; there are always some distortions in the measurements. How can a method be accepted as an accurate method? The answer depends on the impact of measurement errors.

The goal of this article is to adapt microstate accounting, which is a more accurate measurement technique provided by OS (operating system) vendors, to database management systems (DBMSs).

Microstate Response-time Performance Profiling (MRRP) for Oracle is not a new performance modeling technique, but it is a new performance profiling technique for Oracle. It adapts microstate accounting to Oracle by using universal response time performance modeling.

In this article, Oracle and UNIX are used as the DBMS and OS, respectively. However, the concept can be easily adapted to other DBMS.

Measurement of Oracle Services and Waits in OS Level

Measurement of Oracle Services in OS Level

Oracle measures CPU usage by the CPU used by this session statistic. This is done via getrusage() or times() system calls depending on platforms. Exhibit 1 shows a sample output.

Exhibit 1. Test Case: strace Command Output on SuSELinux 7.2/Oracle 8.1.7

19:56:05 gettimeofday({1000781765, 366717}, NULL) = 0
19:56:05 getrusage(RUSAGE_SELF, {ru_utime={0, 80000}, ru_stime={0, 10000}, ...}) = 0
19:56:05 gettimeofday({1000781765, 367403}, NULL) = 0
Oracle uses the getrusage() system call to find CPU usage in user mode(ru_utime), and CPU usage in kernel mode or system mode(ru_stime).

Measurement of Oracle Waits in OS Level

There are three wait mechanisms in Oracle according to wait measurement techniques:

  1. Synchronous wait mechanism
  2. Asynchronous wait mechanism
  3. Timed out wait mechanism
Synchronous Wait Mechanism. In this mechanism, after wait is requested, Oracle process is immediately put into the OS wait-queue. Until wait-event is completed, process sleeps in OS wait-queue. When wait-event is completed, OS kernel posts the waiting process and makes it runnable.

Oracle uses gettimeofday() system call before and after wait-event. The time difference between them is updated in dictionary as wait-time of event. Exhibit 2 shows a sample output.

Exhibit 2. Test Case: strace Command Output on SuSELinux 7.2/Oracle 8.1.7

15:26:02.348478 gettimeofday({1003530362, 348525}, NULL) = 0
15:26:02.348564 gettimeofday({1003530362, 348588}, NULL) = 0
15:26:02.348659 pread(409,
"\6\2\0\0\240\27\200\0\325\227\7\0\0\0\2\0\0\0\0\0\1\0\7\0_\f\0\0\210I
\7\0\0\0\24P\2\6\3\0\30/\200\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\236\0\0\0
N\1\354\3R\6R\6\0\0\236\0\1\0\2\0\3\0\4\0\5\0"...,
2048, 12386304) = 2048
15:26:02.348995 gettimeofday({1003530362, 349035}, NULL) = 0
15:26:02.349174 gettimeofday({1003530362, 349209}, NULL) = 0
15:26:02.349260 write(6, "WAIT #1: nam=\'db file sequential read\'
ela= 0 p1=2 p2=6048 p3=1", 63) = 63
15:26:02.349341 write(6, "\n", 1)          = 1
Oracle gets time by gettimeofday() system call before and after pread() system call. The difference between them is added to wait-time of wait-event.

Some wait-events that use the synchronous wait mechanism include:

V$SESSION_EVENT.TOTAL_TIMEOUTS and V$SYSTEM_EVENT.TOTAL_TIMEOUTS are always 0 for this type of wait-event.

Asynchronous Wait Mechanism. Async I/O routines provide the ability to do real asynchronous I/O in an application. This is accomplished by allowing the calling process or thread to continue processing after issuing a read or write and receive notification either upon completion of the I/O operation or of an error condition that prevented the I/O from being completed.

Oracle uses the asynchronous wait mechanism in asynchronous I/O (AIO) operations. By this mechanism, process does not enter wait-queue immediately after the AIO request is submitted. So, while the AIO operation is continuing on disk, process continues its computations on CPU. That is, service and wait can be implemented in parallel.

From an Oracle perspective, the wait-time during AIO is the actual wait-time in user level, not in OS level. The process does not start waiting immediately; it starts waiting if there is nothing to do on CPU.

Here are some wait-events that use the asynchronous wait mechanism if AIO is available in both Oracle and OS:

Timed Out Wait Mechanism. In the timed-out mechanism, process does not have to wait for posting by OS kernel. Process wakes up after a specific timeout and checks if resource is available. If resource is available, process gets it; if not, process goes to sleep again. The wait-time is updated in each timeout.

Process timer information is written to OS timer queue, which includes process current timeout information. This timeout information is calculated/updated at each clock cycle by an OS-specific algorithm.

If there are a lot of timed-out based waits in a database, OS kernel CPU usage may become high because OS kernel calculates/updates OS timer queue.

V$SESSION_EVENT.TOTAL_TIMEOUTS and V$SYSTEM_EVENT.TOTAL_TIMEOUTS are incremented in each timeout.

There are two mechanisms to implement timeouts:

  1. FIFO ordered timeouts
  2. Non-ordered timeouts
The FIFO (first-in, first-out) ordered timeout mechanism is used by wait-events that need serializations. This mechanism uses the semaphore facilities of OS. Semaphore operations include hardware-based atomic instructions to protect critical sections in source code, and processes are queued in FIFO order in semaphore operations.

Before waiting, process sets OS alarm clock (normally three seconds, but changes according to wait events) using setitimer() system call. When timeout occurs, a SIGALRM signal is delivered by OS to waiting process. If a SIGALRM signal is received by a waiting process and resource is not still available, OS alarm clock is set again for that process, so timer is restarted; if resource is available, timer is switched off using setitimer() system call and process gets resource.

FIFO ordered timeouts are postable. That means waiting process can be posted by OS kernel before timeout occurs. This mechanism has the pseudo code shown in Exhibit 3.

Exhibit 3. Pseudo Code for FIFO Ordered Timeouts Wait Mechanism

while (resource_not_available)
  {
    setitimer();
    if (semop()==0) break;  // if TRUE resource is available,
                            // otherwise sleeps by using semaphore
    if (SIGALRM)            // signal handling
      {
      if (resource_is_available) break;
      }
  }
Here are the system calls of a process while waiting for locked resources in enqueue (TX) wait-event. As seen in Exhibit 4, there is a loop running around each three seconds.

Exhibit 4. Test Case: strace Command Output on SuSELinux 7.2/Oracle 8.1.7

19:56:00 --- SIGALRM (Alarm clock) ---
19:56:00 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
19:56:00 gettimeofday({1000781760, 944916}, NULL) = 0
19:56:00 rt_sigprocmask(SIG_UNBLOCK, [], NULL, 8) = 0
19:56:00 rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0
19:56:00 rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
19:56:00 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
19:56:00 rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
19:56:00 gettimeofday({1000781760, 945344}, NULL) = 0
19:56:00 gettimeofday({1000781760, 945446}, NULL) = 0
19:56:00 rt_sigprocmask(SIG_BLOCK, NULL, [RT_0], 8) = 0
19:56:00 rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
19:56:00 gettimeofday({1000781760, 945649}, NULL) = 0
19:56:00 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={3, 70000}}, NULL) = 0
19:56:00 rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
19:56:00 semop(98304, 0xbfffa8f4, 1)   = -1 EINTR (Interrupted system call)
19:56:04 --- SIGALRM (Alarm clock) ---
19:56:04 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
19:56:04 gettimeofday({1000781764, 14912}, NULL) = 0
19:56:04 rt_sigprocmask(SIG_UNBLOCK, [], NULL, 8) = 0
19:56:04 rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0
19:56:04 rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
19:56:04 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
19:56:04 rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
19:56:04 gettimeofday({1000781764, 15410}, NULL) = 0
19:56:04 gettimeofday({1000781764, 15498}, NULL) = 0
19:56:04 rt_sigprocmask(SIG_BLOCK, NULL, [RT_0], 8) = 0
19:56:04 rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
19:56:04 gettimeofday({1000781764, 15686}, NULL) = 0
19:56:04 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={3, 70000}}, NULL) = 0
19:56:04 rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
19:56:04 semop(98304, 0xbfffa8f4, 1)   = 0
Some wait-events that use FIFO ordered timeouts wait mechanism include: A non-ordered timeout mechanism is used by waitevent which requires timeout mechanism without an order. A latch free wait event uses non-ordered timeouts. Because latch operations do not need to be done in FIFO order, there is no need for a semaphore queue for latch free wait-event. Atomic instructions such as test-and-set, compare-and-swap, etc. are sufficient, instead of expensive semaphore operations. That is why Oracle does not use semaphores in latch operations.

Exhibit 5 lists some assembly codes. The function sskgslcas ( ) includes a compare-and-swap instruction called lock cmpxchg.

Exhibit 5. Test Case: objdump -d Command Output of oracle Executable on SuSELinux 7.2 (x86)/Oracle 8.1.7

08c2c272 <sskgslcas>:
8c2c272:      8b 4c 24 04             mov           0x4(%esp,1),%ecx
8c2c276:      8b 44 24 08             mov           0x8(%esp,1),%eax
8c2c27a:      8b 54 24 0c             mov           0xc(%esp,1),%edx
8c2c27e:      f0 0f b1 11             lock  cmpxchg %edx,(%ecx)
8c2c282:      75 06                   jne             8c2c28a <.failed_cmp>
8c2c284:      b8 01 00 00 00          mov             $0x1,%eax
8c2c289:      c3                      ret
The lock prefix invokes a locked (atomic) read-modify-write operation when modifying a memory operand.

If a process tries to get a latch in willing-to-wait mode, it spins CPU by _SPIN_COUNT times on multi-processor systems. If _SPIN_COUNT is greater than 1 on singleprocessor machines, it is ignored.

If a process cannot get latch after spinning, it sleeps by using select() system call. The select() system call can measure timeout values down to microseconds.

As seen in Exhibit 6, Oracle does not use file descriptors in select() to check if there is available data for file descriptors. Just the <timeout> value is used. By using select() system call in this way, select() system call causes sleep until <timeout> occurs unless process is interrupted by a signal.

Exhibit 6. Test Case: strace Command Output on SuSELinux 7.2/Oracle 8.1.7

16:56:07.107143 gettimeofday({1003708567, 107165}, NULL) = 0
16:56:07.107199 gettimeofday({1003708567, 107221}, NULL) = 0
16:56:07.107251 gettimeofday({1003708567, 107271}, NULL) = 0
16:56:07.107525 gettimeofday({1003708567, 107585}, NULL) = 0
16:56:07.107698 select(0, [], [], [], {0, 10001}) = 0 (Timeout)
16:56:07.172634 gettimeofday({1003708567, 172688}, NULL) = 0
16:56:07.172787 gettimeofday({1003708567, 172815}, NULL) = 0
16:56:07.172866 write(6, "WAIT #1: nam=\'latch free\' ela= 7 p1=1343645620 p2=66 p3=0", 57) = 57
16:56:07.172955 write(6, "\n", 1) = 1
There are two mechanisms to implement non-ordered timeouts:
  1. Postable non-ordered timeouts
  2. Non-postable non-ordered timeouts
If _LATCH_WAIT_POSTING=1, which is the default, waiting process can be interrupted for library cache and shared pool latches only, and gets these latches. If _LATCH_WAIT_POSTING>1, waiting process can be interrupted for all latches, and gets latches. If _LATCH_WAIT_POSTING=0, process that is waiting for latch is never posted, even if latch is free.

Measurement Errors

There are several types of measurement errors, including OS measurement errors, Oracle measurement errors, and user measurement errors. This section discusses performance-related measurement errors on Oracle by examining OS, Oracle, and user measurement errors.

Service Measurement Errors

Active Service Measurement Error. active service is the service of a process while it is running on CPU.

OS updates CPU usage of running process in each clock tick. This update is not the time spent by process on CPU; rather, it is just a clock tick. The clock tick value of a process on CPU is incremented by one in each clock tick. Whether or not running on CPU is not important. If process is on CPU but waits for an event such as fetching from memory, then clock tick is assigned to process.

When Oracle uses getrusages() system call, getrusages() converts tick numbers to time by dividing tick numbers by HZ value of OS. A typical HZ value is 100 ticks per second, which means a typical clock tick rate is 10 milliseconds (ms) per tick.

This method causes some measurement errors called active service measurement errors. Exhibit 7 shows the measurement error samples.

Exhibit 7. CPU Measurement Errors

Measurement errors include: The error is minimal at high usage levels, but ranges up to 80 percent or more at low levels. The problem is that usage is underreported, and the range of error increases on faster CPUs. At a real usage level of 5 percent busy, one will often see vmstat reporting that the system is only 1 percent busy.

Active Wait Measurement Errors. When a process runs in OS perspective, but it seems like a wait in user perspective, that process is in active wait. There are two types of active waits in Oracle:

  1. Latch-spinning
  2. Wait-loop
Because the time spent for these operations is included in service time, it may cause measurement errors. This type of error is called an active wait measurement error.

Spinning time is included in the CPU used by this session statistic. That is, although process is waiting in user perspective, the time spent in spinning is included in both OS perspective and Oracle perspective. This means that latch-spinnings are seen as if running.

Oracle wait-events wake up in a loop for wait-events that implement timed out wait mechanism until they get resources. This loop occupies CPU.

Exhibit 8 shows a sample enqueue (TX) wait-event of DELETE operation. Most CPU usage in a wait-loop is done in kernel mode. There is little time spent in user mode (see Exhibit 9).

Exhibit 8. Test Case: SuSELinux 7.2/Oracle 8.1.7 and strace Command for OS Side

TIME STATISTICS BEFORE DELETE OPERATION:

In Oracle side:

SID        NAME                                           VALUE
---------- ---------------------------------------------- ----------
         9 CPU used when call started                              1
         9 CPU used by this session                                1
         9 OS User time used                                       5
         9 OS System time used                                     1
4 rows selected.

In OS side:

21:19:57.303910 getrusage(RUSAGE_SELF, {ru_utime={0, 50000}, ru_stime={0, 10000}, ...}) = 0

A SNAPSHOT OF TIME STATISTICS IN WAIT-LOOP:

In Oracle side:

SID        NAME                                          VALUE
---------- --------------------------------------------- ----------
         9 CPU used when call started                             1
         9 CPU used by this session                               1
         9 OS User time used                                     15
         9 OS System time used                                  114
4 rows selected.

In OS side:

22:36:18.468218 getrusage(RUSAGE_SELF, {ru_utime={0, 150000}, ru_stime={1, 140000}, ...}) = 0
Exhibit 9. Time Statistics after DELETE Operation

In Oracle side:

SID        NAME                                          VALUE
---------- --------------------------------------------- ----------
         9 CPU used when call started                             1
         9 CPU used by this session                             124
         9 OS User time used                                     15
         9 OS System time used                                  114
4 rows selected.

In OS side:

22:36:33.101442 getrusage(RUSAGE_SELF, {ru_utime={0, 150000}, ru_stime={1, 140000}, ...}) = 0
CPU used by this session is updated after DELETE statement is completed, and this statistic also includes time spent in CPU during wait-loop. That is, CPU used by this session includes CPU usage although process is waiting. Exhibit 10 shows a sample enqueue wait-event of the DELETE operation. As seen, the wait-loop impact increases while enqueued sessions are increased. Here are the interpretations for the following timed terms: Exhibit 10. Test Case: 766 MHz SuSELinux 7.2/Oracle 8.1.7

Total sessions: 10
Enqueued sessions: 1
Vmstat:

procs              memory   swap     io    system       cpu
r b w swpd  free buff cache si so bi bo  in    cs us sy  id
0 0 0 30664 9628 1164 78580  0  0  0  8 103    84  0  0 100

Wait-loop impact per session: 1.670 ms

Total sessions: 60
Enqueued sessions: 51
Vmstat:

procs              memory   swap     io   system      cpu
r b w swpd  free buff cache si so bi bo in    cs us sy id
0 0 0 57240 1396  520 45552  0  0  0 4 102   106  0  1 99

Wait-loop impact per session: 1.688 ms

Total sessions: 210
Enqueued sessions: 202
Vmstat:

procs               memory   swap     io    system      cpu
r b w swpd   free buff cache si so bi bo  in    cs us sy id
0 0 1 248948 1504  476 37776  0  0  0 12 110   187  1  5 94

Wait-loop impact per session: 1.752 ms
Wait-loop elapsed time shows the approximate time spent in wait-loops. It includes time spent of all possible events in wait-loops, such as CPU usage, OS activities such page faults, context switches, etc.

Exhibit 11 shows a wait-loop that has average values. The first instruction starts at 14:38:24.736238, and the last instruction (semop) starts at 14:38:24.737990. The time difference between them is 0.001752 seconds, equal to 1.752 ms wall-clock time. This is an average value and does not include time spent in the semop() system call. So, the average wait-loop elapsed time is 1.752 ms. 

Exhibit 11. Wait-Loop System Calls

14:38:24.736238 --- SIGALRM (Alarm clock) ---
14:38:24.736782 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
14:38:24.736937 gettimeofday({944433504, 736967}, NULL) = 0
14:38:24.737007 rt_sigprocmask(SIG_UNBLOCK, [], NULL, 8) = 0
14:38:24.737091 rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0
14:38:24.737183 rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
14:38:24.737269 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
14:38:24.737341 rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
14:38:24.737428 gettimeofday({944433504, 737452}, NULL) = 0
14:38:24.737531 gettimeofday({944433504, 737562}, NULL) = 0
14:38:24.737610 rt_sigprocmask(SIG_BLOCK, NULL, [RT_0], 8) = 0
14:38:24.737699 rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
14:38:24.737779 gettimeofday({944433504, 737806}, NULL) = 0
14:38:24.737844 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={3, 70000}}, NULL) = 0
14:38:24.737912 rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
14:38:24.737990 semop(98304, 0xbfffa5c4, 1) = -1 EINTR (Interrupted system call)
Exhibit 12. CPU Usage Summary of Wait-Loop System Calls
oracle@linux:~/test5 > strace -p 714 -tt -c
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 52.82    0.000234           2       109           rt_sigprocmask
 22.57    0.000100           2        55           gettimeofday
 21.67    0.000096           3        28           setitimer
  2.93    0.000013           1        13        13 semop
------ ----------- ----------- --------- --------- ----------------
100.00    0.000443                   205        13 total
Exhibit 12 shows a CPU usage summary of wait-loop system calls that has average values. There are 13 semop() system calls. Because each loop includes one semop() system call, there are 13 loops in this sample. It was found that the kernel mode is predominantly used in wait-loops. So, the time spent in system calls can be approximately accepted as the average wait-loop CPU time. Thus, the average wait-loop CPU time = 0.000443/13 = 0.034 ms.
Wait-loop utilization = 100* (average wait-loop
  CPU usage) / (average wait-loop elapsed time)
    = 100 * (0,034/1.752)
    = 1.94%
The wait-loop utilization is too low. This means that most of the time in the wait-loop was spent in OS-specific tasks such as page faults, context switches, etc. These OS tasks were recorded to enqueue wait-event by Oracle.

The wait-loop impact may be small in session level. However, it becomes larger when the number of sessions increases. For example:

wait-loop impact in system level = (wait-loop
  impact per session) * (average timed out
  sessions)
  
    = 1.752 ms * (202)
    = 353,904 ms
    = 0.354 sec
Because enqueue timeout normally occurs every three seconds in this test case, there will be 0.354 seconds total wait-loop impact every three seconds. This means that the total wait-loop impact will be increased every three seconds. Because time spent in wait-loops is recorded as relevant Oracle wait-events, it distorts Oracle wait-events.

Wait Measurement Errors

Inactive Wait Measurement Errors. If a running process is stopped by Oracle process itself, its called process is in inactive wait. Sleeping during inactive waits is controlled by Oracle. That is, inactive waits are Oracle wait-events.

If a process completes its waiting or wait-timeout occurs, it does not get the CPU immediately. It is put into the ready (runnable) queue first. Then, when a CPU is assigned, the process is determined by CPU Scheduler. Operations from taking the process from wait-queue to assigning the CPU to a process are done by OS kernel.

When an Oracle process gets CPU, it updates its waittime. This wait-time includes wait-time in wait-queue, wait-time in ready (runnable) queue, preemption latency time, context switch time, and any operations such as swapping, paging etc. during Oracle waits. This means there are some other operations included in Oracle waits. Time spent in these non-Oracle operations are called inactive wait measurement errors. Inactive wait measurement errors make Oracle waits distorted and overvalued.

Exhibit 13 shows a sample measurement error. An Oracle process slept 1 centisecond (10,001 microseconds) on wait-queue by select() system call. After getting CPU, process measured elapsed time by gettimeofday() system call (7 centiseconds) and updated the latch sleep time by 7 centiseconds. However, this is the total time and includes latch-sleep time in wait-queue, wait-time in the ready (runnable) queue, preemption latency time, and context switch time. In fact, the time spent in latch-sleep was 1 centisecond. Other rounded 6 centiseconds were caused by other components of the system, not by Oracle latch.

Exhibit 13. Test Case: strace Command Output on SuSELinux 7.2/Oracle 8.1.7

16:56:07.107143 gettimeofday({1003708567, 107165}, NULL) = 0
16:56:07.107199 gettimeofday({1003708567, 107221}, NULL) = 0
16:56:07.107251 gettimeofday({1003708567, 107271}, NULL) = 0
16:56:07.107525 gettimeofday({1003708567, 107585}, NULL) = 0
16:56:07.107698 select(0, [], [], [], {0, 10001}) = 0 (Timeout)
16:56:07.172634 gettimeofday({1003708567, 172688}, NULL) = 0
16:56:07.172787 gettimeofday({1003708567, 172815}, NULL) = 0
16:56:07.172866 write(6, "WAIT #1: nam=\'latch free\' ela= 7 p1=1343645620 p2=66 p3=0", 57) = 57
16:56:07.172955 write(6, "\n", 1)       = 1
If there are bottlenecks in CPUs, the number of processes in ready (runnable) queue increases. This manifests as if problems are in Oracle wait-events.

Inactive Service Measurement Errors. If a running process is stopped by OS kernel, its called process is in inactive service. A running process can be stopped by OS for preempting, swapping, paging, etc.

Sleeping during inactive services are not controlled by Oracle. Oracle processes do not know this state. Time spent in inactive services is called an inactive service measurement error. This error is not included in Oracle waits. In other words, it is missed.

Recurring Waits. If a wait-event occurs for the same resource in multiple sessions at the same time,it is called a recurring wait. Adding up the time spent in each wait-event to total wait-time may give distorted results.5 For example, let a log switch completion take 2 seconds. If five processes wait for log file switch completion wait-event at the same time through these 2 seconds, the total time spent in the log file switch completion wait-event will be 5 × 2 = 10 seconds. But, in fact, it takes 2 seconds in system.

The log file switch completion wait-event is a recurring wait-event because it occurs one time in system level, but recurrs multiple times in session level.

Exhibit 14 shows the recurring wait measurement errors. While the number of sessions increases, cumulative measurements may be distorted. Average waits per session may not give correct measurements because wait-times among sessions may not be distributed uniformly. That is, the reponse time performance modeling is not a proper model in system level. Also, a system should be a serial system for response time calculation; that is, the waittime should not include service time and service time should not include wait-time. Both service time and waittime must be separate.

Exhibit 14. Recurring Waits Measurement Errors

Microstate Accounting

Microstate accounting, on the other hand, takes a highresolution timestamp on every state change, every system call, every page fault, and every scheduler change.

The prusage structure of SunOS 5.5 (Exhibit 15) shows the resource usage structure used in microstate accounting. As seen in this structure, more detailed and accurate time information is measured than that provided by getrusage(), times(), and gettimeofday() system calls. By this mechanism it is possible to see the time spent in several system components.

Exhibit 15. prusage Structure of SunOS 5.5

typedef struct prusage {
    id_t        pr_lwpid;    /* lwp id. 0: process or defunct */
    u_long      pr_count;    /* number of contributing lwps */
    timestruc_t pr_tstamp;   /* current time stamp */
    timestruc_t pr_create;   /* process/lwp creation time stamp */
    timestruc_t pr_term;     /* process/lwp termination time stamp */
    timestruc_t pr_rtime;    /* total lwp real (elapsed) time */

    timestruc_t pr_utime;    /* user level CPU time */
    timestruc_t pr_stime;    /* system call CPU time */
    timestruc_t pr_ttime;    /* other system trap CPU time */
    timestruc_t pr_tftime;   /* text page fault sleep time */
    timestruc_t pr_dftime;   /* data page fault sleep time */
    timestruc_t pr_kftime;   /* kernel page fault sleep time */
    timestruc_t pr_ltime;    /* user lock wait sleep time */
    timestruc_t pr_slptime;  /* all other sleep time */
    timestruc_t pr_wtime;    /* wait-cpu (latency) time */

    timestruc_t pr_stoptime; /* stopped time */
    u_long      pr_minf;     /* minor page faults */
    u_long      pr_majf;     /* major page faults */
    u_long      pr_nswap;    /* swaps */
    u_long      pr_inblk;    /* input blocks */
    u_long      pr_oublk;    /* output blocks */
    u_long      pr_msnd;     /* messages sent */
    u_long      pr_mrcv;     /* messages received */
    u_long      pr_sigs;     /* signals received */
    u_long      pr_vctx;     /* voluntary context switches */
    u_long      pr_ictx;     /* involuntary context switches */
    u_long      pr_sysc;     /* system calls */
    u_long      pr_ioch;     /* chars read and written */
} prusage_t;
Oracle and Microstate Accounting

If the OS does not support microstate accounting, Oracle cannot support it. Exhibit 16 shows some system calls generated by the Oracle process when TIMED_OS_ STATISTICS is set. The ioctl() calls are used with the parameter of PIOCUSAGE. If TIMED_OS_STATISTICS is set 0, ioctl() disappears.

Exhibit 16. Test Case: truss Command Output on Solaris 8/Oracle 9.0

$ truss -p 907
read(8, 0x024F28D6, 2064) (sleeping...)
read(8, "\0BC\0\006\0\0\0\0\011 i".., 2064) = 188
times(0xFFBED5C8) = 855271
ioctl(10, PIOCUSAGE, 0xFFBED18C) = 0
ioctl(10, PIOCSTATUS, 0xFFBED28C) = 0
times(0xFFBED560) = 855271
times(0xFFBED5C8) = 855271
times(0xFFBED420) = 855271
Applying PIOCUSAGE to a process that does not have microstate accounting enabled will enable microstate accounting and return an estimate of time spent in the various states up to this point. Further invocations of PIOCUSAGE will yield accurate microstate time accounting from this point.

This means that Oracle uses microstate accounting for the following timed OS statistics for this test case: