|
|
Getting optimum performance for mission-critical Oracle systems is an extremely complex
task. For senior Oracle DBAs, measuring internal services waits within an Oracle database is
a critical aspect of advanced response-time profiling. This article is extremely complex
because the internals of Oracle response-time are, by their nature, very complex. However,
this article provides an excellent overview of the complex world of Oracle response-time, and
a good starting point for those Oracle professionals who wish to become intimate with Oracle
internal response-time mechanisms. This is not a trivial article; it may require several readings
to fully understand the internal response-time mechanisms and the commands that are
used to gather the response-time information, but it is worthwhile if your goal is to fully
understand the complex interactions between Oracle and the operating system.
Goal
Solving performance-related problems requires an understanding of the measurement techniques.
The accuracy of measurements is an important factor in all types of research. However,
there are no 100-percent accurate measurements in Nature; there are always some distortions in the measurements. How can a method be accepted as an accurate method? The
answer depends on the impact of measurement errors.
The goal of this article is to adapt microstate accounting, which is a more accurate measurement
technique provided by OS (operating system) vendors, to database management systems
(DBMSs).
Microstate Response-time Performance Profiling (MRRP) for Oracle is not a new performance
modeling technique, but it is a new performance profiling technique for Oracle. It
adapts microstate accounting to Oracle by using universal response time performance modeling.
In this article, Oracle and UNIX are used as the DBMS and OS, respectively. However, the
concept can be easily adapted to other DBMS.
Measurement of Oracle Services and Waits in OS Level
Measurement of Oracle Services in OS Level
Oracle measures CPU usage by the CPU used by this session statistic. This is done via getrusage() or
times() system calls depending on platforms. Exhibit 1 shows a sample output.
Exhibit 1. Test Case: strace Command Output on SuSELinux 7.2/Oracle 8.1.7
Measurement of Oracle Waits in OS Level
There are three wait mechanisms in Oracle according to
wait measurement techniques:
Oracle uses gettimeofday() system call before and after wait-event. The time difference between them is
updated in dictionary as wait-time of event. Exhibit 2 shows a sample output.
Exhibit 2. Test Case: strace Command Output on SuSELinux 7.2/Oracle 8.1.7
Some wait-events that use the synchronous wait mechanism include:
Asynchronous Wait Mechanism. Async I/O routines provide the ability to do real asynchronous I/O
in an application. This is accomplished by allowing the calling process or thread to continue
processing after issuing a read or write and receive notification either upon completion of the I/O
operation or of an error condition that prevented the I/O from being completed.
Oracle uses the asynchronous wait mechanism in asynchronous I/O (AIO) operations. By this mechanism,
process does not enter wait-queue immediately after the AIO request is submitted. So, while the AIO
operation is continuing on disk, process continues its computations on CPU. That is, service and wait
can be implemented in parallel.
From an Oracle perspective, the wait-time during AIO is the actual wait-time in user level, not in
OS level. The process does not start waiting immediately; it starts waiting if there is nothing to do
on CPU.
Here are some wait-events that use the asynchronous wait mechanism if AIO is available in both Oracle
and OS:
Process timer information is written to OS timer queue, which includes process current timeout
information. This timeout information is calculated/updated at each clock cycle by an OS-specific
algorithm.
If there are a lot of timed-out based waits in a database, OS kernel CPU usage may become high because
OS kernel calculates/updates OS timer queue.
V$SESSION_EVENT.TOTAL_TIMEOUTS and V$SYSTEM_EVENT.TOTAL_TIMEOUTS are incremented in each timeout.
There are two mechanisms to implement timeouts:
Before waiting, process sets OS alarm clock (normally three seconds, but changes according to wait
events) using setitimer() system call. When timeout occurs, a SIGALRM signal is delivered by OS to
waiting process. If a SIGALRM signal is received by a waiting process and resource is not still
available, OS alarm clock is set again for that process, so timer is restarted; if resource is
available, timer is switched off using setitimer() system call and process gets resource.
FIFO ordered timeouts are postable. That means waiting process can be posted by OS kernel before
timeout occurs. This mechanism has the pseudo code shown in Exhibit 3.
Exhibit 3. Pseudo Code for FIFO Ordered Timeouts Wait Mechanism
Exhibit 4. Test Case: strace Command Output on SuSELinux 7.2/Oracle 8.1.7
Exhibit 5 lists some assembly codes. The function sskgslcas ( ) includes a compare-and-swap instruction
called lock cmpxchg.
Exhibit 5. Test Case: objdump -d Command Output of oracle Executable on SuSELinux 7.2
(x86)/Oracle 8.1.7
If a process tries to get a latch in willing-to-wait mode, it spins CPU by _SPIN_COUNT times on
multi-processor systems. If _SPIN_COUNT is greater than 1 on singleprocessor machines, it is ignored.
If a process cannot get latch after spinning, it sleeps by using select() system call. The select()
system call can measure timeout values down to microseconds.
As seen in Exhibit 6, Oracle does not use file descriptors in select() to check if there is available
data for file descriptors. Just the <timeout> value is used. By using select() system
call in this way, select() system call causes sleep until
<timeout> occurs unless process is interrupted by a signal.
Exhibit 6. Test Case: strace Command Output on SuSELinux 7.2/Oracle 8.1.7
Measurement Errors
There are several types of measurement errors, including OS measurement errors, Oracle measurement
errors, and user measurement errors. This section discusses performance-related measurement errors on
Oracle by examining OS, Oracle, and user measurement errors.
Service Measurement Errors
Active Service Measurement Error. active service is the service of a process while it is
running on CPU.
OS updates CPU usage of running process in each clock tick. This update is not the time spent by
process on CPU; rather, it is just a clock tick. The clock tick value of a process on CPU is incremented
by one in each clock tick. Whether or not running on CPU is not important. If process is on CPU but
waits for an event such as fetching from memory, then clock tick is assigned to process.
When Oracle uses getrusages() system call, getrusages() converts tick numbers to time by dividing tick
numbers by HZ value of OS. A typical HZ value is 100 ticks per second, which means a typical clock
tick rate is 10 milliseconds (ms) per tick.
This method causes some measurement errors called active service measurement errors. Exhibit 7 shows
the measurement error samples.
Exhibit 7. CPU Measurement Errors
Result: process CPU time, which is 5 ms, was lost.
Result: although process CPU time is 5 ms, 10 ms are measured. CPU time was measured higher.
Result: although process CPU time is 15 ms, 10 ms are measured. CPU time was measured lower.
Active Wait Measurement Errors. When a process runs in OS perspective, but it seems like a wait
in user perspective, that process is in active wait. There are two types of active waits in Oracle:
Spinning time is included in the CPU used by this session statistic. That is, although process is
waiting in user perspective, the time spent in spinning is included in both OS perspective and Oracle
perspective. This means that latch-spinnings are seen as if running.
Oracle wait-events wake up in a loop for wait-events that implement timed out wait mechanism until
they get resources. This loop occupies CPU.
Exhibit 8 shows a sample enqueue (TX) wait-event of DELETE operation. Most CPU usage in a wait-loop is
done in kernel mode. There is little time spent in user mode (see Exhibit 9).
Exhibit 8. Test Case: SuSELinux 7.2/Oracle 8.1.7 and strace Command for OS Side
Exhibit 11 shows a wait-loop that has average values. The first instruction starts at 14:38:24.736238,
and the last instruction (semop) starts at 14:38:24.737990. The time difference between them is
0.001752 seconds, equal to 1.752 ms wall-clock time. This is an average value and
does not include time spent in the semop() system call. So, the average wait-loop elapsed time is
1.752 ms.
Exhibit 11. Wait-Loop System Calls
The wait-loop impact may be small in session level. However, it becomes larger when the number of
sessions increases. For example:
Wait Measurement Errors
Inactive Wait Measurement Errors. If a running process is stopped by Oracle process itself,
its called process is in inactive wait. Sleeping during inactive waits is controlled by Oracle. That
is, inactive waits are Oracle wait-events.
If a process completes its waiting or wait-timeout occurs, it does not get the CPU immediately. It is
put into the ready (runnable) queue first. Then, when a CPU is assigned, the process is determined
by CPU Scheduler. Operations from taking the process from wait-queue to assigning the CPU to a process
are done by OS kernel.
When an Oracle process gets CPU, it updates its waittime. This wait-time includes wait-time in
wait-queue, wait-time in ready (runnable) queue, preemption latency time, context switch time, and any
operations such as swapping, paging etc. during Oracle waits. This means there are some other
operations included in Oracle waits. Time spent in these non-Oracle operations are called inactive
wait measurement errors. Inactive wait measurement errors make Oracle waits distorted and overvalued.
Exhibit 13 shows a sample measurement error. An Oracle
process slept 1 centisecond (10,001 microseconds) on
wait-queue by select() system call. After getting CPU, process
measured elapsed time by gettimeofday() system call
(7 centiseconds) and updated the latch sleep time by 7 centiseconds.
However, this is the total time and includes
latch-sleep time in wait-queue, wait-time in the ready
(runnable) queue, preemption latency time, and context
switch time. In fact, the time spent in latch-sleep was 1
centisecond. Other rounded 6 centiseconds were caused by
other components of the system, not by Oracle latch.
Exhibit 13. Test Case: strace Command Output on SuSELinux 7.2/Oracle 8.1.7
Inactive Service Measurement Errors. If a running
process is stopped by OS kernel, its called process is
in inactive service. A running process can be stopped by
OS for preempting, swapping, paging, etc.
Sleeping during inactive services are not controlled by
Oracle. Oracle processes do not know this state. Time
spent in inactive services is called an inactive service
measurement error. This error is not included in Oracle
waits. In other words, it is missed.
Recurring Waits. If a wait-event occurs for the same
resource in multiple sessions at the same time,it is called a
recurring wait. Adding up the time spent in each
wait-event to total wait-time may give distorted results.5
For example, let a log switch completion take 2 seconds.
If five processes wait for log file switch completion
wait-event at the same time through these 2 seconds,
the total time spent in the log file switch completion
wait-event will be 5 × 2 = 10 seconds. But, in fact,
it takes 2 seconds in system.
The log file switch completion wait-event is a recurring
wait-event because it occurs one time in system
level, but recurrs multiple times in session level.
Exhibit 14 shows the recurring wait measurement
errors. While the number of sessions increases, cumulative
measurements may be distorted. Average waits per session
may not give correct measurements because wait-times
among sessions may not be distributed uniformly. That
is, the reponse time performance modeling is not a proper
model in system level. Also, a system should be a serial
system for response time calculation; that is, the waittime
should not include service time and service time
should not include wait-time. Both service time and waittime
must be separate.
Exhibit 14. Recurring Waits Measurement Errors
Microstate accounting, on the other hand, takes a highresolution
timestamp on every state change, every system
call, every page fault, and every scheduler change.
The prusage structure of SunOS 5.5 (Exhibit 15)
shows the resource usage structure used in microstate
accounting. As seen in this structure, more detailed and
accurate time information is measured than that provided
by getrusage(), times(), and gettimeofday() system
calls. By this mechanism it is possible to see the time
spent in several system components.
Exhibit 15. prusage Structure of SunOS 5.5
If the OS does not support microstate accounting, Oracle
cannot support it. Exhibit 16 shows some system calls
generated by the Oracle process when TIMED_OS_
STATISTICS is set. The ioctl() calls are used with the
parameter of PIOCUSAGE. If TIMED_OS_STATISTICS
is set 0, ioctl() disappears.
Exhibit 16. Test Case: truss Command Output on Solaris 8/Oracle 9.0
This means that Oracle uses microstate accounting for
the following timed OS statistics for this test case:
Oracle uses the getrusage() system call to find CPU usage in user mode(ru_utime), and CPU usage in
kernel mode or system mode(ru_stime).19:56:05 gettimeofday({1000781765, 366717}, NULL) = 0
19:56:05 getrusage(RUSAGE_SELF, {ru_utime={0, 80000}, ru_stime={0, 10000}, ...}) = 0
19:56:05 gettimeofday({1000781765, 367403}, NULL) = 0
Synchronous Wait Mechanism. In this mechanism, after wait is requested, Oracle process is
immediately put into the OS wait-queue. Until wait-event is completed, process sleeps in OS wait-queue.
When wait-event is completed, OS kernel posts the waiting process and makes it runnable.
Oracle gets time by gettimeofday() system call before and after pread() system call. The difference
between them is added to wait-time of wait-event.15:26:02.348478 gettimeofday({1003530362, 348525}, NULL) = 0
15:26:02.348564 gettimeofday({1003530362, 348588}, NULL) = 0
15:26:02.348659 pread(409,
"\6\2\0\0\240\27\200\0\325\227\7\0\0\0\2\0\0\0\0\0\1\0\7\0_\f\0\0\210I
\7\0\0\0\24P\2\6\3\0\30/\200\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\236\0\0\0
N\1\354\3R\6R\6\0\0\236\0\1\0\2\0\3\0\4\0\5\0"...,
2048, 12386304) = 2048
15:26:02.348995 gettimeofday({1003530362, 349035}, NULL) = 0
15:26:02.349174 gettimeofday({1003530362, 349209}, NULL) = 0
15:26:02.349260 write(6, "WAIT #1: nam=\'db file sequential read\'
ela= 0 p1=2 p2=6048 p3=1", 63) = 63
15:26:02.349341 write(6, "\n", 1) = 1
V$SESSION_EVENT.TOTAL_TIMEOUTS and V$SYSTEM_EVENT.TOTAL_TIMEOUTS are always 0 for this type of wait-event.
Timed Out Wait Mechanism. In the timed-out mechanism, process does not have to wait for posting
by OS kernel. Process wakes up after a specific timeout and checks if resource is available. If
resource is available, process gets it; if not, process goes to sleep again. The wait-time is updated
in each timeout.
The FIFO (first-in, first-out) ordered timeout mechanism is used by wait-events that need
serializations. This mechanism uses the semaphore facilities of OS. Semaphore operations include
hardware-based atomic instructions to protect critical sections in source code, and processes are
queued in FIFO order in semaphore operations.
Here are the system calls of a process while waiting for locked resources in enqueue (TX) wait-event.
As seen in Exhibit 4, there is a loop running around each three seconds.while (resource_not_available)
{
setitimer();
if (semop()==0) break; // if TRUE resource is available,
// otherwise sleeps by using semaphore
if (SIGALRM) // signal handling
{
if (resource_is_available) break;
}
}
Some wait-events that use FIFO ordered timeouts wait mechanism include:
19:56:00 --- SIGALRM (Alarm clock) ---
19:56:00 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
19:56:00 gettimeofday({1000781760, 944916}, NULL) = 0
19:56:00 rt_sigprocmask(SIG_UNBLOCK, [], NULL, 8) = 0
19:56:00 rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0
19:56:00 rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
19:56:00 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
19:56:00 rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
19:56:00 gettimeofday({1000781760, 945344}, NULL) = 0
19:56:00 gettimeofday({1000781760, 945446}, NULL) = 0
19:56:00 rt_sigprocmask(SIG_BLOCK, NULL, [RT_0], 8) = 0
19:56:00 rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
19:56:00 gettimeofday({1000781760, 945649}, NULL) = 0
19:56:00 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={3, 70000}}, NULL) = 0
19:56:00 rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
19:56:00 semop(98304, 0xbfffa8f4, 1) = -1 EINTR (Interrupted system call)
19:56:04 --- SIGALRM (Alarm clock) ---
19:56:04 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
19:56:04 gettimeofday({1000781764, 14912}, NULL) = 0
19:56:04 rt_sigprocmask(SIG_UNBLOCK, [], NULL, 8) = 0
19:56:04 rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0
19:56:04 rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
19:56:04 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
19:56:04 rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
19:56:04 gettimeofday({1000781764, 15410}, NULL) = 0
19:56:04 gettimeofday({1000781764, 15498}, NULL) = 0
19:56:04 rt_sigprocmask(SIG_BLOCK, NULL, [RT_0], 8) = 0
19:56:04 rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
19:56:04 gettimeofday({1000781764, 15686}, NULL) = 0
19:56:04 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={3, 70000}}, NULL) = 0
19:56:04 rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
19:56:04 semop(98304, 0xbfffa8f4, 1) = 0
A non-ordered timeout mechanism is used by waitevent which requires timeout mechanism without an
order. A latch free wait event uses non-ordered timeouts. Because latch operations do not need to be
done in FIFO order, there is no need for a semaphore queue for latch free wait-event. Atomic
instructions such as test-and-set, compare-and-swap, etc. are sufficient, instead of expensive
semaphore operations. That is why Oracle does not use semaphores in latch operations.
The lock prefix invokes a locked (atomic) read-modify-write operation when modifying a memory operand.08c2c272 <sskgslcas>:
8c2c272: 8b 4c 24 04 mov 0x4(%esp,1),%ecx
8c2c276: 8b 44 24 08 mov 0x8(%esp,1),%eax
8c2c27a: 8b 54 24 0c mov 0xc(%esp,1),%edx
8c2c27e: f0 0f b1 11 lock cmpxchg %edx,(%ecx)
8c2c282: 75 06 jne 8c2c28a <.failed_cmp>
8c2c284: b8 01 00 00 00 mov $0x1,%eax
8c2c289: c3 ret
There are two mechanisms to implement non-ordered timeouts:
16:56:07.107143 gettimeofday({1003708567, 107165}, NULL) = 0
16:56:07.107199 gettimeofday({1003708567, 107221}, NULL) = 0
16:56:07.107251 gettimeofday({1003708567, 107271}, NULL) = 0
16:56:07.107525 gettimeofday({1003708567, 107585}, NULL) = 0
16:56:07.107698 select(0, [], [], [], {0, 10001}) = 0 (Timeout)
16:56:07.172634 gettimeofday({1003708567, 172688}, NULL) = 0
16:56:07.172787 gettimeofday({1003708567, 172815}, NULL) = 0
16:56:07.172866 write(6, "WAIT #1: nam=\'latch free\' ela= 7 p1=1343645620 p2=66 p3=0", 57) = 57
16:56:07.172955 write(6, "\n", 1) = 1
If _LATCH_WAIT_POSTING=1, which is the default, waiting process can be interrupted for library cache and
shared pool latches only, and gets these latches. If _LATCH_WAIT_POSTING>1, waiting process can be
interrupted for all latches, and gets latches. If _LATCH_WAIT_POSTING=0, process that is waiting for
latch is never posted, even if latch is free.
Measurement errors include:
The error is minimal at high usage levels, but ranges up to 80 percent or more at low levels. The
problem is that usage is underreported, and the range of error increases on faster CPUs. At a real
usage level of 5 percent busy, one will often see vmstat reporting that the system is only 1 percent
busy.
Because the time spent for these operations is included in service time, it may cause measurement
errors. This type of error is called an active wait measurement error.
Exhibit 9. Time Statistics after DELETE OperationTIME STATISTICS BEFORE DELETE OPERATION:
In Oracle side:
SID NAME VALUE
---------- ---------------------------------------------- ----------
9 CPU used when call started 1
9 CPU used by this session 1
9 OS User time used 5
9 OS System time used 1
4 rows selected.
In OS side:
21:19:57.303910 getrusage(RUSAGE_SELF, {ru_utime={0, 50000}, ru_stime={0, 10000}, ...}) = 0
A SNAPSHOT OF TIME STATISTICS IN WAIT-LOOP:
In Oracle side:
SID NAME VALUE
---------- --------------------------------------------- ----------
9 CPU used when call started 1
9 CPU used by this session 1
9 OS User time used 15
9 OS System time used 114
4 rows selected.
In OS side:
22:36:18.468218 getrusage(RUSAGE_SELF, {ru_utime={0, 150000}, ru_stime={1, 140000}, ...}) = 0
CPU used by this session is updated after DELETE statement is completed, and this statistic also
includes time spent in CPU during wait-loop. That is, CPU used by this session includes CPU usage
although process is waiting.
Exhibit 10 shows a sample enqueue wait-event of the DELETE operation. As seen, the wait-loop impact
increases while enqueued sessions are increased. Here are the interpretations for the following timed
terms:
In Oracle side:
SID NAME VALUE
---------- --------------------------------------------- ----------
9 CPU used when call started 1
9 CPU used by this session 124
9 OS User time used 15
9 OS System time used 114
4 rows selected.
In OS side:
22:36:33.101442 getrusage(RUSAGE_SELF, {ru_utime={0, 150000}, ru_stime={1, 140000}, ...}) = 0
Exhibit 10. Test Case: 766 MHz SuSELinux 7.2/Oracle 8.1.7Total sessions: 10
Enqueued sessions: 1
Vmstat:
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 30664 9628 1164 78580 0 0 0 8 103 84 0 0 100
Wait-loop impact per session: 1.670 ms
Total sessions: 60
Enqueued sessions: 51
Vmstat:
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 57240 1396 520 45552 0 0 0 4 102 106 0 1 99
Wait-loop impact per session: 1.688 ms
Wait-loop elapsed time shows the approximate time spent in wait-loops. It includes time spent of all
possible events in wait-loops, such as CPU usage, OS activities such page faults, context switches,
etc.Total sessions: 210
Enqueued sessions: 202
Vmstat:
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 1 248948 1504 476 37776 0 0 0 12 110 187 1 5 94
Wait-loop impact per session: 1.752 ms
Exhibit 12. CPU Usage Summary of Wait-Loop System Calls
14:38:24.736238 --- SIGALRM (Alarm clock) ---
14:38:24.736782 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
14:38:24.736937 gettimeofday({944433504, 736967}, NULL) = 0
14:38:24.737007 rt_sigprocmask(SIG_UNBLOCK, [], NULL, 8) = 0
14:38:24.737091 rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0
14:38:24.737183 rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
14:38:24.737269 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
14:38:24.737341 rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
14:38:24.737428 gettimeofday({944433504, 737452}, NULL) = 0
14:38:24.737531 gettimeofday({944433504, 737562}, NULL) = 0
14:38:24.737610 rt_sigprocmask(SIG_BLOCK, NULL, [RT_0], 8) = 0
14:38:24.737699 rt_sigprocmask(SIG_BLOCK, [ALRM], NULL, 8) = 0
14:38:24.737779 gettimeofday({944433504, 737806}, NULL) = 0
14:38:24.737844 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={3, 70000}}, NULL) = 0
14:38:24.737912 rt_sigprocmask(SIG_UNBLOCK, [ALRM], NULL, 8) = 0
14:38:24.737990 semop(98304, 0xbfffa5c4, 1) = -1 EINTR (Interrupted system call)
Exhibit 12 shows a CPU usage summary of wait-loop system calls that has average values. There are 13
semop() system calls. Because each loop includes one semop() system call, there are 13 loops in this
sample. It was found that the kernel mode is predominantly used in wait-loops. So, the time spent in
system calls can be approximately accepted as the average wait-loop CPU time. Thus, the average
wait-loop CPU time = 0.000443/13 = 0.034 ms.
oracle@linux:~/test5 > strace -p 714 -tt -c
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
52.82 0.000234 2 109 rt_sigprocmask
22.57 0.000100 2 55 gettimeofday
21.67 0.000096 3 28 setitimer
2.93 0.000013 1 13 13 semop
------ ----------- ----------- --------- --------- ----------------
100.00 0.000443 205 13 total
The wait-loop utilization is too low. This means that most of the time in the wait-loop was spent in
OS-specific tasks such as page faults, context switches, etc. These OS tasks were recorded to enqueue
wait-event by Oracle.Wait-loop utilization = 100* (average wait-loop
CPU usage) / (average wait-loop elapsed time)
= 100 * (0,034/1.752)
= 1.94%
Because enqueue timeout normally occurs every three seconds in this test case, there will be 0.354
seconds total wait-loop impact every three seconds. This means that the total wait-loop impact will be
increased every three seconds. Because time spent in wait-loops is recorded as relevant Oracle
wait-events, it distorts Oracle wait-events.wait-loop impact in system level = (wait-loop
impact per session) * (average timed out
sessions)
= 1.752 ms * (202)
= 353,904 ms
= 0.354 sec
If there are bottlenecks in CPUs, the number of processes
in ready (runnable) queue increases. This manifests
as if problems are in Oracle wait-events.16:56:07.107143 gettimeofday({1003708567, 107165}, NULL) = 0
16:56:07.107199 gettimeofday({1003708567, 107221}, NULL) = 0
16:56:07.107251 gettimeofday({1003708567, 107271}, NULL) = 0
16:56:07.107525 gettimeofday({1003708567, 107585}, NULL) = 0
16:56:07.107698 select(0, [], [], [], {0, 10001}) = 0 (Timeout)
16:56:07.172634 gettimeofday({1003708567, 172688}, NULL) = 0
16:56:07.172787 gettimeofday({1003708567, 172815}, NULL) = 0
16:56:07.172866 write(6, "WAIT #1: nam=\'latch free\' ela= 7 p1=1343645620 p2=66 p3=0", 57) = 57
16:56:07.172955 write(6, "\n", 1) = 1
Microstate Accounting
Oracle and Microstate Accountingtypedef struct prusage {
id_t pr_lwpid; /* lwp id. 0: process or defunct */
u_long pr_count; /* number of contributing lwps */
timestruc_t pr_tstamp; /* current time stamp */
timestruc_t pr_create; /* process/lwp creation time stamp */
timestruc_t pr_term; /* process/lwp termination time stamp */
timestruc_t pr_rtime; /* total lwp real (elapsed) time */
timestruc_t pr_utime; /* user level CPU time */
timestruc_t pr_stime; /* system call CPU time */
timestruc_t pr_ttime; /* other system trap CPU time */
timestruc_t pr_tftime; /* text page fault sleep time */
timestruc_t pr_dftime; /* data page fault sleep time */
timestruc_t pr_kftime; /* kernel page fault sleep time */
timestruc_t pr_ltime; /* user lock wait sleep time */
timestruc_t pr_slptime; /* all other sleep time */
timestruc_t pr_wtime; /* wait-cpu (latency) time */
timestruc_t pr_stoptime; /* stopped time */
u_long pr_minf; /* minor page faults */
u_long pr_majf; /* major page faults */
u_long pr_nswap; /* swaps */
u_long pr_inblk; /* input blocks */
u_long pr_oublk; /* output blocks */
u_long pr_msnd; /* messages sent */
u_long pr_mrcv; /* messages received */
u_long pr_sigs; /* signals received */
u_long pr_vctx; /* voluntary context switches */
u_long pr_ictx; /* involuntary context switches */
u_long pr_sysc; /* system calls */
u_long pr_ioch; /* chars read and written */
} prusage_t;
Applying PIOCUSAGE to a process that does not have
microstate accounting enabled will enable microstate
accounting and return an estimate of time spent in the
various states up to this point. Further invocations of
PIOCUSAGE will yield accurate microstate time accounting
from this point.$ truss -p 907
read(8, 0x024F28D6, 2064) (sleeping...)
read(8, "\0BC\0\006\0\0\0\0\011 i".., 2064) = 188
times(0xFFBED5C8) = 855271
ioctl(10, PIOCUSAGE, 0xFFBED18C) = 0
ioctl(10, PIOCSTATUS, 0xFFBED28C) = 0
times(0xFFBED560) = 855271
times(0xFFBED5C8) = 855271
times(0xFFBED420) = 855271