Reprinted with Permission by Quest Software Dec. 2005


The Limits and Limitlessness of Time
Craig Shallahamer

Oracle performance analysis has come a long way in the last 20 years. First, there was the “just add more resources” approach, then there was ratio analysis, and finally wait event analysis. But that was not complete or broad enough, so in 2001 I published my Oracle Response Time Analysis (RTA) paper. The RTA paper brought together and detailed the two elements of response time, that is, service time and queue time. One of the more subtle aspects of RTA is understanding the relationship between service time and queue time and how that relates to the operating system.

The response time a user feels is the result of many interrelated systems. The Oracle server is just one of these systems. The time spent in each system can be divided into service time and queue time. We could just add up all the service time
and all the queue time to determine the end user’s response time. But in reality, sometimes the effort is not worth the results…not to mention extremely difficult to do. But what we can do, that’s definitely worth the minimal effort, is to get service time and queue time from an Oracle perspective. In addition we can get details about how this relates to the database server’s CPU subsystem.

Take a look at these basic Oracle-focused response time formulas.

Just one of the many things we can learn from analyzing response time is this: Service time has a limit but queue it limitlessness. Each CPU subsystem has a maximum fixed amount of CPU power it can provide. If you have a one CPU
machine, that machine can provide a maximum of 60 seconds of CPU each minute. If you have a 10 CPU machine, it can provide a maximum of 600 seconds of CPU each minute. The situation is distinctly different with queue time.

Queue time is not fixed and is limited only by the workload. If the workload is relatively small, queue time may be near zero. But keep increasing the workload and queue time will reach infinity…it has no limit.

The limitlessness of queue time, births two concepts we need to consider. The first concept is, if Oracle is consuming all the available CPU, then asking for more CPU will increase service time and can also increase Oracle wait time. The result is an increase in response time. This is bad, very bad. This means our solutions need to carefully weigh how they will affect the CPU subsystem. (This concept is elaborated upon in my new paper entitled, Oracle Wait Interface Deceit.) The second concept is, we now have another way to view a very dynamic system. This not only helps our understanding of the system but it allows us to help others understand a potentially very complex Oracle based system.

For example, consider the figure below. The data was gathered from a real life production Oracle system. Each hour, the response time components were gathered and summarized. The queue time was gathered from v$system_event and the
service time was gathered from v$sysstat. By looking at this graph, if performances poor and all non-Oracle server architecture components are shown not to be the bottleneck, the bottleneck would rest on the Oracle server. Armed with the graph below, we could infer there is a serious IO subsystem bottleneck or locking/blocking issues. Perhaps at around 2200 there is a CPU bottleneck, but the rest of the time is definitely an IO bottleneck or locking/blocking issues.

Let’s look at this in more detail. It all centers on knowing the percentage of available CPU power Oracle is consuming from the operating system. Look closely at the figure above. Let’s start with the CPU subsystem. Since around a maximum of 1000 minutes of CPU time was consumed in a single hour, we know there must be at least 17 CPUs. A worst-case scenario would be the CPU resources used equaled the CPU resources available. That combined with the numbers taken from the above graph yields:

So there must be at least 17 CPUs on the database server. If we know, from monitoring the operating system, CPU utilization was around 50% during 2200, we can also infer there could be around 34 CPUs (~16.67 X 2).

Now let’s put this information together so it’s useful. Notice that around lunchtime, Oracle is not consuming up to the 1000 minutes of CPU each hour. We know the server can supply 1000 minutes of CPU each hour, yet at around lunch time it
supplied less than 500 minutes of CPU time each hour, and users are not satisfied with performance. Therefore, we can infer that the queue time (i.e., Oracle wait events) is not CPU related but either IO related or locking/block (e.g., enqueue wait) related. Powerful information!

We know that understanding and applying Oracle response time components increases our performance prowess. But understanding how Oracle service time relates to the operating system CPU takes us to yet another level. Without
understanding the percentage of available CPU Oracle is consuming from the operating system, we simply don’t know if there is plenty of CPU available or not. Knowing the CPU percentage allows our performance diagnosis to be more accurate which should lead to more powerful solutions! Understanding the limits and limitlessness of time will help not only you, but also your peers and the people you need to influence. The better they understand the situation, the more likely they will agree with your proposed course of action. In other words, it makes your performance analysis stronger and it builds trust. Something all DBAs can use a little bit more of.


About the Author

Craig Shallahamer has 19-plus years experience in IT. As the president of OraPub, Inc. his heart is to empower Oracle performance specialists by "doing" and teaching them to "do" whole system performance optimization (reactive and proactive). In addition to course development and delivery, Craig is a consultant, a researcher and writer, HoriZone’s architect (OraPub’s performance prediction product), technically reviews Oracle books and articles, and keynotes at Oracle conferences.