Reprinted with Permission by Quest Software July 2006


One Analytic Function Can do More Than a 1000 Lines of Code
By Alex Nuitjen, AMIS Services B.V.

Introduction

Oracle 8i Enterprise Edition introduced Analytic Functions, since 9iR2 this functionality was included in the Standard Edition. With these analytic functions you can retrieve data from more than one row at the same time without the need for a self join. You can create a ranking based on a value within a group of values. Even though it takes a bit getting used to the syntax, once mastered Analytic Functions can make your life a lot easier. This doesn’t mean they are easy to use. Sometimes it feels your brain is doing cartwheels inside your head, but what a joy it is to solve some of the more challenging programming puzzles. The performance is beyond compare, and it looks so easy when you’re done.

So without further ado, let’s dive into the nitty gritty.

Overview

The processing order of a query with Analytic Functions happens in three stages. First of all the Joins, WHERE conditions, GROUP BY and HAVING clauses are applied. The result set you get from this action will have the Analytic Functions applied to it. Finally the ORDER BY is processed.


Figure 1. Process Order of a Query

Partitions

The result set you get from a query can be subdivided in Partitions. Note: the partitions in the context of Analytic Function are not related to Table Partitions. A Partition reflects a group of logically related rows, like all employees in the same Department or with the same Job. The size of a partition can be as large as the whole result set, and as small as just a few rows or even a single row. This simple example calculates the Analytical Function for each record for all employees in the same department and with the same job:

avg(sal) over (partition by deptno, job)

The Partition clause breaks the result set into groups. The criteria in the partition clause determine how many different groups you’ll have – never more than one per record, never less than one. If you specify no partition clause, then the whole result set is considered as a single partition or group. The Analytic Function is reset at for each group. If you have, say a running total for salaries per department, the counter is set at zero for each department.

Windows

Within each partition, it’s possible to specify a Window. The Window determines the range of rows in the current partition for which you want to perform calculations for the current row.

Analytic Functions are always performed from the perspective of the current row. You can consider the current row as the reference point for the window.

The default Windows clause is RANGE UNBOUNDED PRECEDING. This means that the window is expanding from the first row in the partition to the Current Row. Although it may seem a little cryptic now, it will become clear in the next few sections.

Windows come in two flavors. The first one is an anchored window, the second a sliding window. The default window-clause is an anchored window. This window starts with the first row of the group and ends with the current row. It’s called an anchored window because the first row of the window is fixed. In a sliding window this is different.

The sliding window comes in two varieties. There is a range window, determined by a numeric offset. This window includes all rows where the specified column has a value that falls within the range starting from the current row. For example: all rows with employees that earn within $200 from the current employee.

over (order by sal range between 200 preceding and 200 following)

You can only use this kind of window on numeric values or on dates. With dates, the offset is a number of days.

The second type of a sliding window is a row window. You can specify how many rows you want to look back or ahead in the result set. For example: calculate the analytical function for the current record over (no more than) two records prior to the current record and (no more than) two records following the current record – with all records sorted by salary:

over (order by sal rows between 2 preceding and 2 following)


Figure 2. Windows

Analytic Functions

Some of the regular aggregate functions which are available in SQL also have an Analytic counterpart. Some examples of these are SUM, COUNT, and AVG. You can tell you’re using the analytic function because of the reserved word OVER.

Other analytic functions are a little more esoteric. Just to name a few: STDDEV_SAMP, VAR_POP, and CUME_DIST. It may be advisable when using some of these more esoteric functions that you have a statistical analyst sitting right beside you to interpret the values the query is producing.

There is also a group of analytic functions that may look very strange at first glance, but are not as bad as they seem.

ROW_NUMBER is one of them. You can think of this one as an Analytic counterpart of ROWNUM. Using ROW_NUMBER you can assign a rownum to each row within each partition. This might help you to answer questions like: “Show the top-3 earners for each department” or “Who has the second to highest salary within the company?”

RANK and DENSE_RANK are ranking functions that assign a rank to records within a group. LAG and LEAD allow you to access values from rows that come earlier or later in the result set given a certain ordering of rows.

Let’s look at some of the most common usages of Analytical Functions. Of course, there are many more possibilities. These merely serve as examples to get you going. It is possible to reproduce these result sets with plain ol’ SQL, but the Analytic way is so much more attractive and elegant and often better performing.

For the following section of examples, I will use EMP and DEPT tables. These tables store information regarding Departments and Employees. They are part of the Scott schema, which was the demo-schema for many Oracle versions. Most people will know them by heart.

Running Totals

Running Totals are deceptively easy to create using analytic functions. To do something similar the traditional way can be quite a hard task to accomplish. Let’s show you an example of a running totals query

SQL> select empno
  2       , ename
  3       , sal
  4       , Sum (sal) over (order by empno) overall_total
  5    from emp
  6   order by empno
  7  /

     EMPNO ENAME             SAL OVERALL_TOTAL
---------- ---------- ---------- -------------
      7369 SMITH             800           800
      7499 ALLEN            1600          2400
      7521 WARD             1250          3650
      7566 JONES            2975          6625
      7654 MARTIN           1250          7875
      7698 BLAKE            2850         10725
      7782 CLARK            2450         13175
      7788 SCOTT            3000         16175
      7839 KING             5000         21175
      7844 TURNER           1500         22675
      7876 ADAMS            1100         23775
      7900 JAMES             950         24725
      7902 FORD             3000         27725
      7934 MILLER           1300         29025

In this example you can see that the partition keyword is omitted, therefore there’s only one single group. There is, however, a window that is expanding with each row. The window is determined by the ORDER BY clause in line 4. The – implicit - window clause in this example is RANGE UNBOUNDED PRECEDING, the default for the Window clause. This means that all preceding salaries are added to the current row’s salary, hereby creating a running total.

If we look at Empno 7521 (WARD), we can see that the overall_total column shows 3650. This is a summary of all preceding salaries, including Ward’s. (800 + 1600 + 1250)

To make this a running total per department, simply add a partition clause as in the next example.

SQL> select empno
  2       , ename
  3       , sal
  4       , deptno
  5       , Sum (sal) over (partition by deptno
  6                             order by empno
  7                        ) department_total
  8    from emp
  9   order by deptno, empno
 10  /

     EMPNO ENAME             SAL     DEPTNO DEPARTMENT_TOTAL
---------- ---------- ---------- ---------- ----------------
      7782 CLARK            2450         10             2450
      7839 KING             5000         10             7450
      7934 MILLER           1300         10             8750
      7369 SMITH             800         20              800
      7566 JONES            2975         20             3775
      7788 SCOTT            3000         20             6775
      7876 ADAMS            1100         20             7875
      7902 FORD             3000         20            10875
      7499 ALLEN            1600         30             1600
      7521 WARD             1250         30             2850
      7654 MARTIN           1250         30             4100
      7698 BLAKE            2850         30             6950
      7844 TURNER           1500         30             8450
      7900 JAMES             950         30             9400

When we look at Empno 7521 (WARD) again, we see that the Department_Total column shows 2850. This is a summary of all preceding salaries within his department (1600 + 1250).

Just to show the effect of a running total more clearly, I adjusted to final order by predicate to match the window’s sort condition. It should be noted that there’s no need to do this. When the final order by is different from the sort order in the partition, the final result can be rather confusing. To help you visualize where a window starts and where it ends, you can use the Analytic Functions FIRST_VALUE and LAST_VALUE, that return the specified column value for the first or last record in the window..

SQL> select ename
  2       , Sum (sal) over (partition by deptno
  3                             order by empno
  4                        ) dept_total
  5       , First_Value (ename) over (partition by deptno
  6                                       order by empno
  7                                  ) fv
  8       , Last_Value (ename) over (partition by deptno
  9                                      order by empno
 10                                 ) lv
 11    from emp
 12   where deptno = 20
 13   order by deptno
 14          , empno
 15  /

ENAME      DEPT_TOTAL FV         LV
---------- ---------- ---------- ----------
SMITH             800 SMITH      SMITH
JONES            3775 SMITH      JONES
SCOTT            6775 SMITH      SCOTT
ADAMS            7875 SMITH      ADAMS
FORD            10875 SMITH      FORD

Here you can see that the window is expanding per row. When we look at SCOTT, the window starts with the employee named SMITH and ends with the current row (SCOTT).

Ranking: Top N

Showing the Top 3 of each department is a breeze with Analytic Functions. What we need to do is create partitions and assign a Rank to each row within their partition. There are three variants to the ranking function. There is RANK, DENSE_RANK and ROW_NUMBER. All of these functions assign numbers based on the ORDER BY clause within each Partition. They all do it a little bit differently.

The difference lies in the way equality is resolved. RANK allows ranking numbers to be skipped. DENSE_RANK uses a different kind of ranking which doesn’t skip any numbers. ROW_NUMBER assign an arbitrary number to each row when it is not possible to resolve the ORDER BY of the windows clause, comparable to the way ROWNUM assigns a value to a row.

Take a look at the next example. You can see the difference in the different types of ranking. When we look at department 20 more closely, you can you that there is a tie for the first place, both have 3000. The second highest salary (2975) has a RANK of 3, while for the DENSE_RANK it is ranked 2. RANK skipped second place altogether: we do not award a silver medal if we already have two gold medalists. The last column, where ROW_NUMBER is used, assigns a value arbitrarily to the highest paid employees.

SQL>  select ename
  2       , deptno
  3       , sal
  4       , Rank() over (partition by deptno
  5                          order by sal desc
  6                     ) rk
  7       , Dense_Rank() over (partition by deptno
  8                                order by sal desc
  9                           ) dr
 10       , Row_Number() over (partition by deptno
 11                                order by sal desc
 12                           ) rn
 13    from emp
 14   where deptno = 20
 15   order by deptno
 16          , sal desc;
ENAME          DEPTNO        SAL         RK         DR         RN
---------- ---------- ---------- ---------- ---------- ----------
SCOTT              20       3000          1          1          1
FORD               20       3000          1          1          2
JONES              20       2975          3          2          3
ADAMS              20       1100          4          3          4
SMITH              20        800          5          4          5

Because Analytic Functions cannot be used in the where-clause of a query, it’s necessary to push the Analytic Function into an in-line view to restrict the final result set to a regular Top N, in this case a Top 3 of the highest salaries.

SQL> select ename
  2       , deptno
  3       , sal
  4       , rn
  5    from (select ename
  6               , deptno
  7               , sal
  8               , Row_Number() over (partition by deptno
  9                                        order by sal desc
 10                                   ) rn
 11            from emp
 12         )
 13   where rn <= 3
 14  /

ENAME          DEPTNO        SAL         RN
---------- ---------- ---------- ----------
KING               10       5000          1
CLARK              10       2450          2
MILLER             10       1300          3
SCOTT              20       3000          1
FORD               20       3000          2
JONES              20       2975          3
BLAKE              30       2850          1
ALLEN              30       1600          2
TURNER             30       1500          3

Ranking to pivot the result set

Now that you have an understanding of how ranking works, another fun thing to do with this ranking is to turn rows into columns. Instead of showing the Top 3 earners listed down, we can turn the result set going across the page, Using the previous example, this is what a pivoted result set might look like.

SQL> select deptno
  2       , Max (Decode (rn, 1, ename)) "Top 1"
  3       , Max (Decode (rn, 2, ename)) "Top 2"
  4       , Max (Decode (rn, 3, ename)) "Top 3"
  5    from (select ename
  6               , deptno
  7               , Row_Number() over (partition by deptno
  8                                        order by sal desc
  9                                   ) rn
 10            from emp
 11         )
 12   where rn <= 3
 13   group by deptno
 14  /

    DEPTNO Top 1      Top 2      Top 3
---------- ---------- ---------- ----------
        10 KING       CLARK      MILLER
        20 SCOTT      FORD       JONES
        30 BLAKE      ALLEN      TURNER

It’s not always necessary to use an Analytic Function to create a pivoted result set, but for something like a Top 3 as in the example, it’s really easy.

Deduplication

Another great use of the ranking functions is to remove duplicates from a table. When we accidentally insert the content of the EMP table into itself, duplicate employees will be present in the table. To remove these, we are going to subdivide the contents of the table into Partitions based on the name of the employee. Each record within this Partition will be assigned a rank. Since we don’t care in this case which of the records is going to stay in the table and which record needs to be removed, the windows clause is not really relevant. However, the windows clause is mandatory for ranking function, so we simply order by NULL.

Note that in general you will partition the records by the column(s) that specify their identity so that identical rows (duplicates) are in the same partition. Then you will order these rows by the expression that designates the duplicatie-that-is-to-stay. The CASE expression is frequently handy for this order by clause.

SQL> delete from duplicate_emps
  2   where rowid in (select rid
  3                     from (select rowid rid
  4                                , row_number() over (partition by ename order by null) rn
  5                             from duplicate_emps
  6                          )
  7                    where rn > 1
  8                   );

14 rows deleted.

In the inner most query, we assign a rank to each record in the table based on the ename. In this example it would be a one or a two, since we inserted the data of EMP only once into the same table again. The Analytic function cannot be used in the final order by predicate of this inner query, so we needed to create an in-line view to restrict the output to only those records that we need to remove. Finally the delete statement removes those records which were accidentally duplicated.

Accessing other rows in the result set

Occasionally it’s necessary to access values from different rows in the result set. The LAG and LEAD functions do just this. LAG looks at values from previous rows, and LEAD looks to the following records in the result set. The current row is always the starting point for the number of rows you want to look forward or back into the result set. The need for a self-join has vanished for many cases. As always a piece of code says more than a thousand words, so let’s take a look at another example. We want to show for each employee the next - lead(ename,1) - employee to be hired - order by hiredate - in the same job -partition by job - :

SQL> select ename, job
  2       , hiredate
  3       , Lead (ename) over (partition by job
  4                                order by hiredate
  5                           ) next_hiree_in_job
  6    from emp
  7   order by job
  8          , hiredate;

ENAME      JOB       HIREDATE  NEXT_HIREE
---------- --------- --------- ----------
FORD       ANALYST   03-DEC-81 SCOTT
SCOTT      ANALYST   19-APR-87
SMITH      CLERK     17-DEC-80 JAMES
JAMES      CLERK     03-DEC-81 MILLER
MILLER     CLERK     23-JAN-82 ADAMS
ADAMS      CLERK     23-MAY-87
JONES      MANAGER   02-APR-81 BLAKE
BLAKE      MANAGER   01-MAY-81 CLARK
CLARK      MANAGER   09-JUN-81
KING       PRESIDENT 17-NOV-81
ALLEN      SALESMAN  20-FEB-81 WARD
WARD       SALESMAN  22-FEB-81 TURNER
TURNER     SALESMAN  08-SEP-81 MARTIN
MARTIN     SALESMAN  28-SEP-81

In this example I used the LEAD function to look to following records. We divided the result in Partitions based on the job and sorted by hiredate. When we look at the MANAGERs, you can see that the next hiree from Blake’s perspective is Clark. From Clark’s perspective there were no new managers hired, that’s why this row’s next hiree is null.

It’s also possible to look further ahead or back in the result set. Lag and Lead have two additional optional parameters. One parameter is for the offset of number of records. The third parameter is to provide a default value in case LAG or LEAD points outside the window.

Death of the End Date?

Occasionally you will find tables where there is something like a begin date and an end date. The begin date tells you when is record is active and the end date tells you when the record is deactivated, or should be deactivated. Making sure that a previous valid period ends when the next valid period starts, can be quite a challenge to implement. Now that you know about the LEAD and LAG function, you can ask yourself “Do you actually need an end date column at all?” It is possible to access the end date by querying the begin date of the previous record.

My colleague Lucas Jellema wrote a weblog about this, you can find it here: http://technology.amis.nl/blog/?p=825. The overall conclusion from his research shows that it’s quite possible to implement the requirement using the LEAD and LAG functions; however when the requirements state that there can be a period of no active records at all this method will fail. Performance wise it is better to have an end date in the table. To build up the in-line view to determine all begin and end dates can be quite an expensive operation. This is especially true when the table grows in size over time.

Caveats

As with all new features, it’s not always gold. Once you master the syntax of these Analytic Functions, the sky is the limit. However there are some things to be wary of.

Analytic Functions cannot be used in the WHERE clause or the final ORDER BY clause. What you can do to circumvent this limitation is to push the Analytic Function into an in-line view. Another way of circumventing the ORDER BY limitation is to use a column alias and use that in the ORDER BY clause.

What you also need to look out for is the ordering of NULL values and how this will affect the ORDER BY clause of the Analytic Window. I will show you an example a little later.

Yet another thing to look out for is the performance impact. While Analytic Functions may look like the best thing that ever happened, especially regarding the performance, .using different windows and sort orders may use a considerable amount of sorting and sifting. This might influence the overall performance of the query. As with all features that you use, you should test before you put it in production code with a representative set of data.

Finally a warning, once you get more comfortable using Analytic Functions you will want to use them all the time. It’s easy to get carried away and you might produce some code that is hard, if not impossible to maintain.

Just to illustrate a quick demonstration of how you might get carried away. Say we have a table which contains the contents of the EMP table multiple times and we want to identify the duplicate names, we can issue a query like:

SQL> select distinct
  2         ename
  3    from (select ename
  4               , Row_Number() over (partition by ename
  5                                        order by null
  6                                    ) rn
  7            from big_emp
  8         )
  9   where rn > 1;

ENAME
----------
ADAMS
ALLEN
BLAKE
A more traditional way to fulfill the same requirement is a query like;
SQL> select ename
  2    from big_emp
  3   group by ename
  4  having Count(*) > 1;

ENAME
----------
ADAMS
ALLEN
BLAKE

I tested both of these queries with around 200000 records in the table, and the traditional way gives a slightly better response time and it uses a lot less resources.

Multiple solutions for same problem

Let’s say I have a table which stores different numbers per category and I want to identify the ranges within this table. Of course this should be done per category. The table looks like:

SQL> select * from t;

C      VALUE
- ----------
A         12
A         13
A         14
A         15
A         25
A        503
A        504
B       1502
B       1503
B       1504
B         42

The final result that I want should show the ranges per category. Consecutive numbers can be omitted. It should look similar to this:

C RNG
- ----------
A 12-15
A 25
A 503-504
B 42
B 1502-1504

This following description of the next query is from the deepest in-line view outward.

One way to solve this requirement is to compare the values from the current row with the value of other records using the LAG function. When the difference is one, it means the adjacent records are sequential, no gap is detected. If the difference is more than one or if it is NULL, you have found a gap.

The start of a gap range gets a number assigned to it. This number is carried down to all records within this Partition. At this stage you have broken up the partitions based on the category into smaller groups based on gap-detection. Using the FIRST_VALUE function it’s easy to find the beginning and ending of each small Partition. Finally format the output to the way you want it to look. This technique is demonstrated on AskTom (www.asktom.oracle.com) numerous times. As you can tell from the query it is not really easy to understand what is going on.

select category
      ,case fv
         when lv then to_char(fv)
                 else to_char(fv)||'-'||to_char(lv)
       end rng
from (select distinct category
      ,first_value(value) over (partition by category,grp order by value) fv
      ,first_value(value) over (partition by category,grp order by value desc ) lv
from (select category, value, max(g) over (partition by category order by value) grp
from (select category, value
      ,case value - lag(value) over (partition by category order by value)
         when 1 then null
                else row_number() over (partition by category order by value)
       end g
from   t
)));

Another way is to do it the following way. The in-line view assigns a rank to each record within the partition. The outer query simply groups, not only by category but also by the difference between the value from the table and the assigned rank. I think it is a lot easier to comprehend. The first time I saw this technique was on Oracle Technology Network by a member called Vadim. I think it’s a real clever and original approach to resolve the requirement (still not very easy to read without comments).

select category
     , case
       when min (value) = max (value)
       then to_char (min(value))
       else to_char (min (value))||' - '|| to_char (max (value))
       end rng
from
  (select category
        , value
        , row_number() over (partition by category order by value) rn
     from t
  )
group by category
        ,value - rn
order by category;

There are often multiple solutions to a problem. Sometimes it can be hard to take a step back and look at the solution you created. It is quite possible that one day someone comes by and shows you a radical different way of providing a solution. Always keep an open mind.

The proof of the pudding is in the eating

While working on a project a coworker called me over to show me this query he had just finished. After looking at this query for a few minutes I still had no clue, as to what this query was supposed to do.

The comments didn’t really help much: “Get the most current contract”. Thankfully I had the original developer at hand, so he could explain in detail what the requirement was.

select id
     , cat
     , start_date
     , end_date
  from contracts main_q
 where not exists (select 1
                     from date_list nest_q
                    where nest_q.cat = main_q.cat
                      and Decode (nest_q.end_date, main_q.end_date
                                 , Decode (nest_q.start_date, main_q.start_date
                                          ,Trunc (sysdate,'DDD')
                                           +
                                           Decode (nest_q.id
                                                  ,Least (nest_q.id, main_q.id)
                                                  , 0, 1)
                                          , Nvl(nest_q.start_date
                                               , main_q.start_date-1)
							                     )
                                 , Nvl(nest_q.end_date, main_q.end_date+1))
                        >
                        Decode (main_q.end_date, nest_q.end_date
                               , Decode (main_q.start_date, nest_q.start_date
                                        , Trunc (sysdate,'DDD')
                                          +
                                          Decode (main_q.id
                                                 ,Least (nest_q.id, main_q.id)
                                                 , 0, 1)
                                          , Nvl(main_q.start_date
                                               ,nest_q.start_date-1)
							                     )
                               , Nvl(main_q.end_date, nest_q.end_date+1)))

The requirement, as he told me was: “Get the current contract. You can identify the most recent one by looking at the End Date. The current contract would have the highest End Date or the End Date is not filled in at all. When there are multiple contracts with the same End Date (and both of them NULL being equal in this case), then you need to look at the Begin Date. You want the most recent Begin Date. When the Begin Date is not filled in at all, it doesn’t make any sense. Nulls are considered the earliest date possible.

When the End Date and the Begin Date are equal, with NULL and all, then it really doesn’t matter which contract shows up. Just get the one with the highest ID. Oh, and by the way, this needs to be done per category of contracts.”

The way he solved it was by querying the same table twice and using several nested DECODEs to compare the End Date of the Main Query to the one of the Nested Query. Also for the comparison of the Begin Date, he used several nested DECODEs. And again for the highest ID. Basically he created his own implementation of the LAG and LEAD Analytic Functions we discussed previously.

Needless to say it took a while to sink in. So you take some time as well. The part where he said “Per category” triggered me. That sounded like the partition clause of the Analytic Function. Only having a partition clause is not enough to write a query with. You still need an Analytic Function as well.

Because of the explanation of the requirements “Get the highest value” and “compare this End Date to the next”, my next inclination was to use MAX and doing the comparison with LAG and LEAD functions.

After struggling with these functions for about an hour and many cups of coffee later, I realized a different approach was needed. What I started doing was recreate the solution he made, but now using Analytic Functions.

Is it possible to rephrase the requirement from “Get the most current contract” to something like “Get the Top 1 of the contracts per category”. This sounds like a different approach altogether. This calls for a ranking function, like ROW_NUMBER, RANK or DENSE_RANK.

Because the id is the primary key, I picked the ROW_NUMBER Function and I came up with this query:

select id
     , cat
     , start_date
     , end_date
  from (select id
             , cat
             , start_date
             , end_date
             , Row_Number() over (partition by cat
                                      order by end_date desc nulls first
                                             , start_date desc nulls last
                                             , id desc
                                  ) rn
          from contract
        )
  where rn = 1

The Top 1 is determined by the most current End Date, then the most current Begin Date and finally the highest ID. Using DESC to order each partition would sort the rows from high to low. The NULLS FIRST and NULLS LAST were introduced in Oracle 8i which allows specifying the treatment of nulls in the ORDER BY Clause. These clauses are not limited to Analytic Functions. There’s no need to remember that NULLs are infinitely big, or was it infinitely small?

Performance Difference

It looks so simple and elegant. I know which of the two queries I would like to maintain, and it isn’t the original one. What about performance? Theoretically it should outperform the “traditional” query. But what about some hard evidence?

The first tests were on our test data, a mere 60 records, and I was taken by surprise. With timing on in SQL*Plus, the original query took 00:00:00.01, while the Analytic query took 00:00:00.03. Wasn’t this Analytic stuff supposed to be faster?

A little bit disappointed, I started looking at statistics and here was the good news. The Analytic Query was doing a lot less IO than the traditional one.

Maybe the test data set was just too small to be representative. It turned out that the test data was not at all representative of the production database. So I increased the record count of the test data to do additional performance tests. And lo and behold, the Analytic Query was far superior to the Original Query. For the last test I upped the record count to 1.9 million records and the Analytic Query returned in about 9 seconds. How long the Original Query took I don’t know. After letting it run for over three hours, I decided to call it a victory and go home for the day.

But don’t take my word for it, which is why you can download the demonstration script from our AMIS Technology Blog to verify my test results. (http://technology.amis.nl/blog/?p=506)

Record Count Original Query Analytic Query
60 00:00:00.01 00:00:00.03
7.680 00:00:01.65 00:00:00.07
1.966.080 > 3 hours 00:00:08.37

Conclusion

Commenting your source code and queries is essential. Especially what your code is supposed to be doing and not just the way you solved the requirement. It might be that there was a bug in your code or an otherwise suboptimal implementation. Or, as I’ve shown in this article, Oracle introduced some new feature that fits the bill. Then it’s relatively easy to use this new feature without having to first figure out what the original query was supposed to be doing in the first place.

Analytic Functions are something to get used to. The hardest part is thinking in sets instead of on a row-by-row basis. Once you make this paradigm shift it gets easier. The most important part of writing queries is taking a step back, and realize that there are multiple ways to solving the requirements. Keep all options open.

Test your queries with a representative test set. The quantity of your test data is just as important as the quality. The number of records in a table has an enormous impact on the best possible way to write a query. What seems like a well performing query may bring the production database down to a grinding halt.

Analytic functions offer a very elegant way to write your query. They are easy to read and therefore easy to maintain. The performance they offer, the code reduction and the sheer development joy justifies using them on a daily basis. Just be careful, not everything is a nail even if the Analytic Hammer looks really cool.


Author

Thank you very much. You can reach me for any further background or discussion at alex.nuijten@amis.nl.. You also find some additional code samples, resourcelinks on http://www.amis.nl/technology.The AMIS Technology Weblog is found at http://technology.amis.nl/blog, or through http://www.orablogs.com/orablogs/.

About the author: Alex Nuijten is a consultant for AMIS Services BV in Nieuwegein, The Netherlands.AMIS Services focuses on Oracle and Java development for a variety of clients in The Netherlands and Germany. He has been working with Oracle for the last seven years, with the main focus on SQL and PL/SQL performance.