Reprinted with Permission by Quest Software March 2005


Compressed Composites (Oracle 10g Compression) Explained

Mark Rittman, www.rittman.net

If you've an interest in the OLAP Option and you've read some of my recent articles on the new features in Oracle 10g OLAP, you've probably seen a feature called "compression" mentioned. In this DBAZine article I described compression as "a novel form of cube compression, which promises to both enhance query performance (by retrieving fewer blocks of data for a given logical amount of data) and drastically reduce batch loading and aggregation times, saving disk space on the way. Compression in Oracle 10g OLAP is more about improving performance and scalability than saving disk space (although that’s a nice side effect). The net result of this is that, in a given batch window, Oracle 10g OLAP can now load and aggregate more data than before, and for a given amount of disk, can store more information than before. This, plus big advances in scalability internally around areas such as very big composites, makes Oracle 10g OLAP potentially a very effective platform when building particularly large cubes." So how does this compression feature actually work?

Good question. I actually found out about the compression feature from talking to a couple of people in the OLAP Product team, and at the time this was news to me as I hadn't seen this feature too well signposted in either the recent (2003) batch of Open World papers or in the online documentation. Indeed I was under the impression that it was only coming with the 10.1.0.3 patch release, but I've now located this feature in the online documentation and it's all a bit clearer now.

It appears that what we're talking about when referring to "compression" in 10g OLAP cubes is actually something called "Compressed Composites". Composites should be fairly familiar to anyone working with Express or Oracle OLAP and are structures that you set up when you've got very sparse cubes. In cases where there are few actual used combinations of dimensions for your variable, compared to the potential set of valid combinations, setting up composites, which only store within them the actual combinations of dimensions used, together with indexes to the underlying dimensions, reduces the amount of NA values stored in the variable and results in more efficient data storage. 

So what are compressed composites? Probably the best description is to just quote the OLAP DML Reference R1 (10.1) section on compressed composites:

"In some cases, when you aggregate data in a variable dimensioned by a composite defined with one or more hierarchical dimension, one parent node may have only one descendant node ­— and so on all the way up to the top level. When a variable has a good deal of this type of sparsity, use a compressed composite as the dimension of the variable. Dimensioning this type of variable with a compressed composite creates the smallest possible variable, composite, and composite index—much smaller than if you dimension a variable with a b-tree or hash composite. 

This reduction in size does not occur at the detail level. Oracle OLAP creates composite values for detail level the same way for all composites. A composite contains one composite tuple for each set of base dimension values that identifies non-NA detail data in the variables that use it. 

The reduction in size occurs for those sets of base dimension values that identify non-NA data at higher levels of hierarchical dimensions. Oracle OLAP populates these higher-level values differently depending on whether a variable is dimensioned by a b-tree, hash, or compressed composite: 

For variables dimensioned by b-tree and hash composites, Oracle OLAP creates composite tuples for non-NA data at higher levels the same way that it does for non-NA data at the detail level. There is one composite tuple (with its own physical position) for each set of base dimension values that identifies non-NA data. The composite index contains all of the index entries needed to relate the composite tuple to the base dimension values. 

For variables dimensioned by compressed composites, Oracle OLAP reduces redundancy in the variable, composite, and composite index by using the"intelligence" of the AGGREGATE command that populates the variable. For sets of base dimension values that represent parent nodes, Oracle OLAP creates a physical position in the composite only for those tuples that represent a parent with more than one descendant. Oracle OLAP then creates an index between this composite structure and the base dimensions and uses this composite structure as the dimension of the variable. Since the actual structure of a compressed composite is smaller than that of a b-tree or hash composite, a variable dimensioned by a compressed composite is also smaller than a variable dimensioned by a b-tree or hash composite. Also, since the index for a compressed composite only has nodes for parents with more than one descendant, the index of a compressed composite has fewer levels and is smaller than the index of a b-tree composite. 

Although performance varies depending on the depth of the hierarchies and the order of the dimensions in the composite, aggregating variables defined with compressed composites is typically much faster than aggregating variables defined with b-tree or hash composites."

You can define a composite as compressed using the syntax:

DEFINE name COMPOSITE <dims...> [AW workspace] COMPRESSED [SESSION]

Alternatively, with the forthcoming 10.1.0.4 patch release and the accompanying new version of Analytic Workspace Manager, you can turn on compression for a cube (which translates to one or more variables) using the GUI. 

This presumably tells AWM to create any composites with the COMPRESSED option as detailed above.

One of the things that was mentioned to me when I was first told about compression was that in this initial version, the scope for using it was quite limited. This would appear to be borne out by the reference in the above documentation that states:

"...when you aggregate data in a variable dimensioned by a composite defined with one or more hierarchical dimension, one parent node may have only one descendant node ­— and so on all the way up to the top level. When a variable has a good deal of this type of sparsity, use a compressed composite as the dimension of the variable. Dimensioning this type of variable with a compressed composite creates the smallest possible variable, composite, and composite index—much smaller than if you dimension a variable with a b-tree or hash composite"

which would suggest that compression in this release is only of value when you have the specific situation where you're finding parent nodes with only one descendent node. If anyone from the product team is reading, could you add any more detail for this?

Now that this is a bit clearer, the next step is to get hold of a migrated OES database that's going to 10g, identify whether the specific conditions for compressed composites are met, implement them and do some benchmarking. We've got a couple of clients who are looking to implement this feature so I'll report back with results in due course. Once again full details on compressed composites can be found in the OLAP DML Reference available on OTN.


More Information On Oracle 10g OLAP Compressed Composites

A few weeks ago I posted an article about the new Compressed Composites feature in Oracle 10g OLAP. Basically, compressed composites give you the opportunity, for certain types of sparse cubes, to dramatically reduce the size of the cube and the time it takes to perform aggregation. Although most people aren't really aware of this feature (it wasn't particularly well documented or highlighted when Oracle 10g came out) it's potentially one of the most useful features of the Oracle 10g OLAP Option, and anecdotal evidence suggested that build time could be brought down by up to a factor of eight.

I was particularly interested therefore to see a posting by Scott Powell on the OTN OLAP Forum, that gave a real world example of compressed composites in action, and the performance improvements that he encountered:

"Ok, just got finished testing compressed composites, and in all my life with IRI Express / Oracle Express / Oracle OLAP, I've NEVER seen a better feature added!!!!! 

The details of the results are below, but the quick summary is that a full aggregation went from 166 minutes down to ONE MINUTE!!! (ok, one minute 12 seconds, so sue me!) Also, the size of the fully aggregated db went from approx 3.7 Gb to 148 Mb!!! 

I'm Giddy!!! Note that I did validate the total numbers to make sure the aggregation actually worked (because it solved so quickly I didn't believe it...) 

Scott 

Cube info: 

5 dimensions
dates dim (14 leaf values, 5 levels)
line of bus dim (162 leaf values, 4 levels)
compliance rating dim (23 leaf values, 3 levels)
instruments dim (171 leaf values, 3 levels)
ownership dim (69,771 leaf values, 6 levels, 2 separate hierarchies)

The composites were defined with the dates dimension being DENSE, all others SPARSE:
<dim_dates <dim_ownership dim_lob dim_instruments dim_compliance>>

There were 190,676 total tuples after the data load and before aggregation.

SKIP-LEVEL aggregation, normal composite:
Agg time: 36 minutes 50 seconds
Tuples after agg: 1,520,377
Approx DB size: 1.3 Gb

FULL aggregation, normal composite:
Agg time: 2 hours, 46 minutes
Tuples after agg: 4,650,470
Approx DB size: 3.65 Gb

FULL aggregation, COMPRESSED composite:
Agg time: 1 minute 12 seconds!!!
Tuples after agg: 4,650,470 (assumed, can't measure like you can a normal composite)
Approx DB size: 148 Mb "

Scott and I have discussed the OLAP Option before so I dropped him a line to get a bit more information. According to Scott,

"The performance increase I saw was absolutely fantastic - seeing 2 hour 45 minute solves go down to just a minute or two. One note is that the number of "singles" for my example my be higher than average. I am analyzing investment securities, and some of the dimensions are very "demographic" in nature (i.e. security type, security rating, etc. doesn't vary for a given low level security, so this "forces" many singles. This would also be true in customer demographic applications, etc.)"

Scott also subsequently posted a couple of other threads on compressed composites, the first of which discussed aggregations using compressed composites (basically, it appears that only full aggregations using SUM are currently supported) and a second posting that summarised the current limitations (and workarounds) with compressed composites, together with any workarounds that might be needed:

"Ok, per my previous post, I'm completely in love with compressed composites. However, it does seem like there are some limitations with using them. Can someone who has played with these more than I please confirm the following (and let me know if there are workarounds, and/or if additional functionality will be added in future)?

1. Compressed composites must be FULLY aggregated, you can not do skip-level, partial aggregations

Workaround for #1 - none needed for me, speed performance was so dramatic that my 166 minute full aggregation went down to 1 minute, 12 seconds and the database size was a fraction of the normal size (150 Mb vs. almost 4 Gb)

2. Compressed composites can only do SUM functions, no others

Workaround for #2 - Keep any non-additve dimensions DENSE and out of the composite. This works great for me, where our major non-additive dimension is TIME (we do ending balance type calcs, not sums across time), which was already our dense dim. I should be able to solve the dense TIME dimension using a separate time-only aggmap, and then aggregate the sparse dims using a SUM only aggmap. Note - if you have many non-additive dims, this isn't going to work well

3. Compressed composites can only be used as the base for a single variable? (at least when I try to dimension a second variable by them I get an error message)

Workaround for #3 - either build a separate compressed composite for each variable, or define a "measure" dimension and have a single variable store all of your different values. In this case, and per workaround #2, you will probably have to define the new measure dimension as being DENSE (because you won't want to aggregate across measures)

4. You can not define a variable as having both a normal composite and a compressed composite?

Workaround for #4 - not really needed, this was just my attempt to make one composite (the compressed one) have all of the SUM only dimensions in it, and another composite (a normal b-tree composite) have all of the non-additive aggregation dims in it. Oracle seems to think this will not work, but they were not 100% sure and I have not tested.

Any other ideas / workarounds / knowledge of what the future will bring for compressed composites would be greatly appreciated - this is a HUGE step forward for Oracle OLAP."

to which Chris Chiappa replied:

"For #2, you can also play games with the new Aggregate "dataflow" (FROM/FROMVAR) capabilities in 10.1 as well - you can layer a completely dynamic variable on top of an aggregated CC and using dynamic aggregate to perform some of the operations you wouldn't be able to in the base variable. 

#3 is unlikely to go away any time soon, but since you need separate composites to take full advantage of the new multi-writer stuff in 10.1 anyhow, sharing composites is likely to become much less common going forward."

For anyone looking for more information on compressed composites, check out my original posting and the OLAP DML Reference R1 (10.1) section on compressed composites. Thanks again to Scott for allowing me to reprint his postings.