Questions to ask before deciding to use a cube for analytics
Posted by: ward in Untagged on Sep 18, 2008
Is a cube what you need?
Bill Inmon (aka: the father of the data warehouse) is a world famous expert in data warehousing and a big advocate of cubes, which he calls “Dimensional Modeling”. Inmon wrote an article published under the title "Issues with Dimensional Modeling" (http://www.dmreview.com/issues/20000501/2184-1.html).
Despite being an advocate of cubes, he explains at least ten root cause for problems that occur when cubes are in-appropriately. I’m not allowed to quote him due to a restrictive copyright statement at the bottom of that URL. However, a paraphrased subset of his view seems to be:
- dimensional modeling fits where the requirements are known before the infrastructure is built.
- dimensional modeling is not a good candidate for “exploration warehouses.”
- dimensional modellers have crafted a fine hammer. Now everything looks like a nail to them.
Basically, one cube won’t answer the questions for all people. If you try to deliver a system that does that, you will fail.
He also says you must take the time to know the specific needs of the user before you start. Implicitly, if you don’t know what reports people will want, then cubes aren’t the way to go.
To me, this means a cube doesn’t sound right for analysis, at least not my definition of it. My definition of an analytical workflow is that you ask a question, get the answer and then either revise the question or move on to another one.
In “A Dimensional Modeling Manifesto,” http://www.dbmsmag.com/9708d15.html, Ralph Kimball discusses problems that arise from building fact tables that start at too high a level. He advocates working at the transaction level. He states that, if a fact table were designed at a higher level, then it would be difficult to find and maintain the correct dimensional members, the attributes, and facts for the fact table.
Unfortunately, computing resources and operating systems were relatively less powerful back then (circa 90 to 97) and for that and many other reasons, packing fact tables full of transaction level detail was rarely done. The cube sizes just got too big when designers followed that mantra. In addition to data volumes being too big for memory, probably the most evident problem were the OLAP Clients and Reporting tools. They have memory and GUI limitations, so they could not expose a large number of dimensions and fact table data. Thus, it is necessary to simplify the data, to aggregate multiple physical entities into a single abstract entity. Summarized data is better than nothing.
Kimball's recommendation of working at the transaction level is a key point that should be re-visited with more modern tools than the current round of OLAP cube clients, ones that don't have the limitations in the GUI or their crude visualization.
So, as volume creeps up, what happens? Who hasn't loaded up production data only to find the tool just crawls along, doesn’t work, or looks terrible. Is the problem that there's too much data and metadata for GUI tools to render properly? Or is it the tool? If you have a great hammer, does everything have to look like a nail?
-
With too much metadata, GUI controls operate slowly and with errors, users can’t find the members they are looking for, and the labels get too long.
-
With too much data and metadata, visualizations get cluttered because labels overwrite and there are too many datapoints, colors get re-used, and you can’t discern individual points.
If the size of the data and metadata is going to be a problem, it must be reduced to an acceptable level. Most cube designers therefore have no choice but to aggregate transaction level values to a single cube member. As-is, this is fine for visualization of a high level view of your business, but it doesn’t necessarily lead to useful or correct analysis of your business.
Life to date historical information, the Root Cause
Cubes will usually contain life to date information. For example, a fact table must contain every product that’s ever been sold and the product dimension must contain all products ever sold. If a company is large enough to afford BI, it’s not uncommon to see dimensions with levels that have 100,000 to 500,000 (often more) members.
It's also a good point to mention another intrinsic problem that happens: Slowly Changing Dimensions. Wiki documents various techniques for dealing with the gradual degradation of cubes that they cause. There are no good answers, just workarounds. Bottom line: In a production world, cubes become larger, and slower. If you keep the focus of the reports they produce narrow, the effect is less. The more reports you try to produce, the worse it gets. http://en.wikipedia.org/wiki/Dimensional_database
The Downstream Effect of the Life to Date and Slowly Changing dimensions
The query design parts of the tools don’t have any awareness of the age of the data. Even products that haven’t been sold in ten years show up alongside those that were sold this week. The problem is: GUI controls tends to freeze up when asked to fetch more than 30,000 members.
To help alleviate this, parent child relationships are established. Many potential analytic operations become difficult once members are no longer siblings which represents a challenge either in cube design or necessitates the needs for some very complex cube queries.
If a user is forced to work within an arbitrary parent-child relationship then it’s difficult to find members. To mitigate this, to make these more self-documenting, many report designers display the hierarchy as part of the name. We have seen 250 character product names where only the trailing few characters are different. Think for a moment how that appears in a drop down list box or a crosstab or a the legend of a chart.
The visualization experience suffers, because long labels tend to cause overwriting in the UI (notably the charts) and the extra characters reduces the amount of meaningful data on the screen (e.g. extra indenting is required, more text is shown rather than numbers).
This, and for other reasons due to relative superiority of Excel in filtering and calculations, most cube users open a crosstab and immediately export to Microsoft Excel. Building a cube to feed Excel sounds like a lot of overhead. If they’re going to end up in Excel, what does that say about the investment you put into the cube and it’s dimensional approach?
Because of aggregation, it’s very easy to completely miss important facts when working with cube data
The business analyst is expected to recognize important changes in data. Let’s consider some scenarios that might reasonably arise in a production environment. Start by assuming you have a member labelled “soda”. Soda is a summary of at least three distinct inventory items. “cola in bottle 10 oz english label, orange in can 10 oz english label, orange in can 10 oz spanish label” all get rolled into “3 soda”.
Here’s what might happen, and you’d never even know it:
-
Suppose that sales of cans takes market share from the bottle packaging. The analyst will never notice that the shift in purchasing behaviour . To track cans and bottles separately after the cube is designed and being built is a complicated task and will take some time. They will be chasing the trend long after its happened.
-
Suppose you had a special promotion with inventory from a new supplier whereby you sold 1000 extra colas but refunded 500. In a cube, you’d see positive change of 500 and you’d report a positive new trend, the organization would order more supplies but what really should be reporting is an issue relating to a new supplier.
-
Suppose that you sold a 10,000 extra colas and correspondingly fewer orange. That’s an important shift in market share, you’d want to know about it, but through aggregation, you wouldn’t know about it.
Because of the aggregation of individual transactions to a single “soda”, changes within soda are hidden. An analyst would reasonably conclude “Sales of Soda are basically unchanged” but we know they are anything but unchanged:
-
There’s a shift to cans,
-
Coke market share rose, Sprite dropped, and
-
you’re refunding a large portion of what you sell because of a new supplier.
The point is: The culprit in the wrong conclusion was the aggregation to a single member. If you had had the foresight and a client tool that could handle the volume, you could've seen those separate transactions as separate members. Cube technology being what it is, it's not easy to go back and re-engineer the incoming data to fix it after the fact.
The lesson learned is that aggregation hides the important data but aggregation is a necessary aspect of using cubes. Important anomalies, patterns, and trends that occur at the shop floor level will not be seen by most cube users due to aggregation.
The OLAP cube vendor will tell you start your analysis by looking from the top down and drilling to detail when you see important values. This story sounds reasonable, it suggests you can see important business trends after which you can drill to detail relational records. The problem is this: Those scenarios posed above would not have been flagged by a cube user. There'd be no drill thru to detail because the user would not have known the problem was occurring.
Other points can be made that detract from the strategy of exposing analytics through summarized data:
-
People cannot reliably and consistently see that a pie slice or bar riser in one period is smaller or larger than another period, unless it is extreme.
-
aggregation will often mask changes, whereby the sum of child values net to essentially the same value, even though individually they have experienced dramatic change
-
the sheer difference in the magnitude of a parent’s value can hide the significant changes that occur one of it’s child’s values. (e.g. would a 20% change in Albany change the sales total for New York state?)
-
conclusions made by observing superficial members are not practical and actionable. What’s the point of the business analyst extending trend lines on summary data when it has known issues in the deriving the final values?
Don't Ignore License Cost and Installation
If you are still considering doing analytics with an OLAP Cube, you will need the cube server license and then you will need to install it. Don’t under-estimate the cost of the license.
If you are OEM or ISV, you will end up with another licensee in your installation and their will erode your profit, year after year.
Also, don't underestimate the magnitude of effort associated with installing their software. Even just installing Microsoft SSAS (from scratch) can take days - try it!
In my opinion...
The only reasonable time to use a cube is if you already have it built and it has the calculations you want already in it. But you are still exposed to the risk of not seeing the issues that arise at the layer below your fact table or the level of summarization that you're looking at, so beware.




