Home arrow Blog
My Ramblings on Open Source BI

Seth Grimes published an article about the potential growth of four BI companies. 

“That core software components are free makes open source attractive to users as well as systems integrators and independent software vendors that sell products and services built on open source BI components.” 

Take note: People at least start out expecting products which advertise as Open Source to be free.

The companies he wrote about leverage the words "open source" in their marketing albeit with the word "Commercial" associated with "Open Source". 

I am worried that, to someone who's not "in the know," this distinction is too subtle.  I am thinking of the general press (ie. not BI press) and customers who are new to BI, etc etc. 

They latch onto the words "Open Source," we all know that. They think it's all one and the same, they're talking about True Open Source.

I would worry that might feel as if a bait-and-switch occurred once they learned what's really going on.  To varying degrees, this will eventually detract from the brand and philosophy of "Open Source" (per the OSI definition).  In essence, this is what Roberto Gallopini joked about when he read one of our early days press releases.    BTW, that particular press release was a mistake, some of the wording about Open Source got out of our control and published before we could correct it -- Roberto was quite right to tease us about it :).

Seth said:

 “Commercial open source BI vendors, notably Pentaho and JasperSoft, offer these components in free community editions with open source licenses, and also packaged with non-open source extensions in paid, supported, indemnified editions.
"It's hard to get going without help from the vendor, and anytime I need to upgrade, I get a little bit nervous. (from Seth's article, a quote by Venkat Gaddipati, CTO at online marketplace OnForce)"
“Beyond Compliance harnesses the Palo OLAP Server on the back end and the non-open source Palo Worksheet Server for report distribution.”

To me, these quotes imply that a customer needs closed software to find it useful.   Are these companies really Open Source vendors? I see what I see.   Form your own opinion.

In a separate article by Stephen Swoyer, Vincent Pineau, GM Talend Americas, said:

"We definitely do not feel that open source should equate to free. Yes, we are lower [in price] than our competitors, but we also do not want people to think that cheap in price or cheap in features. We believe that the right features are what people should focus on," Pineau says
At least that's a pretty plain statement, nobody can complain about that.

Ok, then, we have had a project up at Codeplex for a long time now.  It is true open source.  I'd really prefer you pay for the commercial one (at this site).  Can we now call ourselves Commercial Open Source? Is that now an acceptable term? 

Premise-Promise of Open Source

Seth said “For most open source BI adopters, however, the solution search starts with cost.” 

Folks, from my experience, people seeking Open Source want a FREE, totally free, experience.  They want to pay nothing. Period.

This isn't the case with Commercial Open Source vendors. Customers end up paying $30,000.  And they might spend a month or two of expensive IT time trying to get it working before they resort to these Commercial Open Source vendors. 

Remember, this is something they originally thought would get free.  $30,000+ and several months of work is not free.  One might even argue that "Commercial Open Source" is not in-expensive.

Sometimes it's better to pay and get documentation, support, and quality control. IMO, you will end up spending less.

Our own philosophy is that the stuff you need to be open-source, is. 

For example, our User Interface and programmatic interfaces are all Open Source. Supported, customizable Open Source.

Therefore, one shouldn't divide the BI world into two categories.  There are more than the four "commercial open source vendors" and the monolithic mainstream BI vendors who were mentioned. 

In fact, there are plenty of other vendors who can fulfill BI needs very satisfactorily for less cost than Pentaho and Jaspersoft.

Those are good companies. I know some of the founders and they are great people. I can't quite comment on their product quality since Open Source by its nature can easily have random bugs injected and no one is accountable.  

My Point is, if you're shopping for Commercial Open Source, then also open your minds to other vendors too.  The TCO will be about the same and you might have a better experience. Not everyone charges as much as the big guys, although, I wish :)  ...

 
Cindy Howson on cool things happening in BI
Cindi Howson wrote an article on November 10, 2008  for Intelligent Enterprise
I found the article very informative.  She wrote about a number of things. I took an interest in "In-memory analytics" and "Advanced Visualization".  Following are my comments on the topics she wrote about. 


In-memory analytics

:  One of the potential benefits of in-memory is that it replaces the expensive & time-consuming aspects of reporting databases or cubes. 

When a user has "new" business question, underlying data is created on-the-fly with no IT overhead either to design it, or to maintain it.


Advanced Visualization

:  Most "visualization" technology gets its data directly from a database or cube and suffers the limitations of query technology (and not using a business tier in a 3 tier model): 

* single-pass queries, such as SQL;
* single data source;
* complex SQL & MDX being the gating factor in answering business questions;

Due visualization being based on query technology, the data you're going to discover just isn't going to be that interesting.


Here's my solution

What these products need is a "business layer" sitting on top of the query results. 

Advanced metrics and kpis can be developed in a business layer, without much impact on the database.

If it can be created in-memory, so much the better.

If visualization products can plot it, even better.

Our product acts is an in-memory business layer and can supply data to any BI or visualization tool. We give them useful numbers to display and explore visually.

On top of that, per your previous page in this article, it is highly desirable that the business user can experiment with and author their own metrics and KPIs.  Again, we've solved that.  Come take a look!

 

 
First steps in developing KPIs & Metrics

Craig Schiff wrote an article at http://www.intelligententerprise.com/showArticle.jhtml?articleID=51201364 some time ago.  Following are some salient points he made:

"Without steady involvement by the business side, IT's dashboard projects are doomed to failure."
"IT thinks in terms of improving access to data, whereas finance focuses on improving its processes (particularly the painful ones)."
"What's most important is the content: that is, what's being measured and displayed on the dashboard."
"confront one of the toughest challenges: How do you get to that short list of agreed-upon key measures?"
"Once you've settled the corporate strategy, chosen the metrics development team, and put a facilitator in place, you can get down to business. The team needs to work down from the strategy to determine the key business drivers."
"the organization may not have the underlying data necessary to perform the calculations to support the KPI. What do you do then? Throw the KPI out? Wrong answer. What most companies do is manually enter the data for the KPI."
"KPIs should offer a good mix of financial as well as operational measures."

I think I've represented some of his key points accurately.  It's enough to tell me (and you) that Craig and I think alike.   His article was written in 2006 and following are my comments on what I think has evolved since then. 

To begin, I've think it's necessary to cut down the reliance on IT in providing the metrics and generating the KPIs because:

  1. Managers like to generate their own information whether it be KPIs or metrics. The key evidence of this is the popularity of Excel. They need this to be able to react to fast changing business needs.  Routing new business questions and performance metrics and indicators through IT is too slow.

  2. He suggested that, if a company isn't tracking the data, entering it by hand is risky and an auditable process won't be likely.  For me, this would be a least preferable approach, but at the time it was written, it was a decent & practical one.

    Instead, I believe there's a very good chance the data is there in the company, it's just that conventional BI tools are limited to what they can do. 

    If you're using a cube, then the data has to be in the one cube or you have to be a grandmaster in MDX to get what you want. If you're using a reporting tool, then the data has to be available from a single SQL query. Relying on this kind of technology and its limitations is just plain in-adequate now that better choices are avaialble. Let's for the moment ignore the fact that they charge way too much money and the IT overhead is a trememendous financial burden.
  3. Users, not IT, need a way to read in data from multiple heterogenous data sources into the same report.

My work at nextanalytics has been addressing this evolving requirement.   I think we've made great gains and will continue to do so in the future.

 

 
How to Improve Your Scorecards and Dashboards

Most people don't realize just how much work there is in populating the data for a dashboard or scorecard.  Let’s make a list of some of the things that contribute to the problem.

The Query.  Getting the data

To get useful data, SQL is too difficult and expertise is scarce and expensive.  Moreover, SQL can’t do all the things you need. Often, the solution requires extra tables which are usually undesirable.  MDX & Cubes are an alternative, but they are expensive to design, create and maintain especially if the needs are likely to change.

Once the query has been submitted, too much data returns.  The ODBC layer or DataSet Object uses too much memory and causes failures at the presentation layer.

Putting The Data Into The Chart

Now that you have the data, there's too many labels, and large ones affect how the legends and chart labels appear. It is way too easy to get a chart that looks horrible.  Visualizations fail on large number of rows and columns. Overwriting risers, lines, symbols are all quite common.

Most chart package support some  local data manipulation and calculation operations, but they are not integrated with the query, causing a design & implementation burden for the presentation layer developer.

Crosstabs & Datagrid Objects

Just like with charts, too many and large labels take up too much screen real-estate. There’s no room for numbers and you can’t visualize anything about your data. Too much scrolling.

As with chart packages, local math and sort operations are not integrated with the query, and it’s difficult to save the user’s actions to the same state the next time they open the view.

UI Controls  (Combos, listboxes, etc )

Plainly said, these fail given a normal amount of production data.  UI developers resort to hard-coding but slowly changing data (in the business) makes the dashboard a maintenance nightmare.  Metadata mitigates this, but at a great overhead expense .

Three Part Solution

Reduce the volume, provide subsets as a result of analytical workflows, not just arbitrary hierarchy like OLAP provides.  From the user interface, let the users choose an analyzed subset to look at with their scorecard or dashboard.  Alternatively, use analytics to aggregate lower level detail into upper level views but avoid the overhead of a cube or temporary database tables.

Use analytic workflows to improve the content. Pre-calculate patterns, trends, outliers, comparisons, and whatever else suits your business. 

Provide your users with valuable "already analyzed" numbers.  If you do this, you don’t need to chart as much data, and the data being shown is more useful than raw “database data”.  

Avoid the trap of encouraging them to export to Excel, spreadsheets are an entity that requires auditability per Sarbanes-Oxley. Don't get yourself into a legal mess.

Improve your charts and crosstabs by using programmatic techniques to perform text and numeric substitutions.  This makes your visualizations more readable.

Three easy steps. That's all it takes to make scorecards & dashboards more useful.   

Nextanalytics makes these steps easy.  It offers additional processing on query results and transparently provides the analyzed data to presentation layers.  A simple, elegant solution that is fast and easy to implement.

 

 
Questions to ask before deciding to use a cube for analytics

Is a cube  what you need?

Bill Inmon (aka: the father of the data warehouse) is a world famous expert in data warehousing and a big advocate of cubes, which he calls “Dimensional Modeling”.   Inmon wrote an article published under the title "Issues with Dimensional Modeling" (http://www.dmreview.com/issues/20000501/2184-1.html). 

Despite being an advocate of cubes, he explains at least ten root cause for problems that occur when cubes are in-appropriately.  I’m not allowed to quote him due to a restrictive copyright statement at the bottom of that URL. However, a paraphrased subset of his view seems to be:

  • dimensional modeling fits where the requirements are known before the infrastructure is built.
  • dimensional modeling is not a good candidate for “exploration warehouses.”
  • dimensional modellers have crafted a fine hammer. Now everything looks like a nail to them.

Basically, one cube won’t answer the questions for all people. If you try to deliver a system that does that, you will fail. 

He also says you must take the time to know the specific needs of the user before you start.  Implicitly, if you don’t know what reports people will want, then cubes aren’t the way to go.

To me, this means a cube doesn’t sound right for analysis, at least not my definition of it.  My definition of an analytical workflow is that you ask a question, get the answer and then either revise the question or move on to another one.

In “A Dimensional Modeling Manifesto,” http://www.dbmsmag.com/9708d15.html, Ralph Kimball discusses problems that arise from building fact tables that start at too high a level.  He advocates working at the transaction level.  He states that, if a fact table were designed at a higher level, then it would be difficult to find and maintain the correct dimensional members, the attributes, and facts for the fact table. 

Unfortunately, computing resources and operating systems were relatively less powerful back then (circa 90 to 97) and for that and many other reasons, packing fact tables full of transaction level detail was rarely done.   The cube sizes just got too big when designers followed that mantra. In addition to data volumes being too big for memory, probably the most evident problem were the OLAP Clients and Reporting tools.  They have memory and GUI limitations, so they could not expose a large number of dimensions and fact table data.  Thus, it is necessary to simplify the data, to aggregate multiple physical entities into a single abstract entity.  Summarized data is better than nothing.

Kimball's recommendation of working at the transaction level is a key point that should be re-visited with more modern tools than the current round of OLAP cube clients, ones that don't have the limitations in the GUI or their crude visualization.

So, as volume creeps up, what happens? Who hasn't loaded up production data only to find the tool just crawls along, doesn’t work, or looks terrible.  Is the problem that there's too much data and metadata for GUI tools to render properly? Or is it the tool?  If you have a great hammer, does everything have to look like a nail?

  • With too much metadata, GUI controls operate slowly and with errors, users can’t find the members they are looking for, and the labels get too long.
  • With too much data and metadata, visualizations get cluttered because labels overwrite and there are too many datapoints, colors get re-used, and you can’t discern individual points.

If the size of the data and metadata is going to be a problem, it must be reduced to an acceptable level. Most cube designers therefore have no choice but to aggregate transaction level values to a single cube member.  As-is, this is fine for visualization of a high level view of your business, but it doesn’t necessarily lead to useful or correct analysis of your business.

Life to date historical information, the Root Cause

Cubes will usually contain life to date information.  For example, a fact table must contain every product that’s ever been sold and the product dimension must contain all products ever sold.  If a company is large enough to afford BI, it’s not uncommon to see dimensions with levels that have 100,000 to 500,000 (often more) members.

It's also a good point to mention another intrinsic problem that happens: Slowly Changing Dimensions.  Wiki documents various techniques for dealing with the gradual degradation of cubes that they cause. There are no good answers, just workarounds. Bottom line: In a production world, cubes become larger, and slower.  If you keep the focus of the reports they produce narrow, the effect is less. The more reports you try to produce, the worse it gets.  http://en.wikipedia.org/wiki/Dimensional_database

The Downstream Effect of the Life to Date and Slowly Changing dimensions

The query design parts of the tools don’t have any awareness of the age of the data. Even products that haven’t been sold in ten years show up alongside those that were sold this week. The problem is: GUI controls tends to freeze up when asked to fetch more than 30,000 members.

To help alleviate this, parent child relationships are established. Many potential analytic operations become difficult once members are no longer siblings which represents a challenge either in cube design or necessitates the needs for some very complex cube queries.

If a user is forced to work within an arbitrary parent-child relationship then it’s difficult to find members.  To mitigate this, to make these more self-documenting, many report designers display the hierarchy as part of the name. We have seen 250 character product names where only the trailing few characters are different.  Think for a moment how that appears in a drop down list box or a crosstab or a the legend of a chart.

The visualization experience suffers, because long labels tend to cause overwriting in the UI (notably the charts) and the extra characters reduces the amount of meaningful data on the screen (e.g. extra indenting is required, more text is shown rather than numbers).

This, and for other reasons due to relative superiority of Excel in filtering and calculations, most cube users open a crosstab and immediately export to Microsoft Excel.   Building a cube to feed Excel sounds like a lot of overhead.  If they’re going to end up in Excel, what does that say about the investment you put into the cube and it’s dimensional approach?

Because of aggregation, it’s very easy to completely miss important facts when working with cube data

The business analyst is expected to recognize important changes in data. Let’s consider some scenarios that might reasonably arise in a production environment. Start by assuming you have a member labelled “soda”. Soda is a summary of at least three distinct inventory items. “cola in bottle 10 oz english label,  orange in can 10 oz english label,  orange in can 10 oz spanish label” all get rolled into “3 soda”.

Here’s what might happen, and you’d never even know it:

  1. Suppose that sales of cans takes market share from the bottle packaging.  The analyst will never notice that the shift in purchasing behaviour .  To track cans and bottles separately after the cube is designed and being built is a complicated task and will take some time. They will be chasing the trend long after its happened.
  2. Suppose you had a special promotion with inventory from a new supplier whereby you sold 1000 extra colas but refunded 500. In a cube, you’d see positive change of 500 and you’d report a positive new trend, the organization would order more supplies but what really should be reporting is an issue relating to a new supplier.
  3. Suppose that you sold a 10,000 extra colas and correspondingly fewer orange. That’s an important shift in market share, you’d want to know about it, but through aggregation, you wouldn’t know about it. 

Because of the aggregation of individual transactions to a single “soda”, changes within soda are hidden. An analyst would reasonably conclude “Sales of Soda are basically unchanged” but we know they are anything but unchanged:  

  1. There’s a shift to cans,
  2. Coke market share rose, Sprite dropped, and
  3. you’re refunding a large portion of what you sell because of a new supplier.

The point is: The culprit in the wrong conclusion was the aggregation to a single member. If you had had the foresight and a client tool that could handle the volume, you could've seen those separate transactions as separate members.  Cube technology being what it is, it's not easy to go back and re-engineer the incoming data to fix it after the fact. 

The lesson learned is that aggregation hides the important data but aggregation is a necessary aspect of using cubes. Important anomalies, patterns, and trends that occur at the shop floor level will not be seen by most cube users due to aggregation.     

The OLAP cube vendor will tell you start your analysis by looking from the top down and drilling to detail when you see important values.   This story sounds reasonable, it suggests you can see important business trends after which you can drill to detail relational records.  The problem is this: Those scenarios posed above would not have been flagged by a cube user. There'd be no drill thru to detail because the user would not have known the problem was occurring.

Other points can be made that detract from the strategy of exposing analytics through summarized data:

  • People cannot reliably and consistently see that a pie slice or bar riser in one period is smaller or larger than another period, unless it is extreme.
  • aggregation will often mask changes, whereby the sum of child values net to essentially the same value, even though individually they have experienced dramatic change
  • the sheer difference in the magnitude of a parent’s value can hide the significant changes that occur one of it’s child’s values. (e.g. would a 20% change in Albany change the sales total for New York state?)
  • conclusions made by observing superficial members are not practical and actionable.  What’s the point of the business analyst extending trend lines on summary data when it has known issues in the deriving the final values?
Due to aggregation, detail is rarely in a cube.   It’s only in the detail that will you see real-world outliers, patterns, trends, and anomalies as they happen, not in the summarized data.

Don't Ignore License Cost and Installation 

If you are still considering doing analytics with an OLAP Cube, you will need the cube server license and then you will need to install it. Don’t under-estimate the cost of the license.

If you are OEM or ISV, you will end up with another licensee in your installation and their will erode your profit, year after year.

Also, don't underestimate the magnitude of effort associated with installing their software. Even just installing Microsoft SSAS (from scratch) can take days -  try it!

In my opinion...

The only reasonable time to use a cube is if you already have it built and it has the calculations you want already in it.  But you are still exposed to the risk of not seeing the issues that arise at the layer below your fact table or the level of summarization that you're looking at, so beware.

 

 
<< Start < Prev 1 2 Next > End >>

Results 1 - 9 of 15