We are continuing a series of discussions about data warehousing best practices based on my blog entry here. This time we take on choosing the right granularity in a data warehouse. The rule of thumb, and the point I made in the original blog entry was that you should always choose the lowest level of granularity. That makes sense in that you can always aggregate to a higher level, but you can't go back down to detail if the data is not stored.
I can give an example of one data warehouse that I am familiar with that has won awards and accolades for best practices. However, a closer look reveals several trade-offs that were made during design that have had far and long lasting effects. During the design process, the decision was made to drop a lot of detail data in order to save space. This data, which many would consider essential to any data warehouse, contained transactional details and customer information. The data warehouse was built for financial use, and the finance organization was unconcerned about these details. They were looking at metrics by products and the chart of accounts.
Was this necessary? Arguably, it was. Since IT projects were funded by the business organization, the finance organization had no incentive to pay for storage or CPU required to process more data than was absolutely required. They simply made the decision to design for their own users' requirements.
However, this decision also had several long term effects that may have not been taken into consideration. Since the warehouse did not contain details that were important to other departments, such as marketing, sales, and operation, the warehouse was rendered useless for a large segment of the company. These organizations, in turn, were then forced to build their own data warehouses from the same source data in order to satisfy their users' needs. The end result being, as a company, they spent more building and maintaining multiple data warehousing systems than by investing more in the first one.
Another effect that had considerable cost was the other organizations' difficulty reconciling their numbers to the finance organization. Since the detail source data was not maintained by finance, and their business rules were buried in thousands of lines of code, users in the other departments spent countless hours trying to resolve differences between the systems.
So, was the right choice to build one true enterprise data warehouse to serve the entire organization?
In theory, the enterprise data warehouse is the best solution. However, we recognize that the larger the organization, the more difficult this is to accomplish. Enterprise level designs in 50 billion dollar + companies become almost a fantasy. There have been a few documented successes, but the kinds of investment involved in these projects take years and millions of dollars.
It is important that there be a chief architect overseeing the development of all information systems. That person must have a handle on the current environment and a vision for integration and reduced costs. I have met a few of these guys who had the title and responsibility but lacked the vision. I have also met a few who were convinced that silos were simply more cost effective and maintainable. Every organization is different and what works for one might not work for another.
So back to the original consideration for this post. Should a data warehouse reflect the lowest level of granularity? In general, yes, it should, but there are a number of other factors that must be considered when making the decision. Ultimately, the answer is that the granularity of the data should reflect the lowest level that has the greatest use to the greatest number of users.
Recent Comments