In late 2007, two insurance industry collegues brought me into a problem they were having with an SSRS report which listed properties. They wanted to show a count of properties as well as note information for each property. Because each property could have many notes, this inloved two levels of grain – properties, and property notes. They had been trying out a property report with a property note sub-report but it wasn’t working for them. They needed the full report with its subreport sections to render in Excel – which the SSRS Excel render extension did not support.
If they combined the two grains into one dataset, then all the info was available and visible to the report and exportable to Excel. However, this caused the sum of properties to be wrong, as the report would double count the properties.
I’ve run into scenarios like this again and again so I thought a blog post might be useful to somebody.
Definition of Grain
Any dataset (not only in the context of a data warehouse) contains a grain. Grain means exactly what each record in the dataset represents.
Here’s a great Kimball Group Article on Grain.
In the insurance properties example above there were two levels of grain that really necessitated two data sets. Properties, and Property Notes.
Often times, one needs to aggregate measurements stored in datasets. If the level of grain provided by the dataset is too low, then the cardinality is too low, which means that the measure you want to aggregate is no longer unique and you WILL be including it more than once in your aggregation (probably not what you want). Thus, getting the grain right is imperative if you want accurately aggregated measurements.
In the example below, I illustrate one solution to the “wrong level of grain” problem.
The database objects and fictitious manufacturing data shown below were created for this post to illustrate the techniques mentioned. Although they were created for illustration purposes only, similar data structures can be found in real manufacturing execution systems.
Examine the following ERD diagram.
This database models manufacturing lines that each have a series of machines. Each machine records many events. You receive the following report requirement and begin to write your SQL query.
Want to see Machines with their daily cost AND Machine events for a given line.
Sounds simple enough. The dificulty, as you shall quickly see, is that events are at a lower grain than machines. Summing the daily cost for each machine will yield wrong daily costs, as these are duplicated for every event returned.
At this point, you’ll be tempted to blame the business user for an unreasonable requirement but you must not do this as their request is infact quite reasonable. When looking at cost, they very likely want to see details that might explain that cost. Furthermore, they want to see these costs and details together in their favorite tool – Excel. Now, don’t go giving up on SQL just yet, writing two queries for two different grains and massaging like crazy inside Excel. That’s just not scalable.
Here’s the pattern I like to use. Even if you don’t implement it exactly the way I did, (there are many other ways), the principle is what matters:
Flatten the lower grain and roll it into the aggregatable grain.
I chose to do this with FOR XML. It is easy to format XML in your favorite reporting tool. This preserves the Machine grain and allows you to accurately aggregate the measurements (in this case, cost), while still displaying the useful lower grain – the events.
When you expand one of the XML Events Values, here is what it looks like.
This is essentially the solution that was implemented for the properties and property notes problem for the insurance industry. Each property was represented by a single row in the result set and there was an xml column for the notes (all scrunched up in there). It was easy enough in SSRS to use the replace built in function to replace xml tags with line endings and indentations – yielding a beautiful single-dataset Excel-compatible report.
The reason I put the MachineEvents in a double sub query is because I may want to union different types of details in the future which all roll into one xml column.
In a transactional system, you may want to consider doing such rolling up in a view so that you can join to it easily in the future at the desired grain without encountering the “wrong level of grain” problem again. For a data warehouse, upon considering the reporting requirements, you probably want to do the flattening during ETL and plop the xml into a dimension.
You’ll run into the “wrong level of grain” problem again and again everywhere and now you can see – it’s not very dificult to solve. The key is understanding grain and coming up with ways to roll details into the desired grain.