My first employer provided me with some valuable insight:
Microsoft Excel is the main competition for any piece of software.
Over the years this statement has proven true an alarming number of times. And it makes sense too. Everyone knows how to use Excel, and it’s extremely flexible, especially when you’re working with any kind of (tabular) data. In other words: all roads lead to Excel.
Roads from SQL also often lead to Excel, even though they’re not always pretty. Sure, if you’re on foot or horseback, with a limited amount of luggage, the road will be fine. However, here’s a particular scenario that obscures the path.
Scenario
These are the basic constraints:
- Available tools: MSSQL 2012, SSRS, SSIS, Visual Studio 2012 & .NET 4.5.
- Excel versions: either XLS (2003 and below) or XLSX (2007 and up, slightly preferred) will do.
- Form of data: combination of normalized and denormalized data (see below).
- Amount of data: tops 250.000 rows (times 20 when unpivoted).
- Required response time: live exports that should run within seconds.
- Databases: many instances each with the exact same schema but different data.
So there’s access to the latest and greatest Microsoft tools, and the option to include custom components. Free and open source components are preferred, but buying tools and components is also an option.
Data
Here’s a simplified version of how the data is modeled:
- Person is a “flat” table, containing some columns that have “fixed” personal details.
- Property and Value allow for custom (normalized) Person fields.
Here’s a visual of this simplified model:
You can view the SqlFiddle with some sample data. A typical query to get the data that’s going to be our starting point:
1 2 3 4 5 6 7 8 9 10 11 12 |
SELECT p.Id, p.FirstName, p.Surname, prop.Id, prop.PropName, val.Id, val.ValName, ppv.CustomValue FROM Person p LEFT JOIN PersonPropertyValue ppv ON ppv.PersonId = p.Id LEFT JOIN Value val ON val.Id = ppv.ValueId LEFT JOIN Property prop ON prop.Id = val.PropertyId |
This will give output similar to this:
Id | FirstName | Surname | Id | PropName | Id | ValName | CustomValue |
---|---|---|---|---|---|---|---|
1 | John | Doe | 1 | Trait | 2 | Bold | NULL |
1 | John | Doe | 1 | Trait | 3 | Easygoing | NULL |
1 | John | Doe | 2 | Eye color | 4 | Green | NULL |
1 | John | Doe | 3 | Pet name | 7 | Placeholder | Fluffybunny |
2 | Mark | Roe | 1 | Trait | 3 | Easygoing | NULL |
3 | Mary | Fisch | 2 | Eye color | 6 | Other… | Red-brown-ish |
3 | Mary | Fisch | 3 | Pet name | 7 | Placeholder | Ravi |
Note that in reality I’m dealing with these figures:
- About 30 columns on the Person table;
- About 20 different Properties with about 6 possible Values on average;
- Anywhere between 100 and 250.000 Persons;
- Usually between 0 and 2 Values per Person per Property;
For one, this means that the normal output of mentioned query has a lot of redundant information (i.e. the 30-ish Person columns).
Target Output
The business requirement here when moving this data to Excel should be obvious: the data should be pivoted. Each “Property” should become a group of columns, with one column per “Value”. A table says more than a thousand words; this is the requested output:
Trait | Eye Color | Pet name | |||||
---|---|---|---|---|---|---|---|
Id | FirstName | Surname | Bold | Easygoing | Green | Other… | Placeholder |
1 | John | Doe | x | x | x | Fluffybunny | |
2 | Mark | Roe | x | ||||
3 | Mary | Fisch | Red-brown-ish | Ravi |
Something along these lines is what the business user’s would like to see.
Bonus Objectives
Getting the target output in itself is a challenge. I’m not done yet though, here are some bonus objectives we have (with MoSCoW indications):
- Properties and Values both have ordering, the order of columns Should respect that.
- Any solution Should allow for some styling (fonts, borders, backgrounds. It’d be Nice to have further control, for example enable a theme, alternate row coloring, etc.
- I Would like to have a place for metadata (date exported, etc) somewhere in the generated file.
- Localization of the column headers (where applicable) would be Nice to have.
- It’d be Nice to be able to reuse much of the solution in generating XML files instead of Excel sheets.
- Any solution Must be solid and maintainable.
- Any solution Must run on moderate hardware without hogging resources.
Current Solution
Right now, the above is accomplished using Reporting Services. This works decently well for datasets containing no more than a few thousand Person rows with related Property Values.
However, beyond about 3000 records performance quickly starts to degrade. This isn’t entirely unexpected, because Reporting Services isn’t really meant for this task (it’s much better at showing aggregates than at exporting large volumes of data).
Possible Solutions
There are many possible solutions. I’m currently considering a subset of them (some solutions merely for “benchmark” purposes):
- SSIS packages. The tool seems meant for the job. I do hold a grudge against the tool, but maybe it’s time to get over that.
- Dynamic SQL + generated RDLs. Use DynSQL to do the pivoting. This requires generated RDL files because the fields of a query must be known up fron to SSRS.
- Dynamic SQL + OPENROWSET + OleDB. Use DynSQL to do the pivoting, and export it straight to Excel using OleDB.
- FOR XML queries into OpenXML. The basic idea: fast FOR XML queries, possibly an XSLT, generating OpenXML data, and plug it in a basic XLSX.
- ORM or ADO.NET into an OpenXML using an XmlWriter. Something along these lines.
- BCP directly into Excel files. Haven’t looked into this yet, but it may be an option.
- SQL CLR. Not sure how this would work (if at all), but there might be options here.
Now it’s time for me to try some of them, see which one fits the requirements best.