Maintaining data quality is important throughout the organization, but certainly BI applications are expected to maintain an even higher standard because we have greater control over the data going into the data warehouse. As with most things in life, the ideal seldom matches the real for a number of reasons. In SQL Server 2008 (November CTP), the new Data Profiling Task in Integration Services should help a lot becase not only can you perform data quality assessments on an ad hoc basis, but you can also automate data quality by integrating the quality assessment into the ETL process. One important caveat about this task is that it works only with SQL Server sources (2000 or higher).
This one little task has lots of options which I will address over a series of blog posts. Today, I will introduce the Quick Profile feature of the task and show you the output.
To get started with the Data Profiling task – whether you’re using the Quick Profile or not – you must define either a file connection or a variable to hold the output. The output will be generated in XML format. In the editor, there is a Quick Profile button to open the Quick Profile Editor. Consider this the quick and easy approach to profiling a single table (or view).
Now I must say I did read BOL to get an overview of the task, but one little detail about the connection for the Quick Profile escaped detection. I had dutifully created a OLE DB connection manager to AdventureWorks, because how else does one connect to SQL Server databases? But that connection did not show up in the connection drop-down list in the Quick Profile Editor. (Neither did it give me the option to create one, which might have given me a clue. Most other places to set connections give me the ability to create a new connection on the spot, so this seemed odd to me, but it IS a CTP…). Then I went back and re-read BOL to discover that the Quick Profile edtor requires an ADO.NET connection manager using the SqlClient provider. Not particularly intuitive, in my opinion, but it is documented so I can’t complain too loudly. Once I set up the correct connection type, it appeared in the drop-down list and I was set to continue.
Once the connection is set, you have the opportunity define which profiles you want to create. Just for kicks, I decided to profile the Product table in AdventureWorks and kept the default profile selections:
The Quick Profile editor adds all the profiles you selected to the Profile Requests page of the Data Profiling Task Editor.
If you click on a particular request, you can view, and of course change, the properties of the selected request:
You then execute the package and get the results. If you output the results to a file, you can use the Data Profile Viewer. Just click Start, Run, and type DataProfileViewer and click OK. (It’s actually found in C:\Program Files\Microsoft SQL Server\100\DTS\Binn if you’re curious.) Use the Open button there to navigate to your output file fromthe package execution.
Here’s an example of the Candidate Key Profile:
Now in the case of an existing table, this particular profile isn’t particularly enlightening. After all, the key has already been defined. It’s more useful for data we haven’t seen before, perhaps an extract sent to us by a business partner, and we want to evaluate the contents so we can build a table appropriately for it. In terms of data quality, this profile on the Product table tells us that the combination of columns – ProductID, Name, and ProductNumber uniquely identify each record in the table. That’s a good result because that’s what we’d expect! If there were any key violations, they would be listed in the viewer.
Here’s an example of the Column Null Ratio Profile for the Color column:
This profile tells us there are 248 records that have a null value in the column which represents 49% of the total records in the table. That seems like a high number of records, so this is could be an opportunity to do some data clean-up in the source. If that’s not an option, then you might want to consider replacing NULL with some value acceptable to business users before loading into your data warehouse.
Here’s an example of the Column Value Distribution Profile for the Color column:
This information shows the breakdown of records by distinct value showing both the actual count and the percentage. This is useful for validating distinct values and ensuring you don’t have any surprises. This is where dirty data that likes to pop up after you’ve built an Analysis Services database. Now you have an opportunity to spot potential problems and fix them before the data gets to the data warehouse.
Here’s an example of drilldown:
When examining a profile, you can select a profile row and click the drilldown button to view the records associated with that profile characteristic. This drilldown ability lets you know exactly which records fit this aspect of the profile – no need to stop and write a query to figure it out. (Not all profiles – at least in CTP5 – have drilldown capability.)
Here’s an example of the Column Statistics Profile on the StandardCost column:
This profile is useful for checking the min and max values in a column as well as the mean and standard deviation. If you see ranges you don’t expect, you should investigate and clean up accordingly.
Here’s an example of the Column Length Distribution Profile:
This profile identifies the maximum length of values in the column and provides counts and precentage of total records for each length found. This is most useful for columns in which you expect a constant length. For example, if Style should have only values of 1 character, but some records have 2-character values, then you need to find and fix those records with invalid values.
So, you can see the default Quick Profile of the Data Profiling Task provides quite a lot of information without a lot of setup. Many of the profiles can be configured to fine-tune the profiling process which I will explain in a future blog post. –Stacia