Tuesday, November 27, 2007
Maintaining data quality is important throughout the organization, but certainly BI applications are expected to maintain an even higher standard because we have greater control over the data going into the data warehouse. As with most things in life, the ideal seldom matches the real for a number of reasons. In SQL Server 2008 (November CTP), the new Data Profiling Task in Integration Services should help a lot becase not only can you perform data quality assessments on an ad hoc basis, but you can also automate data quality by integrating the quality assessment into the ETL process. One important caveat about this task is that it works only with SQL Server sources (2000 or higher).
 
This one little task has lots of options which I will address over a series of blog posts. Today, I will introduce the Quick Profile feature of the task and show you the output.
 
To get started with the Data Profiling task - whether you're using the Quick Profile or not - you must define either a file connection or a variable to hold the output. The output will be generated in XML format. In the editor, there is a Quick Profile button to open the Quick Profile Editor. Consider this the quick and easy approach to profiling a single table (or view).
 
Now I must say I did read BOL to get an overview of the task, but one little detail about the connection for the Quick Profile escaped detection. I had dutifully created a OLE DB connection manager to AdventureWorks, because how else does one connect to SQL Server databases? But that connection did not show up in the connection drop-down list in the Quick Profile Editor. (Neither did it give me the option to create one, which might have given me a clue. Most other places to set connections give me the ability to create a new connection on the spot, so this seemed odd to me, but it IS a CTP...). Then I went back and re-read BOL to discover that the Quick Profile edtor requires an ADO.NET connection manager using the SqlClient provider. Not particularly intuitive, in my opinion, but it is documented so I can't complain too loudly. Once I set up the correct connection type, it appeared in the drop-down list and I was set to continue.
 
Once the connection is set, you have the opportunity define which profiles you want to create. Just for kicks, I decided to profile the Product table in AdventureWorks and kept the default profile selections:
 
 
 
The Quick Profile editor adds all the profiles you selected to the Profile Requests page of the Data Profiling Task Editor.
If you click on a particular request, you can view, and of course change, the properties of the selected request:
 
 
You then execute the package and get the results. If you output the results to a file, you can use the Data Profile Viewer. Just click Start, Run, and type DataProfileViewer and click OK. (It's actually found in C:\Program Files\Microsoft SQL Server\100\DTS\Binn if you're curious.) Use the Open button there to navigate to your output file fromthe package execution.
 
Here's an example of the Candidate Key Profile:
 
 
Now in the case of an existing table, this particular profile isn't particularly enlightening. After all, the key has already been defined. It's more useful for data we haven't seen before, perhaps an extract sent to us by a business partner, and we want to evaluate the contents so we can build a table appropriately for it. In terms of data quality, this profile on the Product table tells us that the combination of columns - ProductID, Name, and ProductNumber uniquely identify each record in the table. That's a good result because that's what we'd expect! If there were any key violations, they would be listed in the viewer.
 
Here's an example of the Column Null Ratio Profile for the Color column:
 
 
This profile tells us there are 248 records that have a null value in the column which represents 49% of the total records in the table. That seems like a high number of records, so this is could be an opportunity to do some data clean-up in the source. If that's not an option, then you might want to consider replacing NULL with some value acceptable to business users before loading into your data warehouse.
 
Here's an example of the Column Value Distribution Profile for the Color column:
 
 
This information shows the breakdown of records by distinct value showing both the actual count and the percentage. This is useful for validating distinct values and ensuring you don't have any surprises. This is where dirty data that likes to pop up after you've built an Analysis Services database. Now you have an opportunity to spot potential problems and fix them before the data gets to the data warehouse.
 
Here's an example of drilldown:
 
 
When examining a profile, you can select a profile row and click the drilldown button to view the records associated with that profile characteristic. This drilldown ability lets you know exactly which records fit this aspect of the profile - no need to stop and write a query to figure it out. (Not all profiles - at least in CTP5 - have drilldown capability.)
 
Here's an example of the Column Statistics Profile on the StandardCost column:
 
 
This profile is useful for checking the min and max values in a column as well as the mean and standard deviation. If you see ranges you don't expect, you should investigate and clean up accordingly.
 
Here's an example of the Column Length Distribution Profile:
 
 
This profile identifies the maximum length of values in the column and provides counts and precentage of total records for each length found. This is most useful for columns in which you expect a constant length. For example, if Style should have only values of 1 character, but some records have 2-character values, then you need to find and fix those records with invalid values.
 
So, you can see the default Quick Profile of the Data Profiling Task provides quite a lot of information without a lot of setup. Many of the profiles can be configured to fine-tune the profiling process which I will explain in a future blog post. --Stacia
Tuesday, November 27, 2007 11:59:05 PM (GMT Standard Time, UTC+00:00)  #    Comments [0]  | 
Tuesday, November 20, 2007
The most anticipated change to the Script Task and Script Component has certainly been the change from VSA to VSTA for developing the script code. As a result of this change, the script language is no longer limited to Visual Basic .NET; now developers can choose to code using Visual C# .NET instead. Virtually every introduction to Integration Services I've conducted generates at least one complaint from the audience about the lack of support for Visual C# .NET in SQL Server 2005, so I'm glad to see this capability added.
 
All the buzz I've seen around the Script Task and Script Component focuses on the C# capability, but misses other changes that I think are also important to know about. Let's take a closer look at other features introduced in these items with SQL Server 2008 Integraton Services.
 
ReadOnlyVariables and ReadWriteVariables properties

In the editor, the ReadOnlyVariables and ReadWriteVariables in SQL Server 2005 required you to type in the variable names. You can still do this, but I always ask, why type when you can select an item from a list? It saves keystrokes, and more importantly, ensures a correct value! In SQL Server 2008 there is an ellipsis button to click which opens a Select Variables dialog box which displays all the variables in the System and User namespaces. The downside (at least in the November CTP in which I most recentely tested this feature) is that this list seems to be a bit random and you can't sort by name or type.

Precompilation

The PrecompileScriptIntoBinaryCode property in SQL Server 2005 is removed in SQL Server 2008. The removal of this property means you no longer get the choice of whether to precompile the code before package execution (which gets better performance at the expense of a larger package) or to compile just-in-time.

The default behavior in SQL Server 2008 is to precompile your script. When you're editing the script, a set of VSTA project files are created (or re-opened), but they are deleted from your hard disk and persisted only in the package.

Edit Script button (formerly known as the Design Script button)

This is such a little change, but it means a lot to me and I'm sure to a lot of others. While intellectually I understand that it doesn't require a lot of physical effort to click the mouse a few extra times, I invariably am annoyed when I feel I'm clicking much more than necesary to accomplish a task or to reach a destination. With regard to the Script Task and Script Component, I appreciate that the Edit Script button is now on the first page of the editor. Hooray! What really happened here was the order of the pages was changed. In SQL Server 2005, the pages in the Editor are General, Script, and Expressions whereas, in SQL Server 2008, the pages are Script, General, and Expressions. Microsoft acknowledged that the most frequent reason people open the editor is to edit the script and thus shortened the path to get there.

Access to External Assemblies

If you want to add a reference to a DLL in your script in SQL Server 2005, the VSA interface limits access to external assemblies in the Windows\Microsoft.NET\Framework\v2.0.50727 folder on your development machine and to GAC or this folder on the machine running the package. The Add Reference dialog presents a list of DLLs which you can add to your script, but there is no way to browse to another location where you might prefer to (or must) store your DLLs.  Furthermore, there is no way to add a Web reference if you want to interact with a Web service. Instead, you have to create your own proxy class first, register the class on the dev and production machines, and then reference that class in your script. Not impossible, but lots of extra steps required.

In SQL Server 2008, the VSTA interface in the Script Task and Script Component makes development much easier. First, the Add Reference dialog box has all the tabs you'd expect in a full IDE - .NET, COM, Projects, Browse, and Recent. The key point here is you can browse to the location of the DLL you want to use now. Second, you have the ability to add a Web reference and thereby get the proxy class created automatically for you without having to go through all the steps required to get this behavior in SQL Server 2005. Jamie Thomson has posted a video demonstrating how to consume a Web service with VSTA in SQL Server 2008 Integration Services which you should check out if this sort of functionality is useful in your environment. --Stacia

Tuesday, November 20, 2007 10:52:12 PM (GMT Standard Time, UTC+00:00)  #    Comments [0]  | 
Wednesday, October 17, 2007
In theory, using a Lookup transformation in SQL Server 2005 Integration Services is faster than trying to achieve the same goal with a SQL query. If you frequently use the Lookup transformation, you are probably aware of some of its limitations. One limitation in particular--the larger the reference table used in the lookup, the more likely you're going to experience performance degradation. There are several ways to mitigate performance issues, but for the sake of this discussion, I will focus on better performance through caching .
 
To review, in SQL Server 2005 you have three caching options:
  • Full caching (default): loads the entire reference dataset into memory before the lookups begin.
  • Partial caching: loads a subset of the reference dataset into memory (which can be constrained to a specific size), queries the source database when a lookup match isn't found in the existing cache, and loads new rows into the cache if a match is found in the source database.
  • No caching: the source dataset is queried for each row in the pipeline.

While you can tune the caching for the lookup to get optimal performance, that cache goes away. So if you have a second package that needs to lookup to the same reference table, you have to pay the overhead cost of loading up the cache (if you're using it) again. That's where SQL Server 2008 comes to the rescue with "persistent lookups."

The new Lookup transformation hasn't been made available yet in a public CTP, so I haven't had a chance to benchmark the caching, but I think the improvement of persistent lookups shows promise. Here's my understanding of this feature (which of course is subject to change before SQL Server 2008 goes RTM):

  • Caching as it was in SQL Server 2005 continues to work as described with full, partial, and no cache (only better).
  • A new type of cache - a "persistent" cache - can be created and re-used.

One big change to caching in general is that you can have a cache larger than 4GB, on both 32-bit and 64-bit systems which should help the scalability of your Lookups.

Let's explore the idea of a persistent cache further. Obviously, it's not useful when the reference dataset is highly volatile. But for a relatively stable reference dataset, it's got possibilities. Essentially, you start the process by populating the cache. You can do this in a separate data flow from the one containing your Lookup transformation. The cache stays in memory to be used by as many Lookups as you like until the package stops executing.

But wait - there's more! As an alternative, you could populate the cache in its own package and store it in a .caw file. Then that cache file can be used in as many other packages as needed. Here's the resusability factor that was missing in SQL Server 2005. The .caw file can be read faster into memory than reading in a table, view, or query for a full cache. It's like the Raw File you can use for asynchronous processing and optimized for reads by Integration Services. In fact, you can use the Raw File Source to load the cache contents into a data flow, do whatchya gotta do to the data, and land the results someplace else if you use the data for more than just Lookups.

Another benefit of storing the cache in a file is the ability to deploy a package to multiple environments and ensure the contents of each cache are identical. Simply add it to the Miscellaneous folder and the package deployment utility will include it in the set of files to be transferred to each target server.

An important change with partial caching is the ability to store rows in the cache that don't match rows in the data flow. You will even be able to specify how much of the cache you want to devote to this purpose.

Considering a considerable part of the ETL process in data warehousing depends on the Lookup transformation, it's good to see this part of Integration Services has received attention in the next release of SQL Server. Watch for the new Lookup transformation, Cache transformation, and Cache connection manager in a future CTP. --Stacia

Wednesday, October 17, 2007 2:01:23 AM (GMT Standard Time, UTC+00:00)  #    Comments [0]  | 

Theme design by Jelle Druyts