- Full caching (default): loads the entire reference dataset into memory before the lookups begin.
- Partial caching: loads a subset of the reference dataset into memory (which can be constrained to a specific size), queries the source database when a lookup match isn’t found in the existing cache, and loads new rows into the cache if a match is found in the source database.
- No caching: the source dataset is queried for each row in the pipeline.
While you can tune the caching for the lookup to get optimal performance, that cache goes away. So if you have a second package that needs to lookup to the same reference table, you have to pay the overhead cost of loading up the cache (if you’re using it) again. That’s where SQL Server 2008 comes to the rescue with “persistent lookups.”
The new Lookup transformation hasn’t been made available yet in a public CTP, so I haven’t had a chance to benchmark the caching, but I think the improvement of persistent lookups shows promise. Here’s my understanding of this feature (which of course is subject to change before SQL Server 2008 goes RTM):
- Caching as it was in SQL Server 2005 continues to work as described with full, partial, and no cache (only better).
- A new type of cache – a “persistent” cache – can be created and re-used.
One big change to caching in general is that you can have a cache larger than 4GB, on both 32-bit and 64-bit systems which should help the scalability of your Lookups.
Let’s explore the idea of a persistent cache further. Obviously, it’s not useful when the reference dataset is highly volatile. But for a relatively stable reference dataset, it’s got possibilities. Essentially, you start the process by populating the cache. You can do this in a separate data flow from the one containing your Lookup transformation. The cache stays in memory to be used by as many Lookups as you like until the package stops executing.
But wait – there’s more! As an alternative, you could populate the cache in its own package and store it in a .caw file. Then that cache file can be used in as many other packages as needed. Here’s the resusability factor that was missing in SQL Server 2005. The .caw file can be read faster into memory than reading in a table, view, or query for a full cache. It’s like the Raw File you can use for asynchronous processing and optimized for reads by Integration Services. In fact, you can use the Raw File Source to load the cache contents into a data flow, do whatchya gotta do to the data, and land the results someplace else if you use the data for more than just Lookups.
Another benefit of storing the cache in a file is the ability to deploy a package to multiple environments and ensure the contents of each cache are identical. Simply add it to the Miscellaneous folder and the package deployment utility will include it in the set of files to be transferred to each target server.
An important change with partial caching is the ability to store rows in the cache that don’t match rows in the data flow. You will even be able to specify how much of the cache you want to devote to this purpose.
Considering a considerable part of the ETL process in data warehousing depends on the Lookup transformation, it’s good to see this part of Integration Services has received attention in the next release of SQL Server. Watch for the new Lookup transformation, Cache transformation, and Cache connection manager in a future CTP. –Stacia