This post covers examples from Denali  SQL Server 2012 CTP3, version 11.0.1440.

In my last post I hinted towards some interesting findings regarding parallelism and columnstore indexes.  I’ll talk about what I observed in this post.

Just as a quick level-set, columnstore indexing provides two new areas of functionality within SQL Server 2012 that can potentially improve query performance for typical relational data warehouse workloads.  The features contributing to this potential performance gain include the Vertipaq columnstore technology (which uses compression and stores the data by column instead of row) and the new method of query processing that processes blocks of column data in batches.  You can benefit from from the columnstore storage alone  – but you may also see further query performance improvements if the query runs in "batch" execution mode versus "row".  

One key point to underscore– just because you add a columnstore index to a table doesn’t mean that your query will use “batch” execution mode. 

Let’s walk through an example of row vs. batch using the AdventureWorksDWDenali database.  The test server used in this illustration has four visible schedulers (not including the DAC) and 8GB of RAM.

Also – in order to increase the cost of the queries against this table and make it worth the columnstore index's time, I put together the below query to bump up the row count in FactInternetSales from 60,398 rows to 483,184 rows (SalesOrderNumber was part of the primary key, so I populated it with NEWID() and kept the other column values static from the existing rows and then used "GO 3" to populate it a few times):

INSERT dbo.FactInternetSales
(ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey, CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber, OrderQuantity, UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, OrderDate, DueDate, ShipDate
)

SELECT ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey, CurrencyKey, SalesTerritoryKey, LEFT(CAST(NEWID() AS NVARCHAR(36)),20), SalesOrderLineNumber, RevisionNumber, OrderQuantity, UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount, ProductStandardCost, TotalProductCost, SalesAmount
,
TaxAmt, Freight, CarrierTrackingNumber, CustomerPONumber, OrderDate, DueDate,
ShipDate
FROM
dbo.FactInternetSales
GO 3


 

The following query is executed against a table with no columnstore index (I’ve also enabled the actual Execution Plan to show afterwards – and also execute the query twice in order to measure results with the data in cache):

SET STATISTICS IO ON
SET STATISTICS
TIME
ON


SELECT
c.CommuteDistance

       d.CalendarYear
,
       SUM(f.SalesAmount)
TotalSalesByCommuteDistance
FROM
dbo.FactInternetSales as f
INNER JOIN dbo.DimCustomer as c
ON
           f.CustomerKey =
c.CustomerKey
INNER JOIN dbo.DimDate d
ON
           d.DateKey =
f.OrderDateKey
GROUP BY c.CommuteDistance
,
         d.CalendarYear 

The results for STATISTICS IO were as follows:

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'FactInternetSales'. Scan count 1, logical reads 20289, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'DimCustomer'. Scan count 1, logical reads 979, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'DimDate'. Scan count 1, logical reads 66, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

And the results of the SET STATISTICS TIME ON were as follows:

SQL Server Execution Times:

   CPU time = 562 ms,  elapsed time = 625 ms.

 

As for the execution plan, if we hover over the Clustered Index Scan for FactInternetSales you’ll see the “Actual Execution Mode”.

clip_image002[4]

So we see a value of “Row” (the execution mode we’re used to) and “Actual Number of Batches” of 0.

 

Now let’s create a columnstore index:

 

CREATE NONCLUSTERED COLUMNSTORE INDEX [CSI_FactInternetSales] ON [dbo].[FactInternetSales]
(
[OrderDateKey]
,
[CustomerKey]
,
[SalesAmount]
)


GO

Re-executing the query after creating the index, I see the following STATISTICS IO output:

 

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'FactInternetSales'. Scan count 1, logical reads 597, physical reads 0, read-ahead reads 1167, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'DimCustomer'. Scan count 1, logical reads 979, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'DimDate'. Scan count 1, logical reads 66, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 

That’s a reduction in logical reads against FactInternetSales from 20289 down to 597.  Very likeable.

What about STATISTICS TIME?

 

SQL Server Execution Times:

   CPU time = 453 ms,  elapsed time = 485 ms.

 

Okay so CPU time was 562 ms now it is 453 ms.  Elapsed time was 625 ms and now it is 485 ms. 

 

What about the execution plan, were we using batch execution mode?

 

clip_image004[4]

No.  Still using “Row” with a columnstore index scan.   So things brings me to what I observed a few days ago. 

One thing I didn’t mention is that in my test environment -  my max degree of parallelism was set to “1”. 

 

What happens if I uncap the max degree of parallelism for my 4-scheduler server and re-execute the query?

clip_image005[4]

 

After lifting the cap on max degree of parallelism, the query executes using a parallel plan and also switches from “row” to “batch” mode.  Notice also the “Actual Number of Batches” – which shows a value of 1,482.

 

This is all nice – but what about the query performance and I/O.  Any further reductions?

 

The I/O results are as follows:

Table 'DimCustomer'. Scan count 5, logical reads 1072, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'DimDate'. Scan count 3, logical reads 56, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'FactInternetSales'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

What’s this? Zero for all stats related to FactInternetSales?  I even dropped clean buffers to double-check, and it still displayed all zeros.  (More on this for another day.)

 

What about the execution statistics?

 

SQL Server Execution Times:

 

                 CPU time = 78 ms,  elapsed time = 114 ms.

 

Now we're talking big improvements. CPU time was 562 ms without columnstore, then 453 ms with columnstore + row processing and then 78 ms for columnstore + batch processing.  Elapsed time was 625 ms without columnstore and then 485 ms for columnstore + row processing and then 114 ms for columnstore + batch processing.  So more non-trivial reductions.

 

So a few key findings in closing:

 

·        Columnstore + row processing can provide a performance boost (getting column based storage and compression)

·        Columnstore + batch processing can provide additional performance improvements on top of the columnstore index I/O benefits

·        When I capped parallelism, I didn’t see batch processing – so check your execution plan and be aware that there are other factors that impact whether batch processing vs. row processing as well

This post covers examples from Denali CTP3, version 11.0.1440.  

I was working with Denali’s columnstore index feature this last week and was testing it on a virtual machine when I encountered the following error message when trying to create a new index:

The statement has been terminated.
Msg 8657, Level 17, State 5, Line 2
Could not get the memory grant of 91152 KB because it exceeds the maximum configuration limit in workload group 'default' (2) and resource pool 'default' (2). Contact the server administrator to increase the memory usage limit.'

Now at the time the VM I was using was constrained for resources. It was configured to use 1 GB of RAM.

So I shut down the VM and added in 7 GB more of RAM. Given that the “max server memory (MB)” wasn’t capped, I was then able to create the columnstore index successfully. Regarding the memory requirements, BOL states that we need [approximately] 8 MBs times the # of columns in the index times the DOP.  Also, the more string data type columns involved, the higher the memory needed for creating the index.

So I thought I would test out these requirements in a more controlled fashion and see how close the estimates were to the reality.

In this test, I used a VM with 4 processors and 8 GB of RAM, but unlike with my previous experience I capped the “max server memory (MB)” to a much lower value to reproduce the issue I had earlier.  I was adding the columnstore index to the dbo.FactInternetSales table from the AdventureWorksDWDenali database (which can be downloaded here).

For the first step, I capped the max server memory for my test Denali SQL Server instance to a very low value (300 MBs did the trick in this case):

EXEC sp_configure 'show advanced options', 1

RECONFIGURE

EXEC sp_configure 'max server memory (MB)', 300

RECONFIGURE

 

Next I attempted to create the following columnstore index:

USE [AdventureWorksDWDenali]

GO

 

CREATE NONCLUSTERED COLUMNSTORE

INDEX [NCSI_FactInternetSales]

ON [dbo].[FactInternetSales]

(

          [ProductKey],

          [OrderDateKey],

          [DueDateKey],

          [ShipDateKey],

          [CustomerKey],

          [PromotionKey],

          [CurrencyKey],

          [SalesTerritoryKey],

          [SalesOrderNumber],

          [SalesOrderLineNumber],

          [RevisionNumber],

          [OrderQuantity],

          [UnitPrice],

          [ExtendedAmount],

          [UnitPriceDiscountPct],

          [DiscountAmount],

          [ProductStandardCost],

          [TotalProductCost],

          [SalesAmount],

          [TaxAmt],

          [Freight],

          [CarrierTrackingNumber],

          [CustomerPONumber],

          [OrderDate],

          [DueDate],

          [ShipDate])

GO

 

This failed with the following error:

 

The statement has been terminated.

Msg 8657, Level 17, State 5, Line 2

Could not get the memory grant of 143560 KB because it exceeds the maximum configuration limit in workload group 'default' (2) and resource pool 'default' (2).  Contact the server administrator to increase the memory usage limit.

 

So the memory grant request was just over 140 MB.  Looking at the requirements of the columnstore index request and taking into account the equation in BOL, we have 8 MBs x 25 columns x 4 available processors = 800 MB.  Quite a bit higher than the 140 MB it needed at runtime.  But this was assuming 4 processors in the equation.  If we factor in just 1 proc being used – we have 200 MB which is closer to what was being requested and factors in the varying data types and sizes for columns defined in FactInternetSales.

What if I capped the DOP but keep the memory capped at the low value?  I gave it a try:

 

EXEC sp_configure 'max degree of parallelism', 1

RECONFIGURE

 

Attempting a creation of the same index gave the following error message and same memory grant value:

 

The statement has been terminated.

Msg 8657, Level 17, State 5, Line 1

Could not get the memory grant of 143560 KB because it exceeds the maximum configuration limit in workload group 'default' (2) and resource pool 'default' (2).  Contact the server administrator to increase the memory usage limit.

 

So in this case, capping the DOP didn’t reduce the memory requirements.

 

What about using the MAXDOP hint with the CREATE statement (WITH (MAXDOP = 1))?  Again this returned the same memory grant requirement of 143560 KB.

 

What about capping the default workload group itself (not recommending this as standard practice – but rather to further explore memory grant requirements for columnstore)?

 

ALTER WORKLOAD GROUP [default]

WITH(max_dop=1)

GO

ALTER RESOURCE GOVERNOR RECONFIGURE

GO

 

And even after this change, the memory grant requirements remained the same. So I reverted the DOP options to get back to my previous state:

 

EXEC sp_configure 'max degree of parallelism', 0

RECONFIGURE

 

ALTER WORKLOAD GROUP [default]

WITH(max_dop=0)

GO

 

ALTER RESOURCE GOVERNOR RECONFIGURE

GO

 

Regarding the maximum memory grant request itself, the error message gives us enough of a hint on where to look (workload group 'default' (2) and resource pool 'default' (2)).   So I ran the following query to return the request_max_memory_grant_percent value for the default workload group:

 

SELECT request_max_memory_grant_percent

FROM sys.resource_governor_workload_groups

WHERE name = 'default'

 

The value returned was 25%.  So with my 300 MB cap, we’re talking 75 MB.  Not the 140 MB we need. 

I then bumped up the max memory grant percent (but left the max server memory at the low value) just to see what would happen:

 

ALTER WORKLOAD GROUP [default] WITH(request_max_memory_grant_percent=70)

GO

 

ALTER RESOURCE GOVERNOR RECONFIGURE

GO

 

Sure enough – my CREATE NONCLUSTERED COLUMNSTORE INDEX was allowed to execute – but while it executed, it didn't complete.  Instead it ran for 2 minutes and 59 seconds before getting the following error message:

 

The statement has been terminated.

Msg 8645, Level 17, State 1, Line 1

A timeout occurred while waiting for memory resources to execute the query in resource pool 'default' (2). Rerun the query.

I tried creating the index a second time so that I could see what was going on in sys.dm_exec_query_memory_grants - but it executed immediately the second time around. So I set back the max memory grant to the default 25 % - dropped the index and tried to recreate and got the error again.  I then set the max memory grant back to 70% and had the timeout again – but this time I was ready for it and I executed a query against sys.dm_exec_query_memory_grants:

 

SELECT scheduler_id, dop, requested_memory_kb, required_memory_kb, ideal_memory_kb

FROM sys.dm_exec_query_memory_grants

 

The results were as follows:

 

clip_image002[4]

 

What jumped out at me was the DOP value of “1” (even though my SQL Server instance’s “max degree of parallelism” was set to “0” and I had 4 available prcs and the “default” pool dop was also set to 0).

 

Now it seems that the requested memory was remaining the same because the plan was assuming to be maxdop of “1” all along.  Which made me wonder if the requested memory would increase if I added a hint for the creation of the columnstore index to more than one processor (I was reaching, I know, but I'm a fiddler by nature)?  Before testing this, I set back the max memory grant percent to 25%:

 

ALTER WORKLOAD GROUP [default] WITH(request_max_memory_grant_percent=25)

GO

 

ALTER RESOURCE GOVERNOR RECONFIGURE

GO

 

Then I used WITH (MAXDOP = 4) for the query.  The result?  Still asking for a memory grant of 143560 KB.  And repeating the test of bumping up the max memory grant and checking the dop value in sys.dm_exec_query_memory_grants – it remained at “1”. 

So this post was more of an exploration and if I saw this case in the wild I would ask more directly about available memory for the SQL Server instance and/or increasing the max server memory if there was sufficient availability already. 

I would also ask about the necessity of each column being included in the columnstore index definition.  For example – if I needed only half of the columns from FactInternetSales, we’re talking about a 60MB memory grant requirement versus a 140 MB one.

 

During my work last week with columnstore indexes, there were also some interesting findings related to parallelism.  I’ll save this for another post.

Theme design by Nukeation based on Jelle Druyts