Yesterday in class we had a discussion around the conceptual problem of reconciling the fact that SQL Server does set-based operations, but that it does them in query plans that pass single rows around between operators. In other words, it uses iterative processing to implement set-based operators.
The crux of the discussion is: if SQL Server is passing single rows around, how is that set-based operations?
[Edit 8/21/19 – what about batch mode operations? See the end of the post for details.]
I explained it in two different ways…
SQL Server Example
This explanation compares two ways of doing the following logical operation using SQL Server: update all the rows in the Products table where ProductType = 1 and set the Price field to be 10% higher.
The cursor based way (row-by-agonizing-row, or RBAR) would be something like the following:
DECLARE @ProductID INT; DECLARE @Price FLOAT; DECLARE [MyUpdate] CURSOR FAST_FORWARD FOR SELECT [ProductID], [Price] FROM [Products] WHERE [ProductType] = 1; OPEN [MyUpdate]; FETCH NEXT FROM [MyUpdate] INTO @ProductID, @Price; WHILE @@FETCH_STATUS = 0 BEGIN UPDATE [Products] SET [Price] = @Price * 1.1 WHERE [ProductID] = @ProductID; FETCH NEXT FROM [MyUpdate] INTO @ProductID, @Price; END CLOSE [MyUpdate]; DEALLOCATE [MyUpdate];
This method has to set up a scan over the Products table based on the ProductType, and then runs a separate UPDATE transaction for each row returned from the scan, incurring all the overhead of setting up the UPDATE query, starting the transaction, seeking to the correct row based on the ProductID, updating it, and tearing down the transaction and query framework again each time.
The set-based way of doing it would be:
UPDATE [Products] SET [Price] = [Price] * 1.1 WHERE [ProductType] = 1;
This will have one scan based on the ProductType, which will update rows matching the ProductType, but the query, transaction, and scan are only set up once, and then all the rows are processed, one-at-a-time inside SQL Server.
The difference is that in the set-based way, all the iteration is done inside SQL Server, in the most efficient way it can, rather than manually iterating outside of SQL Server using the cursor.
This explanation involves a similar problem but not involving SQL Server. Imagine you need to acquire twelve 4′ x 8′ plywood sheets from your local home improvement store.
You could drive to and from the store twelve times, and each time you need to go into the store, purchase the sheet, and wait for a staff member to become available to load the sheet into your pickup truck, then drive home and unload the sheet.
Or you could drive to the store once and purchase all twelve sheets in one go, with maybe four staff members making three trips each out to your pickup, carrying one sheet each time. Or even just one staff member making twelve trips out to your pickup.
Which method is more efficient? Multiple trips to the store or one trip to the store, no matter how many staff members are available to carry the sheets out?
No-one in their right mind is going to make twelve trips to the home improvement store when one will suffice. Just like no developer should be writing cursor/RBAR code to perform an operation that SQL Server can do in a set-based manner (when possible).
Set-based operations don’t mean that SQL Server processes the whole set at once – that’s clearly not possible as most sets have more rows than your server has processors (so all the rows in the set simply *can’t* be processed at the same time, even if all processors were running the same code at the same time) – but that it can process the set very, very efficiently by only constructing the processing framework (i.e. query plan with operators, scans, etc.) for the operation once and then iterating over the set of rows inside this framework.
PS Check out the technical comment from Conor Cunningham below (Architect on the SQL Server team, and my counterpart on the Query Optimizer when I was a Dev Lead in the Storage Engine for SQL Server 2005)
[Edit 8/21/19 PPS What about batch mode operations? These are done using vector-based CPU instructions working on a vector of column values (multiple rows from a column) instead of a single row, and going vector by vector instead of row by row. So it’s still doing one operation at a time, but it’s a bigger, more efficient operation. Plus there are algorithms used that are optimized for multi-core CPUs and better leverage processor caches than row-based algorithms. Bottom line: batch-mode is much more efficient than row-mode, but still has to go piece-by-piece.]