SQL Server 2005
Partition
ed Tables and Indexes

Author: Kimberly L. Tripp, Founder, SQLskills.com

Summary: Although partitioning tables and indexes has always been a design tactic chosen to improve performance and manageability in larger databases, Microsoft SQL Server 2005 has new features that simplify the design. This whitepaper describes the logical progression from manually partitioning data by creating your own tables to the preliminary features, which enabled partitioning through views in SQL Server 7.0 and SQL Server 2000, to the true partitioned table features available in SQL Server 2005. In SQL Server 2005, the new table-based partitioning features significantly simplify design and administration of partitioned tables while continuing to improve performance. The paper's primary focus is to detail and document partitioning within SQL Server 2005 – giving you an understanding of why, when and how to use partitioned tables for the greatest benefit in your VLDB (Very Large Database). Although primarily a VLDB design strategy, not all databases start out large. SQL Server 2005 provides flexibility and performance while significantly simplifying the creation and maintenance of partitioned tables. Review this document to get detailed information about why you should consider partitioned tables, what they can offer and finally how to design, implement, and maintain partitioned tables.

Scripts from this Whitepaper:

The scripts and examples used in the code samples for this whitepaper can be found in the SQLServer2005PartitionedTables.zip file.


Why do you need Partitioning?

Before one can talk about how to implement partitioning and the features of partitioning one must first understand the need; what are partitions and why might someone consider using them? When you create tables, you design those tables to store information about an entity – i.e. customers or sales. Each table should have attributes that describe only that entity and for customers and sales the historical premise is that all of your customers and all of your sales go into their respective tables. While a single table for each entity is the easiest to design and understand, it may not be the best for performance, scalability and manageability especially as the table grows large. Partitioning can provide benefits for both large tables (and/or their indexes) and tables which have varying access patterns. More specifically, through partitioning practices large tables have better scalability and manageability and the use of tables that have changing data is simplified when adding or deleting large "chunks" (or ranges) of data.

 

So what constitutes a large table? The idea of VLDB (Very Large Database) is that the total size of the database is measured in hundreds of gigabytes or even terabytes but the term does not necessarily specify individual table sizes. A large table is one that does not perform as desired or one where the maintenance costs have gone beyond pre-defined maintenance periods. Furthermore, a table can be considered large if one user’s activities significantly affect another or if maintenance operations affect other user’s abilities. In effect, this even limits availability. Even though the server is available, how can you consider your database available when the sales table’s performance is severely degraded or even inaccessible during maintenance for 2 hours per day, per week, or even per month? In some cases, periodic downtime is acceptable yet it is often possible to avoid or minimize downtime through better design.

 

A table whose access patterns vary may also be considered large when sets (or ranges) of rows have very different usage patterns. Although usages patterns may not always vary (and this is not a requirement for partitioning), when usage patterns do vary there can be additional gains. Again, thinking in terms of sales, the current month's data is read-write while the previous month's data (and often the larger part of the table) is read-only. Large tables where the data usage varies or large tables where the maintenance overhead is overwhelming can limit the table's ability to respond to varied user requests, in turn limiting both availability and scalability. Moreover, especially when large sets of data are being used in different ways maintenance operations can end up routinely maintaining static data. Performing maintenance operations on data which does not truly need it – is costly. The costs can be seen in performance problems, blocking problems, backups (space, time and operational costs) as well as negatively impacting the overall scalability of the server.

 

Furthermore, if a large table exists on a system with multiple CPUs, partitioning the table can lead to better performance through parallel operations. Large-scale operations across extremely large data sets – typically many million rows – can benefit by performing multiple operations against individual subsets in parallel. A simple example of performance gains over partitions can be seen in previous releases with aggregations. For example, instead of aggregating a single large table, SQL Server can work on partitions independently and then aggregate the aggregates. In SQL Server 2005, joins can benefit directly from partitioning; SQL Server 2000 supported parallel join operations on subsets yet needed to create the subsets on the fly. In SQL Server 2005, related tables (i.e. Order and OrderDetails) that are partitioned to the same partitioning key and the same partitioning function are said to be aligned. When the optimizer detects that two partitioned and aligned tables are joined, SQL Server 2005 can choose to join the data which resides on the same partitions first and then combine the results. This allows SQL Server 2005 to more effectively use multiple-CPU machines.

 

So how can partitioning help? Where tables and indexes become very large, partitioning can help by splitting large amounts of data into smaller more manageable chunks (i.e. partitions). The type of partitioning described in this paper is termed horizontal partitioning. With horizontal partitioning, large chunks of rows will be stored in multiple separate partitions. The definition of the partitioned set is customized, defined, and managed – by your needs. Partitioning in SQL Server 2005 allows you to partition your tables based on specific data usage patterns using defined ranges. Finally, SQL Server 2005 offers numerous options for the long-term management of partitioned tables and indexes by adding complementary features designed around the new table and index structure.

 

The History of Partitioning

The concept of partitioning is not new to SQL Server. In fact, forms of partitioning have been possible in every release. However, without features to aid in creating and maintaining your partitioning scheme, partitioning has often been cumbersome and underutilized. Typically, the design is misunderstood by users and developers and the benefits are diminished. However, because of the significant performance gains inherent in the concept, SQL Server 7.0 began improving the features enabling forms of partitioning through partitioned views and SQL Server 2005 now offers the largest advances through partitioned tables.

 

Partitioning Objects manually in releases before SQL Server 7.0

In SQL Server 6.5 and earlier, partitioning had to be part of your design as well as built into all of your data access coding and querying practices. By creating multiple tables and then managing access to the correct tables through stored procedures, views or client applications you could often improve performance for some operations but at the cost of complexity of design. Each user and developer needed to be aware of and properly reference the correct tables. Each partition was created and managed separately and views were used to simplify access; however, this solution yielded few performance gains. When a UNIONed view existed to simplify user/application access, the query processor had to access every underlying table in order to determine if data was needed for the result-set. If only a limited subset of those underlying tables was needed, then each user and developer needed to know the design in order to reference only the appropriate table(s).


Partitioned Views in SQL Server 7.0

The challenges faced by manually creating partitions in releases prior to SQL Server 7.0, were primarily related to performance. While views simplified application design, user access and query writing, they did not offer performance gains. With the release of SQL Server 7.0, views were combined with constraints to allow the query optimizer to remove irrelevant tables from the query plan (i.e. partition elimination) and significantly reduce overall plan cost when a UNIONed view accessed multiple tables.

In Figure 1, examine the YearlySales view. Instead of having all sales within one single, large table, you could define twelve individual tables (SalesJanuary2003, SalesFebruary2003, etc¼) and then views for each quarter as well as a View for the entire year – YearlySales.

Figure 1: Partitioned Views in SQL Server 7.0/2000

Users who accessed the YearlySales view with the following query will be directed ONLY to the SalesJanuary2003 table.

SELECT ys.*
FROM dbo.YearlySales AS ys
WHERE ys.SalesDate = '20030113'

 


As long as the constraints are “trusted” and the queries against the view use a WHERE clause to restrict the results based on the partition key (the column on which the constraint was defined) then SQL Server will access only the necessary base table. “Trusted” constraints are constraints where SQL Server is able to guarantee that all data adheres to the properties defined by the constraint. When the constraint is created, the default behavior is to create the constraint WITH CHECK. This setting causes a schema lock to be taken on the table so that the data can be verified against the constraint. Once the verification validates the existing data, the constraint is added; once the schema lock is removed further inserts, updates, and deletes must adhere to the constraint in effect. Using this procedure to create “trusted” constraints, developers can significantly reduce the complexity of their design using views without having to know and/or directly access the table in which they were interested. With trusted constraints, SQL Server improves performance by removing the unnecessary tables from the execution plan.

 

Note: A constraint can become “untrusted” in various ways; for instance by performing a bulk-insert and not specifying the CHECK_CONSTRAINTS argument. Once a constraint has become untrusted, the query processor will revert to scanning all base tables as it has no way of verifying that the requested data is in fact located in the correct base table.

 

Partitioned Views in SQL Server 2000

Although SQL Server 7.0 significantly simplified design and improved performance for SELECT statements, it did not yield any benefits for data modification statements; INSERT, UPDATE and DELETE statements were supported only against the base tables, not directly against the views which UNIONed the tables. SQL Server 2000 allows data modification statements to also benefit from the partitioned view capabilities of SQL Server 7.0. By allowing data modification statements to use the same partitioned view structure, you can direct modifications to the appropriate base table through the view. In order to configure this properly there are additional restrictions on the partitioning key and its creation; however, the basic principals are the same in that not only will SELECT queries be sent directly to the appropriate base tables but the modifications will be as well. For details on the restrictions, setup, configuration and best practices for partitioning in SQL Server 2000, please refer to the Partitioning Whitepaper at: http://msdn.microsoft.com/library/techart/PartitionsInDW.htm.

 

Partitioned Tables in SQL Server 2005

While the improvements in SQL Server 7.0 and SQL Server 2000 significantly enhanced performance using partitioned views, they did not simplify the administration, design or development of a partitioned data set. Using partitioned views, all of the base tables (on which the view is defined) must be created and managed individually. Application design is easier and users benefit by not needing to know which base table to directly access but administration is complex as there are numerous tables to manage and data integrity constraints must be managed for each table. Because of the management issues, partitioned views were often used to separate tables only when data needed to be "archived" or loaded. When data was moved into the read-only table or when data was deleted from the read-only table, the operations were expensive – taking time, log space and often creating blocking.

 

Additionally, because prior partitioning strategies required the developer to create individual tables and indexes and then UNION them through views, the optimizer was required to validate and determine plans for each partition (as indexes could have varied). Therefore, query optimization time in SQL Server 2000 often goes up linearly with the number of processed partitions. This is not in the case of partitioned tables in SQL Server 2005 where each partition has the same indexes by definition.

 

For example, take a current month of OLTP (Online Transaction Processing) data that needs to be moved into an analysis table at month end. The latter table (to be used for read-only queries) is a single table with one clustered index and two non-clustered indexes; the bulk load of 1GB (into the already indexed and active single table) creates blocking with current users as the table and/or indexes become fragmented and/or locked. Additionally, the loading process will take a significant amount of time as the table and indexes must be maintained as each row comes in. There are ways to speed up the bulk load; however, these can directly affect all other users thereby sacrificing concurrency for speed. If this data were isolated into a newly-created (empty) table the load could occur first (and parallel loads are possible) and then the indexes could be created after load (and index creation can be parallelized). Often you will realize gains of 10 times or better by using this scheme. In fact, by loading into an unindexed [heap] table you can take advantage of multiple CPUs by loading multiple data files in parallel or loading multiple chunks from the same file (defined by starting and ending row position). In any release, partitioning gives you this more granular control and does not restrict you to having all of the data in one location – with heavily fragmented indexes – and no real ability to control any aspect of it at a more granular level. A functional partitioning strategy could be achieved in previous releases by dynamically creating and dropping tables and modifying the UNION view. However; in SQL Server 2005 the solution is more elegant; you can simply “switch in” the newly-filled partition(s) as an extra partition of the existing partition scheme and “switch out” any old partition(s). From end to end, the complete process takes only a short period of time and can be further improved using parallel bulk loading and parallel index creation. More importantly, the partition is manipulated outside of the scope of the table so there is NO effect on the actual table until the partition is added. The result is that adding a partition typically occurs within only a few seconds.

 

Even better is the case when data needs to be removed. If one database only needs to see a “sliding-window” set of data then when new data is ready to be migrated in (for example the current month) then the oldest data (maybe the parallel month from the previous year) can be removed. In this same example, you will likely see several orders of magnitude better performance improvement by utilizing partitioning. While this may seem extreme, consider the difference; when all of the data is in a single table the deleting 1GB of data (the oldest data), requires row-by-row manipulation of the table as well as its associated indexes. The process of deleting data creates a significant amount of log activity and does not allow log truncation for the length of the delete (remember the delete is a single auto-commit transaction; however, you can control the size of the transaction by performing multiple deletes – where possible) and which requires a [potentially much] larger log. To remove the same amount of data – by removing the specific partition from a partitioned table – all that must be done is a removal of the partition (which is a meta data operation) and then drop or truncate the standalone table.

 

Moreover, without knowing how to best design partitions one might not be aware that the use of filegroups in conjunction with partitions is ideal for implementing partitioning. Filegroups allow you to place individual tables on different physical disks. If a single table spans multiple files using filegroups then the actual physical location of data could not be predicted. For systems where parallelism is not expected, SQL Server improves performance by using all disks more evenly through filegroups and specific placement is not as critical.

 

In Figure 2, there are three files in a single filegroup. Two tables: Orders and OrderDetails have been placed on this filegroup. When tables are placed on a filegroup SQL Server proportionally fills the files within the filegroup by taking extent (64KB chunks which equal eight 8K Pages) allocations from each of the files as space is needed for the objects within the filegroups. When the Orders and OrderDetails tables are created, the filegroup is empty. When an order comes in data is input into the Orders table (one row per order) and one row per line item is input into the OrderDetails table. SQL Server allocates an extent to the Orders table from File1 and then another extent to the OrderDetails table from File2. It is likely that the OrderDetails table will grow more quickly than the Orders table and the next few allocations will go to the next table needing space. As OrderDetails grows, it will get the next extent from File3 and SQL Server continues to "round robin" through the files within the filegroup. In the diagram below, follow each table to an extent and from each extent to the appropriate filegroup – the extents are allocated as space is needed and each is numbered based on flow.

Figure 2: Proportional Fill using Filegroups

 

SQL Server will continue to balance allocations among all of the objects within that filegroup. While SQL Server benefits from using more disks for a given operation, it is not as optimal from a management or maintenance perspective or where usage patterns are very predictable (and isolated).

 

With partitioned tables in SQL Server 2005, a table can be designed (using a “function” and a “scheme”) such that all rows that have the same partitioning key are placed directly on (and will always go to) a specific location. The function defines the partition boundaries and in which partition the first value should be placed. In the case of a LEFT partition function, the first value will act as an upper boundary in the first partition. In the case of a RIGHT partition function, the first value will act as a lower boundary in the second partition. You will see many more details regarding partition functions later within this paper. Once the function is defined, a partition scheme can be created to define the physical mapping of the partitions to their lo