Sunday, June 24, 2007

Many of you have already heard the "hardware store" story. What's the best way to model products in a hardware store, where new items arrive at the hardware store each day. Each item has a "short list" of similar properties (e.g. UPC, price) and a long list of dissimilar properties (e.g. paint has color, type, amount and curtain rods have width, metal, etc). How to model the dissimilar properties for each item in relational table(s)?

This isn't as unusual of a problem as you might think, examples I've heard lately include:
 Items in a directory system (like AD)
 Readings for lab test results
 Attributes for Sharepoint items

I've always thought of the main contenders as:
1. Sparse tables - one per product
2. Sparse columns - 90% of the column values would be NULL
3. Model as XML - similar properties are subelements, sparse properties are attributes
4. Entity-attribute-value (EAV) - also known as open schema. A separate "properties" table with name-value pairs.

EAV is one of the most popular solutions, even supposedly endorsed by standard schemas in some industries. Many relational purists detest EAV because its non-relational. It's main drawbacks are that the "name-value pair" table gets huge fast, with the corresponding lack of performance, the need for careful editing (color and colour would be two different attributes), and the fact that the "value" column of name-value must have a data type of nvarchar or SQL-variant.

SQL Server 2005 added the PIVOT keyword. One use for PIVOT is the change the EAV tables into something that looks like sparse tables.

I even had the opportunity to ask Joe Celko (no fan of EAV) which he prefers, trying to ease him towards the "model as XML" mechanism. He stood up for sparse tables or sparse columns.

SQL Server 2008 will include support for sparse columns. You can designate a column as

SPARSE in the DDL, like this:

CREATE TABLE products (product_num int, item_num int, price decimal(7,2), ...,
                       color char(5) SPARSE, width float SPARSE...)

You can have a huge number of sparse columns per table, although the number of non-sparse columns remains at 1024. In addition, SQL Server 2008 will support sparse indexes (aka filtered indexes) defined like:

CREATE INDEX coloridx ON products(color) WHERE product_num IN (21,22,42...)

Finally, you can have an XML "COLUMN SET" column for each table; this exposes the sparse properties (or perhaps a subset of them?) for each item as a collection of XML elements, for those folks that like to model these as XML.

ALTER TABLE products ADD COLUMN properties XML COLUMN_SET FOR ALL_SPARSE_COLUMNS

It's an interesting idea; the proof will be in the perf as well as the usability.

Sunday, June 24, 2007 8:29:39 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 

ORDPATH is a hierarchical labeling scheme used in the internal implementation of the XML data type in SQL Server 2005. It's meant to provide optimized representation of hierarchies, simplify insertion of nodes at arbitrary locations in a tree, and also provide document order. It's described in the academic whitepaper "ORDPATHs: Insert-Friendly XML Node Labels". In addition to being used internally when the XML data type is stored, its also part of the key of the PRIMARY XML INDEX, used to speed up XQuery. It's an implementation/specialization of the path enumeration model of representing hierarchies in relational databases, mentioned in Joe Celko's book "Trees and Hierarchies in SQL".

In SQL Server 2008, there are additional uses of ORDPATH. There is a new system data type HierarchyID, that will likely use ORDPATH in its implementation. This allows simply hierarchies to be represented as relational column and provides methods that optimize common hierarchical operations (like parent, child, sibling, ancestors, descendants) without being concerned about the intricacies of elements and attributes.

In addition to representing and indexing XML and hierarchies, Michael Rys mentioned at his TechEd chalktalk on spatial data that the spatial data types may be indexed using a multi-level grid system and that these indexes would also use ORDPATH. Since neither HierarchyID or spatial types are in the current CTP of SQL Server 2008, we'll have to wait a bit to see if this is truly "ordpath everywhere".

Sunday, June 24, 2007 7:51:08 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Saturday, June 23, 2007

SQL Server 2008 will contain, if my count is correct, 7 new data types. Note: none of these are in the current CTP.

DATE - ANSI-compliant date data type
TIME - ANSI-compliant time data type with variable precision
DATETIMEOFFSET - timezone aware/preserved datetime
DATETIME2 - like DATETIME, but with variable precision and large date range

GEOMETRY - "flat earth" spatial data type
GEOGRAPHY - "round earth" spatial data type
HIERARCHYID - represents hierarchies using path enumeration model

The first four (date/time series) are NOT implemented/exposed as .NET system UDTs, but the last three are exposed as .NET system UDTs. This means that, for the first time, .NET will be used as part of SQL Server. In SQL Server 2005, there were originally DATE and TIME data types implemented in .NET, but implementing temporal data is an intricate process. After much wailing and gnashing of teeth by some members of the user community, these were removed.  One of the complaints was the .NET implementation. Hmmm...

One of the nice side-effects of implementing spatial and hierarchyid as .NET types is that these will be shipped as a separate assembly, and that the types will be available for client-side and middle-tier use as well as in the database. So if you want to do some massive number crunching of spatial sequences on a computation server and the network traffic from database to computation server is acceptable, you can do so.

In general, SQLCLR makes the "logic in database or middle-tier" argument easier to deal with. Although there's no "run on database or run on server" switch in VS, with minimal code changes you can move your logic, or even duplicate the logic between tiers if need be. You can't do this with T-SQL; although its faster and better for data access on the database, it doesn't run outside the database. Unless you want to use SQL Server Express Edition as an application server. But that's a discussion for another time.

Saturday, June 23, 2007 10:03:25 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

Had to write about another thing that "caught my ear" at TechEd during a chalktalk by Rick Negrin about Service Broker usage patterns.

Service Broker supports "internal activation", that is, associate a stored procedure that gets invoked when a queue has messages to process as well as "external activiation". When using external activation, an event notification occurs when a queue has messages to process, and this notification is picked up by an external application; the external application processes the queue messages, out-of-process to SQL Server.

There is an SSMS template (see template explorer) for an internal activator procedure and an engine sample implementation of a class library to support SQLCLR activator procs. In addition, Remus Rusanu has written some excellent blog entries on activation program message processing patterns.

There is a sample implementation of an external activator as well. Rick mentioned "productizing the external activator for SQL Server 2008". This would mean making the activator part of the SQL Server product, along with the extensive testing and support that go along with it.

That's great news, and a supported, standard, configurable, external activator would make a great addition for SQL Server 2008. Now, about improved broker diagnostic and configuration utilities...and that SQLCLR support library....and improved SSMS support. Those would help things out too (some people are never satisfied).

Saturday, June 23, 2007 9:39:47 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

The June CTP of SQL Server 2008 contains support for table-valued parameters. Here's a usage scenario for these that has been around for a while.

Imagine you are running an online store and deal with (among other data) orders and order detail lines. You'd like to have a stored procedure that can add an entire order in one server round trip, regardless of the number of items that I order (that is, 1 order header and 1->n order detail lines). You'd even settle for two round trips, one for the order header, one for the details. Before SQL Server 2008, there is no built-in mechanism that supports this. In the past, I've seen some pretty interesting workarounds, such as:

1. Compose an arbitrarily large SQL batch as a single "command text" by using string concatenation on the client/middle-tier. All SQL Server database APIs support one (and only one) batch per Command. ADO classic did something like this when you inserted/updated/deleted multiple rows in a disconnected Recordset and called for a "batch update".

2. Create a stored procedure with some "static" parameters and an arbitrarily large number of repeating parameters, most of which will always be NULL. The limit to the number of parameters in a stored procedure is 2100. Both this method and the previous one make for some pretty hideous-looking code.

3. Use multiple parameter sets. OLE DB does support multiple parameter sets and some databases can optimize inserts that use multiple parameter sets. The SQL Server providers, at least last time I looked with SQL Profiler, turn multiple parameter sets into multiple calls, that is, one round trip per parameter set. That's not what I wanted.
 
SQL Server 2008's solution to this age-old problem is table-valued parameters (TVPs, for short). You start using a TVP by creating a custom type, using the CREATE TYPE statement, like this:

CREATE TYPE lineitem_type (line_number INT, order_id INT, product INT, quantity INT);

Information about these table types appear in sys.types and also in a new metadata view, sys.table_types. Once you've created such a table type, you can use it in T-SQL like this:

CREATE PROC new_order (@order_id INT, @line_items lineitem_type)
AS
-- silly table names used for clarity
INSERT INTO orders_table VALUES (@order_id ... ) ;
INSERT INTO line_items_table
  SELECT * FROM @line_items;

One round trip. Compact, clean code. Nice.

Saturday, June 23, 2007 4:30:18 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4]  | 

SQL Server 2008 will include GROUPING SETS; a (very) short explanation is that these allow the equivalent of multiple GROUP BY clauses in a single SQL statement. The result is a UNION ALL of the resultsets. SQL Server 2008 also contains/allows standard syntax for ROLLUP and CUBE, which have been in SQL Server for a while.

One way to use (or to think of) grouping sets is that, while ROLLUP with N columns produces a UNION of N+1 results and CUBE produces N-squared -1 results, grouping sets can produce an intermediate number of results, when not all the dimensions produced by CUBE are needed. Grouping sets should allow better optimization of this type of dimensional query.

SQL Server 2008 Reporting Services will contain a new type of control, called the TABLIX. The SQL Server 2008 CTP BOL defines a TABLIX as: "A Reporting Services RDL data region that contains rows and columns resembling a table or matrix, possibly sharing characteristics of both." Grouping sets sound like a good fit with this component. When TABLIX is available, a short profiler session would confirm this.

Of course, in addition to performance benefits and TABLIX support, GROUPING SETS are part of the ISO-ANSI SQL-2006 spec. Another plus.

Saturday, June 23, 2007 4:09:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

There was a fairly well-known Powerpoint slide that attempted to summarize the new SQL Server 2005 features in bullet points of a single slide. By the release, there were so many new features, the feature list had to be rendered in a 5-point font to fit. At TechEd 2007, Microsoft presented the new features of SQL Server 2008 (was: SQL Server Katmai) in an analogous format. Although it's not yet down to a 5-point font, there are quite a few impressive new features on tap. Here's the list, modulo (my own) potential spelling errors. There is currently a CTP of SQL Server 2008 as well as beta1 of the next version of Visual Studio.NET (Orcas) and the ADO.NET synchronization framework. The features I can find in the current betas have asterisks. [Note: Asterisks updated for the July CTP]

Transparent Data Encryption
External Key Management
Data Auditing
Pluggable CPU
Transparent Failover for Database Mirroring
Declarative Management Framework*
Server Group Management*
Streamlined Installation*
Enterprise System Management*
Performance Data Collection*
System Analysis*
Data Compression
Query Optimization Modes
Resource Governor
Entity Data Model*
LINQ*
Visual Entity Designer*
Entity Aware Adapters

SQL Server Change Tracking*
Synchronized Programming Model*
Visual Studio Support*
SQL Server Conflict Detection
FILESTREAM data type
Integrated Full Text Search
Sparse Columns
Large User Defined Types*
Date/Time Data Type*
LOCATION data type
SPATIAL data type
Virtual Earth Integration
Partitioned Table Parallelism
Query Optimizations*
Persistent Lookups
Change Data Capture*
Backup Compression
MERGE SQL Statement*
Data Profiling
Star Join*

Enterprise Reporting Engine*
Internet Report Deployment
Block Computations
Scale out Analysis
BI Platform Management
Export to Word and Excel
Author reports in Word and Excel
Report Builder Enhancements
TABLIX
Rich Formatted Data
Personalized Perspectives
… and many more

It should be straightforward to eventually produce a 5-point slide because these features haven't made "the list" yet.

Grouping Sets*
Table-valued parameters*
Table-valued constructors*
"Delighters" (1-stmt variable declaration/assignment, increment operators)*
HierarchyID data type*
Large User Defined Aggregates
New versions of ODBC driver*, OLE DB provider*, and ADO.NET data provider
SQLCE Version 3.5*
Database Mirroring Enhancements*
Data Mining Algorithm Enhancements*
SSIS support for the new date/time data types*
SSIS VSTA support*

That's an impressive set of features. And I'm pretty certain, since this is a summary, that's there are more to come.

Saturday, June 23, 2007 3:34:28 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4]  | 

Theme design by Jelle Druyts

Pick a theme: