Selective XML Index – Why is this compelling?

OK. VM with the SP1 CTP is shut down. Fun's over (until I think of something new and cool or you think of something and tell me about it). Time to answer Aaron's implied question about "what makes this so special it should be shipped in a Service Pack, no less?".

This could be a game changer with respect to XML querying in SQL Server. Both in query speed and in data size reduction. But, you say, not many customers do XML querying in SQL Server. Well, this was one of the reasons why.

Cust: I stored some XML in SQL Server and queries are incredably slow.
Me: You need at least an primary XML index to make queries tolerably fast.
Cust: But I really can't afford all that extra data. And it slows down inserts. And backups. And sucks up data cache in memory. And XML index rebuilds are single-threaded and offline. And…
Me: It's either indexing or slow queries. Take your pick.
Cust: I pick not.
Me: (Shrugs) Can't blame ya. How about partial decomposition into relational? (persisted computed columns)

So, depending on the specificity (locality) of your queries against "hunks of XML", this could be a game-changer. Selective XML indexes on a specific (hopefuly small) set of pathspecs will speed up your queries and make the overhead MUCH more tolerable. So it might make the concept of using XML + queries a second look. There might even be some advantages vis-a-vis persisted computed columns. Like the fact that persisted computed columns over UDFs (from partially decomposed XML) can't use parallelism. But, it's still offline, single-threaded rebuild.

You WILL, however, have to track which queries you're running against the XML that you are "selective indexing". And maintain your Selective XML Index accordingly (with alter-remove and alter-add for pathspecs). Similar to the way you'd optimize SQL queries. Except SQL queries come with lovely built-in assistants like Database Tuning Advisor. I know some folks don't like DTA, but it at least useful as a starting point. There's no DTA support for XML indexes or any kind (or spatial or FTS indexes for that matter).

Even factoring out query/indexing speed, XML in SQL Server does have its limits. Like using XQuery Data Model, which make some constructs "valid XML" that some/most parsers don't support. Or that output from a SELECT that puts out XML has a single encoding (maybe CONVERT with an encoding parameter on output). And SQL Server itself and XML data type internals doesn't have support for UTF-8 which bloats things a bit. And that the version of XQuery SQL Server uses is getting long in the tooth, and is a subset. And XML-DML is based on a propriatary spec, because the real spec postdates SQL Server.

You might ask: "If this Selective XML Index thing is such a good idea, why didn't they do it this way originally?". Good question. SQL Server XML support (data type, indexing, XQuery, etc) was introduced in SQL Server 2005. Sparse columns, the technology on which SXI is based, was introduced in SQL Server 2008. How come it took so long? The chicken and egg problem above. Hard to add resources if it's not a popular feature, but it won't be popular UNTIL you add resources.

And finally, how relevent is XML today? It seems like lots of data-related things these days are using JSON (that's JavaScript Object Notation) instead, no? Well no. And JSON is used to send data over the network, not necessarily the best way to store (and certainly not to query) data. Unless you're a NoSQL database like one of these. But these both seem to support both REST (XML)-based and JSON-based output if not store data that way. Is the idea of a database preformatting data for the consumer (and even storing and querying the format in question) a good idea? Well, SQL Server tried this with XML (HTTP Endpoints over SOAP protocol). It will discontinued in SQL Server 2012, too slow. Waste of memory and cycles for the database server. And SOAP (on which it was based) seemed to have a new permutation every few months or so for a while. HTTP Endpoints couldn't (or didn't) keep up with these.

These days Microsoft under the guise of OData, seems to be pushing/offering both JSON and Atompub formats, but the formatting and serving to clients takes place on a middle tier WCF service, for one of their implementations. And the spec lets you implement the (OData) procotol however you'd like. So I don't really think XML is dead, its just not the only cool, portable, text-based data format in town anymore.

It's been a long day. I personally think SXI is a cool technology, but only time will tell about its adaptation. IMHO, it's worth using a service pack as a ship vehicle to get it to folks sooner. But, as to future, and text-based data formats, as Jim Bouton, I believe, once said/wrote: "I think I'll go wash my brain out with soap". And that's soap the cleanser, not SOAP the protocol. Of course. ;-)

Send comments.

Cheers, Bob
@bobbeauch

Selective XML Indexes – Learning the rules through error messages

Experimenting without docs is a good way to get lots of error messages along the way. I'm learning by attrition, and it seems like I'd be better off knowing the rules. It's not the best way to learn anything, but you can at least learn the limits through the error messages. I was tired of trying seemingly random things and being surprised, so I decided to look for "selective XML index" error messages.

So, here's how to find (or imply) "the rules" for Selective XML Indexes.

– Ids 662, 970, 2735, 6342, 6343 and 6367-6399 and 9535-9539 inclusive
select message_id, text from sys.messages
where language_id = 1033
and text like '%selective XML%';

Now some of these are general (parameterized) error messages, so you won't get exact limits in terms of numbers. But, I tried the catagorize and summarize these for my own use. Here's some interesting ones.

1. Not all SQL data types are supported (not surprising)
2. Not all XQUERY (XSD) data types are supported (there were some surprises here, like xs:integer not supported)
3. Selective XML indexes' typing can't contradict types in an XML SCHEMA COLLECTION
4. Selective XML indexes' typing can affect new data you try and add. For example, if the types don't match the SXI-defined types and you're not using an XML SCHEMA COLLECTION.
5. You can't index a non-leaf node for the value or use * wildcard as the last step in the path.
6. There is a maximum depth to the path in a pathspec.

And so on… you can read 'em as well as I can.

Hopefully, this, and the previous entries should be enough to get you started. I did write these in one sitting and most of them were written in "realtime" so if you have any additions, questions, and/or find something neat to do with these drop me a line by email (be creative, figure out my email address) or ping me on twitter.

Motivation in the next (which is also the last, for now) posting.

@bobbeauch

Selective XML Index – Secondary Selective XML Indices

I've only been talking so far about the "primary" Selective XML Index. But you can also create 0-n "secondary" Selective XML Indexes. The syntax looks a little bit like secondary "non-selective" XML Indexes in that you use the "USING [related XML index]". You specify one (and only one) pathspec. Additionally, the XML value that the pathspec "points to" (see previous post) cannot be data type xs:untypedAtomic (varbinary(max)).

create xml index sxi_secondary1 on foo(thexml)
using xml index fooidx
for (pathname1); — strongly typed to SQLVARCHAR(25)

create xml index sxi_secondary2 on foo(thexml)
using xml index fooidx
for (anotherxq); — strongly typed to XQUERY xs:double (SQL FLOAT)

– Msg 102, Level 15, State 1, Line 3
– Incorrect syntax near ','.
– Means: error: only one pathspec name allowed
create xml index sxi_secondary3 on foo(thexml)
using xml index fooidx
for (pathname1, pathname2);

– Msg 6391, Level 16, State 0, Line 1
– Path 'pathname2' is promoted to a type that is invalid for use as a key column in a secondary selective XML index.

– Means: can't have varbinary(max) as index key
create xml index sxi_secondary3 on foo(thexml)
using xml index fooidx
for (pathname2);

These work the way you'd think. They create a NONCLUSTERED index over the (internal) table that comprises the "primary" Selective XML Index. The NONCLUSTERED index key is the "value" column in the path spec, along with the traditional "backpointer" to the primary key of the "primary" Selective XML index. These NONCLUSTERED index are FILTERED indexes, filtered on the value column in the path spec.The filter is "[Valuecolumn] IS NOT NULL"

So, if an XML value in a pathspec is sparse (meaning, it doesn't appear in every/most rows) these secondary Selective XML Indexes could be helpful. And remember, like Michael said, these indexes are chosen by the optimizer, not the algebrizer, until the way the algebrizer chooses the primary Selective XML index, but after the algebrizer decides that the Selective XML Index could be useful at all.

@bobbeauch

Selective XML Index – Implementation Details

A while after I posted part 1, there was a twitter message from Aaron (@AaronBertrand), a well-known "smart person" in the SQL Server space, about using feature packs as "ship vehicles", especially when they (possibly) have the effect of requiring keeping all instances in-sync at the service pack level to keep things working and possibly even making SP feature relevent during a restore. I answered that "it must be a compelling feature". And then we reminisced about features like "vardecimal" and "15k partitions". So I'll discuss why it is a compelling feature (or at least why I happen to think it is) later on in the process.

But for now, about how it's implemented…

Before this even. You can add or remove path specs on a Selective XML Index. I found this one by "reading into an error message" and trying syntax until it worked.

– add a pathspec
alter index fooidx on foo
for
(
add another_pathspec = 'foo/bar2'
);

–remove the pathspec we just added
alter index fooidx on foo
for
(
remove another_pathspec
);

As Micheal said, the selective XML index is implemented using sparse columns. The first column (or columns) is the primary key of the base table, which they always named pk1…n. Then a row_id column, which is incremented if there are multiple matching nodes in the same row.

Each pathspec consists of at least two columns: a path column and 1-2 value column(s). Path column is named path_[n]_id (which is always varbinary(900) and sparse).

Value column is named either "[pathspec_name]_[n]_value" or "[pathspec_name]_[n]_sql_value" (for SQL type pathspec). If the pathspec contains a wildcard, there is an additional column named path_[n]_hid (varbinary(900)) where HID stands for hierarchyid (i.e. the path in the document). A pathspec with an XPath/XQuery wildcard would look like this: '/foo/*/baz2'.

As an aside, remember that the SQL Server XML data type (node paths) and hierarchyid data type use the same encoding scheme, known as Ordpath, see my old blog entrry "Ordpath, ordpath, everywhere".

All of the columns (except pk1 and row_id) are sparse. As far as data types (because the nice thing is that SPARSE columns can preserve strong typing) as typed as:
 Type 1: xs:untypedAtomic = varbinary(max)
 Type 2: The closest SQL data type to XSD type (e.g. xs:double = SQL FLOAT)
 Type 3: The SQL type specified in the "pathspec AS SQL…" specification. Including length and collation.

The length of all of the sparse columns are based on the length of the column (either express or implied) in the pathspec.

So, what's this all mean?

It means that, depending on how selective each pathspec is, how many pathspecs are designation in the Selective XML Index definition and the sparseness of each of the column values, the Selective XML Index can be MUCH, MUCH smaller than the (original) PRIMARY XML INDEX (which is implemented as a 12-column side-table, one row per XML node). Because you're only indexing those values your care about. And individual node data CAN be sparse in XML (element text can be, but XML attribute nodes are even more likely to be sparse).

And it also means that the limitations of SQL Server 2008 SPARSE columns apply. See this starting point in Books Online to understand these limits in detail. And, because there isn't a XML columnset column, you are limited to theoretical maximum of 511 pathspecs, if none of them are wildcards and there's a 1 column primary key (1024 maximum columns in table without a columnset, – 2 / 2. I don't know what the real limit is yet.

@bobbeauch

Selective XML Indexes in SQL Server – First Try…it works

OK, so let's try something. Load a bunch of documents. Although, actually this part works with an empty table.

create selective xml index fooidx2 on foo(thexml)
for
(
pathname1 = 'foo/bar/baz2'
);

Msg 6379, Level 16, State 201, Line 1
selective XML index 'fooidx' already exists on column 'thexml' in table 'foo'.
Multiple selective XML indexes per column are not allowed.

Only one SXI at a time. With 1-n path specifications.

About path specifications. To paraphrase the talk…there are three basic types of paths.

1. Simple: /foo/bar/baz1
2. With optimizations:
   /foo/bar/baz2 as XQUERY 'xs:string' MAXLENGTH(25)
   /foo/bar/baz3 as XQUERY 'xs:double' SINGLETON
   /foo/bar/baz4 as XQUERY node() — check for node existance
3. For use with the XQuery value method:
   /foo/bar/baz5 as SQL VARCHAR(25)

Forms 1 and 2 are for XQuery with any XQuery method. Form 3 is for XQuery value method. This seems to turn out to be more important than I thought it was at first.

Rule for using it in query plan:
   1. If SXI (selective XML index) exists and path matches specification, use it
   2. If PXI (primary XML index) exists, use it
   3. Else use XML Reader with XPath Filter

First experiment:

Defined (the one and only one) SXI index for this XML column on path /foo/bar/baz. Type 1 path specification. Issued query:

SELECT thexml.value('(/foo/bar/baz)[1]', 'varchar(25)')
FROM foo;

SXI isn't used in the plan. Hmm…he said it was a decision of the algebrizer, not the optimizer. And path sure looks like it matches. Futz with this for an hour or so. I actually added more documents. Used three path specs that (I thought) all matched. I thought this worked in his demo, but no joy. Tried to force the SXI with index hint (you can't force an SXI in any case). But it's the algebrizer, not the optimizer, Bob…

Instead try index on /foo/bar/baz AS SQL VARCHAR(25). Type 3 path specification. Now, it's used.

@bobbeauch

Getting started with Selective XML Indexes in SQL Server

So, it was announced that SQL2012 SP1 CTP4 was released yesterday. You've likely already heard this (more than once, everyone seems to revel in repeating announcements from the team). And it contains a new feature (I was surprised, thought that wasn't supposed to happen, new features in service packs, meanwhile…) called Selective XML Indexes. Interesting concept, especially because dragging around a primary XML index that's 3-6 bigger than the data itself is a burden. But, without it, querying on any non-trivial XML is quite slow. To put it nicely…

Downloaded and installed the SP (remember its a CTP, DON'T install this on a production system). Got my XML test bed out to try. First problem: no docs on the feature. Well, the only information that exists (OK, that I could find quickly) is in Michael Rys' TechEd US/Europe talks. Downloaded, listened to them, you can find them with a web search. I'll try not to repeat his content. There's DDL in his talk. So:

create table foo (id int identity primary key, thexml xml);

create selective xml index fooidx on foo(thexml)
for
(
pathname1 = 'foo/bar/baz',
pathname2 = 'foo/bar2'
);

Msg 9539, Level 16, State 1, Line 1
Selective XML Index feature is not supported for the current database version

Bummer. Really? Well, there is a trick to it (thanks, Michael).

exec sys.sp_db_selective_xml_index null, 'on' — turn on for current database

BTW, if you want to restore/attach this database now to a pre-SP1 system, you have to turn the feature off.

exec sys.sp_db_selective_xml_index null, 'off'

Which likely means deleting all the selective XML index stuff you have in the database too. Haven't tried it yet, though. And first parameter is the database name. NULL means "current database".

And it works now. So now we can all play along at home. ;-) Let a hundred selective XML indexes bloom. Michael suggests there will be BIG performance improvements and little disk space used.

@bobbeauch

Using filetable with full-text? Lose the redundant index

I'm not sure how many full-text search over filetable demos I've seen (including some I've written that I was just perusing), that do something like this.

CREATE TABLE foo AS FILETABLE WITH (FILETABLE_DIRECTORY = N'ftdir');
CREATE UNIQUE INDEX fooidx ON foo(stream_id);
CREATE FULLTEXT INDEX ON foo (file_stream TYPE COLUMN file_type) KEY INDEX fooidx;

After looking at this a few times, I couldn't beleive there wasn't some kind of key contraint/index on the stream_id column defined as part of the filetable definition. And indeed there is. There's a UNIQUE constraint already that casues a nonclustered unique index to be built to enforce it. So why the extra nonclustered index, just to use full-text? A couple of extra lines of code should be able to tell us what the filetable named it's constraint/index, and we can use that name in the full-text index creation instead. And we won't need the redundant index.

But wait…we don't even have to do that.

CREATE TABLE… AS FILETABLE takes some additional parameters that make the extra code unnecessary. For our purposes, we only need to name the key constraint that corresponds to the KEY INDEX we're using in the full-text index. So…

CREATE TABLE foo AS FILETABLE WITH (FILETABLE_DIRECTORY = N'ftdir', FILETABLE_STREAMID_UNIQUE_CONSTRAINT_NAME=UQ_stream_id);
CREATE FULLTEXT INDEX ON foo (file_stream TYPE COLUMN file_type) KEY INDEX UQ_stream_id;

loses the redundant index. There are also parameters to specify the FILETABLE_PRIMARY_KEY_CONSTRAINT_NAME and FILETABLE_FULLPATH_UNIQUE_CONSTRAINT_NAME as well, if you need them, to make it easier to define foreign keys.  Although the filetable's related objects are fixed at creation time, you can add constraints to a filetable after it is created.

@bobbeauch

Using filestream, streaming I/O and SQL logins with impersonation

A number of months ago, I wrote an article using the SQL Server filestream feature programmatically. I've gotten a number of inquiries since from folks who *need* to login to SQL Server using SQL logins (often the same SQL login for everyone), but want to access the stream using streaming I/O (e.g. SqlFileStream data type). This doesn't ordinarily work, but how about using impersonation in SQL Server? 

I have a SQL login named "bob". And a Windows login named "ZMV03\Mary". Mary is not a Windows Administrator. Neither one is sysadmin in SQL Server. Added users for each one to the filestream-enabled database with appropriate permissions. In the master database, I granted bob impersonate privilege.

GRANT IMPERSONATE on LOGIN::[zmv03\Mary] to bob;

In the ADO.NET program, I impersonate the login before accessing the getting the transaction context and PathName. The rest of the program looks the same.

SqlConnection string contains Bob's credentials. Then:

command.CommandText = @"execute as login ='zmv03\Mary';select Top(1) Photo.PathName(),GET_FILESTREAM_TRANSACTION_CONTEXT () from employees2;revert";

And the streaming I/O using SqlFileStream and the transaction token works. Some cavaets:
 You need to impersonate the login, not the user correspoinding to the login.
 You can't use "execute as login=" in a CREATE PROCEDURE statement, but you can use it inside a stored procedure.
 You need to use the same Windows login with the SQL impersonation as with the client code. With ASP.NET for example, that would be the user running the ASP.NET worker process/app pool. You can't do the SQL impersonation using any old Windows login.
 Note that you can even revert the impersonation before accesing the stream. And that's probably a good idea.

And about the setup, I used:
 Local console app, running as zmv03\Mary, which is an individual, not a Windows group, login. Local means console app is running on same machine as SQL Server.
 SQL Server 2012. Database is defined with non_transacted_mode = off.  Machine was not a part of a domain.

Just listed these for sanity reasons. I'm not talking about filetable, but filestream. You can use transacted access with filetable, but normally you'd use non-transacted and open the file directly, using the network share. If this breaks using different conditions, I'd like to hear about it. I read email more often that I look for blog comments. This may be spec'd to work somewhere but I couldn't find it.

Also, I've been unsuccessful in using SQLCLR, filestream, and streaming I/O with the context connection. If you really, really want to do this, maybe using a real connection and external access might work. I didn't have time to try this, and am not sure if it would be particularly useful. But you would have to be running the SQLCLR proc as the exact (Windows login) that does the impersonation, and craft your SqlCommand text accordingly.

Also in case you confused this discussion with the error message "Either a required impersonation level was not provided, or the provided impersonation level is invalid." with older OSes, I'm not talking about that, but there is a fix for it. See this KB entry. The error that you get trying to access the stream after using SQL login and GET_FILESTREAM_TRANSACTION_CONTEXT is "Access denied".

@bobbeauch

Extended Events in Windows Azure SQL Database

Continued from the previous post, "Some new/upcoming Windows Azure SQL Database DMVs and diagnostics"… Read that one to find out who Vinod and Michal are.

So, before you all run out and try it, this isn't working yet. But, like the WASD event log view, the pieces seem to be in place. So, standard disclaimer applies. I only saw it working in Vinod and Michal's demo. The stuff I'm looking at may bear no resemblance to the final product. On the other hand, it may…

So, I went looking in WASD today. Some of the extended event views exist in every database today. This jibes with the demo that showed an XEvent node in SSMS Object Explorer in the Database/Programmability folder. As opposed to XEvents sessions, etc in SQL Server which live in master, these appear to live in each database. Which only makes sense if you know how WASD works…

Obviously there is no SSMS node in SSMS today. But for those of us who like XEvents in SQL Server 2008, sans GUI, that really doesn't present a problem.

They showed four pre-defined XEvent sessions, and said that initially there would only be predefined sessions. No user-defined sessions. The names they showed matched the output from "select * from sys.event_sessions" and are:

azure_xe_activity_tracking
azure_xe_query_batch
azure_xe_query_detail
azure_xe_query_waits

You can get metadata and "current" data by using views that are similarly named to the views in SQL Server. SSMS object explorer for WASD shows the following:

sys.event_session_actions
sys.event_session_events
sys.event_session_fields
sys.event_session_targets
sys.event_sessions

sys.dm_xe_database_session_event_actions
sys.dm_xe_database_session_events
sys.dm_xe_database_session_object_columns
sys.dm_xe_database_session_targets
sys.dm_xe_database_sessions

Some of these views are more "filled in" than others. For example, the views related the XEvent targets are currently empty. And there are no targets for any of the predefined event session.  Note the use of the naming convention "dm_xe_database…" for the views instead of "dm_xe_server…". Makes sense.

So, gui or no gui, target or no target, I just had to try this. Had my "stop" and "harvest" scripts available in case it did start, but…

alter event session azure_xe_query_waits on server state=start;

Produces: "session does not exist or you do not have permission"

Hmmm… how about:

alter event session azure_xe_query_waits on database state=start;

Produces: "Failed to start event session 'azure_xe_query_waits' because doing so will exceed the maximum number allowed for this database."

Sounds like that could be the ticket. In future. So, until then, folks diagnosing WASD issues wait for the announcement for this, and the event log views, with baited breath. When I asked some SQL Azure folks about the lack of announcements about the "intermediate" versions of WASD (i.e. there was no formal announcement on the Azure blog about the new DMVs), the response was "we'd like to make small version changes to WASD a non-event, and only announce large changes". This will be a big…eh… event. THANKS Vinod and Michal.

@bobbeauch

Some new/upcoming Windows Azure SQL Database DMVs and diagnostics

I didn't get to a whole lot of others' sessions at TechEd this year, but was pleasantly surprised by Vinod and Michal's session on "SQL Azure Administration and Application Self-Servicing". Although most of the really interesting stuff they showed wasn't in place yet (they demo'd on a "test portal" and test Windows Azure SQL Database (did I get the new name right?) servers, there are some things folks have been looking for that are available now. The near-future looks even better.

Windows Azure SQL Database (we gotta come up with an acronym for this, how about WASD for now?) has always contained a subset of DMVs from SQL Server. And the ones that are there don't expose "physical information", for example memory addresses, server-level info. That's because WASD is a multi-tenant database and you're not allowed to know what the other tenants are up to. Good.

But, when I lecture about maintaning indexes on WASD, the first comment out of folks' mouths was always "but it's missing sys.dm_db_index_physical/operational/usage_stats"! No longer. These DMV are there and work NOW, today. Modulo some physical information, of course, but its enough to tune your indexes with. Hoorah!

Here's a complete list, from the presentation (and from my experimentation, they all work now):

sys.dm_exec_query_memory_grants
sys.dm_exec_cached_plans

sys.dm_exec_trigger_stats
sys.dm_exec_procedure_stats

sys.dm_db_index_physical_stats
sys.dm_db_index_operational_stats
sys.dm_db_index_usage_stats

sys.dm_db_missing_index_details
sys.dm_db_missing_index_group_stats
sys.dm_db_missing_index_groups
sys.dm_db_missing_index_columns

So diagnose away… The only one that I've had "trouble" with so far is sys.dm_db_missing_index_details. Most times I'll get an error message "Msg 40197, Level 20, State 1, Line 1 The service has encountered an error processing your request. Please try again. Error code 40143." The information I see seem to tie this error with database failover, so maybe I'll leave this one alone for the time being.

The more interesting views that don't seem to be active yet relate to SQL Server log information. You don't have access the log file directly in WASD (multi-tenant, figure it out). But there are two views in MASTER that will expose a subset of this information, as well as a portal-to-come. The actual views they mentioned seem to be there now, but return no information. So it's "in place", but not "hooked up" yet, I'm guessing. These views are:

sys.database_connection_stats – connection information
sys.event_log – event log information

The view names are obviously self-explanatory.

The coolest thing they demo'd was related to the question I always get, "Does WASD support SQL Profiler?". There will probably never be SQL Profiler support, AFAIK, but there will be (soon) Extended Event (the new profiler) support. Guess that deserves a post of its own.

@bobbeauch