Choosing Paths for Selective XML Index – Part 1

About a month ago, I wrote a series of blog entries on the Selective XML Index introduced in SQL Server 2008 SP1. Just to summarize before I start up again, the Selective XML Index (I’ve abbreviate as SXI) is a feature introduced to allow indexing of only certain paths in an XML column. It uses side-tables using sparse column technology (that was introduced in SQL Server 2008) to effectively index pieces of XML documents while keeping the size of the index relatively low. It is unrelated to the “original” XML index technology (CREATE PRIMARY XML INDEX, etc) that creates large indexes (2-5 times size of the original documents) but indexes every element, attribute, and text node. Either XML indexing technology may be used independently or both may be used together. In addition to the (primary) SXI, you can have “secondary” SXIs, which are simply nonclustered indexes over specific columns in the side table.

Since I wrote the original series, Books Online topics for SXI have appeared. So I thought I’d work through a strategy for deciding to define these indexes and show how these indexes affect query plans. As implied by the name of the feature, since you index specific paths, deciding which paths to index is key. Each path takes up space in the side-table. In addition, you can specify “keywords” on the paths, such as XQuery data type, SQL data type, and whether this path is a singleton. I’ve also found four new Extended events to help in my decision. These events fire when you have a selective XML index in general, but the path may be missed or specified incorrectly. Namely,

selective_xml_index_no_compatible_sql_type – Occurs when a value() method is used in a query, specifying a sql type that differs from the type specified during selective XML index creation for that path.

selective_xml_index_no_compatible_xsd_types – Occurs when the user specifies a different resulting XSD type in a query, than the one that was used to promote the path during selective XML index creation.

selective_xml_index_path_not_indexed – Occurs when the path given in the user query is not promoted in the selective XML index, and thus the index is not used.

selective_xml_index_path_not_supported – Occurs when selective XML index is not used for a XML query due to the user using an unsupported path type.

I’ve created and started an XEvent session with these events, also including the text of the SQL Server statement involved. So let’s start with a simple example. After enabling SXI on the database (see my original posting), I took this one from the BOL and define the table as:

CREATE TABLE tbl (  id int identity primary key,  xmlcol XML );

Note that xmlcol is untyped XML, meaning we don’t have an associated XML Schema Collection. We’re looking to optimize, as an exemplar query:
SELECT T.id FROM tbl T WHERE T.xmlcol.exist(‘/a/b[c = "43"]‘) = 1;

And add a single row that will return true:
INSERT T values(‘<a><b><c>43</c></b></a>’);

Let’s try this query first without any index. The query plan is supplied as a file (because it’s fairly large), but consists of:

1. Clustered index (CI) scan of base table
2. Two “XML Reader with XPath Filter” steps (to select the nodes we want)    a. One step for the path outside the XPath predicate (a/b)    b. One step for the path outside the XPath predicate (c as child of a/b)
3. One “XML Reader” to get the value of text node c.

For such a simple query against a 1-row table, the estimated subtree cost is over 1. (1.09117). We can do better.

So let’s start by trying BOL suggested SXI:

CREATE SELECTIVE XML INDEX simple_sxi ON Tbl(xmlcol)
FOR (
path123 =  ‘/a/b’,
path124 =  ‘/a/b/c’
);

Unfortunately, this returns an error: Msg 6394, Level 16, State 1, Line 1 Found non leaf element indexed for the value with path ‘simple_sxi’ for selective XML index ‘path123′.

It’s “upset” because I have the path /a/b. It won’t index any path that not a “terminating path”, i.e. that has other XML underneath (in this case the “c” element). This is interesting for a few reasons. First off, I can’t index “mixed content” with the SXI. Mixed content in XML is content which contains both tags and text, for example an HTML page. Here’s a simple example.

– The “value” of node b contains both text and tags (the <i></i> tags).
INSERT tbl values(‘<a><b>This is some text. This word is <i>italicized</i>.</b></a>’);

Since storing/querying mixed content (e.g. WordXML) is a common use case, the fact that it won’t work with XSI is good to know. But, in this case, I just want to look for the existance of the /a/b node. BTW, existance in this case isn’t related to the “XML exist” method, but to my filter in my XML Reader With XPath Filter step. Is there a way to do this? Turns out, this is exactly what the “node()” specifier is for. So let’s try that.

CREATE SELECTIVE XML INDEX simple_sxi ON Tbl(xmlcol)
FOR (
path123 =  ‘/a/b’ as XQUERY ‘node()’,
path124 =  ‘/a/b/c’
);

No error on creating the SXI now. That’s good. And how about the query plan? Ahh…that’s much better.

1. Clustered index scan on the SXI side-table. Filter on value 43.
2. Clustered index seek on side-table for primary key of base table.
3. Clustered index seek to join to base table.

Query cost: 0.0098608. That’s a little better.

…And this post is getting a little long. So I’ll continue on with further investigations in the next post. But what about the XEvents? We haven’t seen any events yet. When we don’t have an SXI, it doesn’t suggest “missing paths”. And the SXI we did use contained all the appropriate paths, so there’s no XEvents on that query. So we’ll wait on those for now.

@bobbeauch

sxi_plans.zip

New XEvent events in SQL Server 2012 SP1

I’ve been looking a bit more into the SQL Server 2012 SP1 release, mostly to find diagnostics related to the Selective XML Index (SXI). I found some additional extended events for SXI (there are four of them in the “index” category) and I’ll cover them in, as well as exercise them, the next posting. But in my travels, I also came across five more new-to-SP1-from-RTM extended events. They occur in the “warnings” category when you’re using the XEvent GUI, and correspond, for the most part, to warnings that were added to query plans in SQL Server 2012 RTM. So if you liked these as warnings you’ll probably like them as events. They are:

plan_affecting_convert – Occurs when a type convert issue affects the plan.  The expression value indicates the conversion that can cause inaccurate Cardinality Estimation or prevent from using a seek based plan in the query plan choice.  If performance is affected, rewriting the query could help.

spatial_guess – Occurs when the optimizer cannot get enough join selectivity information.  Use this event to look at the queries where spatial index could be used and the optimizer guesses the spatial selectivity. (Note: I *always* seem to see this hint, even when there should be enough information for the optimizer to guess the spatial selectivity, and filed a Connect item for this one.)

unmatched_filtered_indexes – Occurs when the optimizer cannot use a filtered index due to parameterization.  For each such index an event is fired.  The unmatched_database_name, unmatched_schema_name, unmatched_table_name, unmatched_index_name fields give details of the filtered index that could not be used.

The “original” new SQL Server 2012 plan warnings, hash_warning and sort_warning exist in RTM in the “errors” category; they came over to extended events when all the SQL Profiler events were moved over.

There is also one additional warning.

optimizer_timeout – Occurs when the optimizer times out either due to spending too much time or hitting a memory limit.  Use this event to look at all the queries that are impacted by the optimizer timeout in a particular workload. This can be very useful when tuning a particular workload.

Finally, there is an additional replication XEvent in the replication category:

logreader_start_scan – Outputs the replnextlsn value from the database table that the first replcmds scan uses as a starting point.

Happy event-ing!

@bobbeauch

Speaking at DevWeek London in March

Lucky me, I’ll be speaking at DevWeek in London again this year. The festivities start off with pre-cons on March 4, I’ll do a whole day on writing best-performing T-SQL and application code. After that, I’ll be doing breakout talks on a variety of topics, from SQL Server Extended Events, to Data Movement in Windows Azure SQL Database, to SQLCLR internal, and more.

Hope to see some familiar faces at this one and make some new friends as well. Stop by and say hi…

@bobbeauch

How To Return Document Property Values in Full-Text Search

About a week ago I got involved with a question via Twitter that was posted originally to Stack Exchange. The question was "Is it possible to list document properties via FTS"? The question refers to SQL Server 2012's new support or property-based search with a syntax that looks like this:

SELECT name DocumentName, file_stream.GetFileNamespacePath() Path
FROM Documents
WHERE CONTAINS(PROPERTY(file_stream, 'Title'), 'data OR SQL');

This functionality has nice synergy with another SQL Server 2012 feature, the filetable feature. So you can store documents in a filetable and search with FTS. Including document properties. There's just one thing missing: no way to get the actual property values after you've searched for them. So, as I wrote in this suggestion on Connect:

"Suppose I have three documents that have authors of "Bob Beauchemin", "Bob Newhart" and "Bob Gupta". I can do a CONTAINS-based search for documents with authors that have a keyword "Bob", but can't bring back the entire propery value, which would be very useful.

Currently, the closest I can get is using sys.dm_fts_index_keywords_by_property. But think only produces the keywords, not the property values. So, in the case above, I wouldn't know if the actual author name was "Bob Newhart" or "Newhart Bob". and keywords wouldn't include noisewords."

Well, now there IS a solution. I'd just written a blog post about a standalone component for full-text search called ThinkHighlight. Wrote to the folks over at Interactive Thoughts and… voila! Their latest version supports returning not only highlighted text, but an XML format column that includes highlighted text AND property name-value pairs called HitHighlightEx. It includes all the properties as the canonical propertyset/propertyid, and if you're FTS index uses a SEARCH PROPERTY LIST, it will resolve the propertyset/propertyid to the property name from the SEARCH PROPERTY LIST.

As an example, let's use the demo script from the FTS blog. I can search for the keyword "compression" and return highlighted text and properties:

DECLARE @query nvarchar(50) = 'compression';
DECLARE @context varbinary(8000) = dbo.HitHighlightContext('Documents', 'file_stream', @query, -1);
SELECT TOP 10 Name, dbo.HitHighlightEx(@context, 'top-fragment', 200, stream_id) as description
FROM Documents
WHERE CONTAINS(file_stream, @query);

But better, I can take the property XML apart and return individual columns as properties:

DECLARE @query nvarchar(50) = 'compression';
DECLARE @context varbinary(8000) = dbo.HitHighlightContext('Documents', 'file_stream', @query, -1);
SELECT Name,
  Description.value('(/document/properties/property[@name="Title"]/@value)[1]', 'nvarchar(max)') as Title,
  Description.value('(/document/properties/property[@name="Author"]/@value)[1]', 'nvarchar(max)') as Author,
  Description.value('(/document/properties/property[@name="Tags"]/@value)[1]', 'nvarchar(max)') as Tags
FROM
(
SELECT TOP 10 Name, dbo.HitHighlightEx(@context, 'top-fragment', 200, stream_id) as description
FROM Documents
WHERE CONTAINS(file_stream, @query)
) a;

And even get properties that aren't in my SEARCH PROPERTY LIST. So, to get the "LastAuthor" property (Property/Details in Windows Explorer calls this "Last Saved By"):

DECLARE @query nvarchar(50) = 'compression';
DECLARE @context varbinary(8000) = dbo.HitHighlightContext('Documents', 'file_stream', @query, -1);
SELECT Name,
  Description.value('(/document/properties/property[@name="Title"]/@value)[1]', 'nvarchar(max)') as Title,
  Description.value('(/document/properties/property[@name="Author"]/@value)[1]', 'nvarchar(max)') as Author,
  Description.value('(/document/properties/property[@name="Tags"]/@value)[1]', 'nvarchar(max)') as Tags,
  Description.value('(/document/properties/property[@set="f29f85e0-4ff9-1068-ab91-08002b27b3d9" and @id="8"]/@value)[1]', 'nvarchar(max)') as LastAuthor
FROM
(
SELECT TOP 10 Name, dbo.HitHighlightEx(@context, 'top-fragment', 200, stream_id) as description
FROM Documents
WHERE CONTAINS(file_stream, @query)
) a;

As always, you can find a complete list of canonical Windows properties here.

So there you have it. To get in on the beta and make suggestions, download the ThinkHighlight beta here. BTW: I have NO affliation of any kind with Interactive Thoughts. But I DO like the way they respond to enhancement requests.

@bobbeauch

Reacting to XEvents in almost real-time

I've heard it said that the main difference between SQL Server Extended Events and SQL Server Event Notifications is that you can't "react to" Extended Events. Event Notifications are written to the service broker service of your choice, and multiple Event Notifications can even be routed to the same service, even on different machines.

But there IS a way to react to Extended Events in "almost realtime", that is, by capturing and reacting using the Extended Event XeReader API. It's certainly not as nice and built-in as Event Notifications, but never say never.

Mike Wachal introduced the XeReader to the world in a blog post at SQL Server 2012 CTP3. There are a variety of constructors for QueryableXEventData, you're interested in the one that uses a connection string and the name of an Extended Event session, which must be predefined (from Mike's blog):

QueryableXEventData stream = new QueryableXEventData(
  @"Data Source = (local); Initial Catalog = master; Integrated Security = SSPI",
  "alert_me",      EventStreamSourceOptions.EventStream,      EventStreamCacheOptions.DoNotCache);

In this case the "alert_me" is the name of an existing XEvent session. Now that you're capturing the events, you can take whatever kind of action you want. See the sample in Mike's blog for where you'd hook into this. You can even write the events to a service broker service, if you'd like.

It's still not quite as good, to summarize:

Event Notifications are defined using DDL and the writing the broker is built-in. This method requires:
1. Custom .NET programming
2. An external program that must be running with 2 SQL connections (one to read the events, one to accomplish the actions)
3. Custom code to write the "react to" actions (although this could be templated and parameterized in the code)

In addition, there's a WMI event provider for "Event Notification"-type trace events that can be easily hooked into SQL Server Agent. You'd also have to write an XeReader equivalent for XEvents. But until there's something built-in for XEvents, you have a "next best thing" alternative for now. Use the code in Mike's blog as a starting point

Bob

@bobbeauch

Loading spatial data and cleansing-converting with SQL Server 2012

Someone asked, on Twitter, if it were possible to get a set of spatial data that included cities, counties (administrative regions) and countries in database table format for SQL Server spatial. I'd come across just these datasets (and more) as shapefiles at the CDC website and pointed them out. Then came the question about how to load these shapefiles into SQL Server.

I've always preferred FME by Safe Software to load spatial data into SQL Server. They have a variety of products  for SQL Server, including a low-priced option that more than suffices for loading shapefiles. But what if you can't afford even Safe's low-priced option (you're doing it for proof-of-concept, for example). I'd then recommend Shape2SQL, a free tool that's been around since SQL Server Spatial's inception back in SQL Server 2008.

The first time I'd loaded these shapefiles let's use the cntry00 (countries) file as an example, I'd run into some problems loading them as geography data type and had to bypass some records in the shapefile, leaving me with a map that was missing some countries. I'd always assumed that this was a ring-order problem that produced geographies large than a hemisphere. SQL Server Spatial in 2012 fixed the "larger than a hemisphere" limitation, so I thought I'd try loading as a geography again. Unfortunately I got the same errors. So I decided to reload the data as geometry, cleanse the data and convert to geography. What follows is a series of steps to do just that. It relies on some SQL Server 2012-specific functionality.

After trying to summarize this on Twitter as 140-character messages (I always refer to Twitter as "my 140-character lobotomy") someone suggested this would be a good blog entry. So here it is.

So, acquire the shapefile data. What you're looking for is a .shp file (input into Shape2SQL) and possibly the .prj file to determine the SRID to use. In this case, I'll go with SRID 4326 eventually for geography. You can use an SRID with geometry too, but it doesn't effect the spatial library's computation like it does with geography.

1. Unzip the zipfile
2. Pointed Shape2SQL at the .shp file choosing the following options:
   a. Planar Geometry
   b. Set SRID 4326
   c. Create spatial index – no (checkbox cleared)
   d. spatial column name – geom (default)

Now all the data will load. So let's run some queries to clean things up and convert to geography.

– First, look for Invalid geometries and get the reasons why they are invalid
– IsValidDetailed is SQL Server 2012-specific
– BTW, it wasn't a ring order problem after all.

– Error: 24411
select geom.IsValidDetailed(), * from cntry00
where geom.STIsValid() = 0;

– Make them valid, noting the row ids for your reference
– rows 21, 42, 252
update cntry00
set geom = geom.MakeValid()
output inserted.id
where geom.STIsValid() = 0;

– add geography column
alter table cntry00 add geog geography;

– vacuous conversion to geography
update cntry00
set geog = geography::STGeomFromWKB(geom.STAsBinary(), 4326);

– see if any invalid now… and the reason
– 24413, 24414 near the poles (3 rows)
select geog.IsValidDetailed(), * from cntry00
where geog.STIsValid() = 0;

– validate again
– rows 1,181,215
update cntry00
set geog = geog.MakeValid()
output inserted.id
where geog.STIsValid() = 0;

– check for more than a hemisphere, none found
select * from cntry00
where geog.EnvelopeAngle() > 90;

– if there are geographies greater than a hemisphere
– because of ring order problems fix 'em
– SQL Server 2012-specific
update cntry00
 set geog = geog.ReorientObject()
 output inserted.id
 where geog.EnvelopeAngle() > 90;

And there you have it. You might think that you can use the spatial results tab in SSMS to compare your data before and after validation to see if validation has left artifacts or changed the shape. Don't bother, though; the spatial results tab called MakeValid before displaying the data.

Hope this helps,
Bob

@bobbeauch

Hit Highlighting/Summarization product for SQL Server Full-Text Search

I've often asked about add-ins for SQL Server Full-Text Search feature. And the most requested feature for FTS has long been some way to accomodate hit highlighting. Recently, I've come across a beta for a product that does just this, along with document summarization formats for display, ThinkHighlight by Interactive Thoughts.

The add-in is implemented as a single SQLCLR assembly that exposes a pair of UDFs, HitHighlightContext and HitHighlight. The first UDF prepares a context. You only have to do this once per query, so this appears to be a nice optimization. The HitHighlight one is used to produce a column of output with summarized text and highlighted hits. You have a choice of three different output summarization strategies: baseline, complete, and top-fragment. Baseline and top-fragment are meant to be used in a manner similar to one that you'd see used in search websites. There's an interactive demo that shows some examples.

The product comes with a nice installer that populates the assembly (its an unsafe assembly as it needs to access COM components like IFilters). However, it doesn't require that user databases be marked trustworth, there's an asymmetric key installed in master, in accordance with best practices.  And I haven't been able to cause any problems with the functions, try as I might. And it works with any type of full-text indexed column (text. binary, as well as filetable) and query (all four FTS predicates/table-valued functions).

If you're looking for extended functionality currently not supported in the FTS feature, this might be right up your alley.

Cheers, Bob

@bobbeauch

Samples from SQLServerDays now posted on SQLskills website

Returned from SQLServerDays in Belgium last week. Had a wonderful time, hope you enjoyed it as much as I did. Thanks!

Samples/demos are posted on SQLskills website in the "Resources" section. Enjoy!

I've been rather remiss on my blogging lately, but I've acquired some interesting things to blog about in the interim and should be starting up again in earnest something this week…

Bob

@bobbeauch

I’ll be speaking at SQLServerDays in Antwerp in November

I'm honored to be invited to speak at SQLServerDays, Nov 19-20, in Schelle Belgium, just outside of Antwerp. I've been to Belgium before, but this is my first time in Antwerp and I'm really looking forward to it.

I'll be doing a day-long precon on Nov 19 on New Features in SQL Server 2012. And following up on Tuesday with talks on two of my favorite topics: Spatial Data and SQL Performance, both about the improvements on SQL Server 2012.

Looking forward to catching up with old friends (the Belgian folks were always kind and invited me to their "country party" at TechEds over the years) and meeting up with some new friends. And they'll be lots of great speakers at this conference. If you're around Monday and/or Tuesday, stop by and say hi.

Selective XML Index – Why is this compelling?

OK. VM with the SP1 CTP is shut down. Fun's over (until I think of something new and cool or you think of something and tell me about it). Time to answer Aaron's implied question about "what makes this so special it should be shipped in a Service Pack, no less?".

This could be a game changer with respect to XML querying in SQL Server. Both in query speed and in data size reduction. But, you say, not many customers do XML querying in SQL Server. Well, this was one of the reasons why.

Cust: I stored some XML in SQL Server and queries are incredably slow.
Me: You need at least an primary XML index to make queries tolerably fast.
Cust: But I really can't afford all that extra data. And it slows down inserts. And backups. And sucks up data cache in memory. And XML index rebuilds are single-threaded and offline. And…
Me: It's either indexing or slow queries. Take your pick.
Cust: I pick not.
Me: (Shrugs) Can't blame ya. How about partial decomposition into relational? (persisted computed columns)

So, depending on the specificity (locality) of your queries against "hunks of XML", this could be a game-changer. Selective XML indexes on a specific (hopefuly small) set of pathspecs will speed up your queries and make the overhead MUCH more tolerable. So it might make the concept of using XML + queries a second look. There might even be some advantages vis-a-vis persisted computed columns. Like the fact that persisted computed columns over UDFs (from partially decomposed XML) can't use parallelism. But, it's still offline, single-threaded rebuild.

You WILL, however, have to track which queries you're running against the XML that you are "selective indexing". And maintain your Selective XML Index accordingly (with alter-remove and alter-add for pathspecs). Similar to the way you'd optimize SQL queries. Except SQL queries come with lovely built-in assistants like Database Tuning Advisor. I know some folks don't like DTA, but it at least useful as a starting point. There's no DTA support for XML indexes or any kind (or spatial or FTS indexes for that matter).

Even factoring out query/indexing speed, XML in SQL Server does have its limits. Like using XQuery Data Model, which make some constructs "valid XML" that some/most parsers don't support. Or that output from a SELECT that puts out XML has a single encoding (maybe CONVERT with an encoding parameter on output). And SQL Server itself and XML data type internals doesn't have support for UTF-8 which bloats things a bit. And that the version of XQuery SQL Server uses is getting long in the tooth, and is a subset. And XML-DML is based on a propriatary spec, because the real spec postdates SQL Server.

You might ask: "If this Selective XML Index thing is such a good idea, why didn't they do it this way originally?". Good question. SQL Server XML support (data type, indexing, XQuery, etc) was introduced in SQL Server 2005. Sparse columns, the technology on which SXI is based, was introduced in SQL Server 2008. How come it took so long? The chicken and egg problem above. Hard to add resources if it's not a popular feature, but it won't be popular UNTIL you add resources.

And finally, how relevent is XML today? It seems like lots of data-related things these days are using JSON (that's JavaScript Object Notation) instead, no? Well no. And JSON is used to send data over the network, not necessarily the best way to store (and certainly not to query) data. Unless you're a NoSQL database like one of these. But these both seem to support both REST (XML)-based and JSON-based output if not store data that way. Is the idea of a database preformatting data for the consumer (and even storing and querying the format in question) a good idea? Well, SQL Server tried this with XML (HTTP Endpoints over SOAP protocol). It will discontinued in SQL Server 2012, too slow. Waste of memory and cycles for the database server. And SOAP (on which it was based) seemed to have a new permutation every few months or so for a while. HTTP Endpoints couldn't (or didn't) keep up with these.

These days Microsoft under the guise of OData, seems to be pushing/offering both JSON and Atompub formats, but the formatting and serving to clients takes place on a middle tier WCF service, for one of their implementations. And the spec lets you implement the (OData) procotol however you'd like. So I don't really think XML is dead, its just not the only cool, portable, text-based data format in town anymore.

It's been a long day. I personally think SXI is a cool technology, but only time will tell about its adaptation. IMHO, it's worth using a service pack as a ship vehicle to get it to folks sooner. But, as to future, and text-based data formats, as Jim Bouton, I believe, once said/wrote: "I think I'll go wash my brain out with soap". And that's soap the cleanser, not SOAP the protocol. Of course. ;-)

Send comments.

Cheers, Bob
@bobbeauch