Choosing Paths for Selective XML Index – Part 2 – Using the SXI XEvents

In part 1 of this series, we tried a simple example from the Books Online without any SXI index at all, and an SXI with all the paths covered. No XEvents were emitted in either case. Now let’s change the XSI around (by dropping and recreating, although we could use ALTER, so envision a drop in between each example) to see what triggers the XEvents and what the query plan effects are.

– first let’s back off one path
CREATE SELECTIVE XML INDEX simple_sxi ON Tbl(xmlcol)
FOR (
path123 =  ‘/a/b’ as XQUERY ‘node()’
– path124 =  ‘/a/b/c’
);

– now, you get XML Reader, XML Reader with XPath Filer, CI Seek on SIDX, CI Scan on base table
– and “selective_xml_index_path_not_indexed” XEvent for /a/b/c
SELECT T.id FROM tbl T WHERE T.xmlcol.exist(‘/a/b[c = "43"]‘) = 1

– how about adding the SINGLETON keyword and XQuery Type?
CREATE SELECTIVE XML INDEX simple_sxi ON Tbl(xmlcol)
FOR (
path123 =  ‘/a/b’ as XQUERY ‘node()’,
path124 =  ‘/a/b/c’ as XQUERY ‘xs:string’ SINGLETON
);

– now, you get a single CI Seek on SIDX and CI Scan on base table
SELECT T.id FROM tbl T WHERE T.xmlcol.exist(‘/a/b[c = "43"]‘) = 1

– how about an XML.value query instead of XML.exists?
– get XML Reader with XPath Filter, XML Reader, and CI Scan on base table
– and “selective_xml_index_no_compatible_sql_type” XEvent for /a/b/c
SELECT T.id FROM tbl T WHERE T.xmlcol.value(‘(/a/b/c)[1]‘, ‘nvarchar(10)’) = ’43′;

– For this to use SXI, we need:
ALTER INDEX simple_sxi ON Tbl FOR (add path124s = ‘/a/b/c’ as SQL nvarchar(10) SINGLETON);

– SXI CI Seek and CI Scan on base table, no XEvents
SELECT T.id FROM tbl T WHERE T.xmlcol.value(‘(/a/b/c)[1]‘, ‘nvarchar(10)’) = ’43′;

BTW, if you use ‘varchar(10)’ rather than ‘nvarchar(10)’ in the query, the SXI won’t be used. One last one, let’s see if the SXI will be used if we specify an incorrect XQuery type:

CREATE SELECTIVE XML INDEX simple_sxi ON Tbl(xmlcol)
FOR (
path123 =  ‘/a/b’ as XQUERY ‘node()’,
path124 =  ‘/a/b/c’ as XQUERY ‘xs:double’ SINGLETON
);

– We get “selective_xml_index_no_compatible_xsd_types” XEvent for /a/b/c
– And XML Reader with XPath Filter, XML Reader, CI Seek on SIDX and CI Scan on base table
SELECT T.id FROM tbl T WHERE T.xmlcol.exist(‘/a/b[c = "43"]‘) = 1;

So, to summarize, in this post and the last post we’ve shown:
The SXI cannot be used to search for mixed content.
The ‘AS XQUERY node()’ specification can be used to check for existance of non-terminating node.
The XEvents for SXI can guide you to understanding which nodes to specify.
You must use “As SQL…” hint in order for SXI to be used in XML.value method.
You use “As XQUERY…” hint to use SXI for XQuery predicates or paths in the XML.exist method.

To go back to the BOL examples and test your understanding, write a selective XML index create statement for the following:

SELECT T.record,
T.xmldata.value(‘(/a/b/c/d/e[./f = "SQL"]/g)[1]‘, ‘nvarchar(100)’)
FROM myXMLTable T

Cheers, Bob

@bobbeauch

Choosing Paths for Selective XML Index – Part 1

About a month ago, I wrote a series of blog entries on the Selective XML Index introduced in SQL Server 2008 SP1. Just to summarize before I start up again, the Selective XML Index (I’ve abbreviate as SXI) is a feature introduced to allow indexing of only certain paths in an XML column. It uses side-tables using sparse column technology (that was introduced in SQL Server 2008) to effectively index pieces of XML documents while keeping the size of the index relatively low. It is unrelated to the “original” XML index technology (CREATE PRIMARY XML INDEX, etc) that creates large indexes (2-5 times size of the original documents) but indexes every element, attribute, and text node. Either XML indexing technology may be used independently or both may be used together. In addition to the (primary) SXI, you can have “secondary” SXIs, which are simply nonclustered indexes over specific columns in the side table.

Since I wrote the original series, Books Online topics for SXI have appeared. So I thought I’d work through a strategy for deciding to define these indexes and show how these indexes affect query plans. As implied by the name of the feature, since you index specific paths, deciding which paths to index is key. Each path takes up space in the side-table. In addition, you can specify “keywords” on the paths, such as XQuery data type, SQL data type, and whether this path is a singleton. I’ve also found four new Extended events to help in my decision. These events fire when you have a selective XML index in general, but the path may be missed or specified incorrectly. Namely,

selective_xml_index_no_compatible_sql_type – Occurs when a value() method is used in a query, specifying a sql type that differs from the type specified during selective XML index creation for that path.

selective_xml_index_no_compatible_xsd_types – Occurs when the user specifies a different resulting XSD type in a query, than the one that was used to promote the path during selective XML index creation.

selective_xml_index_path_not_indexed – Occurs when the path given in the user query is not promoted in the selective XML index, and thus the index is not used.

selective_xml_index_path_not_supported – Occurs when selective XML index is not used for a XML query due to the user using an unsupported path type.

I’ve created and started an XEvent session with these events, also including the text of the SQL Server statement involved. So let’s start with a simple example. After enabling SXI on the database (see my original posting), I took this one from the BOL and define the table as:

CREATE TABLE tbl (  id int identity primary key,  xmlcol XML );

Note that xmlcol is untyped XML, meaning we don’t have an associated XML Schema Collection. We’re looking to optimize, as an exemplar query:
SELECT T.id FROM tbl T WHERE T.xmlcol.exist(‘/a/b[c = "43"]‘) = 1;

And add a single row that will return true:
INSERT T values(‘<a><b><c>43</c></b></a>’);

Let’s try this query first without any index. The query plan is supplied as a file (because it’s fairly large), but consists of:

1. Clustered index (CI) scan of base table
2. Two “XML Reader with XPath Filter” steps (to select the nodes we want)    a. One step for the path outside the XPath predicate (a/b)    b. One step for the path outside the XPath predicate (c as child of a/b)
3. One “XML Reader” to get the value of text node c.

For such a simple query against a 1-row table, the estimated subtree cost is over 1. (1.09117). We can do better.

So let’s start by trying BOL suggested SXI:

CREATE SELECTIVE XML INDEX simple_sxi ON Tbl(xmlcol)
FOR (
path123 =  ‘/a/b’,
path124 =  ‘/a/b/c’
);

Unfortunately, this returns an error: Msg 6394, Level 16, State 1, Line 1 Found non leaf element indexed for the value with path ‘simple_sxi’ for selective XML index ‘path123′.

It’s “upset” because I have the path /a/b. It won’t index any path that not a “terminating path”, i.e. that has other XML underneath (in this case the “c” element). This is interesting for a few reasons. First off, I can’t index “mixed content” with the SXI. Mixed content in XML is content which contains both tags and text, for example an HTML page. Here’s a simple example.

– The “value” of node b contains both text and tags (the <i></i> tags).
INSERT tbl values(‘<a><b>This is some text. This word is <i>italicized</i>.</b></a>’);

Since storing/querying mixed content (e.g. WordXML) is a common use case, the fact that it won’t work with XSI is good to know. But, in this case, I just want to look for the existance of the /a/b node. BTW, existance in this case isn’t related to the “XML exist” method, but to my filter in my XML Reader With XPath Filter step. Is there a way to do this? Turns out, this is exactly what the “node()” specifier is for. So let’s try that.

CREATE SELECTIVE XML INDEX simple_sxi ON Tbl(xmlcol)
FOR (
path123 =  ‘/a/b’ as XQUERY ‘node()’,
path124 =  ‘/a/b/c’
);

No error on creating the SXI now. That’s good. And how about the query plan? Ahh…that’s much better.

1. Clustered index scan on the SXI side-table. Filter on value 43.
2. Clustered index seek on side-table for primary key of base table.
3. Clustered index seek to join to base table.

Query cost: 0.0098608. That’s a little better.

…And this post is getting a little long. So I’ll continue on with further investigations in the next post. But what about the XEvents? We haven’t seen any events yet. When we don’t have an SXI, it doesn’t suggest “missing paths”. And the SXI we did use contained all the appropriate paths, so there’s no XEvents on that query. So we’ll wait on those for now.

@bobbeauch

sxi_plans.zip

New XEvent events in SQL Server 2012 SP1

I’ve been looking a bit more into the SQL Server 2012 SP1 release, mostly to find diagnostics related to the Selective XML Index (SXI). I found some additional extended events for SXI (there are four of them in the “index” category) and I’ll cover them in, as well as exercise them, the next posting. But in my travels, I also came across five more new-to-SP1-from-RTM extended events. They occur in the “warnings” category when you’re using the XEvent GUI, and correspond, for the most part, to warnings that were added to query plans in SQL Server 2012 RTM. So if you liked these as warnings you’ll probably like them as events. They are:

plan_affecting_convert – Occurs when a type convert issue affects the plan.  The expression value indicates the conversion that can cause inaccurate Cardinality Estimation or prevent from using a seek based plan in the query plan choice.  If performance is affected, rewriting the query could help.

spatial_guess – Occurs when the optimizer cannot get enough join selectivity information.  Use this event to look at the queries where spatial index could be used and the optimizer guesses the spatial selectivity. (Note: I *always* seem to see this hint, even when there should be enough information for the optimizer to guess the spatial selectivity, and filed a Connect item for this one.)

unmatched_filtered_indexes – Occurs when the optimizer cannot use a filtered index due to parameterization.  For each such index an event is fired.  The unmatched_database_name, unmatched_schema_name, unmatched_table_name, unmatched_index_name fields give details of the filtered index that could not be used.

The “original” new SQL Server 2012 plan warnings, hash_warning and sort_warning exist in RTM in the “errors” category; they came over to extended events when all the SQL Profiler events were moved over.

There is also one additional warning.

optimizer_timeout – Occurs when the optimizer times out either due to spending too much time or hitting a memory limit.  Use this event to look at all the queries that are impacted by the optimizer timeout in a particular workload. This can be very useful when tuning a particular workload.

Finally, there is an additional replication XEvent in the replication category:

logreader_start_scan – Outputs the replnextlsn value from the database table that the first replcmds scan uses as a starting point.

Happy event-ing!

@bobbeauch

Speaking at DevWeek London in March

Lucky me, I’ll be speaking at DevWeek in London again this year. The festivities start off with pre-cons on March 4, I’ll do a whole day on writing best-performing T-SQL and application code. After that, I’ll be doing breakout talks on a variety of topics, from SQL Server Extended Events, to Data Movement in Windows Azure SQL Database, to SQLCLR internal, and more.

Hope to see some familiar faces at this one and make some new friends as well. Stop by and say hi…

@bobbeauch

How To Return Document Property Values in Full-Text Search

About a week ago I got involved with a question via Twitter that was posted originally to Stack Exchange. The question was "Is it possible to list document properties via FTS"? The question refers to SQL Server 2012's new support or property-based search with a syntax that looks like this:

SELECT name DocumentName, file_stream.GetFileNamespacePath() Path
FROM Documents
WHERE CONTAINS(PROPERTY(file_stream, 'Title'), 'data OR SQL');

This functionality has nice synergy with another SQL Server 2012 feature, the filetable feature. So you can store documents in a filetable and search with FTS. Including document properties. There's just one thing missing: no way to get the actual property values after you've searched for them. So, as I wrote in this suggestion on Connect:

"Suppose I have three documents that have authors of "Bob Beauchemin", "Bob Newhart" and "Bob Gupta". I can do a CONTAINS-based search for documents with authors that have a keyword "Bob", but can't bring back the entire propery value, which would be very useful.

Currently, the closest I can get is using sys.dm_fts_index_keywords_by_property. But think only produces the keywords, not the property values. So, in the case above, I wouldn't know if the actual author name was "Bob Newhart" or "Newhart Bob". and keywords wouldn't include noisewords."

Well, now there IS a solution. I'd just written a blog post about a standalone component for full-text search called ThinkHighlight. Wrote to the folks over at Interactive Thoughts and… voila! Their latest version supports returning not only highlighted text, but an XML format column that includes highlighted text AND property name-value pairs called HitHighlightEx. It includes all the properties as the canonical propertyset/propertyid, and if you're FTS index uses a SEARCH PROPERTY LIST, it will resolve the propertyset/propertyid to the property name from the SEARCH PROPERTY LIST.

As an example, let's use the demo script from the FTS blog. I can search for the keyword "compression" and return highlighted text and properties:

DECLARE @query nvarchar(50) = 'compression';
DECLARE @context varbinary(8000) = dbo.HitHighlightContext('Documents', 'file_stream', @query, -1);
SELECT TOP 10 Name, dbo.HitHighlightEx(@context, 'top-fragment', 200, stream_id) as description
FROM Documents
WHERE CONTAINS(file_stream, @query);

But better, I can take the property XML apart and return individual columns as properties:

DECLARE @query nvarchar(50) = 'compression';
DECLARE @context varbinary(8000) = dbo.HitHighlightContext('Documents', 'file_stream', @query, -1);
SELECT Name,
  Description.value('(/document/properties/property[@name="Title"]/@value)[1]', 'nvarchar(max)') as Title,
  Description.value('(/document/properties/property[@name="Author"]/@value)[1]', 'nvarchar(max)') as Author,
  Description.value('(/document/properties/property[@name="Tags"]/@value)[1]', 'nvarchar(max)') as Tags
FROM
(
SELECT TOP 10 Name, dbo.HitHighlightEx(@context, 'top-fragment', 200, stream_id) as description
FROM Documents
WHERE CONTAINS(file_stream, @query)
) a;

And even get properties that aren't in my SEARCH PROPERTY LIST. So, to get the "LastAuthor" property (Property/Details in Windows Explorer calls this "Last Saved By"):

DECLARE @query nvarchar(50) = 'compression';
DECLARE @context varbinary(8000) = dbo.HitHighlightContext('Documents', 'file_stream', @query, -1);
SELECT Name,
  Description.value('(/document/properties/property[@name="Title"]/@value)[1]', 'nvarchar(max)') as Title,
  Description.value('(/document/properties/property[@name="Author"]/@value)[1]', 'nvarchar(max)') as Author,
  Description.value('(/document/properties/property[@name="Tags"]/@value)[1]', 'nvarchar(max)') as Tags,
  Description.value('(/document/properties/property[@set="f29f85e0-4ff9-1068-ab91-08002b27b3d9" and @id="8"]/@value)[1]', 'nvarchar(max)') as LastAuthor
FROM
(
SELECT TOP 10 Name, dbo.HitHighlightEx(@context, 'top-fragment', 200, stream_id) as description
FROM Documents
WHERE CONTAINS(file_stream, @query)
) a;

As always, you can find a complete list of canonical Windows properties here.

So there you have it. To get in on the beta and make suggestions, download the ThinkHighlight beta here. BTW: I have NO affliation of any kind with Interactive Thoughts. But I DO like the way they respond to enhancement requests.

@bobbeauch

Reacting to XEvents in almost real-time

I've heard it said that the main difference between SQL Server Extended Events and SQL Server Event Notifications is that you can't "react to" Extended Events. Event Notifications are written to the service broker service of your choice, and multiple Event Notifications can even be routed to the same service, even on different machines.

But there IS a way to react to Extended Events in "almost realtime", that is, by capturing and reacting using the Extended Event XeReader API. It's certainly not as nice and built-in as Event Notifications, but never say never.

Mike Wachal introduced the XeReader to the world in a blog post at SQL Server 2012 CTP3. There are a variety of constructors for QueryableXEventData, you're interested in the one that uses a connection string and the name of an Extended Event session, which must be predefined (from Mike's blog):

QueryableXEventData stream = new QueryableXEventData(
  @"Data Source = (local); Initial Catalog = master; Integrated Security = SSPI",
  "alert_me",      EventStreamSourceOptions.EventStream,      EventStreamCacheOptions.DoNotCache);

In this case the "alert_me" is the name of an existing XEvent session. Now that you're capturing the events, you can take whatever kind of action you want. See the sample in Mike's blog for where you'd hook into this. You can even write the events to a service broker service, if you'd like.

It's still not quite as good, to summarize:

Event Notifications are defined using DDL and the writing the broker is built-in. This method requires:
1. Custom .NET programming
2. An external program that must be running with 2 SQL connections (one to read the events, one to accomplish the actions)
3. Custom code to write the "react to" actions (although this could be templated and parameterized in the code)

In addition, there's a WMI event provider for "Event Notification"-type trace events that can be easily hooked into SQL Server Agent. You'd also have to write an XeReader equivalent for XEvents. But until there's something built-in for XEvents, you have a "next best thing" alternative for now. Use the code in Mike's blog as a starting point

Bob

@bobbeauch

Loading spatial data and cleansing-converting with SQL Server 2012

Someone asked, on Twitter, if it were possible to get a set of spatial data that included cities, counties (administrative regions) and countries in database table format for SQL Server spatial. I'd come across just these datasets (and more) as shapefiles at the CDC website and pointed them out. Then came the question about how to load these shapefiles into SQL Server.

I've always preferred FME by Safe Software to load spatial data into SQL Server. They have a variety of products  for SQL Server, including a low-priced option that more than suffices for loading shapefiles. But what if you can't afford even Safe's low-priced option (you're doing it for proof-of-concept, for example). I'd then recommend Shape2SQL, a free tool that's been around since SQL Server Spatial's inception back in SQL Server 2008.

The first time I'd loaded these shapefiles let's use the cntry00 (countries) file as an example, I'd run into some problems loading them as geography data type and had to bypass some records in the shapefile, leaving me with a map that was missing some countries. I'd always assumed that this was a ring-order problem that produced geographies large than a hemisphere. SQL Server Spatial in 2012 fixed the "larger than a hemisphere" limitation, so I thought I'd try loading as a geography again. Unfortunately I got the same errors. So I decided to reload the data as geometry, cleanse the data and convert to geography. What follows is a series of steps to do just that. It relies on some SQL Server 2012-specific functionality.

After trying to summarize this on Twitter as 140-character messages (I always refer to Twitter as "my 140-character lobotomy") someone suggested this would be a good blog entry. So here it is.

So, acquire the shapefile data. What you're looking for is a .shp file (input into Shape2SQL) and possibly the .prj file to determine the SRID to use. In this case, I'll go with SRID 4326 eventually for geography. You can use an SRID with geometry too, but it doesn't effect the spatial library's computation like it does with geography.

1. Unzip the zipfile
2. Pointed Shape2SQL at the .shp file choosing the following options:
   a. Planar Geometry
   b. Set SRID 4326
   c. Create spatial index – no (checkbox cleared)
   d. spatial column name – geom (default)

Now all the data will load. So let's run some queries to clean things up and convert to geography.

– First, look for Invalid geometries and get the reasons why they are invalid
– IsValidDetailed is SQL Server 2012-specific
– BTW, it wasn't a ring order problem after all.

– Error: 24411
select geom.IsValidDetailed(), * from cntry00
where geom.STIsValid() = 0;

– Make them valid, noting the row ids for your reference
– rows 21, 42, 252
update cntry00
set geom = geom.MakeValid()
output inserted.id
where geom.STIsValid() = 0;

– add geography column
alter table cntry00 add geog geography;

– vacuous conversion to geography
update cntry00
set geog = geography::STGeomFromWKB(geom.STAsBinary(), 4326);

– see if any invalid now… and the reason
– 24413, 24414 near the poles (3 rows)
select geog.IsValidDetailed(), * from cntry00
where geog.STIsValid() = 0;

– validate again
– rows 1,181,215
update cntry00
set geog = geog.MakeValid()
output inserted.id
where geog.STIsValid() = 0;

– check for more than a hemisphere, none found
select * from cntry00
where geog.EnvelopeAngle() > 90;

– if there are geographies greater than a hemisphere
– because of ring order problems fix 'em
– SQL Server 2012-specific
update cntry00
 set geog = geog.ReorientObject()
 output inserted.id
 where geog.EnvelopeAngle() > 90;

And there you have it. You might think that you can use the spatial results tab in SSMS to compare your data before and after validation to see if validation has left artifacts or changed the shape. Don't bother, though; the spatial results tab called MakeValid before displaying the data.

Hope this helps,
Bob

@bobbeauch

Hit Highlighting/Summarization product for SQL Server Full-Text Search

I've often asked about add-ins for SQL Server Full-Text Search feature. And the most requested feature for FTS has long been some way to accomodate hit highlighting. Recently, I've come across a beta for a product that does just this, along with document summarization formats for display, ThinkHighlight by Interactive Thoughts.

The add-in is implemented as a single SQLCLR assembly that exposes a pair of UDFs, HitHighlightContext and HitHighlight. The first UDF prepares a context. You only have to do this once per query, so this appears to be a nice optimization. The HitHighlight one is used to produce a column of output with summarized text and highlighted hits. You have a choice of three different output summarization strategies: baseline, complete, and top-fragment. Baseline and top-fragment are meant to be used in a manner similar to one that you'd see used in search websites. There's an interactive demo that shows some examples.

The product comes with a nice installer that populates the assembly (its an unsafe assembly as it needs to access COM components like IFilters). However, it doesn't require that user databases be marked trustworth, there's an asymmetric key installed in master, in accordance with best practices.  And I haven't been able to cause any problems with the functions, try as I might. And it works with any type of full-text indexed column (text. binary, as well as filetable) and query (all four FTS predicates/table-valued functions).

If you're looking for extended functionality currently not supported in the FTS feature, this might be right up your alley.

Cheers, Bob

@bobbeauch

Samples from SQLServerDays now posted on SQLskills website

Returned from SQLServerDays in Belgium last week. Had a wonderful time, hope you enjoyed it as much as I did. Thanks!

Samples/demos are posted on SQLskills website in the "Resources" section. Enjoy!

I've been rather remiss on my blogging lately, but I've acquired some interesting things to blog about in the interim and should be starting up again in earnest something this week…

Bob

@bobbeauch

I’ll be speaking at SQLServerDays in Antwerp in November

I'm honored to be invited to speak at SQLServerDays, Nov 19-20, in Schelle Belgium, just outside of Antwerp. I've been to Belgium before, but this is my first time in Antwerp and I'm really looking forward to it.

I'll be doing a day-long precon on Nov 19 on New Features in SQL Server 2012. And following up on Tuesday with talks on two of my favorite topics: Spatial Data and SQL Performance, both about the improvements on SQL Server 2012.

Looking forward to catching up with old friends (the Belgian folks were always kind and invited me to their "country party" at TechEds over the years) and meeting up with some new friends. And they'll be lots of great speakers at this conference. If you're around Monday and/or Tuesday, stop by and say hi.