Loading spatial data and cleansing-converting with SQL Server 2012

Someone asked, on Twitter, if it were possible to get a set of spatial data that included cities, counties (administrative regions) and countries in database table format for SQL Server spatial. I'd come across just these datasets (and more) as shapefiles at the CDC website and pointed them out. Then came the question about how to load these shapefiles into SQL Server.

I've always preferred FME by Safe Software to load spatial data into SQL Server. They have a variety of products  for SQL Server, including a low-priced option that more than suffices for loading shapefiles. But what if you can't afford even Safe's low-priced option (you're doing it for proof-of-concept, for example). I'd then recommend Shape2SQL, a free tool that's been around since SQL Server Spatial's inception back in SQL Server 2008.

The first time I'd loaded these shapefiles let's use the cntry00 (countries) file as an example, I'd run into some problems loading them as geography data type and had to bypass some records in the shapefile, leaving me with a map that was missing some countries. I'd always assumed that this was a ring-order problem that produced geographies large than a hemisphere. SQL Server Spatial in 2012 fixed the "larger than a hemisphere" limitation, so I thought I'd try loading as a geography again. Unfortunately I got the same errors. So I decided to reload the data as geometry, cleanse the data and convert to geography. What follows is a series of steps to do just that. It relies on some SQL Server 2012-specific functionality.

After trying to summarize this on Twitter as 140-character messages (I always refer to Twitter as "my 140-character lobotomy") someone suggested this would be a good blog entry. So here it is.

So, acquire the shapefile data. What you're looking for is a .shp file (input into Shape2SQL) and possibly the .prj file to determine the SRID to use. In this case, I'll go with SRID 4326 eventually for geography. You can use an SRID with geometry too, but it doesn't effect the spatial library's computation like it does with geography.

1. Unzip the zipfile
2. Pointed Shape2SQL at the .shp file choosing the following options:
   a. Planar Geometry
   b. Set SRID 4326
   c. Create spatial index – no (checkbox cleared)
   d. spatial column name – geom (default)

Now all the data will load. So let's run some queries to clean things up and convert to geography.

– First, look for Invalid geometries and get the reasons why they are invalid
– IsValidDetailed is SQL Server 2012-specific
– BTW, it wasn't a ring order problem after all.

– Error: 24411
select geom.IsValidDetailed(), * from cntry00
where geom.STIsValid() = 0;

– Make them valid, noting the row ids for your reference
– rows 21, 42, 252
update cntry00
set geom = geom.MakeValid()
output inserted.id
where geom.STIsValid() = 0;

– add geography column
alter table cntry00 add geog geography;

– vacuous conversion to geography
update cntry00
set geog = geography::STGeomFromWKB(geom.STAsBinary(), 4326);

– see if any invalid now… and the reason
– 24413, 24414 near the poles (3 rows)
select geog.IsValidDetailed(), * from cntry00
where geog.STIsValid() = 0;

– validate again
– rows 1,181,215
update cntry00
set geog = geog.MakeValid()
output inserted.id
where geog.STIsValid() = 0;

– check for more than a hemisphere, none found
select * from cntry00
where geog.EnvelopeAngle() > 90;

– if there are geographies greater than a hemisphere
– because of ring order problems fix 'em
– SQL Server 2012-specific
update cntry00
 set geog = geog.ReorientObject()
 output inserted.id
 where geog.EnvelopeAngle() > 90;

And there you have it. You might think that you can use the spatial results tab in SSMS to compare your data before and after validation to see if validation has left artifacts or changed the shape. Don't bother, though; the spatial results tab called MakeValid before displaying the data.

Hope this helps,
Bob

@bobbeauch

Hit Highlighting/Summarization product for SQL Server Full-Text Search

I've often asked about add-ins for SQL Server Full-Text Search feature. And the most requested feature for FTS has long been some way to accomodate hit highlighting. Recently, I've come across a beta for a product that does just this, along with document summarization formats for display, ThinkHighlight by Interactive Thoughts.

The add-in is implemented as a single SQLCLR assembly that exposes a pair of UDFs, HitHighlightContext and HitHighlight. The first UDF prepares a context. You only have to do this once per query, so this appears to be a nice optimization. The HitHighlight one is used to produce a column of output with summarized text and highlighted hits. You have a choice of three different output summarization strategies: baseline, complete, and top-fragment. Baseline and top-fragment are meant to be used in a manner similar to one that you'd see used in search websites. There's an interactive demo that shows some examples.

The product comes with a nice installer that populates the assembly (its an unsafe assembly as it needs to access COM components like IFilters). However, it doesn't require that user databases be marked trustworth, there's an asymmetric key installed in master, in accordance with best practices.  And I haven't been able to cause any problems with the functions, try as I might. And it works with any type of full-text indexed column (text. binary, as well as filetable) and query (all four FTS predicates/table-valued functions).

If you're looking for extended functionality currently not supported in the FTS feature, this might be right up your alley.

Cheers, Bob

@bobbeauch

Samples from SQLServerDays now posted on SQLskills website

Returned from SQLServerDays in Belgium last week. Had a wonderful time, hope you enjoyed it as much as I did. Thanks!

Samples/demos are posted on SQLskills website in the "Resources" section. Enjoy!

I've been rather remiss on my blogging lately, but I've acquired some interesting things to blog about in the interim and should be starting up again in earnest something this week…

Bob

@bobbeauch

I’ll be speaking at SQLServerDays in Antwerp in November

I'm honored to be invited to speak at SQLServerDays, Nov 19-20, in Schelle Belgium, just outside of Antwerp. I've been to Belgium before, but this is my first time in Antwerp and I'm really looking forward to it.

I'll be doing a day-long precon on Nov 19 on New Features in SQL Server 2012. And following up on Tuesday with talks on two of my favorite topics: Spatial Data and SQL Performance, both about the improvements on SQL Server 2012.

Looking forward to catching up with old friends (the Belgian folks were always kind and invited me to their "country party" at TechEds over the years) and meeting up with some new friends. And they'll be lots of great speakers at this conference. If you're around Monday and/or Tuesday, stop by and say hi.

Selective XML Index – Why is this compelling?

OK. VM with the SP1 CTP is shut down. Fun's over (until I think of something new and cool or you think of something and tell me about it). Time to answer Aaron's implied question about "what makes this so special it should be shipped in a Service Pack, no less?".

This could be a game changer with respect to XML querying in SQL Server. Both in query speed and in data size reduction. But, you say, not many customers do XML querying in SQL Server. Well, this was one of the reasons why.

Cust: I stored some XML in SQL Server and queries are incredably slow.
Me: You need at least an primary XML index to make queries tolerably fast.
Cust: But I really can't afford all that extra data. And it slows down inserts. And backups. And sucks up data cache in memory. And XML index rebuilds are single-threaded and offline. And…
Me: It's either indexing or slow queries. Take your pick.
Cust: I pick not.
Me: (Shrugs) Can't blame ya. How about partial decomposition into relational? (persisted computed columns)

So, depending on the specificity (locality) of your queries against "hunks of XML", this could be a game-changer. Selective XML indexes on a specific (hopefuly small) set of pathspecs will speed up your queries and make the overhead MUCH more tolerable. So it might make the concept of using XML + queries a second look. There might even be some advantages vis-a-vis persisted computed columns. Like the fact that persisted computed columns over UDFs (from partially decomposed XML) can't use parallelism. But, it's still offline, single-threaded rebuild.

You WILL, however, have to track which queries you're running against the XML that you are "selective indexing". And maintain your Selective XML Index accordingly (with alter-remove and alter-add for pathspecs). Similar to the way you'd optimize SQL queries. Except SQL queries come with lovely built-in assistants like Database Tuning Advisor. I know some folks don't like DTA, but it at least useful as a starting point. There's no DTA support for XML indexes or any kind (or spatial or FTS indexes for that matter).

Even factoring out query/indexing speed, XML in SQL Server does have its limits. Like using XQuery Data Model, which make some constructs "valid XML" that some/most parsers don't support. Or that output from a SELECT that puts out XML has a single encoding (maybe CONVERT with an encoding parameter on output). And SQL Server itself and XML data type internals doesn't have support for UTF-8 which bloats things a bit. And that the version of XQuery SQL Server uses is getting long in the tooth, and is a subset. And XML-DML is based on a propriatary spec, because the real spec postdates SQL Server.

You might ask: "If this Selective XML Index thing is such a good idea, why didn't they do it this way originally?". Good question. SQL Server XML support (data type, indexing, XQuery, etc) was introduced in SQL Server 2005. Sparse columns, the technology on which SXI is based, was introduced in SQL Server 2008. How come it took so long? The chicken and egg problem above. Hard to add resources if it's not a popular feature, but it won't be popular UNTIL you add resources.

And finally, how relevent is XML today? It seems like lots of data-related things these days are using JSON (that's JavaScript Object Notation) instead, no? Well no. And JSON is used to send data over the network, not necessarily the best way to store (and certainly not to query) data. Unless you're a NoSQL database like one of these. But these both seem to support both REST (XML)-based and JSON-based output if not store data that way. Is the idea of a database preformatting data for the consumer (and even storing and querying the format in question) a good idea? Well, SQL Server tried this with XML (HTTP Endpoints over SOAP protocol). It will discontinued in SQL Server 2012, too slow. Waste of memory and cycles for the database server. And SOAP (on which it was based) seemed to have a new permutation every few months or so for a while. HTTP Endpoints couldn't (or didn't) keep up with these.

These days Microsoft under the guise of OData, seems to be pushing/offering both JSON and Atompub formats, but the formatting and serving to clients takes place on a middle tier WCF service, for one of their implementations. And the spec lets you implement the (OData) procotol however you'd like. So I don't really think XML is dead, its just not the only cool, portable, text-based data format in town anymore.

It's been a long day. I personally think SXI is a cool technology, but only time will tell about its adaptation. IMHO, it's worth using a service pack as a ship vehicle to get it to folks sooner. But, as to future, and text-based data formats, as Jim Bouton, I believe, once said/wrote: "I think I'll go wash my brain out with soap". And that's soap the cleanser, not SOAP the protocol. Of course. ;-)

Send comments.

Cheers, Bob
@bobbeauch

Selective XML Indexes – Learning the rules through error messages

Experimenting without docs is a good way to get lots of error messages along the way. I'm learning by attrition, and it seems like I'd be better off knowing the rules. It's not the best way to learn anything, but you can at least learn the limits through the error messages. I was tired of trying seemingly random things and being surprised, so I decided to look for "selective XML index" error messages.

So, here's how to find (or imply) "the rules" for Selective XML Indexes.

– Ids 662, 970, 2735, 6342, 6343 and 6367-6399 and 9535-9539 inclusive
select message_id, text from sys.messages
where language_id = 1033
and text like '%selective XML%';

Now some of these are general (parameterized) error messages, so you won't get exact limits in terms of numbers. But, I tried the catagorize and summarize these for my own use. Here's some interesting ones.

1. Not all SQL data types are supported (not surprising)
2. Not all XQUERY (XSD) data types are supported (there were some surprises here, like xs:integer not supported)
3. Selective XML indexes' typing can't contradict types in an XML SCHEMA COLLECTION
4. Selective XML indexes' typing can affect new data you try and add. For example, if the types don't match the SXI-defined types and you're not using an XML SCHEMA COLLECTION.
5. You can't index a non-leaf node for the value or use * wildcard as the last step in the path.
6. There is a maximum depth to the path in a pathspec.

And so on… you can read 'em as well as I can.

Hopefully, this, and the previous entries should be enough to get you started. I did write these in one sitting and most of them were written in "realtime" so if you have any additions, questions, and/or find something neat to do with these drop me a line by email (be creative, figure out my email address) or ping me on twitter.

Motivation in the next (which is also the last, for now) posting.

@bobbeauch

Selective XML Index – Secondary Selective XML Indices

I've only been talking so far about the "primary" Selective XML Index. But you can also create 0-n "secondary" Selective XML Indexes. The syntax looks a little bit like secondary "non-selective" XML Indexes in that you use the "USING [related XML index]". You specify one (and only one) pathspec. Additionally, the XML value that the pathspec "points to" (see previous post) cannot be data type xs:untypedAtomic (varbinary(max)).

create xml index sxi_secondary1 on foo(thexml)
using xml index fooidx
for (pathname1); — strongly typed to SQLVARCHAR(25)

create xml index sxi_secondary2 on foo(thexml)
using xml index fooidx
for (anotherxq); — strongly typed to XQUERY xs:double (SQL FLOAT)

– Msg 102, Level 15, State 1, Line 3
– Incorrect syntax near ','.
– Means: error: only one pathspec name allowed
create xml index sxi_secondary3 on foo(thexml)
using xml index fooidx
for (pathname1, pathname2);

– Msg 6391, Level 16, State 0, Line 1
– Path 'pathname2' is promoted to a type that is invalid for use as a key column in a secondary selective XML index.

– Means: can't have varbinary(max) as index key
create xml index sxi_secondary3 on foo(thexml)
using xml index fooidx
for (pathname2);

These work the way you'd think. They create a NONCLUSTERED index over the (internal) table that comprises the "primary" Selective XML Index. The NONCLUSTERED index key is the "value" column in the path spec, along with the traditional "backpointer" to the primary key of the "primary" Selective XML index. These NONCLUSTERED index are FILTERED indexes, filtered on the value column in the path spec.The filter is "[Valuecolumn] IS NOT NULL"

So, if an XML value in a pathspec is sparse (meaning, it doesn't appear in every/most rows) these secondary Selective XML Indexes could be helpful. And remember, like Michael said, these indexes are chosen by the optimizer, not the algebrizer, until the way the algebrizer chooses the primary Selective XML index, but after the algebrizer decides that the Selective XML Index could be useful at all.

@bobbeauch

Selective XML Index – Implementation Details

A while after I posted part 1, there was a twitter message from Aaron (@AaronBertrand), a well-known "smart person" in the SQL Server space, about using feature packs as "ship vehicles", especially when they (possibly) have the effect of requiring keeping all instances in-sync at the service pack level to keep things working and possibly even making SP feature relevent during a restore. I answered that "it must be a compelling feature". And then we reminisced about features like "vardecimal" and "15k partitions". So I'll discuss why it is a compelling feature (or at least why I happen to think it is) later on in the process.

But for now, about how it's implemented…

Before this even. You can add or remove path specs on a Selective XML Index. I found this one by "reading into an error message" and trying syntax until it worked.

– add a pathspec
alter index fooidx on foo
for
(
add another_pathspec = 'foo/bar2'
);

–remove the pathspec we just added
alter index fooidx on foo
for
(
remove another_pathspec
);

As Micheal said, the selective XML index is implemented using sparse columns. The first column (or columns) is the primary key of the base table, which they always named pk1…n. Then a row_id column, which is incremented if there are multiple matching nodes in the same row.

Each pathspec consists of at least two columns: a path column and 1-2 value column(s). Path column is named path_[n]_id (which is always varbinary(900) and sparse).

Value column is named either "[pathspec_name]_[n]_value" or "[pathspec_name]_[n]_sql_value" (for SQL type pathspec). If the pathspec contains a wildcard, there is an additional column named path_[n]_hid (varbinary(900)) where HID stands for hierarchyid (i.e. the path in the document). A pathspec with an XPath/XQuery wildcard would look like this: '/foo/*/baz2'.

As an aside, remember that the SQL Server XML data type (node paths) and hierarchyid data type use the same encoding scheme, known as Ordpath, see my old blog entrry "Ordpath, ordpath, everywhere".

All of the columns (except pk1 and row_id) are sparse. As far as data types (because the nice thing is that SPARSE columns can preserve strong typing) as typed as:
 Type 1: xs:untypedAtomic = varbinary(max)
 Type 2: The closest SQL data type to XSD type (e.g. xs:double = SQL FLOAT)
 Type 3: The SQL type specified in the "pathspec AS SQL…" specification. Including length and collation.

The length of all of the sparse columns are based on the length of the column (either express or implied) in the pathspec.

So, what's this all mean?

It means that, depending on how selective each pathspec is, how many pathspecs are designation in the Selective XML Index definition and the sparseness of each of the column values, the Selective XML Index can be MUCH, MUCH smaller than the (original) PRIMARY XML INDEX (which is implemented as a 12-column side-table, one row per XML node). Because you're only indexing those values your care about. And individual node data CAN be sparse in XML (element text can be, but XML attribute nodes are even more likely to be sparse).

And it also means that the limitations of SQL Server 2008 SPARSE columns apply. See this starting point in Books Online to understand these limits in detail. And, because there isn't a XML columnset column, you are limited to theoretical maximum of 511 pathspecs, if none of them are wildcards and there's a 1 column primary key (1024 maximum columns in table without a columnset, – 2 / 2. I don't know what the real limit is yet.

@bobbeauch

Selective XML Indexes in SQL Server – First Try…it works

OK, so let's try something. Load a bunch of documents. Although, actually this part works with an empty table.

create selective xml index fooidx2 on foo(thexml)
for
(
pathname1 = 'foo/bar/baz2'
);

Msg 6379, Level 16, State 201, Line 1
selective XML index 'fooidx' already exists on column 'thexml' in table 'foo'.
Multiple selective XML indexes per column are not allowed.

Only one SXI at a time. With 1-n path specifications.

About path specifications. To paraphrase the talk…there are three basic types of paths.

1. Simple: /foo/bar/baz1
2. With optimizations:
   /foo/bar/baz2 as XQUERY 'xs:string' MAXLENGTH(25)
   /foo/bar/baz3 as XQUERY 'xs:double' SINGLETON
   /foo/bar/baz4 as XQUERY node() — check for node existance
3. For use with the XQuery value method:
   /foo/bar/baz5 as SQL VARCHAR(25)

Forms 1 and 2 are for XQuery with any XQuery method. Form 3 is for XQuery value method. This seems to turn out to be more important than I thought it was at first.

Rule for using it in query plan:
   1. If SXI (selective XML index) exists and path matches specification, use it
   2. If PXI (primary XML index) exists, use it
   3. Else use XML Reader with XPath Filter

First experiment:

Defined (the one and only one) SXI index for this XML column on path /foo/bar/baz. Type 1 path specification. Issued query:

SELECT thexml.value('(/foo/bar/baz)[1]', 'varchar(25)')
FROM foo;

SXI isn't used in the plan. Hmm…he said it was a decision of the algebrizer, not the optimizer. And path sure looks like it matches. Futz with this for an hour or so. I actually added more documents. Used three path specs that (I thought) all matched. I thought this worked in his demo, but no joy. Tried to force the SXI with index hint (you can't force an SXI in any case). But it's the algebrizer, not the optimizer, Bob…

Instead try index on /foo/bar/baz AS SQL VARCHAR(25). Type 3 path specification. Now, it's used.

@bobbeauch

Getting started with Selective XML Indexes in SQL Server

So, it was announced that SQL2012 SP1 CTP4 was released yesterday. You've likely already heard this (more than once, everyone seems to revel in repeating announcements from the team). And it contains a new feature (I was surprised, thought that wasn't supposed to happen, new features in service packs, meanwhile…) called Selective XML Indexes. Interesting concept, especially because dragging around a primary XML index that's 3-6 bigger than the data itself is a burden. But, without it, querying on any non-trivial XML is quite slow. To put it nicely…

Downloaded and installed the SP (remember its a CTP, DON'T install this on a production system). Got my XML test bed out to try. First problem: no docs on the feature. Well, the only information that exists (OK, that I could find quickly) is in Michael Rys' TechEd US/Europe talks. Downloaded, listened to them, you can find them with a web search. I'll try not to repeat his content. There's DDL in his talk. So:

create table foo (id int identity primary key, thexml xml);

create selective xml index fooidx on foo(thexml)
for
(
pathname1 = 'foo/bar/baz',
pathname2 = 'foo/bar2'
);

Msg 9539, Level 16, State 1, Line 1
Selective XML Index feature is not supported for the current database version

Bummer. Really? Well, there is a trick to it (thanks, Michael).

exec sys.sp_db_selective_xml_index null, 'on' — turn on for current database

BTW, if you want to restore/attach this database now to a pre-SP1 system, you have to turn the feature off.

exec sys.sp_db_selective_xml_index null, 'off'

Which likely means deleting all the selective XML index stuff you have in the database too. Haven't tried it yet, though. And first parameter is the database name. NULL means "current database".

And it works now. So now we can all play along at home. ;-) Let a hundred selective XML indexes bloom. Michael suggests there will be BIG performance improvements and little disk space used.

@bobbeauch