Selective XML Index – Why is this compelling?

OK. VM with the SP1 CTP is shut down. Fun's over (until I think of something new and cool or you think of something and tell me about it). Time to answer Aaron's implied question about "what makes this so special it should be shipped in a Service Pack, no less?".

This could be a game changer with respect to XML querying in SQL Server. Both in query speed and in data size reduction. But, you say, not many customers do XML querying in SQL Server. Well, this was one of the reasons why.

Cust: I stored some XML in SQL Server and queries are incredably slow.
Me: You need at least an primary XML index to make queries tolerably fast.
Cust: But I really can't afford all that extra data. And it slows down inserts. And backups. And sucks up data cache in memory. And XML index rebuilds are single-threaded and offline. And…
Me: It's either indexing or slow queries. Take your pick.
Cust: I pick not.
Me: (Shrugs) Can't blame ya. How about partial decomposition into relational? (persisted computed columns)

So, depending on the specificity (locality) of your queries against "hunks of XML", this could be a game-changer. Selective XML indexes on a specific (hopefuly small) set of pathspecs will speed up your queries and make the overhead MUCH more tolerable. So it might make the concept of using XML + queries a second look. There might even be some advantages vis-a-vis persisted computed columns. Like the fact that persisted computed columns over UDFs (from partially decomposed XML) can't use parallelism. But, it's still offline, single-threaded rebuild.

You WILL, however, have to track which queries you're running against the XML that you are "selective indexing". And maintain your Selective XML Index accordingly (with alter-remove and alter-add for pathspecs). Similar to the way you'd optimize SQL queries. Except SQL queries come with lovely built-in assistants like Database Tuning Advisor. I know some folks don't like DTA, but it at least useful as a starting point. There's no DTA support for XML indexes or any kind (or spatial or FTS indexes for that matter).

Even factoring out query/indexing speed, XML in SQL Server does have its limits. Like using XQuery Data Model, which make some constructs "valid XML" that some/most parsers don't support. Or that output from a SELECT that puts out XML has a single encoding (maybe CONVERT with an encoding parameter on output). And SQL Server itself and XML data type internals doesn't have support for UTF-8 which bloats things a bit. And that the version of XQuery SQL Server uses is getting long in the tooth, and is a subset. And XML-DML is based on a propriatary spec, because the real spec postdates SQL Server.

You might ask: "If this Selective XML Index thing is such a good idea, why didn't they do it this way originally?". Good question. SQL Server XML support (data type, indexing, XQuery, etc) was introduced in SQL Server 2005. Sparse columns, the technology on which SXI is based, was introduced in SQL Server 2008. How come it took so long? The chicken and egg problem above. Hard to add resources if it's not a popular feature, but it won't be popular UNTIL you add resources.

And finally, how relevent is XML today? It seems like lots of data-related things these days are using JSON (that's JavaScript Object Notation) instead, no? Well no. And JSON is used to send data over the network, not necessarily the best way to store (and certainly not to query) data. Unless you're a NoSQL database like one of these. But these both seem to support both REST (XML)-based and JSON-based output if not store data that way. Is the idea of a database preformatting data for the consumer (and even storing and querying the format in question) a good idea? Well, SQL Server tried this with XML (HTTP Endpoints over SOAP protocol). It will discontinued in SQL Server 2012, too slow. Waste of memory and cycles for the database server. And SOAP (on which it was based) seemed to have a new permutation every few months or so for a while. HTTP Endpoints couldn't (or didn't) keep up with these.

These days Microsoft under the guise of OData, seems to be pushing/offering both JSON and Atompub formats, but the formatting and serving to clients takes place on a middle tier WCF service, for one of their implementations. And the spec lets you implement the (OData) procotol however you'd like. So I don't really think XML is dead, its just not the only cool, portable, text-based data format in town anymore.

It's been a long day. I personally think SXI is a cool technology, but only time will tell about its adaptation. IMHO, it's worth using a service pack as a ship vehicle to get it to folks sooner. But, as to future, and text-based data formats, as Jim Bouton, I believe, once said/wrote: "I think I'll go wash my brain out with soap". And that's soap the cleanser, not SOAP the protocol. Of course. 😉

Send comments.

Cheers, Bob
@bobbeauch

One thought on “Selective XML Index – Why is this compelling?

  1. "How come it took so long?" It’s true that the company may not allocate enough resources. But a more important reason is that the selective index is more than selecing a portion of elements to index. It internally uses a special mapping, a product of a research project, which may not be an obvious solution for many engineers in the first place. We may understand more when more implementation details are revealed.

    As for the XML’s position, it is in question today because it was heavily abused by the W2C committee, IMHO. Document-oriented and data-oriented data clearly have quite different focuses. XML tries to embraces the two. A direct consequence is that the query standard must support the needs from both sides, whose implementations and efficiencies get screwed and no one wants to use it anymore. A notorious example, IMO, is document order. While document order makes sense for document-oriented apps, it is meaningless for data-oriented apps. Apps always define the "order" based on their own semantics, e.g., by time, by distance or by revenue. And there is no universal order, called document order, that makes sense to all apps. That’s why SQL has bag semantics and apps can always define what they want using ORDER BY. Supporting document order in query languages unnecessarily forces all data-oriented apps to pay the execution cost. Considering this, it is easy to understand why JSON is more popular nowadays in web apps, which mostly deal with "data" exchange, rather than "document" exchange.

Comments are closed.

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.