Using filetable with full-text? Lose the redundant index

I'm not sure how many full-text search over filetable demos I've seen (including some I've written that I was just perusing), that do something like this.

CREATE TABLE foo AS FILETABLE WITH (FILETABLE_DIRECTORY = N'ftdir');
CREATE UNIQUE INDEX fooidx ON foo(stream_id);
CREATE FULLTEXT INDEX ON foo (file_stream TYPE COLUMN file_type) KEY INDEX fooidx;

After looking at this a few times, I couldn't beleive there wasn't some kind of key contraint/index on the stream_id column defined as part of the filetable definition. And indeed there is. There's a UNIQUE constraint already that casues a nonclustered unique index to be built to enforce it. So why the extra nonclustered index, just to use full-text? A couple of extra lines of code should be able to tell us what the filetable named it's constraint/index, and we can use that name in the full-text index creation instead. And we won't need the redundant index.

But wait…we don't even have to do that.

CREATE TABLE… AS FILETABLE takes some additional parameters that make the extra code unnecessary. For our purposes, we only need to name the key constraint that corresponds to the KEY INDEX we're using in the full-text index. So…

CREATE TABLE foo AS FILETABLE WITH (FILETABLE_DIRECTORY = N'ftdir', FILETABLE_STREAMID_UNIQUE_CONSTRAINT_NAME=UQ_stream_id);
CREATE FULLTEXT INDEX ON foo (file_stream TYPE COLUMN file_type) KEY INDEX UQ_stream_id;

loses the redundant index. There are also parameters to specify the FILETABLE_PRIMARY_KEY_CONSTRAINT_NAME and FILETABLE_FULLPATH_UNIQUE_CONSTRAINT_NAME as well, if you need them, to make it easier to define foreign keys.  Although the filetable's related objects are fixed at creation time, you can add constraints to a filetable after it is created.

@bobbeauch

Using filestream, streaming I/O and SQL logins with impersonation

A number of months ago, I wrote an article using the SQL Server filestream feature programmatically. I've gotten a number of inquiries since from folks who *need* to login to SQL Server using SQL logins (often the same SQL login for everyone), but want to access the stream using streaming I/O (e.g. SqlFileStream data type). This doesn't ordinarily work, but how about using impersonation in SQL Server? 

I have a SQL login named "bob". And a Windows login named "ZMV03\Mary". Mary is not a Windows Administrator. Neither one is sysadmin in SQL Server. Added users for each one to the filestream-enabled database with appropriate permissions. In the master database, I granted bob impersonate privilege.

GRANT IMPERSONATE on LOGIN::[zmv03\Mary] to bob;

In the ADO.NET program, I impersonate the login before accessing the getting the transaction context and PathName. The rest of the program looks the same.

SqlConnection string contains Bob's credentials. Then:

command.CommandText = @"execute as login ='zmv03\Mary';select Top(1) Photo.PathName(),GET_FILESTREAM_TRANSACTION_CONTEXT () from employees2;revert";

And the streaming I/O using SqlFileStream and the transaction token works. Some cavaets:
 You need to impersonate the login, not the user correspoinding to the login.
 You can't use "execute as login=" in a CREATE PROCEDURE statement, but you can use it inside a stored procedure.
 You need to use the same Windows login with the SQL impersonation as with the client code. With ASP.NET for example, that would be the user running the ASP.NET worker process/app pool. You can't do the SQL impersonation using any old Windows login.
 Note that you can even revert the impersonation before accesing the stream. And that's probably a good idea.

And about the setup, I used:
 Local console app, running as zmv03\Mary, which is an individual, not a Windows group, login. Local means console app is running on same machine as SQL Server.
 SQL Server 2012. Database is defined with non_transacted_mode = off.  Machine was not a part of a domain.

Just listed these for sanity reasons. I'm not talking about filetable, but filestream. You can use transacted access with filetable, but normally you'd use non-transacted and open the file directly, using the network share. If this breaks using different conditions, I'd like to hear about it. I read email more often that I look for blog comments. This may be spec'd to work somewhere but I couldn't find it.

Also, I've been unsuccessful in using SQLCLR, filestream, and streaming I/O with the context connection. If you really, really want to do this, maybe using a real connection and external access might work. I didn't have time to try this, and am not sure if it would be particularly useful. But you would have to be running the SQLCLR proc as the exact (Windows login) that does the impersonation, and craft your SqlCommand text accordingly.

Also in case you confused this discussion with the error message "Either a required impersonation level was not provided, or the provided impersonation level is invalid." with older OSes, I'm not talking about that, but there is a fix for it. See this KB entry. The error that you get trying to access the stream after using SQL login and GET_FILESTREAM_TRANSACTION_CONTEXT is "Access denied".

@bobbeauch

Extended Events in Windows Azure SQL Database

Continued from the previous post, "Some new/upcoming Windows Azure SQL Database DMVs and diagnostics"… Read that one to find out who Vinod and Michal are.

So, before you all run out and try it, this isn't working yet. But, like the WASD event log view, the pieces seem to be in place. So, standard disclaimer applies. I only saw it working in Vinod and Michal's demo. The stuff I'm looking at may bear no resemblance to the final product. On the other hand, it may…

So, I went looking in WASD today. Some of the extended event views exist in every database today. This jibes with the demo that showed an XEvent node in SSMS Object Explorer in the Database/Programmability folder. As opposed to XEvents sessions, etc in SQL Server which live in master, these appear to live in each database. Which only makes sense if you know how WASD works…

Obviously there is no SSMS node in SSMS today. But for those of us who like XEvents in SQL Server 2008, sans GUI, that really doesn't present a problem.

They showed four pre-defined XEvent sessions, and said that initially there would only be predefined sessions. No user-defined sessions. The names they showed matched the output from "select * from sys.event_sessions" and are:

azure_xe_activity_tracking
azure_xe_query_batch
azure_xe_query_detail
azure_xe_query_waits

You can get metadata and "current" data by using views that are similarly named to the views in SQL Server. SSMS object explorer for WASD shows the following:

sys.event_session_actions
sys.event_session_events
sys.event_session_fields
sys.event_session_targets
sys.event_sessions

sys.dm_xe_database_session_event_actions
sys.dm_xe_database_session_events
sys.dm_xe_database_session_object_columns
sys.dm_xe_database_session_targets
sys.dm_xe_database_sessions

Some of these views are more "filled in" than others. For example, the views related the XEvent targets are currently empty. And there are no targets for any of the predefined event session.  Note the use of the naming convention "dm_xe_database…" for the views instead of "dm_xe_server…". Makes sense.

So, gui or no gui, target or no target, I just had to try this. Had my "stop" and "harvest" scripts available in case it did start, but…

alter event session azure_xe_query_waits on server state=start;

Produces: "session does not exist or you do not have permission"

Hmmm… how about:

alter event session azure_xe_query_waits on database state=start;

Produces: "Failed to start event session 'azure_xe_query_waits' because doing so will exceed the maximum number allowed for this database."

Sounds like that could be the ticket. In future. So, until then, folks diagnosing WASD issues wait for the announcement for this, and the event log views, with baited breath. When I asked some SQL Azure folks about the lack of announcements about the "intermediate" versions of WASD (i.e. there was no formal announcement on the Azure blog about the new DMVs), the response was "we'd like to make small version changes to WASD a non-event, and only announce large changes". This will be a big…eh… event. THANKS Vinod and Michal.

@bobbeauch

Some new/upcoming Windows Azure SQL Database DMVs and diagnostics

I didn't get to a whole lot of others' sessions at TechEd this year, but was pleasantly surprised by Vinod and Michal's session on "SQL Azure Administration and Application Self-Servicing". Although most of the really interesting stuff they showed wasn't in place yet (they demo'd on a "test portal" and test Windows Azure SQL Database (did I get the new name right?) servers, there are some things folks have been looking for that are available now. The near-future looks even better.

Windows Azure SQL Database (we gotta come up with an acronym for this, how about WASD for now?) has always contained a subset of DMVs from SQL Server. And the ones that are there don't expose "physical information", for example memory addresses, server-level info. That's because WASD is a multi-tenant database and you're not allowed to know what the other tenants are up to. Good.

But, when I lecture about maintaning indexes on WASD, the first comment out of folks' mouths was always "but it's missing sys.dm_db_index_physical/operational/usage_stats"! No longer. These DMV are there and work NOW, today. Modulo some physical information, of course, but its enough to tune your indexes with. Hoorah!

Here's a complete list, from the presentation (and from my experimentation, they all work now):

sys.dm_exec_query_memory_grants
sys.dm_exec_cached_plans

sys.dm_exec_trigger_stats
sys.dm_exec_procedure_stats

sys.dm_db_index_physical_stats
sys.dm_db_index_operational_stats
sys.dm_db_index_usage_stats

sys.dm_db_missing_index_details
sys.dm_db_missing_index_group_stats
sys.dm_db_missing_index_groups
sys.dm_db_missing_index_columns

So diagnose away… The only one that I've had "trouble" with so far is sys.dm_db_missing_index_details. Most times I'll get an error message "Msg 40197, Level 20, State 1, Line 1 The service has encountered an error processing your request. Please try again. Error code 40143." The information I see seem to tie this error with database failover, so maybe I'll leave this one alone for the time being.

The more interesting views that don't seem to be active yet relate to SQL Server log information. You don't have access the log file directly in WASD (multi-tenant, figure it out). But there are two views in MASTER that will expose a subset of this information, as well as a portal-to-come. The actual views they mentioned seem to be there now, but return no information. So it's "in place", but not "hooked up" yet, I'm guessing. These views are:

sys.database_connection_stats – connection information
sys.event_log – event log information

The view names are obviously self-explanatory.

The coolest thing they demo'd was related to the question I always get, "Does WASD support SQL Profiler?". There will probably never be SQL Profiler support, AFAIK, but there will be (soon) Extended Event (the new profiler) support. Guess that deserves a post of its own.

@bobbeauch

Saving the contents of a SQL Server XML column to a file

While preparing for my TechEd talk on filetables and full-text search, I came across this tidbit that I thought was interesting.

With the advent of XML support in databases, specifically SQL Server's XML data type in SQL Server 2005, there has been two main ways to store XML; in files on the file system, and in a database. SQL Server has a built-in way to move XML data from files to the database, namely the OLE DB BULK provider and OPENROWSET SINGLE BLOB/CLOB/NCLOB. It looks like this

– Load XML from a file
DECLARE @x XML;
SET @x = ( SELECT * FROM OPENROWSET( BULK 'C:\invoice1.xml', SINGLE_BLOB) AS x);

– Or direct to table
INSERT INTO dbo.invoice_docs(invoice)
SELECT * FROM OPENROWSET(BULK 'C:\invoice1.xml', SINGLE_BLOB) AS x;
go

Because there is no corresponding built-in way to move XML from the database to files, I'd always referred to this asymmetry as "all your XML belong to us". Or, a bit more rudely, as "SQL Server is the roach motel of XML files, they get in, but they can't get out".

All joking aside, in SQL Server 2012, filetables provide a straightforward way to export XML column data (or any data for that matter) as files using only T-SQL. Suppose I have an have a filetable named Documents. This code stores the XML resume from the row with JobCandidateID of 1 to a file in the root directory of the filetable named "JobCandidate1.xml". To store this into a subdirectory using T-SQL, see the code in this blog entry. You can then copy the file to the file system location of your choice.

– get existing database blob into files
declare @resume varbinary(max);
declare @name varchar(40);

select @resume = cast(Resume as varbinary(max))
        ,@name = 'JobCandidate' + cast(JobCandidateID as varchar(1)) + '.xml'
from AdventureWorks.HumanResources.JobCandidate
where JobCandidateID = 1;

insert dbo.Documents(Name, file_stream) values(@name, @resume);

Now XML files can get out!

@bobbeauch

Resources from my TechEd talks are available

The demos from my TechEd talks in Orlando and in Amsterdam are posted on the SQLskills website under Resources. Sorry for the delay.

TechEd Amsterdam isn't near finished yet tho', and I'll be at the Ask-The-Experts tomorrow and at one of the Windows Azure SQL Database booths on Friday morning.

If you're around, stop by and say hi! And for those folks at the talk today, thanks for all the great questions…

@bobbeauch

Does everybody get that? (generalizing esoteric optimization techniques)

This post is an example of the "bring your own problem" approach in the classroom, inspired by upcoming SQLskills Immersion Events in August. See previous posting for the reason behind the blog post title. 

Figured that I couldn't wrap up this series without mentioning T-SQL, as most of the "bring your own problem"-type problems have to do with SQL, as you might guess. Rather than go after specifc problems that have showed up in teaching/consulting gigs, I thought I'd mention a class of problems related to what you might call "generalizing esoteric optimization techniques".

The SQL Server query optimizer does a pretty darn good job of coming up with a reasonable query plan most of the time. And of adapting as the distribution of data amongst tables and within a table changes. In fact, you could site the dynamic adaptation as one of its best features. That said, there are times when it doesn't always come up with the very best plan. my favorite example is the "Reason For Early Termination" = "Good enough plan found" in a query plan. And we've all seen the parameter sniffing behavior where the "sniffer" in the optimzer sniffed a non-representative value(s) and came up with a plan that doesn't work well with any other set of values.

Folks who have been consultants or teachers for long enough are used to discovering the subtle anamolies in optimization, and coming up with neat workarounds for specific cases. These usually take the form of query syntax rewrites or use of query hints. The results, based upon a known set of static test data, can often be dramatic. However, once you've seen the dramatic results (possibly demonstrated somewhere, but especially when you think of it yourself) there is a natural inclination to apply the neat trick to EVERY case that even looks vaguely related. And the possibility exists that because the distribution of the data may not be the same as the original, or may change over time, it won't really be an optimization in every situation. Sometimes with disasterous consequences. That's when, as a consultant, you might get to consult on it again.

Another situation may be relying on extreme parallelization, which may work well in a controlled single-user situation. And parallelism in general ISN'T esoteric. But, in an OLTP system, that plethora of processors may not always (or will never) be available. Someone in a class recently showed me two SQL stored procedures that logically should have been close to equivalent. One ran over 1000 times faster than the other (that's clock-on-the-wall time). The faster one broke the problem into a series of queries, each of which could be parallelized. The other used a single query with a construct that couldn't be parallelized. The big difference was that the parallelizable process used 48 processors (all the processors on the box) in each step. That's a situation that's unlikely in an OLTP system. If its the only sproc running on the box, and you run one of them at a time, this may be fine. Maybe it would still be better if there were 2 processors available. You can play with MAXDOP in a test situation to try this out.

I'm not saying that there aren't some general rules or optimization opportunities that almost always work. Or that you shouldn't try to squeeze every last bit of performance out of your queries, especially the one that's running 1000 times per second. But if you do use a neat trick, especially one that seems counterintutive, document it in your stored procedure. Along with any table/table set distributions that could make a difference. And prepare to go back and revisit these queries every once in a while, to ensure that your optmization still does what it did when you chose it. Especially with hard-coded query hints.

Does everybody get that?

@bobbeauch

Loading ShapeFiles into SQL Server 2008 and 2012

Since I find myself "in a spatial mood", I thought I'd write one more today. About importing data from shapefiles. According to Wikipedia "The Esri shapefile or simply a shapefile is a popular geospatial vector data format for geographic information systems software". And the question always goes something like this: "Does SQL Server have anything built-in to import shapefiles?"

Short answer is "no, not built in". There's some vendor products, like SAFE Software's FME series, including an inexpensive consumer version called "Microsoft SQL Server Data Loader" as well as their well-known FME Desktop. And if you have ESRI software, there's likely a shapefile-to-SQL loader in there. But folks just starting out are likely to have heard of a free tool called Shape2SQL. It's not the only free tool out there, but its quite easy to use.

One problem that arises reasonably frequently when importing shapefiles into the GEOGRAPHY data-type is that SQL Server's GEOGRAPHY is sensitive to ring order. Many shapefile authors are not. This manifests itself as an error in Shape2SQL (you can skip the offending row(s) IIRC) because a GEOGRAPHY with the wrong ring order will be almost always look to SQL Server like its greater than a hemisphere. That's an error in SQL Server 2008/2008 R2, but not in SQL Server 2012, where GEOGRAPHY instances of greater than a hemisphere are supported. You won't get the error (unless he's explicitly checking for larger than hemisphere, but I don't think that's the case) but you will have the wrong shape (e.g. instead of Iceland, you'll have everything that ISN'T Iceland!).

If you're sure that you shouldn't have any GEOGRAPHY instances of more than a hemisphere, here's an easy fix for SQL Server 2012.

– assumes you are using SRID 4326 (default) and have a column named "Geog" in table named "spatialtable"
declare @h geography = 'FULLGLOBE';  — SRID 4326 is default

update spatialtable
set Geog = Geog.ReorientObject()
where Geog.STArea() > @h.STArea() / 2;

BTW, you can also use this method if you're doing a conversion from GEOMETRY (which doesn't care about ring order) to GEOGRAPHY and might have ring order problems. If you might have instances greater than a hemisphere, then you will have to "eyeball-check" (meaning, check the instance in the SSMS spatial results tab for reasonableness) instances on both sides of the hemisphere cutoff. If you're on SQL Server 2008, SAFE Software can automatically correct geographies greater than a hemisphere.

@bobbeauch

Does everybody get that? (Spatial Index Reprise)

This post is an example of the "bring your own problem" approach in the classroom, inspired by upcoming SQLskills Immersion Events in August. See previous posting for the reason behind the blog post title.

I believe it was John Lennon who wrote: "You say you got a real solution. Well, you know, We'd all love to see the plan". I'm certain he wasn't referring to the query plans I'd "love to see" when I am diagnosing a problem, but that's the cognitive overlap that happens when you're listening to tunes while troubleshooting. And sometimes just "seeing the (estimated) plan" isn't enough.

I've been working with spatial data and spatial indexing for quite a while now. And have already published a blog entry on how to ensure your spatial index is used. But, since then, I've answered countless questions in classes, at conferences, and from clients about why the spatial index isn't being used without a hint.

The top two things to be sure of for starters are:
 1. Make sure you are on at least SQL Server 2008 SP1 to
 2. Make sure the query processor can sniff your spatial parameter values (i.e. use sproc or sp_executesql with spatial parameter passed in)

Here's two of more curious ones I've gotten lately. First one is client is using 4 processor machine. All is going well, but they figure it will be faster with a 24-proc machine. The spatial queries go slower in testing, by orders of magnitude. Turns out the 4-proc was using spatial index. With one tester on the 24-proc, the optimizer decided it was best to use a highly parallelised non-spatial-index plan. Setting maxdop down (or using a hint) was a temporary fix. The optimizer algorithm was changed along the way (2008 R2 SP1, IIRC). But, to think of this problem another way, unless you have an entire 24-proc machine to yourself, the actual number of processors available won't usually be equal to the total number of processors on the machine. In fact, on a loaded system, you may only get one or two processors at runtime. So setting maxdop down so the spatial index will be used or hinting isn't a bad idea. BTW, the spatial index plans can and do use parallelism for most or all of the plan if you have enough available processors.

The optimizer is superb at "reasoning" over most relational queries and indexes, but tends to underestimate the spatial index, which can have a dramatic effect when used. I think this happens for two reasons, first is an underestimation of the expense of STIntersects(), especially when one of the spatial operands is a complex polygon. Second is overhead of doing a seek into the base table for each candidate row, which is a side-effect of using the spatial index. So also choosing the correct spatial index parameters is crucial for cutting down on the number of seeks. Second one was that a query in this form: …WHERE spatialcol.STDistance(@spatialparm)*(conversionfactor) < number. This not only wouldn't use the spatial index, you couldn't even hint it. The reason for this is the spatial index is only usable for certain forms, using …WHERE spatialcol.STDistance(@spatialparm)= number/conversionfactor made using the spatial index viable.

So, to wrap up, if your spatial query does not use the spatial index, or seems to run too long, try hinting the spatial index first. If you get an error, you're not using a "spatial index viable" predicate. Once you've got the query "hintable", then you can work having the optimizer choose the spatial index. And compare clock-on-the-wall time with and without the spatial index. Not the estimated query plan.

Does everybobdy get that?

@bobbeauch

Does everybody get that? (XML or relational for “multi-table” XML documents)

This post is an example of the "bring your own problem" approach in the classroom, inspired by upcoming SQLskills Immersion Events in August. See previous posting for the reason behind the blog post title.

SQL Server is a relational database that supports XML. But just because your data is received in XML format doesn't necessarily mean you should store it that way for querying. And if you intend to do any serious querying of your XML, be prepared to use at least a primary XML index.

I'd had students who planned to use XML in SQL Server for a big application with lots of queries (in the hundreds of thousands per day) and lots of inserts. Think of your typical OLTP system does with XML documents instead of multi-table joins. A single document might contain information from 4-5 or more relational tables. Is it a good idea to store what's usually relational data as XML?

XML is stored in SQL Server in a propriatary binary format. SQL Server's XQuery is not standalone, but always combined with SQL statements. To be useable in these statements the XML is decomposed (or the part of its that's needed to satisfy the XML predicate is) into relatonal in-memory structures. See the TVF steps XmlReader or XmlReader with XPath Filter iterators in the query plan. That's more-or-less the structure that is persisted on disk (as an internal table) when you create XML indexes. The primary XML index and secondary XML indexes are simply clustered and nonclustered indexes over the internal table. With these indexes SQL Server has a reasonably fast implimentation of XQuery, for a relational database, because its using the relational engine. But "raw" XML has no statistics on the element and attribute values to help the relational engine select a best plan. You'll see huge costs for this decomposition in SQL query plans, but bear in mind that the costs may be overestimated because of overestimating of cardinality, even with schema-valid XML.

So you can do partial decomposition once-or-more per-query or decompose once (during insert of a row with an XML column) and query the structure using the XML index many times. Either way, you're using relational queries… but the overhead of the decomposition is "non-trivial". And the table structure is non-trivial, taking up between 2-6 times the space of the original XML. 

If you want to store XML but need to query only one or two element or attribute values, consider persisted computed columns, based on XQuery. You can even persist a subset of the XML document into a separate table (think instead-of trigger) and query that, to cut down on the size of the XML index(es). Remember though, you will be doing one or more XML queries against non-indexed data with each insert into the base table.

The upshot of all this is, if you have data that can be modeled as relational, use relational tables. If you'd thought of using XML columns and queries because it is a ubiquitous format that everyone uses (the "its cool" factor), well…you'll thank me later. And if you do go with XQuery, don't think "I won't use XML indexes, they are too big". Use 'em…and/or persisted computed columns.

Does everybody get that?

@bobbeauch