I've devoted the last few blog entries to describing what's in that spatial index analysis proc output and how to use it and interpret the results. Just wanted to describe the rest of the information and how the information relates to what you'd find in a query plan that uses a spatial index.

The procs return:
1. Information about the index
2. Information about the query sample
3. How efficient the index is when used against the query sample

The information you get about the index is:
-  Bounding Box dimensions - BOL says they are NULL for geography but they are always -180, -90, 180, 90 for geography
-  GridLevelSize - 4 levels
-  Cells_Per_Object - maximum tessellated cells per row, specified in DDL
-  Page and Row counts for the spatial index
-  Avg number of cells per base row (how many of the CellsPerObject were actually used)
-  Height, Width, Area of a cell - Geography is always the same, but its nice to know how big the cells are compared to the other levels. Not sure what the unit of measure is for geography (BOL says "depends on SRID"), by looking at the number it doesn't look like it could be meters.
-  CellAreaToBoundingBox percentage - 4 levels
-  Total SRIDs found

The most interesting is likely the info about the number of cells in each level (including level0 - off the bounding box). Any levels that are 0 rows don't appear in the XML output. You get info for:
-Number of Object Cells for Index
-Number of Interior Cells
-Number of Intersecting Cells
-Number of Border Cells
-also some numbers about cells normalized to the leaf grid

The cells are useful in how they relate to the primary filter. The interior, intersecting, and border cells are used by the internal filter.

For the query sample, you get the same cells numbers. No "cells normalized to the leaf" however. BTW, for the geography index ignore the BOL comment that indicates that if Level0 of the query sample is 1 then index is not useable for this query. For geography its always 1.

This relates to the following information that appears in a query plan that uses spatial indexes.
1. All of the query samples (in the spatial predicate) are tessellated using the same parameters (density/max cells)
2. The table is joined against all parent/children of the index (the index levels are actually a hierarchy)
3. Index ranges are calculated
4. A join of the tesselated samples on index is done. This results in the number of rows selected by the primary filter.
5. The internal and secondary filters are applied in a separate query iterator (of type Filter) later on in the plan.

The upshot of this is that, not only is the index dimensions used to tessellate the base table rows, the same dimensions as the index are used to tessellate the query sample for the spatial predicate (query sample, in the sproc). So, for example, I'd always thought that spatial indexes over points should specify all-HIGH density and maybe 2 max cells per object. But if your query sample contains large polygons, those same dimensions will be used against the query sample too. So maybe all-HIGH density for point-based indexes isn't a hard and fast rule.

Hopefully this should give you enough information to investigate your indexes and queries using the spatial analysis stored procs. As far as patterns for "which specifications of index is best", more patterns will show up as I investigate more different patterns of spatial data. For now, I'm dividing these into three main categories:

1. Spatial index over table of points, Query sample is polygon.
2. Spatial index over table of polygons, Query sample is point.
3. Spatial index over table of polygons, Query sample is polygon.

With the simplifying premise that linestrings are thin like a point (ie they won't fill a cell), but if you put a STBuffer around them, they're more like polygons.

Happy index hunting.

A little more about the query sample that gets fed into the spatial index procs.

Query sample is a singleton geometry or geography. It's not a query. So it's not as straightforward as "here's a spatial query that could use an index, run this query and show me the figures". So this procedure is not going to work any differently for a query that use STIntersects vs STDistance vs... It's going to show you how an index on table A would be utilized given a single operation.

It maps most closely when your query is something like:
select ... from tablea a where a.geog.STIntersects(@g)=1

Is this case @g is your query sample, and Number_Of_Output_Rows is the number of rows the query returns. But what if you have a conditional query or query that has more than one row as output, like

select ...
from tablea a,
     tableb b
where a.id = @someid and b.geog.STIntersects(a.geog)=1

In this case you could specify your query sample like this:
DECLARE @my_query_sample geography = (select geog from tablea a where a.id = @someid)

...And you're looking to an analyze usage on an geography index on tableb. But this will cause an error if "select geog from tablea a where a.id = @someid" returns more than one row. How do I use the sprocs in this case? Immediately you think:
DECLARE @my_query_sample geography;
SELECT @my_query_sample = geog from tablea a where a.id = @someid

or even:
select @my_query_sample = a.geog
from tablea a,
     tableb b
where a.id = @someid and b.geog.STIntersects(a.geog)=1

Let's analyze this way of specifying query sample. If your query sample looks like it's going to represent more than one row, note that the spatial index is always use in (at least I've only seen it used in) a SEEK. So if your query window-producing query represents N rows, you need to choose one row. Hopefully a representative row. This would work if the "top(1) without an order by" row was representative:

DECLARE @my_query_sample geography = (select top(1) geog from tablea a where a.id = @someid)

And realize that you're going to have to do this operation (repsented by the output) N times. Let's make this (hopefully) more clear.

select a.zipcode, b.point
from zipcodes a
     points b
where a.zipcode IN (select zipcode from zipcodes where state = 'OR')
and a.geog.STIntersects(b.geog)

There are 430 zipcodes in Oregon. There are 1000 points in your table. You have an index on zipcodes and one on points. Remember that STIntersects is commutative (a.geog.STIntersects(b.geog) = b.geog.STIntersects(a.geog)) which index is "better"? Notice that the query returns 1 column from zipcodes and 1 column from points, so you can't just STUnion all of the Oregon zipcodes together. You'd get the same points that way, but have no idea which zipcode goes with which point. So choices are:

1. Representative query sample for zipcodes in Oregon, analyze point index, realize that you're going to do this "query" 430 times.
DECLARE @my_query_sample geography = (select top(1) geog from zipcodes a
 where a.zipcode IN (select zipcode from zipcodes where state = 'OR') )
2. Representative query sample for point, analyze zipcode index, realize that you're going to do this "query" 1000 times.
DECLARE @my_query_sample geography = (select top(1) geog from points)

Be aware that doing 430 or 1000 seeks against the spatial index might "scare off" the query optimizer from choosing that plan. You could also declare you're query sample like this:

DECLARE @geog GEOGRAPHY
select @geog = a.geog --- or @geog = b.geog
from zipcodes a
     points b
where a.zipcode IN (select zipcode from zipcodes where state = 'OR')
and a.geog.STIntersects(b.geog)

But that's still going to give you one representative combination point or zipcode as a query sample (ie @geog is still a singleton). If all of your points are in Oregon, it's going to produce a similar analytic output, and if all your points are not in Oregon, it might be better to specify it that way. But then you would have to know how many points in Oregon to deduce how many seeks. And you can't check the ouput against "Number_Of_Rows_Returned" because you don't know which row it actually chose.

If you are just needing points, not zipcode-point combinations you could STUnion together all the zipcodes in Oregon and run the query that way. The spatial analysis procs give you a 1-1 mapping against that query. Or use a state table rather than a zipcode table. Of course this is predicated on the fact that zipcodes don't overlap multiple states.

Hope this answers the question, "what do I use as a query sample to troubleshoot query X and index Y?".

Last post discussed the filters. The procs report some raw numbers and some derived numbers. Here's a cheat sheet, although the info is all in the BOL.

N = Number of rows in the table
O = Number of rows output
P = Number of rows selected by primary filter (by the index)
S = Number of rows selected by internal filter (by the index optimizations)

Then,

P-S       = Number_Of_Times_Secondary_Filter_Is_Called (number of times they ran the expensive operation)
(N-P)/N = Percentage_Of_Rows_NotSelected_By_Primary_Filter
S/P       = Percentage_Of_Primary_Filter_Rows_Selected_By_internal_Filter
S/O       = Internal_Filter_Efficiency %
O/P       = Primary_Filter_Efficiency %

<Percentage_Of_Rows_NotSelected_By_Primary_Filter>9.942714746408677e+001
<Percentage_Of_Primary_Filter_Rows_Selected_By_Internal_Filter>9.095871716137836e+001
<Internal_Filter_Efficiency>9.593378913278158e+001
<Primary_Filter_Efficiency>9.481405663596043e+001

So how efficient is the index? Let's change this info into statements, using the values from the previous blog entry:

N = 511650, O = 2779, P = 2931, S = 2666

<Number_Of_Times_Secondary_Filter_Is_Called>265</Number_Of_Times_Secondary_Filter_Is_Called>
<Percentage_Of_Rows_NotSelected_By_Primary_Filter>9.942714746408677e+001
<Percentage_Of_Primary_Filter_Rows_Selected_By_Internal_Filter>9.095871716137836e+001
<Internal_Filter_Efficiency>9.593378913278158e+001
<Primary_Filter_Efficiency>9.481405663596043e+001

1. They had to run STIntersects 265 times, rather than 511650 times if they didn't use the index.
2. 99.4% of the rows were eliminated by using the index
3. The Internal_Filter (internal optimizations) eliminated 90% of the times they would have had to run Intersects, even if the used the index (primary_filter).
4. The Internal_Filter was able to "deduce" 95.9% of the rows in the answer.
5. The Primary_Filter was able to find 94.8% of the rows in the answer.

Pretty good index, huh? And pretty good internal filter. Probably is going to be used.

Let's look at one more, but this time from using a geoemtry index. The index is over in geometry column that contains only points. The query sample is a polygon.

<Base_Table_Rows>11267
<Number_Of_Rows_Selected_By_Primary_Filter>265
<Number_Of_Rows_Selected_By_Internal_Filter>149
<Number_Of_Times_Secondary_Filter_Is_Called>116
<Number_Of_Rows_Output>413
<Percentage_Of_Rows_NotSelected_By_Primary_Filter>9.764799857992367e+001
<Percentage_Of_Primary_Filter_Rows_Selected_By_Internal_Filter>5.622641509433962e+001
<Internal_Filter_Efficiency>3.607748184019371e+001
<Primary_Filter_Efficiency>1.558490566037736e+002

Note that in this case, the internal filter only elimates 56.2% of the rows that the primary filter finds. It only finds 36% of the output rows. But note that in this case, primary filter efficiency = 155%. How can that be? Well, this is geometry and the primary filter didn't find all the rows in the answer. There must be some outside the bounding box. Remember that Primary_Filter_Efficiency is defined as Output_Rows / Primary_Filter rows.

More coming...

The spatial index diagnostic procs' most enlightening pieces of information have to do with the filter counts and efficiencies reported at the end of the set of columns or XML document output. But what are these filters exactly and what do the results mean?

The proc output refers to three filters:
1. Primary Filter
2. Internal Filter
3. Secondary Filter

Remember that the spatial indexes are based on tessellation or tiling. Issac has some nice diagrams that show how the tiling works in theory. Please review his indexing posts before reading these and when/if you get confused by the terminology. Depending on the size and type of the spatial object, the object may completely cover a tile (think big polygon), partially cover (intersect) a tile (think point) or not cover a tile. It can also hit the border of a tile. The tiles are referred to in the spatial procs as cells.

Remember also that the point of a spatial index is to reduce the number of Intersect operations (or other operations covered by the spatial index) actually performed by using tiling. I liked Isaac's analogy that "if I want to find roads that intersect Madison WI, I only have to look at roads that intersect Wisconsin". So the index (primary filter) selects all roads that intersect Wisconsin. It can get false positives (roads that intersect Wisconsin but not Madison). The "Secondary Filter" is this case is the Intersects operation itself.

So what's the "internal filter"?  If you think of the primary filter as filtering out 100 polygons out of 10000, they'd still have to run Intersects on the 100 polygons to actually see if they qualify. But say that the index and query sample both *completely* cover a cell. Or index completely covers the cell and query sample touches it. We know that the cell qualifies for Intersect in that case, we don't need to run Intersect on the query sample for that combination. That's the Internal Filter.

So let's look at some sample output. In this case the index was over a geography column that contains only points. The query sample is a polygon. 

<Base_Table_Rows>511650</Base_Table_Rows>
<Number_Of_Rows_Selected_By_Primary_Filter>2931</Number_Of_Rows_Selected_By_Primary_Filter>
<Number_Of_Rows_Selected_By_Internal_Filter>2666</Number_Of_Rows_Selected_By_Internal_Filter>
<Number_Of_Times_Secondary_Filter_Is_Called>265</Number_Of_Times_Secondary_Filter_Is_Called>
<Number_Of_Rows_Output>2779</Number_Of_Rows_Output>

So: 511650 rows in the table. Using the index (primary filter) eliminates all but 2931 of them. The internal filter determines that 2666 are true hits, not false positive. They have to run Intersect 265 times (2931 - 2666) times. And the number of rows returned is 2779, so there were some false positives. Good thing they ran Intersect. BTW. Using the method Filter() instead of Intersect() just means they're returning all the rows selected by primary filter (2931).

Finally, remember if you're using a geography index, there is no bounding box, so no part of the rows or query sample can be outside the index. With geometry, some of the rows or query sample can be outside the bounding box. This means that with geography, Rows_Selected_By_Primary_Filter >= Number_Of_Rows_Output. With geometry, Number_Of_Rows_Output may be > Rows_Selected_By_Primary_Filter as well.

I've been looking at the spatial index stored procedures (sp_help_spatial_geography_index_xml and friends) a little harder recently, in an attempt to help answer two questions.

1. If I have a spatial query, why does/doesn't it use my spatial index?
2. If I have a specific spatial query in mind, what are the best values to choose for my spatial index density?

In addition to the diagnostic information presented in the procs, there are a few things to keep in mind when attempting to answer the first question. Currently, the query optimizer is estimating "high" when it comes to using the spatial index, so there are occasions were the spatial index should be used and isn't. Isaac mentioned this in his PDC talk, and mentioned that this would be changed in the next refreshed. As far as I could see, it isn't in SQL Server 2008 CU2.

Also, remember that SQL Server will never use cardinality estimates for a parameter set in a variable by the SET statement unless you use OPTION(RECOMPILE) on the statement. This is a general SQL Server behavior, not a spatial behavior because of when SQL Server creates the query plan. So if you want cardinality estimates for the variable @g in the query "select geog from table where geog.Intersects(@g)=1" you must either pass @g in as a stored procedure parameter or use sp_executesql. Realize that if you use a parameterized query in a client app, the client API stacks will change the parameterized query to invoke sp_executesql, so you're OK here. According to Michael Rys' SQLPASS talk, one spatial-specific behavior is that cardinality won't be used if @g if specified as a constant, that is "select...where geog.Intersects('POINT(1 1)')=1 " either. Finally SQL Server doesn't change the query plan in a parameter value changes, general behavior, not spatial-specific.

Back to the diagnostic stored procs.

The information that the spatial index procs provide consists of three main categories:
1. Information about the index itself
2. Information about the query sample
3. How efficient the query index is when used against the query sample

The query sample is specified as an input parameter to the diagnostic procs, and would be "@g" in the Intersects queries above.

If you've read my previous blog post on the diagnostic procs, read it again, I've revised it. Filters are the subject of the next post.

Sorry to appear after a blog drought with theory meta-type blog entries. Too much time at conferences pondering technologies, I guess.

The "relational database bigots" I hang out with don't like LINQ at all. They hope it would shrivel up in a corner and become part of the fad-technology graveyard. Or they're waiting to make big bucks fixing the performance problems they think will ensue.

For the life of me I can't figure out why. I think they're reacting to LINQ to SQL and Entity Framework, layers on top of relational data access that happen to use LINQ to obtain sets of data. Every layer of abstraction is going to make it easier to write less-performant SQL. Granted. But think of it a different way....

You, my SQL-centric friends, have spent much of your career specializing in this declarative query language. SQL. The best language so far (modulo C.J. Date's third manifesto "D" language) to query the relational data structures  at the center of your universe. LINQ is just "SQL for the rest of us(them)". Disguising iterators, everyone is now using SQL (well, at least the SELECT...FROM...WHERE...GROUP BY...ORDER BY part) to query everything and anything. You were right to invest such effort into this query language. With LINQ, it's become universal.

The impression I get (I could be wrong) is that LINQ might become the next OLE DB. OLE DB was supposed to go beyond relational data sources to provide extensions to the column-row paradigm and provide "data access" over non-relational systems like "file system" (OLE DB provider for WebDav), Active Directory, Search Server (fulltext search worked through OLE DB once), Exchange, etc, etc. (Mostly) similar access to every source of data. You wouldn't believe what data sources folks wrote OLE DB providers for in the good-ol days. Anything that could be cast to an API that could implement (or be cast to an implementation of) the Connection/Session/Command/Rowset co-types (or even just IRowset) was fair game.
 
LINQ is a system that provides a *SQL-like* (high-level) language for anything (including collections of objects/.NET types, XML, etc, etc). Anything that has any API that can implement IQueryable/IUpdateable. Yes, LINQ to SQL (or LINQ to Entities) doesn't always generate exactly the most optimal SQL that generates the best query plans. NO argument. But T-SQL/PL-SQL/etc doesn't always generate the most optimal query plans for the storage engine 100% of the time either. Although they will get closer than LINQ-to-database, because they (T-SQL, etc) are more tightly coupled to the relational model and query optimizer than LINQ to SQL is. Every level of abstraction (translation) pays a performance penalty; the goal is to minimize it, while making the abstraction similar. And all of the same kinds of providers that speak the LINQ language are appearing. Just like the OLE DB providers did.

So perhaps, the next version of SQL Server will support not only OLE DB-based Linked Servers, but also "LINQ-ed" servers (bad pun intended) as non-SQL Server rowset sources. Hmmm...

Categories:
Data Access

I've been following the ADO.NET Entity Framework since its inception. Never did buy the idea about it being "more than an ORM", thought it was just marketing hype. After all, if it looks like an ORM and smells like an ORM, then... But, the more I've been wallowing around in it and thinking about its position in the world in general, the more I'd thought about what could drive someone to the "more than an ORM" conclusion. Here's three things that come to mind.

1. It's built on the ADO.NET provider model. This is actually interesting for two reasons. First, anyone with an ADO.NET data provider can plug in without having to change all the tools. More importantly, it provides the EF an entry to any tool that consumes ADO.NET data providers. Imagine being able to  do reporting against your object model with SSRS or Crystal Reports. Or use your CDS (conceptual data store) as an integration source.  Or integration with Office. Granted that currently these utilities don't know anything about IExtendedDataRecord currently, but Microsoft's Report Builder let's you build a user-friendly object model for reporting...why not use a model that already looks like your business objects?

2. Lack of IExtendedDataRecord support brings us to point two. The ESQL language extensions. You can use ESQL to provide a query over the data source, flattening the hierarchies out when needed. And that's just what most tools need, a connection string and a query string. That produces rowsets (even flat rowsets) to be consumed by tools that don't care a bit about object models. Without waiting for all these tools to add support for an object-oriented conceptual data source.

3. Finally, I came across this one. The writer was writing about PowerShell (specifically, why PowerShell in SQL Server when there already is T-SQL), but here's an interesting quote:

Within Microsoft we have a concept called Common Engineering Criteria (CEC). The goal of CEC is to establish consistency across IT products (Windows Server, Exchange, SQL Server, BizTalk, System Center, etc). [some text deleted] ...One of the latest CEC requirements is support for PowerShell. [more text deleted] ...Transact-SQL is a good scripting language but it’s not model driven. We will continue to invest in and evolve T-SQL. But we’ll also continue to invest in and evolve the models over DDL and DML. The model over DDL is SMO. The model over DML is EDM. (emphasis mine)

While this doesn't say (or even imply) that EDM is part of the Common Engineering Criteria, the idea of it a "model-driven DML" is intruiging.

So perhaps "more than an ORM" isn't just marketing hype. We'll just have to see.

Categories:
Data Access

Theme design by Nukeation based on Jelle Druyts