The nearest neighbor optimization in SQL Server Denali

The nearest neighbor query is one of the most common queries against spatial data. How many people haven't gone to a mapping app, typed in their current location and asked for the 10 nearest [your favorite thing goes here]? The obvious way to phrase the spatial query for this, given an STDistance method on the SqlGeometry data type, would be:

DECLARE @me = 'POINT (-121.626 47.8315)'
SELECT TOP(10) Location, Description, Location.STDistance(@me) AS distance
FROM [spatial_table]
ORDER BY distance

Unfortunately (for your performance, it does return the correct answer) in SQL Server 2008/R2, this query doesn't use the spatial index. And attempting to hint the spatial index results in the "could not produce a plan with your hints" error. Oh.

In SQL Server Denali, this just works. You do have to include a predicate that refers to STDistance, like:

…FROM [spatial_table]
WHERE Location.STDistance(@me) < 1000 — 'where-distance' query
ORDER BY distance

but the predicate can also be something as unintrusive (ie you don't have to commit to a max distance) as:

…FROM [spatial_table]
WHERE Location.STDistance(@me) IS NOT NULL — 'where is not null' query
ORDER BY distance

This uses the spatial index without hinting. And its way faster than the 2008 version of the same query. Excellent. But…when the "nearest neighbor" question came up once too often in SQL Server 2008, Isaac Kunen came up with an alternate algorithm that uses a numbers table.  And its quite fast, although you do have to hint the spatial index (or, let's say, I've never got it to use the index without hinting). So how does this algorithm do against the new, automatic spatial index use plan?

In the simple example where I tried this out, here's the plan cost:
Isaac's algorithm and hint:             0.517
Denali with WHERE and distance:  4.06683
Denali with WHERE..IS NOT NULL: 14.7

But the differences in query plans that involve spatial don't always seem to the "revelent" (the QO does sometimes underestimate the big difference using the spatial index will make). All the queries are subsecond, compared to 10 seconds or so when not using the index. So, any other comparison? How about I/O or worker time?

Issac's algorithm and hint:   1x worker time/reads/clrtime
Denali with wHERE and distance: 10x worker time/reads/clrtime
Denali with WHERE..IS NOT NULL:  3x worker time/reads/clrtime

This is interesting due to the fact that the plan cost was much less for the where-distance query vs. where-isnotnull query, but the resource cost is exactly the opposite. And more interesting is the fact that Isaac's numbers table-based algorithm wins both ways.

Caveats: The obvious big caveat is that this is only a single test case, albeit a fairly common one (both "@me" and the points-of-interest are points). YMMV. Second, none of the queries I used, run in SSMS, allow for a proper cardinality estimate at plan creation time (aka parameter sniffing). But attempting to use sp_executesql or a sproc so that it could be estimated, causes the Denali "where-distance" query to loop endlessly. Bug reported; it is afterall, CTP1. And sp_executesql with the "where-isnotnull" case, doesn't do much better, still worse than "numbers table". Third: Isaac's original algorithm is fragile in some edge cases (blank spatial table/no qualifying points, IIRC), Ed and I handled these cases in the slightly revised version in the chapter in Inside SQL Server 2008 Programming.

So, it appears that, even though the query is far less straightforward to write and the spatial index needs to be hinted, the numbers table method wins, so far. However, I'd still say that the nearest neighbor optimization is useful to have in the product. But it is still Denali CTP1. And it goes without saying, if you have a test that disproves this, send it right along, I'm always happy to see it.

@bobbeauch

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.