Be careful with EMPTY/NULL values and spatial indexes

There are two slightly different ways of representing non-existing spatial data, the "empty geometry" concept and database NULL. You can read about the subtle differences between them here. If you're using a spatial index, either of these types have the ability to cause query perforamance problems.

Let's take, as an example, a table with 10000 rows, all of which contain 'POINT EMPTY'. If we put a spatial index on those rows, we will have 10000 rows in our spatial index all with a cellid value of 0 and a SRID (in the index information) of 0. This is true even if you are using the geography data type where STSrid for the data (when using POINT EMPTY) actually returns 4326.

When you're using the spatial index diagnostic stored procedures, these index entries show up as "ObjectCells in Level 0 for Index"; the diagnosic also report 2 SRIDs in the index data when using the geography index. This can have a big effect on your query plan.

Suppose that I have 10000 empty values (either 'POINT EMPTY' or NULL) in a table called "mypoints" and a query like this:

DECLARE @g geography;
SELECT @g = shapegeog FROM zipcodes where id = '97225'
SELECT * FROM mypoints WHERE mypoints.shapegeog.STIntersects(@g)=1

If the query plan uses the spatial index, you'll get a "candidate hit" on each of the empty points during the spatial index seek iterator. This is because the row for 'POINT EMPTY' is technically outside the index bounds (level0 cell) and the spatial method will have to be evaluated later in the query. Having 10000 additional index seeks on the *base* table, to fetch the geography column in order to evaluate the spatial method, will slow things down. You'd think it might be useful to set the geography to NULL and redo the query like this:

SELECT * from mypoints
WHERE mypoints.shapegeog IS NOT NULL AND mypoints.shapegeog.STIntersects(@g)=1

This will still get the 10000 additional candidate hits, and in my limited tests, I still get 10000 additional index seeks on the base table; the IS NOT NULL predicate is manifested as a separate filter iterator.

A workaround for this is to set the "empty" points to a valid value far outside your search areas (say 'POINT(0 90)'). Unless you're doing point-in-polygon computations where the North Pole is a valid choice (in that case, you're answers will be wrong because they'll include your placeholders) your queries will execute much faster.

This workaround can cause big problems when using other methods (like STDistance for example) because now you have a real (but nonsensical) value, so be careful of the complications. You can use CASE statements to substitute NULL values, for example,

— If I used the North Pole to speed up point-in-polygon queries
declare @p geography = 'POINT(0 0)'
   when (shapegeog.STEquals('POINT(0 90)')=1) THEN NULL
   else shapegeog.STDistance(@p)
from mypoints

I'd also thought of using NULLIF to substitute NULL for my placeholder values in spatial operations that don't need to use the spatial index, but this won't work because instances of the spatial data types aren't comparable except when using STEquals. So NULLIF and spatial types causes an error.

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.