Of filetables, full-text and semantic search, and docidmaps

SQL Server Denali's new Filetable feature and full-text search are made for each other. Set up database for non-transactional filestream access with a directory name, create a table "AS FILETABLE" with a subdirectory name, drag over a bunch of files and turn on full-text search. Voila, instant search with two predicates and two TVFs. Even better, if you add the keyword SEMANTIC_STATISTICS after your column name in the CREATE FULLTEXT INDEX statement you have three new semantic search table-valued functions as well (provided that they're supported in your language). Add a property list and you can search on documents' extended properties. What could be easier? Looks quite like the functionality of the long-forgotten WinFS, if you ask me (although I don't remember any Semantic Search in WinFS, just property search).

There's just one thing. Creating a fulltext index on using a FILETABLE uses an additional internal table the fulltext docidmap.The reason for this table is to map full-text's docid (an integer) to the the FTS index's key column. In the FILETABLE's case this would be the stream_id column, which as marked as a ROWGUIDCOL. CREATE FULLTEXT INDEX will use the ROWGUIDCOL column as a key column if you don't specify a key column. FTS uses the docid to do its lookup.

But…

In SQL Server 2008, an optimization was introduced if and only if the FTS index key column was an INT or BIGINT. In the case where FTS index key column is an INT/BIGINT, no docidmap table is created when the FTS index is. Better yet, this optimization does away with the docidmap lookup step in the query plan. This step which basically adds another JOIN to every FTS and every Semantic Search query, is simply removed, improving the plan.

Because the FILETABLE's FTS key column is a UNIQUEIDENTIFIER, it can't take advantage of this optimization. You can't add columns to a FILETABLE either, so this isn't an option. Darn. One last question…because a UNIQUEIDENTIFIER (GUID) has a  bigger value space than BIGINT, what happen's when you get more than VALUESPACE_OF(BIGINT) documents in your FTS/Semantic corpus? Guess I'd need to have over 18,446,744,073,709,551,616 documents to find out…wait, how many are on that C drive again?

@bobbeauch

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.