This is the
next in the series of posts about the new iFTS (Integrated Full Text Search)
in SQL Server 2008.
One of the huge frustrations with using Full Text prior to SQL Server 2008
was that you had no idea what content the iFilter has extracted from the
document you have indexed. This leads to questions like “Why hasn’t this
document been returned when this one has?”. We had a situation where a process
was run that looked for discrimination, this included looking for the word
black. At one time one companies content kept getting flagged as including
discrimination but on looking at the content we couldn’t figure out why.
After doing some digging we realised that the content was HTML, the viewer
showed rendered HTML and not the raw HTML, and the HTML contained the font
elements specifying the colour Black. We were had a Full Text index that wasn’t
type and so the generic text iFilter was used and so Black was included in the
index. We changed to use a type column and specified “HTML” which resulted in
the HTML iFilter being used and so the markup elements were not returned and
stored in the index.
If we had access to the words that had been indexed, it
would have been clear what had happened. Well in SQL Server 2008 there are Two dynamic
management views that return the keywords of the index, and the keywords of each
document in the index. These dmvs are, sys.dm_fts_index_keywords and
sys.dm_fts_index_keywords_by_document respectively.
select
*
from sys.dm_fts_index_keywords (db_id(),object_id('Table_1'))
select
*
from sys.dm_fts_index_keywords_By_Document (db_id(),object_id('Table_1'))
Note: The second dmv whilst is by document does not take a
document id as a parameter. This means that this returns all the keywords for
all the documents in an index. That will be a lot of rows. As an example if you
store the definition from all_sql_modules in master in a table you will have ~
1780 rows. When index using the default stop lists that will result in ~153000
rows being returned from sys.dm_fts_index_keywords_by_document.
Because of this it is not a dmv to be called from your application. If you
want to use the output from his dmv I suggest you have a batch process that
persists the output to a table which can be indexed and so usable in an
application. This would be an ideal candidate for using the MERGE statement.
Further uses for this information?
If you are interested in looking at other query models for
text content then this information provides you a fantastic starting point.
Effectively you have text mining functionality in the database. The ranking
algorithms are fixed in SQL Server (something that is being looked at for
future releases) if you don’t like their ranking algorithm you could build your
own.
Unfortunately the position information of keywords in a
document is not returned by these dmvs so you can’t build an algorithm that
involves position. That also means you can’t use this information to rebuild
snippets based on search criteria in a way that live.com or google.com do.
What it is great at doing is giving you analysis features or
your data. If you have a set of search terms you can easily use the contents of
this index to find how many matches each term makes. This is useful if you want
to display counts of items on your web site.
Another option that I thought of in response to a post on my introduction
post (SQL
Server 2008 – iFTS Introdction) Is the use for partial matches.
Currently it is only possible to do prefix wild card searching (Wildcard-search-in-SQL-Server-Full-Text)
i.e. "Mirco*" You can't look for "*soft", the only way of achieving he latter is
by indexing a reverse version of the content, which is fine for text content
(give the extra space) but not for other content type i.e. word or pdf. Having
the contents index available would allow you to do build a query based on
the terms that match your criteria.The terms in index will be a fraction of the
total index size and so querying using LIKE would not be too bad. i.e. select
term from MyIndexTerms where term LIKE '%<partialword>%'. You can then use
these terms to build a CONTAINS statement, i..e looking for "compute" CONTAINS
(*, 'computedcolname OR computername OR computetime OR ....'). Essentially you
are treating the index as a dictionary.
You could also do matching based on sounds like (not SOUNDEX), but thats
beyond the scope of this post.
The following are the other posts in the series
If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx