This is the first of a series of posts on the new Integrated Full Text feature in SQL Server 2008. This is one of my favourite features which I’ve had to keep stum about  for quite a while.

Firstly an introduction to full text.

Full text is the feature in SQL Server that allows you to search text content (varchar, char,blobs etc) in a way that performs well ad also do stuff that you can’t do simply in the relational world with SQL. The simplest example is a description of a product. If you wanted to search for ice cream containing cherries your query might look like this.

select ProductId, ProductDescription

from dbo.product      p

join dbo.productType  pt on pt.ProductTypeId = p.ProductTypeId

where pt.productType = 'Ice Cream'

and p.description  like '%cherries%'

This is using the like operator to find products that contain the word cherries in the product description.  If you obtained the query plan for this, even though you had an index on productDescription and you only had 1 product which contained cherries, it is more than likely you would end up with a table scan of the products table. If you had an index on the productTypeId column you may end up with a range scan of that index and a bookmark lookup to get the product description.

The reason for this is that a normal index on a text based column only helps predicate based searches i.e. where the column starts with a certain set of characters, i.e.

and p.description  like 'cherries%’

Note: There is no % at the beginning only at the end.

You will also find that you are unable to index a column that is more than 900 bytes. So if your column is a unicode column you are limited to 450 characters (unicode requires 2 bytes per character)

So how do you get optimal querying. This is where full text comes in. Full Text indexes each word in the column not the whole column. This means you can convert the query above into

select ProductId, ProductDescription

from dbo.product      p

join dbo.productType  pt on pt.ProductTypeId = p.ProductTypeId

where pt.ProductType = 'Ice Cream'

and contains (p.ProductDescription,'cherries')

This enables the query engine to obtain the set of products that have cherries and merge the results with those that are ice cream to produce the results. What this means in terms of query performance I will discuss later.

So how does Full Text work.

It consists of a number of components, these are documented well in BOL and in a set of white papers. They are,

1.       iFilter – This takes the value in the column and returns a lump of text containing a set of words, this might sound odd but when you realise that there are iFilters for pdfs, jpegs, CAD drawings you can see that irrespective of the content in the file they are able to return the text from that content to the Full Text engine.

These iFilters are used by all Microsoft Search applications including sharepoint, index server and desktop search.

2.       Word breaker – This takes the text returned from the iFIlter and breaks the paragraphs down into individual words based on the rules for the relevant language, i.e. spaces, commas, etc. Some languages are more complex than others because they contain compound words.

3.       Indexer – The output from the word breaker is then stored in the full text index to provide optimal retrieval, its not just the words put also the position that is stored, which enables you to search where two words are near each other.

4.       Parser – Does a number of things. It removes words that are to be ignored i.e. the, at, in and a. It applies the thesaurus if one has been configured, applies word breaking and converts the result into a set of criteria to be used by the full text service to query the index.

5.       OLEDB interface – In SQL 2005 and before Full Text is a windows service that runs outside of the sqlserver.exe process. This means the sql server engine needs a way of querying the data from the service. This is done using OLEDB, in an execution plan you will see a “Remote Scan” operator. This is the engine talking to the full text service using OLEDB.

Prior to SQL Server 2008 most of the items above where black boxes. If something didn’t work as expected you had no idea what was wrong. Further more the performance of having to use OLEDB to go to an external service was not good.

In the rest of the series I will be going over what the issues are with SQL Server 2005 and how the improvements in SQL Server 2008 truely make the Full Text index first class.

The following are the other posts in the series

If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx


This is cross posted from my SQLBlogcasts blog which can be found here, http://sqlblogcasts.com/blogs/simons/SQL Server 2008 – iFTS Introduction

Categories:

Comments

Add comment


(Will show your Gravatar icon)  

  Country flag

biuquote
  • Comment
  • Preview
Loading



Theme design by Nukeation based on Jelle Druyts