The next post in the series on iFTS (Integrated Full Text Search) covers one of the major improvements in SQL Server 2008 for Full Text and is related to the Integrated part or iFTS.

SQL Server 2005 introduced a huge performance improvement by moving the location of the docid map.

So what is a docid, you may ask?

Well we all know that to get best performance with indexes our keys should be as small as possible. When you create a Full Text index you specify a unique index to be used as the key for the row. However this might be anything, a uniqueidentifier, a large varchar, anything, so to ensure optimal performance uses a surrogate key to identify a row in a table. This is the docid (Document Id).

Note: In full text terms a document is a row in a table.

There is obviously a need to translate between the docid and the key of the table. Prior to SQL 2005 that was done in the full text engine, this meant two things. Large keys had to be passed through OLEDB to SQL Server and also unnecessary mapping may have undertaken in the Full Text service.

In SQL 2005 this mapping was moved into the database which meant the transfer over OLEDB was quicker and the engine could do more optimisations in terms of doing the mapping.

So what’s new in SQL 2008?

Firstly the index is stored in the engine and so there is no OLEDB stuff going on, but more importantly if you have an integer key for your Full Text index, Full Text will use that key as the docid and so you won’t have to map between two values.

This has huge benefit when you have full text and normal relational filters because the optimiser can push the relational filter into the querying of the full text index and thus read less data and so perform better.

The following query plans highlight the docid mapping. The first is using a uniqueidentifier as a key,

And this shows the query plan with an integer key,

The subtree cost of the first is 0.01 (which isn't high but I only have a 174 rows in this table. The subtree cost of the second is 0.000459, thats a 20 fold decrease. Just for using an integer key. All is not lost, if you have a non-integer key you still benefit from the performance changes and the ability to use all the relational join operators in the engine, its just you will have a few extra steps to perform and so performance will not be as good as with an integer key.

What about upgrade?

If you are upgrading from SQL 2005 you have a number of options. The first is to maintain the index as is (but held within SQL Server). This has the benefit of reduced downtime, but you still have a docidmap.

The second option is to rebuild the index. This will rebuild the index using the new structure AND remove the docidmap. So will give you the best performance but requires a full rebuild of the index (which should be quicker in 2008 as well).

The third option takes the index offline and can’t be used. This allows you to delay the rebuild until a time that is convenient for you and your server. (If you upgrade 10 databases it is probably not wie to have them all rebuilding their full text indexes at the same time).

What about new databases ?

One of the other aspects is that because the full text is in the database you no longer have to enable the database for full text. You simply need to create your full text catalog and away you go.

Be conscious of the docid issue when creating new tables. You will get best performance if the unique index you use for the full text index has an integer key.

The following are the other posts in the series

If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx


This is cross posted from my SQLBlogcasts blog which can be found here, http://sqlblogcasts.com/blogs/simons/SQL Server 2008 – iFTS Performance – DocidMap

Categories:


The next post in the series on iFTS (Integrated Full Text Search) covers a thesaurus files. A feature introduced in SQL 2005.

The loading of thesaurus files is a bit of an uncontrolled beast in SQL Server 2005 (there weren’t supported or documented prior to SQL 2005). The only way you could force a thesaurus files to be reloaded and used was to restart the Full Text service. (I believe you may have had to restart the SQL Service as well but that isn’t documented). This is obviously a slight pain.

In SQL Server 2008 thesaurus files are still files on the file system residing in the FTDATA folder, but the difference is that you can reload them when you want using the sys.sp_fulltext_load_thesaurus_file procedure and specifying which language to load i.e.

EXEC sys.sp_fulltext_load_thesaurus_file 1033;

The books online has(will have by RTM) much better FTS documentation that in SQL Server 2005 around thesaurus files.

One area to be wary of a term can only be included in one pat of a replacement or sub of an expansion. If you have a term repeated it appears that expansions take precedence over replacements and the order in the file the takes precedence. If you find that words are not being expanded or replaced as expected check whether you have included the term more than once. There are some trace flags that support can use to help you if you run into difficulties.

If you are running a search function on a website, thesaurus files are a great way of allowing or mis spelling of terms. i.e. Mircosoft, using expansion elements will allow the user to spell something wrong and the person writing the content to have got it wrong. This would be the case if users could post their own content. If you know your content is rock solid you can just use replacement to auto correct spelling mistakes into the correct spelling, i.e.

        <replacement>
            <pat>mircosoft</pat>
            <pat>microsft</pat>
            <pat>micrsoft</pat>
            <sub>microsoft</sub>
        </replacement>

If you are not capturing what people are searching on and the results they get back then you should be, the information whilst often overwhelming can be like gold dust. If you’re running a music website how many people know how to spell Anastacia. If you reviewed misspelt searches you could return results based on a corrected spelling or like Live does append the results to those of the badly spelt word as often a misspelt word can still be a valid word even in the context of the search. In the case of Anastacia if I entered Anastasia this is still valid as there are film scores for this however based on popularity you could assume people wanted Anastacia and so included those results by use of a thesaurus.

The following are the other posts in the series

If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx


This is cross posted from my SQLBlogcasts blog which can be found here, http://sqlblogcasts.com/blogs/simons/SQL Server 2008 – iFTS Manageability – Loading thesaurus files

Categories:


This is the next in the series of posts about the new iFTS (Integrated Full Text Search) in SQL Server 2008.

One of the huge frustrations with using Full Text prior to SQL Server 2008 was that you had no idea what content the iFilter has extracted from the document you have indexed. This leads to questions like “Why hasn’t this document been returned when this one has?”. We had a situation where a process was run that looked for discrimination, this included looking for the word black. At one time one companies  content kept getting flagged as including discrimination but on looking at the content we couldn’t figure out why.  After doing some digging we realised that the content was HTML, the viewer showed rendered HTML and not the raw HTML, and the HTML contained the font elements specifying the colour Black. We were had a Full Text index that wasn’t type and so the generic text iFilter was used and so Black was included in the index. We changed to use a type column and specified “HTML” which resulted in the HTML iFilter being used and so the markup elements were not returned and stored in the index.

If we had access to the words that had been indexed, it would have been clear what had happened. Well in SQL Server 2008 there are Two dynamic management views that return the keywords of the index, and the keywords of each document in the index. These dmvs are, sys.dm_fts_index_keywords and sys.dm_fts_index_keywords_by_document respectively.

select *
  from sys.dm_fts_index_keywords (db_id(),object_id('Table_1'))

 select *
  from sys.dm_fts_index_keywords_By_Document (db_id(),object_id('Table_1'))

Note: The second dmv whilst is by document does not take a document id as a parameter. This means that this returns all the keywords for all the documents in an index. That will be a lot of rows. As an example if you store the definition from all_sql_modules in master in a table you will have ~ 1780 rows. When index using the default stop lists that will result in ~153000 rows being returned from sys.dm_fts_index_keywords_by_document.

Because of this it is not a dmv to be called from your application. If you want to use the output from his dmv I suggest you have a batch process that persists the output to a table which can be indexed and so usable in an application. This would be an ideal candidate for using the MERGE statement.

Further uses for this information?

If you are interested in looking at other query models for text content then this information provides you a fantastic starting point. Effectively you have text mining functionality in the database. The ranking algorithms are fixed in SQL Server (something that is being looked at for future releases) if you don’t like their ranking algorithm you could build your own.

Unfortunately the position information of keywords in a document is not returned by these dmvs so you can’t build an algorithm that involves position. That also means you can’t use this information to rebuild snippets based on search criteria in a way that live.com or google.com do.

What it is great at doing is giving you analysis features or your data. If you have a set of search terms you can easily use the contents of this index to find how many matches each term makes. This is useful if you want to display counts of items on your web site.

Another option that I thought of in response to a post on my introduction post (SQL Server 2008 – iFTS Introdction) Is the use for partial matches.

Currently it is only possible to do prefix wild card searching (Wildcard-search-in-SQL-Server-Full-Text) i.e. "Mirco*" You can't look for "*soft", the only way of achieving he latter is by indexing a reverse version of the content, which is fine for text content (give the extra space) but not for other content type i.e. word or pdf. Having the contents index available would allow you to do build a query based on the terms that match your criteria.The terms in index will be a fraction of the total index size and so querying using LIKE would not be too bad. i.e. select term from MyIndexTerms where term LIKE '%<partialword>%'. You can then use these terms to build a CONTAINS statement, i..e looking for "compute" CONTAINS (*, 'computedcolname OR computername OR computetime OR ....'). Essentially you are treating the index as a dictionary.

You could also do matching based on sounds like (not SOUNDEX), but thats beyond the scope of this post.

The following are the other posts in the series

If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx


This is cross posted from my SQLBlogcasts blog which can be found here, http://sqlblogcasts.com/blogs/simons/SQL Server 2008 – iFTS Transparency – Viewing the words in the index

Categories:


SQL Server 2008 – iFTS Transparency – dm_fts_parser

In the next in this series of posts on Integrated Full Text Search (iFTS) in SQL Server 2008, we look at the new dmv dm_fts_parser.

Wow thats a cool function name what does it do Simon?

Well in my first post I talked about the processes involved in the full text process which until now have been black boxes. This function makes some of these more transparent from a querying perspective.

dm_fts_parser takes a full text query and breaks it up using the word breaker rules, applies stop lists (more on them later), and any configured thesaurus. This is essential in the first step of diagnosing when users are complaining because there queries aren’t working. Often this is due to, a word not breaking as expected, use of noise words that exist in the stop list or thesaurus replacing  or substituting words.

You call the function using the same query string as you would use normally with a CONTAINS statement, along with a language, a stop list and where the search should be accent sensitive.

SELECT *
FROM
sys.dm_fts_parser ('FORMSOF( THESAURUS, "Internet Explorer")', 2057, 0, 0)

This  returns the following,

You can see that in my thesaurus I have added substitution elements for Internet Explorer or firefox and netscape.

The following query ,

SELECT *
FROM
sys.dm_fts_parser ('multi-million', 2057, 0, 0)

Returns the following showing how the word breaking as broken the word up but also maintained the combined word.

Finally

SELECT *
FROM
sys.dm_fts_parser ('SQL OR Server OR 2008 OR is OR the OR best', 2057, 0, 0)

Returns the following which nicely indicates which words are noise words but also that numbers are searched as numbers and text. Note the nn prefix.

And finally finally, the query about c++, c# etc.

SELECT *
FROM sys.dm_fts_parser ('C or c or C++ or c++ or C# or c#', 2057, 0, 0)

Returns the following, which shows what you need to put in to get an exact search on c++, or c#. Capitalise the C. What’s also interesting is that C, C++ both relate to C as well but C# doesn’t, which means it C is removed from the noise word then C++ would return any document containing the word C.

The following are the other posts in the series

If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx


This is cross posted from my SQLBlogcasts blog which can be found here, http://sqlblogcasts.com/blogs/simons/SQL Server 2008 – iFTS Transparency – dm_fts_parser

Categories:


In the previous post I talked about Full Text in general. I will now talk about whats new in SQL Server 2008.

I would categorise the new features into Performance, Manageability and Transparency. So what’s new in SQL Server 2008, well we have,

1.       Full Text Indexes are stored in the database

2.       Full Text Query engine is part of the main SQL Server Query Engine

3.       Words in a Full Text Index are visible.

4.       Words in a document are

5.       Noise words are no longer in a text file but stored in the database and are called Stop Lists

6.       Contents of a Stop List can be amended on the fly (pre indexed content will not be amended) without restarting SQL Server

7.       Thesaurus Files can be amended on the fly without restarting SQL Server

8.       Indexes are maintained in SQL Server memory rather than the OS.

9.       A new function exists to show how a query gets translated by word breaking, stop lists and thesaurus’s

10.   A new external process is used to index content. (This is for security and stability reasons as iFilters can be written by anybody)

11.   Support for filestream data, which means you don’t have to blobs in your database and means managing file growth is easier.

If you are wondering iFTS means Integrated Full Text Search which is what it is.

The following are the other posts in the series

If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx

 


This is cross posted from my SQLBlogcasts blog which can be found here, http://sqlblogcasts.com/blogs/simons/SQL Server 2008 – iFTS New features

Categories:


The next CTP of SQL Serve 2008 is now available for download.

Whats different about this CTP from the others, apart from the new features which there are a lot, is there is a bug bash contest that is opening.

Simply put the people that raise the most nonduplicate reproducible bugs on CTP6 in the next month will win XBoxes. The rules are available here

https://connect.microsoft.com/SQLServer/content/content.aspx?ContentID=7734

I was suprised at how many XBoxes they are giving away, which shows the commitment to getting bugs ironed out now and not later in the day.

CTP6 is nearest to RTM with very few additional features being added from here on in. So if you fancy an XBox get installing and see if you can find a bug.

 


This is cross posted from my SQLBlogcasts blog which can be found here, http://sqlblogcasts.com/blogs/simons/SQL Server 2008 - CTP6 is now available

Categories:


This is the first of a series of posts on the new Integrated Full Text feature in SQL Server 2008. This is one of my favourite features which I’ve had to keep stum about  for quite a while.

Firstly an introduction to full text.

Full text is the feature in SQL Server that allows you to search text content (varchar, char,blobs etc) in a way that performs well ad also do stuff that you can’t do simply in the relational world with SQL. The simplest example is a description of a product. If you wanted to search for ice cream containing cherries your query might look like this.

select ProductId, ProductDescription

from dbo.product      p

join dbo.productType  pt on pt.ProductTypeId = p.ProductTypeId

where pt.productType = 'Ice Cream'

and p.description  like '%cherries%'

This is using the like operator to find products that contain the word cherries in the product description.  If you obtained the query plan for this, even though you had an index on productDescription and you only had 1 product which contained cherries, it is more than likely you would end up with a table scan of the products table. If you had an index on the productTypeId column you may end up with a range scan of that index and a bookmark lookup to get the product description.

The reason for this is that a normal index on a text based column only helps predicate based searches i.e. where the column starts with a certain set of characters, i.e.

and p.description  like 'cherries%’

Note: There is no % at the beginning only at the end.

You will also find that you are unable to index a column that is more than 900 bytes. So if your column is a unicode column you are limited to 450 characters (unicode requires 2 bytes per character)

So how do you get optimal querying. This is where full text comes in. Full Text indexes each word in the column not the whole column. This means you can convert the query above into

select ProductId, ProductDescription

from dbo.product      p

join dbo.productType  pt on pt.ProductTypeId = p.ProductTypeId

where pt.ProductType = 'Ice Cream'

and contains (p.ProductDescription,'cherries')

This enables the query engine to obtain the set of products that have cherries and merge the results with those that are ice cream to produce the results. What this means in terms of query performance I will discuss later.

So how does Full Text work.

It consists of a number of components, these are documented well in BOL and in a set of white papers. They are,

1.       iFilter – This takes the value in the column and returns a lump of text containing a set of words, this might sound odd but when you realise that there are iFilters for pdfs, jpegs, CAD drawings you can see that irrespective of the content in the file they are able to return the text from that content to the Full Text engine.

These iFilters are used by all Microsoft Search applications including sharepoint, index server and desktop search.

2.       Word breaker – This takes the text returned from the iFIlter and breaks the paragraphs down into individual words based on the rules for the relevant language, i.e. spaces, commas, etc. Some languages are more complex than others because they contain compound words.

3.       Indexer – The output from the word breaker is then stored in the full text index to provide optimal retrieval, its not just the words put also the position that is stored, which enables you to search where two words are near each other.

4.       Parser – Does a number of things. It removes words that are to be ignored i.e. the, at, in and a. It applies the thesaurus if one has been configured, applies word breaking and converts the result into a set of criteria to be used by the full text service to query the index.

5.       OLEDB interface – In SQL 2005 and before Full Text is a windows service that runs outside of the sqlserver.exe process. This means the sql server engine needs a way of querying the data from the service. This is done using OLEDB, in an execution plan you will see a “Remote Scan” operator. This is the engine talking to the full text service using OLEDB.

Prior to SQL Server 2008 most of the items above where black boxes. If something didn’t work as expected you had no idea what was wrong. Further more the performance of having to use OLEDB to go to an external service was not good.

In the rest of the series I will be going over what the issues are with SQL Server 2005 and how the improvements in SQL Server 2008 truely make the Full Text index first class.

The following are the other posts in the series

If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx


This is cross posted from my SQLBlogcasts blog which can be found here, http://sqlblogcasts.com/blogs/simons/SQL Server 2008 – iFTS Introduction

Categories:


I amd really getting into the use of UDTs as a way of packaging up SQL CLR code. I'm also looking into implementing an array type so I can do code like

declare @s dbo.SQLArray

set @s.AddKeyword('sas')

set @s.AddKeyword('dog')

Fo this to be realy useful you would want to be able to select from your array.

select *

from @s.ListValues()

Unfortunately thats not possible, you can't have a table valued function on a type. You get 1 of two errors one complaining that you have to alias the table and columns of the valued function and the other saying you can't alias the column in table valued function. On reading BOL it states that you can't do this.

So your other option is to add a static method to your type and have it take an instance of the type from which you return the list.

declare @s dbo.SQLArray

 

set @s = 'simon,sabin'

 

select *

from dbo.ListValues(@s) as List

I am still looking into the true impact of having such code in terms of performance and supportability, but I am sure it would be of use in some situations to make more readable code even if it doesn't perform great.


This is cross posted from my SQLBlogcasts blog which can be found here, http://sqlblogcasts.com/blogs/simons/Table Valued Methods on Type

Categories:
SQL Server 2005


Last year we promised more usergroup meetings and hopefully this year we will deliver.

I have now arranged to have a user group meeting every 3rd Thursday of the month. We will be starting off in London on the 21st of February and then alternating between Reading and London every month.

As mentioned the first one is a week on Thursday at Microsofts new London office you can find more details of the event here http://www.sqlserverfaq.com/?eid=108.

As we get the agendas sorted for the following events we will post those on the site as well. There is so much going on this year there is plenty of content to talk about.

If you want to speak at a meeting feel free to contact Tony ormyself


This is cross posted from my SQLBlogcasts blog which can be found here, http://sqlblogcasts.com/blogs/simons/UK SQL Usergroup meeting in London - 21st Feb

Categories:


I'm going to build a list of locations to download spatial data for free.

This is the first

http://arcdata.esri.com/data_downloader/DataDownloader?s=s&service=ESRI_World

It allows you to downloads country boundaries, roads, water bodies and cities


This is cross posted from my SQLBlogcasts blog which can be found here, http://sqlblogcasts.com/blogs/simons/Spatial data where to download

Categories:

Theme design by Nukeation based on Jelle Druyts