The perils of case-insensitive data (and our life in tangent-land)

OK, have you ever been working on one thing…that led you to another (and another and another) and then you seem to have lost hours? OK, I know. That’s our life [in the computer industry and I’m sure others!] – putting out fires and chasing strange behaviors that we eventually call “gremlins” when we really can’t figure them out (especially when they don’t repro). And, I know that we all want to be incredibly prepared but, sometimes bugs happen. And, sometimes bugs lead to serious problems possibly even data corruption/loss (which I’ve seen) and NO, I’m not directly relating this to anything about SQL Server. I’m just wanting to stress the necessity of a backup strategy (ah, a *tested* backup strategy) but, the bugs I’ve run into today are really not all that serious (they do NOT corrupt data). But, they do lead me to believe that far fewer changes are made to collations than I had thought? Or, that many of you change collations at the most granular level (probably at the column level?) and that database collation changes are done but without additional column level changes later.  


And, that’s really the point of this blog post… for now, I’m going to recommend that you make changes at the column-level OR you don’t make additional changes AFTER you’ve changed a particular database’s collation. In other words, if you have a case insensitive server and a case sensitive database then things will probably work well. You can even leverage things like default_collation for temp tables. However, if you try to make addition changes to collations in other objects – such as views – it doesn’t seem to work. Basically, I’ve run into problems creating views with different collations only when the database’s collation is different than the server’s collation. So far, that’s the only thing that I’ve found that’s wrong with what I’ve been doing. And, I didn’t even figure this one out on my own – I did a live search on the error and found this: http://cc.msnscache.com/cache.aspx?q=72171562874629&mkt=en-US&lang=en-US&w=286a60c3&FORM=CVRE which seems like the same problem I’m having (and sorry for the cached page, I couldn’t seem to get to the live page).


Regardless of this issue (is it a bug?), the real reason for this blog post is that a great discussion came up on the Regional Director tech alias (it’s an internal thing we use to leverage each other’s skills). The original question led to a few discussions and in the end, I think there are really two questions that I thought I’d discuss here:


#1) do you want a ONE-TIME way of checking a bunch of data to find rows that are lower-case (and shouldn’t be)
#2) do you want to REPEATEDLY find rows based based on a case-sensitive search (where the data is stored case-insensitive).


In my first response, I answered #2. And, I’m going to start with that one here too. If you want to query a case-insensitive column with a case-sensitive search then changing the collation on the fly (with a where clause) works (although there are some performance issues related to this). So, I took an old example of mine (which was against pubs) and I decided that it needed a refresh (meaning, I wanted to update this to work against AdventureWorks). And, that’s where half of my fun today started since this is where I’ve run into what I think is a bug. Anyway, let’s start with what works:


— First, I’ll create a test database. Without a collation specified,
— it will use the server’s default collation.


CREATE DATABASE TestAdventureWorks
go


— Verify the database collation
SELECT DATABASEPROPERTYEX(‘TestAdventureWorks’, ‘Collation’)
go


— database is set to SQL_Latin1_General_CP1_CI_AS as expected
— this is a case-insensitive database


USE TestAdventureWorks
go


SELECT LastName collate database_default AS LastName
, FirstName collate database_default AS FirstName
, MiddleName collate database_default AS MiddleName
INTO MyTestContacts
FROM Adventureworks.Person.Contact
go


SELECT *
FROM MyTestContacts
WHERE Lastname = N‘Adams’
go — (86 row(s) affected)


SELECT *
FROM MyTestContacts
WHERE Lastname = N‘adams’
go — (86 row(s) affected)


SELECT *
FROM MyTestContacts
WHERE Lastname COLLATE Latin1_General_CS_AS_KS_WS = N‘Adams’
go — (86 row(s) affected)


SELECT *
FROM MyTestContacts
WHERE Lastname COLLATE Latin1_General_CS_AS_KS_WS = N‘adams’
go — (0 row(s) affected)


— Next, create a view:
CREATE VIEW ContactLastNameCaseSensitive
AS
SELECT
LastName COLLATE Latin1_General_CS_AS_KS_WS AS CSName
FROM MyTestContacts
go


SELECT *
FROM ContactLastNameCaseSensitive
WHERE CSName = N‘Adams’
go — (86 row(s) affected)


SELECT *
FROM ContactLastNameCaseSensitive
WHERE CSName = N‘adams’
go — (0 row(s) affected)


And, everything works… in TestAdventureworks. In the *real* AdventureWorks, I get an error when I try to create the view:
Msg 2791, Level 16, State 5, Procedure ContactLastNameCaseSensitive, Line 3
Could not resolve expression for schemabound object or constraint.


So, this is the first issue. It seems as though you can’t create the view if your database has a different collation than the server collation. Well, (again), I haven’t spent all that much time on this one but I did repro what the chain on the forum seemed to have found.


Now, as for the second issue… the query can be EXTREMELY painful and slow if you run this against a large set of data. See, changing the collation on the fly will need to perform a row-by-row comparison of the data. So, to minimize that HUGE hit on performance – you have two options.



1) actually consider changing the column’s collation so that it matches your queries AND then create an index (note: the actual use of the index will be determined by the selectivity of the data).



SELECT LastName collate database_default AS LastName
, FirstName collate database_default AS FirstName
, MiddleName collate database_default AS MiddleName
INTO MyTestContacts2
FROM Adventureworks.Person.Contact
go


ALTER TABLE MyTestContacts2
ALTER COLUMN LastName nvarchar(100) COLLATE Latin1_General_CS_AS_KS_WS
go


CREATE INDEX CSNameInd on MyTestContacts2 (LastName)
go


— let’s use a query that’s highly selective (selective enough to use the index):


SELECT *
FROM MyTestContacts2
WHERE LastName = N‘Barlow’
go — (1 row(s) affected)


SELECT *
FROM MyTestContacts2
WHERE LastName = N‘barlow’
go — (0 row(s) affected)


2) create an index with a different collation… but this is harder than it sounds as the CREATE INDEX statement doesn’t directly allow changing collation (however, it should!). But, you can do this by either creating another column (real or computed) with the case-sensitive collation and then indexing it OR you could do this through an indexed view (but that adds a few complexities as well). I think the computed column that’s indexed is GREAT if the searches are generally highly selective. If they are not, then it is probably better to create a real column – as a computed copy of the inserted value – that is case sensitive. However, at that point, I’m not entirely sure why you’re keeping the case-insensitive version around…unless it’s to keep the actual inserted value (maybe for printing and/or display?). Regardless, here’s how you can create an indexed computed column.


ALTER TABLE MyTestContacts
ADD
CSName
AS LastName COLLATE Latin1_General_CS_AS_KS_WS
go


SELECT *
FROM MyTestContacts
WHERE CSName = N‘Adams’
go — (86 row(s) affected)


SELECT *
FROM MyTestContacts
WHERE CSName = N‘adams’
go — (0 row(s) affected)


CREATE INDEX CSNameInd ON MyTestContacts (CSName)
go


SELECT *
FROM MyTestContacts
WHERE CSName = ‘Barlow’
go — (1 row(s) affected)


SELECT *
FROM MyTestContacts
WHERE CSName = ‘barlow’
go — (0 row(s) affected)


And, the index will be used if the query is highly selective.


OK, so that ends the answer to part 2 of the question (see how tangents can take us a bit off track :)….


Now, let’s get back to question #1.


What if you want to do a one-time search through your data to find all of the lower case data? Well, there are a few thoughts here…. First, let’s modify the ONE Barlow row to be lowercase barlow so that we have something to find:


UPDATE MyTestContacts2
SET LastName = N‘barlow’
WHERE LastName = N‘Barlow’
go — (1 row(s) affected)


NOTE: This is an ABSOLUTELY horrible quiery as I didn’t use any key to point to the exact row I wanted to modify. Had there been a lot of Barlow’s I would have modified them all. This worked here because I knew there was only one row. But, all of your tables should have a primary key, etc. (not even going to begin this tangent :).


Now, having said that… let’s see if we can find this row easily? You should be able to do this using Transact-SQL and using some type of wildcard pattern matching such as:


SELECT * FROM MyTestContacts2
WHERE Lastname like N‘b%’
go — (1 row(s) affected)


And, that works without any problems.


So, what about NOT an upper case B.


SELECT * FROM MyTestContacts2
WHERE Lastname NOT LIKE N‘B%’
go — (18768 row(s) affected)


SELECT * FROM MyTestContacts2
WHERE Lastname NOT LIKE N‘%B%’
go — (18765 row(s) affected)


tangent number 87 <g>: if you’re wondering what the 3 rows are (as was I :)… they are 1 row of O’Brien and 2 rows of Smith-Bates. Here’s that query:



SELECT * FROM (SELECT * FROM MyTestContacts2
WHERE Lastname NOT LIKE N‘B%’) AS Bs
WHERE Lastname like ‘%B%’
go


OK, so, I thought we were there… I thought we could go to what I thought was the next logical step…..


SELECT * FROM MyTestContacts2
WHERE Lastname NOT LIKE N‘%[A-B]%’
go


And… well, we lose barlow from the result set. For some reason…when you do ranges of characters it seems to lose the case??? I remember that [A-Z] and [a-z] were different in some release? Is this a regression? Someone help me out with this one as I’m without a clue. In the end, the ONLY way I could get this to work is to do this:


SELECT * FROM MyTestContacts2
WHERE Lastname not like N‘%A%’
AND Lastname not like N‘%B%’
AND Lastname not like N‘%C%’
AND Lastname not like N‘%D%’
AND Lastname not like N‘%E%’
AND Lastname not like N‘%F%’
AND Lastname not like N‘%G%’
AND Lastname not like N‘%H%’
AND Lastname not like N‘%I%’
AND Lastname not like N‘%J%’
AND Lastname not like N‘%K%’
AND Lastname not like N‘%L%’
AND Lastname not like N‘%M%’
AND Lastname not like N‘%N%’
AND Lastname not like N‘%O%’
AND Lastname not like N‘%P%’
AND Lastname not like N‘%Q%’
AND Lastname not like N‘%R%’
AND Lastname not like N‘%S%’
AND Lastname not like N‘%T%’
AND Lastname not like N‘%U%’
AND Lastname not like N‘%V%’
AND Lastname not like N‘%W%’
AND Lastname not like N‘%X%’
AND Lastname not like N‘%Y%’
AND Lastname not like N‘%Z%’
go


And, well, that works. But, it is NOT pretty! The query’s going to require a table scan anyway AND it is a one-time query. I’m OK with this as a solution to this problem BUT, am I missing something here? Please tell me there’s something more clever here? Is this a bug?


I’m definitely interested in feedback on this one!
kt

6 thoughts on “The perils of case-insensitive data (and our life in tangent-land)

  1. The following query will return any fully lower case names in the table:

    SELECT LastName
    FROM Person.Contact
    WHERE LastName COLLATE Latin1_General_CS_AS_KS_WS = LOWER( LastName ) COLLATE Latin1_General_CS_AS_KS_WS

  2. OK, I’ll definitely give you that one. It’s a great trick – and simple too. Doesn’t require that you create another index. However, just to play devils advocate <G>, you’re still not forcing an index to be used. There are cases where the case-insensitive version of the data is NOT selective and the case-sensitive is selective. Now, all of this depends on your data and in many cases this is probably not likely – however, it could happen. Regardless, if you do have an index on the case-insensitive version of the column (and the case-insensitive data is relatively selective) then this is a nice/simple and easier solution for sure. Thanks!

    And, I still want to get down to the bottom of the "search for all lower case in a case-sensitive (or insensitive) column" though. So, keep the comments coming…

    THANKS Denis!
    kt

  3. Well, we’ve all just posted at the same time… and David’s got it. OK, that’s a lot cleaner. It still doesn’t really answer why the Transact-SQL seems to have broken somewhere along the way (or is it just me?) and why [A-Z] is not uppercase (v. [a-z]) but, it’s definitely a lot prettier than mine (thank goodness for cut and paste).

    OK, cool. Thanks David!

    kt

  4. Hi Kimberly,

    The reason [A-Z] doesn’t work, is that a collation doesn’t just govern case sensitive vs case insensitive but also (amongst others) the sort order of letters. And most case sensitive collations sort like A – a – B – b – … – Z – z. So [^A-Z] would include all letters except the lowercase z.

    You can use [A-Z] to find uppercase characters in a binary collation (since all uppercase characters are in one range of ASCII, and all lowercase characters in another), but not in any other collation.

  5. Hey there Hugo – Of course… I thought about this more last night and realized that as well. We all know the general sort for ASCii but the order of characters in unicode is obviously different.

    So… thanks Hugo! That’s really the last issue there.

    Cheers,
    kt

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Wow! Wow! Wow! THANK YOU!

I announced my retirement from SQL/tech here and your comments on my blog, on LinkedIn, and on Facebook were overwhelming and humbling! I’m so touched

Explore

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.