Follow on post to “The perils of case-insensitive data (and our life in tangent-land)”

OK, so thought I’d do a follow up to the post I did a couple of days ago titled: The perils of case-insensitive data (and our life in tangent-land). The reason I’d like to followup on it is that I received some excellent comments and I want to make sure that you’re all aware of the tips/tricks and recommendations that there were (some of you may not have returned to see all of the comments). Really, I was impressed by the speed at which people responded as well as the great comments (and things I learned!). It just reminds me of the fact that none of us can know everything AND that our SQL community is awesome in its willingness to share and communicate.


As for the tips/tricks and “yes, duh!” realizations I came to… here are the interesting points from the comments:


First – why did my comparison work for a single character (e.g. ‘%A%’) but not when I did a character range (e.g. ‘%[A-Z]%’)? Well, it was because it was unicode! This was a “right! duh!” realization that I think I dreamed after I wrote this BUT, Hugo Kornelis is exactly right in his comment. Thanks Hugo! Here is a direct cut/paste of his comment:



The reason [A-Z] doesn’t work, is that a collation doesn’t just govern case sensitive vs case insensitive but also (amongst others) the sort order of letters. And most case sensitive collations sort like A – a – B – b – … – Z – z. So [^A-Z] would include all letters except the lowercase z.

You can use [A-Z] to find uppercase characters in a binary collation (since all uppercase characters are in one range of ASCII, and all lowercase characters in another), but not in any other collation.


And, you can check out more from Hugo on his blog: http://sqlblog.com/blogs/hugo_kornelis/default.aspx


Second – the comparison query that I wrote all together (where I stated each letter individually in the WHERE clause) only took a few seconds to write (thanks to cut and paste :) AND it did work…And, sometimes getting something to work and moving on is all we can do (come on – you’ve ALL been there, eh? :). However, my main comment was that “it wasn’t pretty”. A much more elegant and unbelievably simple solution came from David R Buckingham (aren’t the great answers always the really simple ones :)). Here is a direct/cut/paste of his comment:



The following query will return any fully lower case names in the table:

SELECT LastName
FROM Person.Contact
WHERE LastName COLLATE Latin1_General_CS_AS_KS_WS = LOWER( LastName ) COLLATE Latin1_General_CS_AS_KS_WS


I don’t believe that David has a blog… maybe he should :).


Third – a very cool and clever trick that came in from Denis Gobo is related to the performance of repeatedly doing case-sensitive searches on a case-insensitive column. I suggested that creating an additional column (preferably a computed column that uses the case sensitive collation) would be an easy and optimal solution. This is still definitely true when the case-insensitive values are NOT selective enough to warrant using an index and the case-sensitive values are… However, if both the case-sensitive AND the case-insensitive values are reasonably selective then the trick that helps is from his comment. Here is a direct cut/paste of his comment:



Kimberly, the way to force an index seek is to do this

SELECT *
FROM MyTestContacts
WHERE Lastname = N’adams’
AND Lastname COLLATE Latin1_General_CS_AS_KS_WS = N’Adams’

The WHERE might return more than one row but the AND will return only the case sensitive one

I wrote about that a while back here:
http://sqlservercode.blogspot.com/2007/05/make-your-case-sensitive-searches-1000.html 


And, you can check out more from Denis on his blog: http://sqlservercode.blogspot.com/


Now, as for the issues related to creating a view in a database that has a different collation from the server’s collation… Here, I’m fairly certain that there’s still a bug. However, I’m happy to say that I don’t think that it’s the most likely situation that exists for collations. I think the two most likely situations are:




  1. The server has one collation. The database inherits that collation. The database developer makes column level collation changes throughout the db. This seems to work well. OR


  2. The server has one collation. The database has a different collation. The database developer consistently uses that collation throughout their app. A good example of this is where people have case-sensitive databases on case-insensitive servers. This works fairly well (although there are some issues wrt to temp tables, etc. and default_collation is a good thing to know).


I guess there’s even a third one where column level changes are made in a database whose collation is different from the server but where there aren’t any views that also change the collation to yet a different collation (and this is where there seems to be a bug).


So, this was an excellent (and reasonably fun :) :) exercise to go through wrt collations. And, this is how I (we?) learn! I really want to thank everyone for reading – and commenting/sharing! – the things they learned/knew. That’s part of why I love the SQL Server community. And, speaking of which, I thought I’d end this entry with a few community links – as a reminder to everything that’s out there:



Thanks for reading! Thanks for commenting!
kt

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Wow! Wow! Wow! THANK YOU!

I announced my retirement from SQL/tech here and your comments on my blog, on LinkedIn, and on Facebook were overwhelming and humbling! I’m so touched

Explore

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.