MSSQL - Query case insensitive data

31. March 2011 23:26

I came across a post a few weeks ago which was attempting to explain how todo a case insensitive query on data. Of course the solution works, well kind of it also fails really badly if you attempt to scale the size of the data.

 

The problem being is that when you change the collation on the data type sql server can no longer perform an index seek instead an index scan would be performed. This happens because the data stored in the index is case sensitive and a case insensitive operation is being attempted.

 

Here is what i used to setup the same situation. I created a table using the following. Then modified the collaction of the column to make it case sensitive.

 

CREATE TABLE [dbo].[CaseInSensitive](
	[cisId] [int] IDENTITY(1,1) NOT NULL,
	[cisText] [varchar](50) NULL,
 CONSTRAINT [PK_CaseInSensitive] PRIMARY KEY CLUSTERED 
(
	[cisId] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

 

Then filled it with some sample data. Approx 20k rows which is something a small / medium database would be containing. I also added an index to the cisTest column.

 

When you run the following which is not actually what we want. But it is mentioned here for a comparison for the query speed. I also turned on io statistics at this point.

 

SELECT COUNT(*) FROM CaseInSensitive
	WHERE cisText = 'abc'

SELECT COUNT(*) FROM CaseInSensitive
	WHERE cisText COLLATE SQL_Latin1_General_CP1_CI_AS = 'abc'

 

 

The first query brough in a total of 2 logical reads. The 2nd brough in a total of 88 logical reads. So you can quite clearly see the performance drop off. In fairness the first query was not operating correctly but we will prove the performance again later.

 

The really simple way to bring this back is to alter the table slightly. Then add another index on our new column "cisTextLower". So our new table layout will now look like the following

 

 

CREATE TABLE [dbo].[CaseInSensitive](
	[cisId] [int] IDENTITY(1,1) NOT NULL,
	[cisText] [varchar](50) NULL,
	[cisTextLower]  AS (lower([cisText])) PERSISTED,
 CONSTRAINT [PK_CaseInSensitive] PRIMARY KEY CLUSTERED 
(
	[cisId] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

 

So now when we run this sql across our database we get the correct operation with almost no performance loss.

 

SELECT COUNT(*) FROM CaseInSensitive
	WHERE cisTextLower = LOWER('ABC')
	
SELECT COUNT(*) FROM CaseInSensitive
	WHERE cisText COLLATE SQL_Latin1_General_CP1_CI_AS = 'abc'

 

 

The first sql statement had a total of 3 logical reads. The second still had 88 logical read as it was performing an index scan.

 

The following is a screenshot of the actual execution plan which really shows the difference in performance (not the performance relative to the batch). Bear in mind this was over 20,000 rows. Think of how the performance would be effected over a much larger data set. Even with 1,000,000 rows the index seek will probably still only perform less than 10 logical reads.

 

E-mail Kick it! DZone it! del.icio.us Permalink