The above table is just to give you a visual sample of the available ASCII characters. Now, using our new Tally table, we're going to "unpivot" a string (§KSQL.^TEST ) into rows, determine the ASCII number for each character, and include a RowID to boot (the RowID will be very important shortly).
-- "Unpivot" a string and view each ASCII number
DECLARE @String VARCHAR(15) = '§KSQL.^TEST ' ;
Item = LEFT(SUBSTRING(@String, t.RowID, LEN(@String)), 1),
ASCII_Number = ASCII(LEFT(SUBSTRING(@String, t.RowID, LEN(@String)), 1))
FROM (SELECT RowID = TallyID FROM dbo.Tally WHERE TallyID <= LEN(@String)) AS T;
There's an application that I work with that doesn't properly sanitize input; I'm sure this is an isolated case and noneof you have experienced this, but speaking from the hypothetical here is a solution. I stumbled on a query plan last week that was outputting 2 trillion rows when, at the very most, should have had a maximum of 500k. It was causing performance problems (shocking!) and I took the usual optimization steps but found out that including a LTRIM( RTRIM ( ) ) function in a join was one of the causes of the "fat execution plan lines." I removed the function from the join and found that execution time dropped from 12 seconds to less then 1 and the plan was significantly improved. And then found about 50 other sprocs that would be happier if I did the same. Sounded like a good business case to me.
I'll admit I still don't know why this had the effect it did but regardless I knew that having this function in so many joins is a symptom of a common underlying cause: bandaid code written to accommodate bad data. That I can fix.
It's not as easy as LTRIM-RTRIM. Those functions don't strip ASCII characters such as tabs or carriage returns.
I created a trigger on two of our tables with the heaviest use that strips away any unwanted ASCII characters from ONLY the beginning or the end of the string. Doing a find and replace is relatively simple, but only touching leading and trailing characters AND being choosy about which to whitelist? More fun. Here's the scoop below.
The first step is to get a visual of what you're working with. [master].[dbo].[spt_values], my favorite cuddly undocumented table, can help you do this. The first step is to create a Tally table if you don't have one already. I used this code from StackOverflow as a starting point. I kept my rowcount small which isn't typical but I'm only using it on small strings for the time being.
IF OBJECT_ID('dbo.TallyTable','U') IS NOT NULL DROP TABLE dbo.TallyTable;
SELECT TOP 256 IDENTITY(INT,1,1) AS Number
FROM sys.objects AS S1
CROSS JOIN sys.objects AS S2;
ALTER TABLE dbo.TallyTable
ADD CONSTRAINT PK_NumbersTest PRIMARY KEY CLUSTERED (Number);
[spt_values] contains the ASCII characters (and every character or string in existence in the database) that I want to allow as leading or trailing characters in my field. In this scenario, I only allowed values between 33 and 126 but you should customize the range to your needs.
-- View all ASCII Characters
SELECT ASCII_Number = number,
ASCII_Character = CHAR(number)
WHERE [type] = 'p'
AND number BETWEEN 33 AND 255
ORDER BY number;
In order to prevent unwanted characters from the start and end of our columns, a group effort between Ben Teraberry, Jake Minette and myself came up with the following trigger as our final solution. It handles any inserts or updates (batches too! That was an issue with v1.0) by using the same method above with the Tally table, by finding the first and last acceptable character, and then updating the field with the trimmed string.The app can give us crap, but we no longer allow it.
SET S.StringThing =
(SELECT SUBSTRING(S.StringThing, FirstChar, LastChar - FirstChar + 1) FROM
(SELECT FirstChar = MIN(TallyID), LastChar = MAX(TallyID) FROM dbo.Tally WHERE
TallyID = LEN(s.StringThing)
AND ASCII(SUBSTRING(S.StringThing, TallyID, 1)) BETWEEN 33 AND 126) AS Q),
S.CharThing = (SELECT SUBSTRING(S.CharThing, FirstChar, LastChar - FirstChar + 1)
FROM (SELECT FirstChar = MIN(TallyID), LastChar = MAX(TallyID) FROM dbo.Tally
WHERE TallyID = LEN(S.CharThing)
AND ASCII(SUBSTRING(S.CharThing, TallyID, 1)) BETWEEN 33 AND 126)AS Q),
FROM dbo.SampleData AS S
INNER JOIN INSERTED AS I ON S.RowID = I.RowID;
Let me know if you get any errors creating that trigger. Now, the fun part! Below is a series of queries that attempt to input bad data, which are clean by the trigger. Run them and see how they are intelligently concatenated to insert clean data!
SELECT * FROM dbo.SampleData;
UPDATE dbo.SampleData SET StringThing = 'abcdefg¥' WHERE RowID = 1;
UPDATE dbo.SampleData SET StringThing = ' abcdefg ¥' WHERE RowID = 1;
UPDATE dbo.SampleData SET StringThing = ' abcdefg h' WHERE RowID = 1;
One thing to mention is that with a CHAR datatype, you will always see the "empty spaces" in the app and in SSMS. I thought I was going mad because I had trimmed a CHAR(10) to 8 characters and my SSMS cursor kept indicating 10 spaces. It's true! A join with LTRIM-RTRIM will work, but the space is there and used. Avoid CHAR datatypes ;-)
We've used our Tally table to split the string out, order by RowID, and return the number of each corresponding ASCII character. Now to get down to business. Let's create a sample table, populate some data, and use this code.
-- Create sample data table
IF OBJECT_ID('dbo.SampleData','U') IS NOT NULL DROP TABLE dbo.SampleData;
CREATE TABLE dbo.SampleData(
RowID INT IDENTITY (1,1)NOT NULL,
INSERT INTO dbo.SampleData(StringThing, CharThing)
SELECT * FROM dbo.SampleData;