T-SQL equals and group by ignores trailing spaces
Recently I found a quirk of T-SQL, where a group by statement was treating strings as the same if the only difference was one or more trailing spaces. So, ‘aa’ would be grouped with ‘aa ‘. I did some digging, and this is what I found.
I started with just looking at equals predicates, and like predicates. using the following code:
SELECT
CASE
WHEN 'a ' = 'a'
THEN 1
ELSE 0
END AS EqualsTrailingSpaceFirst
, CASE
WHEN 'a' = 'a '
THEN 1
ELSE 0
END AS EqualsTrailingSpaceLast
, CASE
WHEN ' a' = 'a'
THEN 1
ELSE 0
END AS EqualsLeadingSpace
, CASE
WHEN 'a ' LIKE '%a'
THEN 1
ELSE 0
END AS LikeTrailingSpaceFirst
, CASE
WHEN 'a' LIKE '%a '
THEN 1
ELSE 0
END AS LikeTrailingSpaceLast
, CASE
WHEN 'a ' LIKE 'a'
THEN 1
ELSE 0
END AS LikeTrailingSpaceNoWildcard
, CASE
WHEN N'a ' = N'a'
THEN 1
ELSE 0
END AS EqualsTrailingSpaceNVARCHAR
, CASE
WHEN CAST('a ' AS CHAR(2)) = CAST('a' AS CHAR(2))
THEN 1
ELSE 0
END AS EqualsTrailingSpaceCHAR2;
The only cases where the two values didn’t match was when I put a leading space, and when I added a trailing space to the pattern the LIKE
was matching on. Including one or more trailing spaces on either side of the equals operator, or in the string being checked in the LIKE
operator made no difference to the result.
I did a bit more digging and came up with this article from Microsoft, explaining what’s happening. Under the hood, if you attempt to check if any two strings are the same, SQL Server adds trailing spaces to the shorter string to make their lengths equal. So WHERE 'abc' = 'a'
is actually translated to WHERE 'abc' = 'a '
under the hood. In the case of strings where the only differences are trailing spaces, this makes the strings the same before the comparison takes place.
It’s probably worth noting that this is consistent with the ANSI/ISO SQL-92 specifications, and we should therefore expect this behaviour in any SQL language (although that’s not something I’ve checked).
This also explained what is happening with the GROUP BY
. Essentially a GROUP BY
is a long series of equals checks under the hood. The engine will take the first value, check if it equals the second value, if it does there’s no change and if it doesn’t the engine adds the second value to the table of distinct values. We can see the results of this with the following code:
SELECT
val.Value1
FROM
(
VALUES
('a')
, ('a ')
, ('a ')
, ('b')
) AS val(Value1)
GROUP BY
val.Value1;
This only returns two values, despite us seeing it as four distinct strings.
It’s also essentially random which ‘a’ variant gets returned. There isn’t a way to check with the LEN
function, as that returns the length ignoring trailing spaces, but we can check with the DATALENGTH
function. This returns the size of the value in bytes, so for a VARCHAR
it will return the number of characters including empty spaces (I could have also added a character to the end of the string to show what is happening here). So, to check what ‘a’ variant is returned I ran this:
WITH GroupedValues AS
(
SELECT
val.Value1
FROM
(
VALUES
('a')
, ('a ')
, ('a ')
, ('b')
) AS val(Value1)
GROUP BY
val.Value1
)
SELECT
GroupedValues.Value1
, LEN(GroupedValues.Value1)
, DATALENGTH(GroupedValues.Value1)
FROM GroupedValues;
And it came back with a DATALENGTH
of 3, so it’s picked out the ‘a’ with two trailing spaces. I then moved that row, to the end of the VALUES
block and ran it again:
WITH GroupedValues AS
(
SELECT
val.Value1
FROM
(
VALUES
('a')
, ('a ')
, ('b')
, ('a ')
) AS val(Value1)
GROUP BY
val.Value1
)
SELECT
GroupedValues.Value1
, LEN(GroupedValues.Value1)
, DATALENGTH(GroupedValues.Value1)
FROM GroupedValues;
This time it gave a DATALENGTH
of 2, so it returned a different ‘a’ variant. This, again, is expected behaviour. If I had to guess what is happening, the SQL Server engine is taking the first value it finds in that VALUES
block, and checking the others against it. Because it sees the other ‘a’ variants as the same, it’s just keeping that first value, and one of the basic principals of SQL Server is that data sets are not returned in a guaranteed order (unless you use an ORDER BY
, but that only works in the final displayed result or for things like TOP 1
, it’s not applicable in this situation.
So, this is something to keep an eye on. I ran into this issue because the results of this query were being used by other processes that did care about trailing spaces, and we were returning values with different numbers of trailings spaces on different days. It meant our destination table was ending up with what SQL Server considered duplicates, but the intermediate SSIS process didn’t, which was less than ideal and is taking some time to clean up.
Trackbacks & Pingbacks