Skip to content

T-SQL equals and group by ignores trailing spaces

March 31, 2023

Recently I found a quirk of T-SQL, where a group by statement was treating strings as the same if the only difference was one or more trailing spaces. So, ‘aa’ would be grouped with ‘aa ‘. I did some digging, and this is what I found.

I started with just looking at equals predicates, and like predicates. using the following code:

SELECT
      CASE
          WHEN 'a ' = 'a'
          THEN 1
          ELSE 0
      END AS EqualsTrailingSpaceFirst
    , CASE
          WHEN 'a' = 'a '
          THEN 1
          ELSE 0
      END AS EqualsTrailingSpaceLast
    , CASE
          WHEN ' a' = 'a'
          THEN 1
          ELSE 0
      END AS EqualsLeadingSpace
    , CASE
          WHEN 'a ' LIKE '%a'
          THEN 1
          ELSE 0
      END AS LikeTrailingSpaceFirst
    , CASE
          WHEN 'a' LIKE '%a '
          THEN 1
          ELSE 0
      END AS LikeTrailingSpaceLast
    , CASE
          WHEN 'a ' LIKE 'a'
          THEN 1
          ELSE 0
      END AS LikeTrailingSpaceNoWildcard
    , CASE
          WHEN N'a ' = N'a'
          THEN 1
          ELSE 0
      END AS EqualsTrailingSpaceNVARCHAR
    , CASE
          WHEN CAST('a ' AS CHAR(2)) = CAST('a' AS CHAR(2))
          THEN 1
          ELSE 0
      END AS EqualsTrailingSpaceCHAR2;

The only cases where the two values didn’t match was when I put a leading space, and when I added a trailing space to the pattern the LIKE was matching on. Including one or more trailing spaces on either side of the equals operator, or in the string being checked in the LIKE operator made no difference to the result.

I did a bit more digging and came up with this article from Microsoft, explaining what’s happening. Under the hood, if you attempt to check if any two strings are the same, SQL Server adds trailing spaces to the shorter string to make their lengths equal. So WHERE 'abc' = 'a' is actually translated to WHERE 'abc' = 'a ' under the hood. In the case of strings where the only differences are trailing spaces, this makes the strings the same before the comparison takes place.

It’s probably worth noting that this is consistent with the ANSI/ISO SQL-92 specifications, and we should therefore expect this behaviour in any SQL language (although that’s not something I’ve checked).

This also explained what is happening with the GROUP BY. Essentially a GROUP BY is a long series of equals checks under the hood. The engine will take the first value, check if it equals the second value, if it does there’s no change and if it doesn’t the engine adds the second value to the table of distinct values. We can see the results of this with the following code:

SELECT
      val.Value1
FROM
    (
        VALUES
              ('a')
            , ('a  ')
            , ('a ')
            , ('b')
    ) AS val(Value1)
GROUP BY
      val.Value1;

This only returns two values, despite us seeing it as four distinct strings.

It’s also essentially random which ‘a’ variant gets returned. There isn’t a way to check with the LEN function, as that returns the length ignoring trailing spaces, but we can check with the DATALENGTH function. This returns the size of the value in bytes, so for a VARCHAR it will return the number of characters including empty spaces (I could have also added a character to the end of the string to show what is happening here). So, to check what ‘a’ variant is returned I ran this:

WITH GroupedValues AS
    (
        SELECT
              val.Value1
        FROM
            (
                VALUES
                      ('a')
                    , ('a  ')
                    , ('a ')
                    , ('b')
            ) AS val(Value1)
        GROUP BY
            val.Value1
    )
SELECT
      GroupedValues.Value1
    , LEN(GroupedValues.Value1)
    , DATALENGTH(GroupedValues.Value1)
FROM GroupedValues;

And it came back with a DATALENGTH of 3, so it’s picked out the ‘a’ with two trailing spaces. I then moved that row, to the end of the VALUES block and ran it again:

WITH GroupedValues AS
    (
        SELECT
              val.Value1
        FROM
            (
                VALUES
                      ('a')
                    , ('a ')
                    , ('b')
                    , ('a  ')
            ) AS val(Value1)
        GROUP BY
            val.Value1
    )
SELECT
      GroupedValues.Value1
    , LEN(GroupedValues.Value1)
    , DATALENGTH(GroupedValues.Value1)
FROM GroupedValues;

This time it gave a DATALENGTH of 2, so it returned a different ‘a’ variant. This, again, is expected behaviour. If I had to guess what is happening, the SQL Server engine is taking the first value it finds in that VALUES block, and checking the others against it. Because it sees the other ‘a’ variants as the same, it’s just keeping that first value, and one of the basic principals of SQL Server is that data sets are not returned in a guaranteed order (unless you use an ORDER BY, but that only works in the final displayed result or for things like TOP 1, it’s not applicable in this situation.

So, this is something to keep an eye on. I ran into this issue because the results of this query were being used by other processes that did care about trailing spaces, and we were returning values with different numbers of trailings spaces on different days. It meant our destination table was ending up with what SQL Server considered duplicates, but the intermediate SSIS process didn’t, which was less than ideal and is taking some time to clean up.

From → SQL Server, T-SQL

One Comment

Trackbacks & Pingbacks

  1. ANSI SQL and Trailing Spaces – Curated SQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: