Chris Johnson

April 23, 2023

Speaking at a user group and epic life quest

One of my goals for 2023 was to speak at a SQL user group again. I set this because in 2021 I’d felt like I was starting to build a speaking profile. I’d spoken at New Stars of Data, and that had gone really well, and I’d applied to GroupBy and DataMinutes and been accepted for both. For reasons I’ve already gone into, I didn’t keep going with it, but it felt like I had some momentum at the time and giving a talk at a user group seemed like a good first step to getting that back without setting myself too big a challenge.

Anyway, last Thursday I spoke at Data Bristol. I gave a modified version of the Visual Studio database projects talk from New Stars of Data, gearing it more towards people who didn’t necessarily have any prior experience with the tool. I felt like that was important for an audience like a user group, where people don’t have any choice of sessions, it’s fine to pitch an expert level talk at a large conference where attendees have other sessions they can go to instead but at a user group everyone has to listen to your talk so it doesn’t feel that I can assume existing knowledge.

I felt like the talk went pretty well. I wasn’t as prepared as I was for the New Stars of Data talk, and that showed in a couple of places when the transition between slides wasn’t that smooth, or I lost the thread of what I was saying a bit. I also felt that the lack of prep affected my confidence a bit, and because of that I didn’t project well enough in places. With that said, I got some great feedback and a few questions, so overall I feel pretty good.

I also feel like that was a good warmup for maybe doing some other talks, so I’m going to keep my eyes open for other opportunities and start applying. I do feel like database projects is a good topic for a talk, and using them to split your database over multiple projects and re-using code across databases is a really powerful tool that nobody else is really talking about.

Epic life quest

Speaking at a user group wasn’t a distinct goal of my epic life quest, but there is a goal to speak 5 times at user groups, and this takes that up to 3.

However, I also got some non-SQL news this week. One of the goals I had was to become a faction ref at my larp system. This didn’t quite happen, but instead I applied this year to be a member of the game team, and I heard on Friday that they’d accepted my application. Larp is a ridiculous hobby that I love doing (there’s a very cool video that my main system has put out as an advertisement) and being on game team means I will be helping to run things (as a volunteer) and bring some of the joy I’ve got from it to other people.

I’m therefore counting this as a point in my epic life quest, and that gives me all 5 points, which means I’ve completed level 8. Hopefully I can push on a bit this year and make some progress towards level 9 as well!

April 13, 2023

Granularity in queries

Today’s post might be a bit of a rant. I’d like to talk about granularity in queries for a minute, and what I think are better or worse ways to lay your code out when you’re dealing with tables of different granularity. To start off, I’ve got an example of the type of query structure that really bugs me, taken from the WideWorldImporters database that Microsoft use as their sample database.

SELECT
      Ord.OrderID
    , Ord.OrderDate
    , Ppl_PickedBy.LogonName AS LogonOfOrderPicker
    , Ord.DeliveryInstructions
    , Ord.InternalComments
    , SUM(OrdLn.Quantity*OrdLn.UnitPrice) AS OrderCost
FROM Sales.Orders AS Ord
INNER JOIN Sales.OrderLines AS OrdLn
    ON Ord.OrderID = OrdLn.OrderID
INNER JOIN [Application].People AS Ppl_PickedBy
    ON Ord.PickedByPersonID = Ppl_PickedBy.PersonID
GROUP BY
      Ord.OrderID
    , Ord.OrderDate
    , Ppl_PickedBy.LogonName
    , Ord.DeliveryInstructions
    , Ord.InternalComments

This query is pretty simple, we’re wanting to return a set of data about the orders placed including the total cost of each order. However the orders in the database are split into an Orders table that holds the order level information, and an OrderLines table that holds information on each product ordered in each order, and we need that OrderLines table to get the cost.

So we write our query, and we join to the OrderLines table, and we sum the cost. But then, because we have an aggregate, we need to group by every other column we’re returning. And to me that just doesn’t look right, and it doesn’t convey the intention of the query properly.

What I mean with that intention remark, is we are trying to return a data set where each row represents one order, so our data set should exist at the OrderID grain. But looking at the GROUP BY doesn’t tell us that. Instead it says we are returning something with a composite grain that includes columns like DeliveryInstructions and InternalComments, but those are just attributes we want to hang off the OrderID.

A better approach, in my opinion, is to categorise any tables you’re joining to as either the primary table (or tables in some cases) that holds data at the query grain level, lookup tables that return attributes and measures that exist at a higher grain, and detail tables that return measures you need to sum up to the correct grain.

Your primary table(s) should ideally be first in the FROM clause, and any lookup tables can then be joined to them without an issue. The detail tables should be moved out to a CTE (or sub-query, or view, or something else) to sum them up to the granularity you want for the overall query. For example, I think this is a significant improvement on the above query:

WITH cte_OrderLines AS
    ( /*Sales.OrderLine data summed to OrderID granularity*/
        SELECT
              OrdLn.OrderID
            , SUM(OrdLn.Quantity*OrdLn.UnitPrice) AS OrderCost
        FROM Sales.OrderLines AS OrdLn
    )
SELECT
      Ord.OrderID
    , Ord.OrderDate
    , Ppl_PickedBy.LogonName AS LogonOfOrderPicker
    , Ord.DeliveryInstructions
    , Ord.InternalComments
    , OrdLn.OrderCost
FROM Sales.Orders AS Ord
INNER JOIN cte_OrderLines AS OrdLn
    ON Ord.OrderID = OrdLn.OrderID
INNER JOIN [Application].People AS Ppl_PickedBy
    ON Ord.PickedByPersonID = Ppl_PickedBy.PersonID

We now don’t have a GROUP BY clause in the main query at all.

It’s also easier for adding other data that’s stored at a lower granularity than OrderID. For instance, if we add the Sales.Invoices table because we want to get the total amount invoiced for this order, this joins via OrderID so there could potentially be multiple invoices for the same order. If we add this to the original query there is clearly a high risk of duplication, but if we add it as another CTE at the start of the second query then that risk is mitigated in a more graceful manner, and the overall query is easier to read (at least in my opinion).

This might not seem hugely significant when looking at the examples I’ve given here, but I’ve seen some extremely bloated queries that tried to go down the first approach and ended up with long lists of columns and calculations in the GROUP BY clause that really had nothing to do with the query granularity. I’d like to not have to deal with those types of query again, but I suspect I will end up digging through many more of them in my career.

April 6, 2023

2023 goals – Q1 update

I set myself some goals at the start of the year, and, as part of keeping myself accontable, I want to give some roughly quarterly updates on how they’re going.

So:

Weigh under 95kg every day for 3 months – This is not going well. I lost some weight at the start of the year, but put quite a bit of it back on in the last month. I did stop focussing on this in the run-up to my strongman competition, so now that’s out of the way this is going to be a major focus for Q2.
Deadlift 200kg – I haven’t done a proper check of my max deadlift, but I know it improved in the run-up to the strongman comp. I feel like I’m close to this, and won’t have any issues hitting it this year.
Deadlift 250kg – I’m less certain about hitting this before the end of the year, but it still feels possible. If I do, it will likely be sometime in Q4.
Complete every event in Bristol’s Strongest Novice – Success! This happened a few weeks ago, and I managed to lift every weight in the deadlift ladder, get every sandbag weight over the bar i the sandbag throw, complete the sandbag carry and sled pull medley well under time, and get several reps on the log press and sandbag over bar events. I feel really good about all of that.
Finish a half marathon – I’ve booked one in for the end of September. Training so far hasn’t gone great, but that will be another focus for Q2 along with the weight loss, as the two go well together.
Paint a Warhammer 40k combat patrol – I’m almost there, just need to find a few evenings to get this done.
Paint a Warhammer 40k incursion force – This is probably something for Q3.
Paint a gang/team/squad/something else for one of the Games Workshop boxed games – there’s a campaign starting up soon, so that should provide good motivation to get this done. Hopefully it can get completed by the end of Q2.
Complete my paint tracking app – No progress so far, this is something I need to find more time to work on.
Blog every week for 6 months – This will be my eighth consecutive week blogging, so 1/3 of the way there.
Speak at another SQL Server user group – I’m speaking at the Bristol user group 2 weeks from today, and I’m going to use the long Easter weekend to get the talk up to standard.
Sort out a group tent for larp – For various reasons, this is probably being crossed off the list
Book another big holiday – I’ve looked at a few, but not taken the plunge on any yet
Have 5 smaller adventures this year – Booked/arranged so far: LoveTrails running festival in July, Tithing larp in June, mud run in December. For the other two, I’m thinking trips to London to see some shows.

Although I’ve not hit many of my goals yet, I feel like I’m making solid progress on a lot of them. For Q2 I need to focus on the running and weight loss, and keep progressing with the blogging and the deadlifts. I’ll cross off the SQL user group talk in Q2, and hopefully the 200kg deadlift and one or two of the painting goals as well. The main thing I’m concerned about is the painting app, I’d ideally want to see some solid progress on that over the next 3 months but I’m not sure I’ll have the time for it.

March 31, 2023

T-SQL equals and group by ignores trailing spaces

Recently I found a quirk of T-SQL, where a group by statement was treating strings as the same if the only difference was one or more trailing spaces. So, ‘aa’ would be grouped with ‘aa ‘. I did some digging, and this is what I found.

I started with just looking at equals predicates, and like predicates. using the following code:

SELECT
      CASE
          WHEN 'a ' = 'a'
          THEN 1
          ELSE 0
      END AS EqualsTrailingSpaceFirst
    , CASE
          WHEN 'a' = 'a '
          THEN 1
          ELSE 0
      END AS EqualsTrailingSpaceLast
    , CASE
          WHEN ' a' = 'a'
          THEN 1
          ELSE 0
      END AS EqualsLeadingSpace
    , CASE
          WHEN 'a ' LIKE '%a'
          THEN 1
          ELSE 0
      END AS LikeTrailingSpaceFirst
    , CASE
          WHEN 'a' LIKE '%a '
          THEN 1
          ELSE 0
      END AS LikeTrailingSpaceLast
    , CASE
          WHEN 'a ' LIKE 'a'
          THEN 1
          ELSE 0
      END AS LikeTrailingSpaceNoWildcard
    , CASE
          WHEN N'a ' = N'a'
          THEN 1
          ELSE 0
      END AS EqualsTrailingSpaceNVARCHAR
    , CASE
          WHEN CAST('a ' AS CHAR(2)) = CAST('a' AS CHAR(2))
          THEN 1
          ELSE 0
      END AS EqualsTrailingSpaceCHAR2;

The only cases where the two values didn’t match was when I put a leading space, and when I added a trailing space to the pattern the LIKE was matching on. Including one or more trailing spaces on either side of the equals operator, or in the string being checked in the LIKE operator made no difference to the result.

I did a bit more digging and came up with this article from Microsoft, explaining what’s happening. Under the hood, if you attempt to check if any two strings are the same, SQL Server adds trailing spaces to the shorter string to make their lengths equal. So WHERE 'abc' = 'a' is actually translated to WHERE 'abc' = 'a ' under the hood. In the case of strings where the only differences are trailing spaces, this makes the strings the same before the comparison takes place.

It’s probably worth noting that this is consistent with the ANSI/ISO SQL-92 specifications, and we should therefore expect this behaviour in any SQL language (although that’s not something I’ve checked).

This also explained what is happening with the GROUP BY. Essentially a GROUP BY is a long series of equals checks under the hood. The engine will take the first value, check if it equals the second value, if it does there’s no change and if it doesn’t the engine adds the second value to the table of distinct values. We can see the results of this with the following code:

SELECT
      val.Value1
FROM
    (
        VALUES
              ('a')
            , ('a  ')
            , ('a ')
            , ('b')
    ) AS val(Value1)
GROUP BY
      val.Value1;

This only returns two values, despite us seeing it as four distinct strings.

It’s also essentially random which ‘a’ variant gets returned. There isn’t a way to check with the LEN function, as that returns the length ignoring trailing spaces, but we can check with the DATALENGTH function. This returns the size of the value in bytes, so for a VARCHAR it will return the number of characters including empty spaces (I could have also added a character to the end of the string to show what is happening here). So, to check what ‘a’ variant is returned I ran this:

WITH GroupedValues AS
    (
        SELECT
              val.Value1
        FROM
            (
                VALUES
                      ('a')
                    , ('a  ')
                    , ('a ')
                    , ('b')
            ) AS val(Value1)
        GROUP BY
            val.Value1
    )
SELECT
      GroupedValues.Value1
    , LEN(GroupedValues.Value1)
    , DATALENGTH(GroupedValues.Value1)
FROM GroupedValues;

And it came back with a DATALENGTH of 3, so it’s picked out the ‘a’ with two trailing spaces. I then moved that row, to the end of the VALUES block and ran it again:

WITH GroupedValues AS
    (
        SELECT
              val.Value1
        FROM
            (
                VALUES
                      ('a')
                    , ('a ')
                    , ('b')
                    , ('a  ')
            ) AS val(Value1)
        GROUP BY
            val.Value1
    )
SELECT
      GroupedValues.Value1
    , LEN(GroupedValues.Value1)
    , DATALENGTH(GroupedValues.Value1)
FROM GroupedValues;

This time it gave a DATALENGTH of 2, so it returned a different ‘a’ variant. This, again, is expected behaviour. If I had to guess what is happening, the SQL Server engine is taking the first value it finds in that VALUES block, and checking the others against it. Because it sees the other ‘a’ variants as the same, it’s just keeping that first value, and one of the basic principals of SQL Server is that data sets are not returned in a guaranteed order (unless you use an ORDER BY, but that only works in the final displayed result or for things like TOP 1, it’s not applicable in this situation.

So, this is something to keep an eye on. I ran into this issue because the results of this query were being used by other processes that did care about trailing spaces, and we were returning values with different numbers of trailings spaces on different days. It meant our destination table was ending up with what SQL Server considered duplicates, but the intermediate SSIS process didn’t, which was less than ideal and is taking some time to clean up.

March 24, 2023

Believe the code over what people tell you

This is just a quick post, about a development principal that I think applies across pretty much all development languages.

Always believe the code over what someone else tells you.

A good example in databases is NULLable columns. You will sometimes get the situation where you are selecting from a NULLable column, but someone in the business tells you there can’t ever be NULLs in there because of how the application works. I get very skeptical of this, because users can always find a way to make applications do strange things, because applications change over time and just because it’s impossible now doesn’t mean it always has been, and because it’s always possible to change the data in other ways that bypass the particual application being discussed (manual data fix, for exmple).

The first thing to do in that scenario is check if there are any NULLs in there, because if there are you can just go back with the evidence. However, sometimes there won’t be, and sometimes you’ll even run into people who will double down on their assertion that there can’t be NULLs in that column despite the evidence. In that case you have two options: either you code to account for the possibility of NULLs anyway, which may involve making some assumptions about what the business wants to replace them with; or you code as if the column is not NULLable, but if you do that I strongly advise getting the assurances that the data will never be NULL in writing.

Which route you end up going down is very much down to your situation at the company, the seniority of the person telling you to not code for NULLs, how serious it would be if there are NULLs, and a whole load of other factors that fall under the broad heading of ‘office politics’ (and just typing that makes me feel a bit dirty inside).

It’s also important to note that this won’t just be limited to NULLability, that’s just a convenient example. I had a case once where we had a complaint that related to one of our core processes. I investigated and found a fairly serious flaw that would have major implcations for a lot of the work we had done for this client. I took my findings to the person pushing this investigation, who was also the CEO. He flat our refused to believe what I was saying and insisted the problem had to be elsewhere. In this case, I stuck to my assertion that this was what the code was doing, and ended up being pulled off the investigation and replaced with a developer who was more likely to go along with the CEO (that developer privately admitted to me that the process was known to not work correctly). It wasn’t the ideal result for the company, but it didn’t do my standing in the company any real harm, and meant I was kept away from any fallout that might ensue from the investigation.

In this case the biggest factor in me sticking to my guns was the seriousness of the issue. I definitely didn’t want anyone to be able to say I gave bad advice on this, or misrepresented the code, and a small loss of standing with the CEO was a price I was willing to pay to either get the right result or get me off the project.

March 14, 2023

TSQLTuesday #160: Microsoft Open AI Wish List

This is my 11th TSQLTuesday, and this month Damien Jones has given the prompt: “What’s on your wishlist for the partnership between Microsoft and OpenAI?”

This is probably going to be quite a short blog, both because I’m pressed for time today, and also because unlike most people I haven’t spent much time looking at the new shiny AI stuff that’s come out recently. There was someone at my company who gave a talk a few weeks ago about how he sees AI being able to write the simpler code that we currently end up repeating over and over again (as a BI developer, merging into a slowly changing dimension springs to mind here), and I think there is something to talk about there, but that’s not what I’m going to talk about today.

Instead, the main hope I have for the Microsoft/OpenAI partnership is that it will produce something to make ingesting files easier.

I have spent a lot of my career writing ETL processes to ingest data into a data warehouse, usually via some sort of staging layer. Inevitably, some of that data will come from files rather than other databases, and that is a problem. The reasons I’ve needed to import file data have varied, sometimes my employer dealt with third parties and they were only willing to send the data in file format. Sometimes we needed ways for other people in the business to manage some of the data, and for whatever reason nobody was willing/able to build a proper master data system. One company’s entire business model was managing data for other companies, so we needed to ingest data from both our clients and the third parties they had contracts with. The one thing all of these scenarios had in common was that the files did not remain consistent. By this, I mean everything could change. Some examples off the top of my head:

Header text could change
Column order could change
Data types in the same column could change
Headers could be on the wrong columns (always fun)
New columns could appear
The number of rows before the data started could change
Values might be wrapped in “” one month and not the next
In Excel, sheet names could change
In Excel, region names could change if we were picking only the data in a particular region
I even had one case where row delimiters would sometimes change

Needless to say, this is always extremely frustrating to deal with. I’ve come up with various ways to mitigate some of these issues over the years, usually involving something like checking inside the file, building a basic import table based off that, and dynamically building an import and cleanse process that would hopefully produce the columns we needed at the end of it.

This really feels like something an AI might be able to do, from my largely unqualified vantage point. The import is largely repetitive, but with some changes every time, so maybe you would need to train it on a set of files you have already, but I feel like it could learn the particular quirks unique to a data supplier and the best ways of dealing with them when they happen. I feel like I’m being quite unambitious here, when talking about a tech that might change the world, but I’ve been at a company where the the files they were receiving were so numerous, and so important to the business, that they employed half a dozen contractors just to write and troubleshoot file imports.

March 11, 2023

INSERT INTO with IDENTITY columns

This is a little quirk of SQL Server that I ran into a while ago.

We’ve probably all seen or used an INSERT... INTO script before, but just in case:

An INSERT... INTO script typically looks something like

SELECT
      *
INTO #Test1
FROM dbo.Test1;

And it will create a new table (#Test1) from the data selected in the SELECT statement, with the column definitions being pulled from the columns in the SELECT in the same way that the engine defines data types etc. for a view.

But what happens when one of the columns being selected is an identity column. To answer that, let’s run a quick test:

DROP TABLE IF EXISTS #Test1;
DROP TABLE IF EXISTS dbo.Test2;
DROP TABLE IF EXISTS dbo.Test3;

CREATE TABLE #Test1
    (
          Test1ID INT IDENTITY(1, 1) NOT NULL
        , SomeData VARCHAR(50) NOT NULL
    );

INSERT INTO #Test1
    (
          SomeData
    )
VALUES
      ('Some data')
    , ('Some more data')
    , ('Even more data')
    , ('And a bit more data');

SELECT
      *
INTO dbo.Test2
FROM #Test1;

SELECT
      Test1ID
INTO dbo.Test3
FROM #Test1;

What we’re doing here is creating a temp table (#Test1) with an identity column, then running a couple of SELECT... INTO statements that include that identity column to create new permanent tables in the dbo schema (Test2 and Test3). The reason I’m creating permanent tables is it make it easier for us to look at their definitions, which are:

CREATE TABLE [dbo].[Test2](
	[Test1ID] [int] IDENTITY(1,1) NOT NULL,
	[SomeData] [varchar](50) NOT NULL
);

CREATE TABLE [dbo].[Test3](
	[Test1ID] [int] IDENTITY(1,1) NOT NULL
);

So, we can see in this case the SELECT... INTO has created the Test1ID column as an IDENTITY column, the same as the source. The next question is what happens when the IDENTITY column exists at a different grain to the results?

DROP TABLE IF EXISTS #Test1;
DROP TABLE IF EXISTS #Test2;
DROP TABLE IF EXISTS dbo.Test3;

CREATE TABLE #Test1
    (
          Test1ID INT IDENTITY(1, 1) NOT NULL
        , SomeData VARCHAR(50) NOT NULL
    );

CREATE TABLE #Test2
    (
          Test2ID INT IDENTITY(1, 1) NOT NULL
        , Test1ID INT NOT NULL
        , SomeOtherData VARCHAR(50) NOT NULL
    );

SET IDENTITY_INSERT #Test1 ON;

INSERT INTO #Test1
    (
          Test1ID
        , SomeData
    )
VALUES
      (1, 'Some data')
    , (2, 'Some more data')
    , (3, 'Even more data')
    , (4, 'And a bit more data');

SET IDENTITY_INSERT #Test1 OFF;

INSERT INTO #Test2
    (
          Test1ID
        , SomeOtherData
    )
VALUES
      (1, 'DataData')
    , (1, 'DataDtaData')
    , (2, 'DataDataDataData')
    , (2, 'Too much Data')
    , (3, 'Not enough data')
    , (4, 'I also like data')
    , (4, 'Feed me data')
    , (4, 'I hunger for data');

SELECT
      T1.Test1ID
    , T1.SomeData
    , T2.Test2ID
    , T2.SomeOtherData
INTO dbo.Test3
FROM #Test1 AS T1
INNER JOIN #Test2 AS T2
    ON T1.Test1ID = T2.Test1ID;

So in this case we have 2 tables joined, both have IDENTITY columns that essentially make up the primary key, but the result is:

CREATE TABLE [dbo].[Test3](
	[Test1ID] [int] NOT NULL,
	[SomeData] [varchar](50) NOT NULL,
	[Test2ID] [int] NOT NULL,
	[SomeOtherData] [varchar](50) NOT NULL
);

So no IDENTITY column in the newly created table. In this case the engine can’t see which of these can be IDENTITY columns. But what if we feed it a bit more information.

DROP TABLE IF EXISTS dbo.Test2;
DROP TABLE IF EXISTS dbo.Test1;
DROP TABLE IF EXISTS dbo.Test3;
DROP TABLE IF EXISTS dbo.Test4;

CREATE TABLE dbo.Test1
    (
          Test1ID INT IDENTITY(1, 1) NOT NULL PRIMARY KEY CLUSTERED
        , SomeData VARCHAR(50) NOT NULL
    );

CREATE TABLE dbo.Test2
    (
          Test2ID INT IDENTITY(1, 1) NOT NULL PRIMARY KEY CLUSTERED
        , Test1ID INT NOT NULL
        , SomeOtherData VARCHAR(50) NOT NULL
        , CONSTRAINT FK_Test2_Test1ID
            FOREIGN KEY (Test1ID)
            REFERENCES dbo.Test1(Test1ID)
    );

SET IDENTITY_INSERT dbo.Test1 ON;

INSERT INTO dbo.Test1
    (
          Test1ID
        , SomeData
    )
VALUES
      (1, 'Some data')
    , (2, 'Some more data')
    , (3, 'Even more data')
    , (4, 'And a bit more data');

SET IDENTITY_INSERT dbo.Test1 OFF;

INSERT INTO dbo.Test2
    (
          Test1ID
        , SomeOtherData
    )
VALUES
      (1, 'DataData')
    , (1, 'DataDtaData')
    , (2, 'DataDataDataData')
    , (2, 'Too much Data')
    , (3, 'Not enough data')
    , (4, 'I also like data')
    , (4, 'Feed me data')
    , (4, 'I hunger for data')

SELECT
      T1.Test1ID
    , T1.SomeData
    , T2.Test1ID AS FKTest1ID
    , T2.Test2ID
    , T2.SomeOtherData
INTO dbo.Test3
FROM dbo.Test2 AS T2
LEFT JOIN dbo.Test1 AS T1
    ON T1.Test1ID = T2.Test1ID;

SELECT
      Test2ID
    , SomeOtherData
INTO dbo.Test4
FROM dbo.Test2;

So here I’ve made it explicit that Test2ID is the primary key of dbo.Test2, added in a foreign key to dbo.Test1 to also make it explicit that the join won’t be introducing any duplicate rows, and even turned the join into a LEFT JOIN to make explicit to the engine that we’re returning every row from dbo.Test2, but the answer is still:

CREATE TABLE [dbo].[Test3](
	[Test1ID] [int] NOT NULL,
	[SomeData] [varchar](50) NOT NULL,
	[Test2ID] [int] NOT NULL,
	[SomeOtherData] [varchar](50) NOT NULL
);

CREATE TABLE [dbo].[Test4](
	[Test2ID] [int] IDENTITY(1,1) NOT NULL,
	[SomeOtherData] [varchar](50) NOT NULL
);

So we can see from dbo.Test4 that we can get Test2ID to be recognised as an IDENTITY column but only if we don’t join to anything else.

This is mainly a bit of fun for me, going down a bit of a rabbit hole to see what’s at the end, but it does have some real world implications. If you’re using SELECT... INTO anywhere in your scripts then you do need to be aware of when it will and won’t create IDENTITY columns and account for that in future interactions with the created table. I became aware of this because I was editing an existing script and needed to add new data to a table created in this way, and I got errors because I was trying to insert into an IDENTITY column, so there are problems it can cause. You also need to be aware that if you’ve accounted for the IDENTITY column in the created table, but subsequently add a join into your SELECT, that column will no longer be created as an IDENTITY and you may need to do some editing downstream.

In general I don’t particularly like SELECT... INTO, I feel that I lose some control as a developer when I use it rather than explicitly creating the table, and I lose some of the self-documenting nature of the code, and this just gives me another reason to avoid it. But I also know that at a minimum I will end up debugging someone else’s code that uses it in the future, so it’s a good idea to know about these quirks.

February 28, 2023

SSDT database projects CASE changes

Just a quick post today, to highlight one of the quirks of SSDT database projects.

Usually when you hit publish, or publish through your automated pipeline, or generate a script, or do whatever else you do to push the project out onto a server the process is pretty simple. There’s some sort of comparison engine that runs under the hood, and it does some technical wizardry to detect any differences between your project and the database you’re deploying to. It generates a script based on those changes, and runs that script against the database to bring it in line with the project. Simple enough, right?

Mostly, yes. But there are some cases where it isn’t that simple. I’ve blogged before about how you need to write any check constraints in a specific way, and today I’m going to look at another of these quirks.

The scenario

Recently in my job we were looking at tidying up the cases of some description fields in our dimensions. One of the procedures that generates the dimension data was generating a description field as ‘YES’ and ‘NO’, and we felt that ‘Yes’ and ‘No’ were better options. This was all being generated through a case statement (CASE WHEN Column = 1 THEN 'YES' WHEN Column = 0 THEN 'NO' END) so we thought it would be an easy fix to just change the YES to a Yes etc. We were wrong.

When we came to deploy, we found that the changes weren’t deploying to any of our environments. As they were part of a larger change set, we were really puzzled about why it was just these procedures that weren’t deploying, until we realised that it was because the only difference between the project version and the database version of these procedures was the case of the labels. I did some looking into it, and it turns out this is expected behaviour. I had a look around the publish options, and couldn’t find any ‘detect changes in case’ type option, and the internet suggested the only way to make the publish process work the way we wanted was to update the collation of the database to be case sensitive.

This was definitely a pain, and the answer in the end was to add a comment to the procedures to trick the comparison engine into thinking they’d changed, but the more I thought about it the more it made sense. By setting the database to be case insensitive, we had told SQL Server that we didn’t care about the case of anything in that database. Why, in that case, would the comparison engine waste time altering a proc when the only difference was something we had called out as unimportant?

More generally, I think this speaks to a principal of doing the right thing in the right place. If your database is case insensitive then don’t try to do case sensitive stuff in there. Instead, worry about the case in whatever layer you have that does care about the case of your labels. And if that layer is your database then make it explicit by making the database case sensitive.

February 22, 2023

Why I’ve not been posting for a while

As I mentioned in my blog last week, I’ve not posted for a long time on here. In fact, before last week, you had to go back to May 2021 for my last post, so that’s almost a 2 year gap. This gap wasn’t really by choice, but was a result of external factors sweeping in and blowing all of my plans out the window. But before I get to what happened, I want to go through where I was professionally in 2021, to give a bit of context.

2021

So, 2021, the pandemic had been going for a while at this point, and I was generally coping ok. I’d come through some nasty mental health issues in 2020, and was feeling a lot better about a lot of things. I’d managed to blog at least once a week for 3 months in 2020, and had set myself the goal of doing that for 6 months in 2021, as well as regularly contributing to T-SQL Tuesdays. I also applied to present at New Stars of Data, and was accepted, which meant getting some mentoring from the awesome DBA with a beard. The presentation itself went really well, and off the back of it I got accepted to give a quick talk at DataMinutes. I also applied to give the New Stars of Data talk at GroupBy, and got accepted. This was a huge deal, because I’d be putting my work in front of a couple of hundred people, and it really felt like between the conferences and this blog I was taking the first steps to becoming known in the community.

Why this didn’t happen

Obviously, this didn’t happen. Instead I got a call one day from one of my aunts to say my mum had suffered a stroke. Immediately every plan I had was thrown out, personally and professionally. On the personal side, I ended up moving to be close to my mum when she was released back home, so I could help her through the transition back to whatever life she’d have. Professionally I obviously stopped blogging, I cancelled the talk at GroupBy (they were great about it by the way), and ended up still giving the DataMinutes talk (they were also great about it, and were happy for me to either give the talk or not), although I probably shouldn’t have, and I don’t think it was very good (one of the big names in the SQL world was watching and he called out a major mistake).

I also had to take a break from work to deal with the incredible stress all of this put on me. In the early days covid meant nobody could visit in the hospital so the only communication we had with doctors was over the phone, and they didn’t always have the time to explain things. The only communication I had with my mum was via video chat, which needed the nurses to know how to make that happen, and they weren’t always technically savvy (not a criticism, knowing how to use Zoom was not a job requirement and they were doing awesome work under horrible circumstances). When I came back to work, they were starting to bring people back into the office and I was planning to move away for an unknown amount of time, so I ended up changing jobs as part of the move.

Where I am now

Since then, things have been difficult but we’ve all come through. My mum is home, and largely independant, and I’m starting to look at picking up my life from where I left it in 2021. Professionally I feel like I’ve lost the momentum I had in 2021, so that is something I need to work on getting back. That means picking up this blog, so I’ve again set the target of posting at least once a week for 6 months in 2023, as well as trying to do some speaking and maybe picking up some new skills. My post last week had a lot about my goals for the year, and these are all on there.

I also took a holiday at the end of the year, and that helped a lot. I spent almost 2 weeks in Bali, having fun, relaxing by the pool, drinking cocktails out of coconuts, doing yoga and meditation, and generally putting a bit of distance between me and everything that was going on. It gave me a chance to think, and to put things into perspective, and I’ve come back feeling a lot more optimistic about my future.

This isn’t a post for sympathy, just an explanation of why I’ve been away. I don’t think I really have a regular readership, but I wanted to explain for anyone who was curious about why I vanished. I also wanted to remind people that these things happen, and they will knock you off course. All you can really do is pick yourself up at the end of it, see where you are, and start working on making progress again. And when you are in the middle of something like this, remember to look after yourself. It’s something I wasn’t always good at, but if you’re not ok then you can’t help anyone else.

February 14, 2023

TSQLTuesday #159: What’s your favourite new feature

This is my 10th TSQLTuesday, andthis month Deepthi Goguri has given the double prompt to talk about our new favourite features in SQL Server 2022 and our New Year’s resolutions.

Now, for various reasons, I’ve not been posting on this blog for a while. Like, a long while. Like the last post was TSQLTuesday #138. I was going to do a post about why that was, but I haven’t got it written yet and I really want to post on this topic so that might come next week.

Anyway, the same things that kept me away from the blog (and a lot of other things) also stopped me from looking into SQL Server 2022 in any great detail. I did a post in April 2021 about the GREATEST and LEAST functions that were present in Azure and were coming to on prem at some point in the future, but apart from that I haven’t really been paying attention. With all of that, I guess the thing I’m looking forward to the most is the thing I don’t know is there yet. It’s always fun to watch talks and read blogs on new features, and think about how you can apply them to the big issues at your workplace and make life easier, so that’s what I’m hoping I’ll get a chance to do over the next few months.

New Year’s goals

Many years ago I stumbled on Brent Ozar’s epic life quest. Reading around that led me to Steve Kamb’s Epic Quest of Awesome, and I decided I wanted to do something similar. I write about this in more detail here, and list all of my goals here, but the basic idea is to make a list of things you want to do in your life, then focus on ticking them off. I really struggle with keeping focus on things for a long period of time, so having something like this allows me to take some time and set down what’s important, and refer back to it later as a guide to what I should be working on.

Some years after I did this I decided I was done with the idea of New Year’s Resolutions, because they always end up being a bit vague, and you get to the end of the year and you can’t really say if you did them, and they just tend to fade from my attention after a bit. I decided that what I wanted to do instead was set myself some concrete, S.M.A.R.T. goals, and to do that I largely pulled from my epic quest, along with a few other things. The last couple of years have been really bad when it came to achieving anything, because of the pandemic and other reasons, but this year I’m feeling a lot better, and I have some catching up to do, so I’ve been ambitious and set myself quite a few goals. The plan is to check in every 3 months and see how I’m doing, and keep all of them in mind as I go through the year. So, goals for 2023 are:

Weigh under 95kg every day for 3 months. This has been something I’ve tried for a couple of years before now, but I think I’m in a better place to tackle it this year. The idea is if I can get to that weight and maintain it for 3 months it will set some healthier habits in place rather than immediately putting it all back on again.
Deadlift 200kg. Current tested max is 165kg, but I think I’d be closer to 180kg if I tested it today.
Deadlift 250kg. This is definitely more ambitious, and probably won’t happen until near the end of the year if at all.
I’ve entered a novice strongman comp that’s coming up in about a month, so I’ve set myself the challenge of completing every event. That means getting at least one rep on the ‘for reps’ events, hitting all of the weights in the ‘progressive’ events (things like deadlift x weight then y then z, I want to get all the way to z), and completing the carry and pull medley in the time (carry 2 sandbags then pull a sled a set distance in 60 seconds).
Finish a half marathon. I’ve not done much running the last couple of years, but I’ve entered the Nottingham half, which is the end of September, and if I train right I can definitely complete it.
Paint a Warhammer 40k combat patrol. This was nearly completed last year, but I didn’t quite make it so I want to finish it off this year.
Paint a Warhammer 40k incursion force. This will incorporate the combat patrol, but double the size of it. I really enjoy painting these models, so this is about having fun and reducing my ‘to be painted’ pile.
Paint a gang/team/squad/something else for one of the Games Workshop boxed games. Same as above, painting relaxes me and gives me a feeling of achieving something.
Related to painting, I had an idea for a simple app to track all the paints I currently own, and what paints I used on each model, to help me repeat existing paint jobs on similar models in the future. I’ve not done app development before, or really anything outside of databases, so I’m thinking I’ll set myself the challenge of building this in Azure, and hopefully learn some stuff along the way.
Blog every week for 6 months. This blog is really neglected and I need to show it some love.
Speak at another SQL Server user group. I spoke at the Nottingham user group a couple of times, and was just starting to get accepted to conferences when things got difficult. I’d like to do more speaking, and both the Cardiff and Bristol groups aren’t that far from me so I’m keeping an eye out for when they start up again this year.
Sort out a group tent for larp. If you don’t know what larp is, this won’t make a lot of sense, but essentially I want to get a tent and decorate it like it’s from a magical fantasy kingdom.
Book another big holiday. I may not get to go on one this year but I’d like to at least have one lined up for 2024 by the end of the year. I went to Bali over New Year and it was really good for me, and I feel like I need to do those trips more often.
Have 5 smaller adventures this year. So far I’ve booked a running festival in the summer, and a new larp event, so I need to find another 3. I’m thinking some weekends in London to catch some shows.

Anyway, that is it (apart from one goal I’m keeping secret for personal reasons, but I’ll post if it happens). Like I said, some are off my epic quest and others are just things I thought would be good for me this year, but having the list to refer back to every time I have a spare half a day helps keep me focussed on chasing the things that matter more to me.