Chris Johnson

March 13, 2021

New Stars of Data

I gave my first ever conference talk yesterday at New Stars of Data. This is a conference specifically for new speakers, where you get paired with a mentor who gives you advice and works with you to make your talk better over the months before the conference. It started out as a new speaker track at DataGrillen, but with the main event being cancelled due to Coronavirus it’s shifted to being a separate conference for the last couple of years.

I can honestly say the experience was fantastic. My mentor was great (Rob Sewell, also known as Beardy McBeardface or DBAWithABeard), and I learned a huge amount from him over the couple of months we were working on the talk together. I also got some really positive feedback from some of the more experienced presenters watching, so that’s really good as well.

I can’t recommend enough that you apply for this if you’re thinking about getting into speaking. Hopefully next year the DataGrillen conference will happen, and I think at that point this will be folded back into the main conference, but you will still have the support and mentoring I had. I’m definitely going to try and speak at more events, and I can’t think of a better way to get started on my speaking journey than this. Huge thank you to everyone involved.

February 9, 2021

T-SQL Tuesday #135: The outstanding tools of the trade that make your job awesome

This image has an empty alt attribute; its file name is tsqltues.png

This is my 7th TSQLTuesday, and this month we’re all invited to blog about the tools we use in our job that we are particularly attached to.

As it happens, I’m giving a talk in just over a month at New Stars of Data on a tool I think a lot of the community overlook: Visual Studio Database Projects.

I’m not going to regurgitate the whole talk here, that would take far too long, but I will give a quick preview of what’s covered on a couple of slides that cover what I like to call the “quality of life” improvements, small things that just become easier when coding through a Visual Studio project rather than in Management Studio or similar.

Renaming and moving objects

Usually we avoid renaming anything in SQL Server. You don’t know what other bit of code might be lurking out there that depends on the object you just renamed, or when that code might be called, and so spelling errors end up being left in for years because remembering it’s the Customet table not the Customer table is easier than having the application fall over at 3am because we missed a reference somewhere in our code base.

Database projects gets rid of that fear. If you rename an object, or move it from one schema to another, Visual Studio automatically propagates that change through your solution. And even if there’s some reference somewhere it didn’t recognise as needing updating, the intellisense-style validation will cause the next build to fail and tell you the database is no longer fully functional until you fix the issue. You do have to remember to rename through the interface rather than in the scripts, but in exchange you get the peace of mind that comes with knowing you’ve not broken anything with your change and you can finally query the Customer table.

All of your scripts are object creation scripts

This is a bit of an odd point but bear with me, it leads to something good.

Every script you write in a database project is the script to create the object from scratch. It isn’t a generated script from the object, which is how you view objects in Management Studio.

The reason this is significant is it means you can add comments to objects you wouldn’t normally be able to add comments to, like tables. This means you can add a comment next to a column name and give a decent explanation of what it means, or add a description at the top of the table script to say exactly what each row in this table represents, and why the table exists.

If you’re encrypting your code, which you might do if you are selling a managed service that gets deployed to the client’s servers, you can also store the unencrypted code in your database project. If you try to retrieve the code from Management Studio you won’t get anything (that’s the point of encrypted objects) but Visual Studio just holds the create script, not the object itself.

F12 to jump to object

Finally, if you highlight an object in your code and hit F12, you will open another tab with the code for that object. No more hunting around in the database for the view definition, now you can just open it up instantly. I found this incredibly useful once I got into the habit of it, but I have to warn you that it only works if the code you’re currently in is error free.

So there you have it, three nice little features in Visual Studio database projects that can make a real difference to your coding. I hope this was helpful to someone, and I’d encourage anyone reading this who isn’t a TSQLTuesday regular to give it a go next month.

January 12, 2021

TSQLTuesday #134: Give me a break

This is my 6th TSQLTuesday, and it’s a subject close to my heart this month: travel.

Travelling is awesome

As I wrote about in a previous TSQLTuesday blog, I really like to travel. I have a big list of places I want to see, as a big part of my epic quest of awesome, and I know I’m never going to get to see everything I want to. Obviously money and time are factors in how often I can get away, and so after the Everest trip in 2019 I’d set myself the fairly modest goal in 2020 of visiting Rome, to tick off the Colosseum, Vatican City, and take a day trip to see the Leaning Tower of Pisa. I also wanted to visit Stonehenge, as it’s in the same country which means I can make it a weekend trip and see some friends in Bristol at the same time.

Reassessing plans

Clearly the Rome trip never happened. It became quite obvious fairly early in the year that travel to Italy was a bad idea, and I came to terms with that pretty quickly. I held out some hopes for the Stonehenge trip, but in the end the situation never felt safe enough for that kind of frivolous holiday.

Instead I spent the year getting some things sorted in my home life, and that included a big think about my plans for the future. I plan on writing a more detailed post (or maybe several) about the process I went through to do that at some point, but I won’t go into it too much detail here. One of the things that came out of that was I had a serious look at my finances for the first time in a long while, and looked at the types of holiday I want to do, and how regularly I could afford to do them.

Categorising holidays

As a data person, obviously I needed to split the holidays up into categories:

Weekend trips – things like the trip to Stonehenge, usually in the UK but could be in Europe, little to no time off work and relatively inexpensive. I can probably absorb the cost of one of those into a month’s budget (or maybe spread across 2 months) by cutting down on other expenses like meals out/takeaway and not making any other big purchases. These are holidays I don’t have to plan for.
City breaks – these are trips up to a week long, where I travel to another city (probably in Europe) and see stuff. They obviously need a bit of time off work, won’t come with too much of a plan, but will need some organisation and saving beforehand. The Rome trip I planned was a good example of one of those.
Adventures – these are trips like the trek to Everest base camp, or the Vietnam experience I talked about in my other blog post. They last anywhere between 1-4 weeks, will be booked as a tour with a set itinerary, usually involve changes of location, and often need me to buy quite a bit of gear. These obviously come with a fair price tag and require some significant organisation beforehand.
Crazy adventures – Never done one of these yet, but it would be a combination of city breaks and adventures strung together in one holiday.

Next trip

Having taken stock, through about some of the holidays I wanted to do soon, and mentally assigned them some categories, I decided I could probably bump the next holiday I take up from a city break to an adventure. There were a couple of factors in that, but the main one was that the extra year of not doing anything was an extra year to save and get myself in a good financial position to absorb a holiday like that into my budget.

I had a think about what exactly I wanted out of the next holiday, especially given what had happened on the Everest trek. One of the things I decided was it needed to be a physical holiday, I think I need to prove something to myself that I am still capable of these more physical holidays. I also decided I wanted to go somewhere warm, after the cold of the Everest trek. With those criteria in mind I settled on a trek through the Jordan desert to Petra and the Dead Sea. There are a few different options for that type of trek, but this one looks pretty representative of what you get.

The whole trip looks very exciting, and as a bonus I get to pretend I’m in the Last Crusade.

Other next trips

I also plan on doing Stonehenge this year, provided things get better in time. I know a few people who may want to come with me, and I have friends in Bristol so I can catch up with them at the same time.

Beyond that I might try to fit in a city break in Berlin to see my sister. A few people I know rave about the city so I’d like to see what it is that makes it so special, and obviously I want to spend some time with my sister after not being able to see her for so long.

Grand plans for the future

For a long time I’ve had a crazy adventure in mind. I want to start somewhere in South America, maybe Brazil, and travel north. Start off with something like an adventure tour in the Amazon, maybe one where you canoe down the river for a week, and see some of Rio. Then travel through different countries to see places like Machu Picchu, Chichen Itza, and the Paricutin volcano, eventually entering the US. From there I’d maybe rent a car and spend a couple of months visiting different cities in the US and seeing places like Death Valley and the Grand Canyon, before finally heading north to Canada where I’d see the Northern Lights.

This is probably never going to happen, and may well end up being broken down into several different holidays, but it’s definitely a nice dream to have.

Why plans for the future matter

I think it’s important to have these types of plans, they give you something to look forward to and something to strive for. These things are especially important in the strange times we live in now, but they were also important before and will continue to matter when the world gets back to normal. I know that when work gets bad (as it does sometimes for all of us) and I feel like I’m not making any difference, I can remind myself that even if I’m not achieving stuff at work I am getting myself closer to my next epic adventure.

November 10, 2020

TSQLTuesday #132: How are you coping with the pandemic?

This is my 5th TSQLTuesday, and it’s a little rushed because I only saw the invite this morning.

This isn’t the usual type of blog post that I write. I’m not sure if I have much in the way of a readership out there, but if I do they are probably used to technical articles, or some long post about best practices with lots of code examples sprinkled in. This, obviously, won’t be one of those. It may also meander a little bit, but bear with me, it will all come together in the end.

Travel

I like to travel, and I’ve not done it nearly enough over the last few years. When I was younger I found myself flying to different parts of Europe quite a bit. A former partner was from Luxembourg so I flew there several times to visit her. A friend I made at Uni was from Switzerland, so I flew out there once, and we visited Germany while I was there. Another former partner’s parents had a holiday home in France so we went there a few times. I also ended up in the US a couple of times, once in Florida for Disneyland, and once in Washington DC for my sister’s wedding. I got pretty comfortable on planes from these trips, and came to like airports and long flights, and travel in general.

I also had a few more epic adventures. I spent a couple of months in Southern Africa, living on a nature reserve and doing a bit of backpacking. I also got to white water raft on the Zambezi river below Victoria Falls.

I didn’t do another epic adventure for a while, but 5 years ago I went on a 4 week trip to Vietnam, with a few days in Bangkok at the start. It was pretty epic, and I got to see some amazing things, including Ha Long Bay which remains one of the most beautiful places I’ve ever been. I came back from that holiday determined to do more epic travelling, which leads us to the crucial bit of the story…

Everest

I’d settled on my next destination pretty early. I wanted to do the trek to Everest base camp. It isn’t cheap, and needs a lot of fitness, and for various reasons I didn’t make it happen for a few years. Finally, last year, I had the money and the fitness and I booked it. I made sure I had all of the equipment, and I was confident I was fit enough (I ran a half marathon a couple of weeks before going).

The trip itself was awesome, but very hard at the same time. The days were long, the nights were cold, and the food was dull. But, the scenery was amazing, the people were amazing, and the whole thing felt like I was stepping into another world.

I loved the trek, and I loved the feeling of accomplishment that came with every day. Every so often we would pass markers that said how many hours and kilometers we were from base camp, and I think everyone got more and more excited as we got closer.

Then I got ill. It came on really suddenly. I was pretty much ok at the start of the day, and could barely walk by the end of it. The next morning my fingertips and lips had turned blue, and they had to call a helicopter for me. I spent 3 days in hospital in Kathmandu recovering. It turned out I had what’s called a High Altitude Pulmonary Edema. This is where the decreased oxygen levels cause the heart to beat faster to move oxygen around the body faster, but the increased blood pressure causes fluid to leak from the blood vessels into the lungs further reducing the oxygen levels. It’s fatal if you don’t get to lower altitude in time, but once you do you’re basically going to be ok.

PTSD

For some time after this I thought I was fine. Physically I felt a bit beaten up, but nothing a bit of heathy eating and exercise wouldn’t get me through. It wasn’t until the increased stresses from work started to pile up that I started to really struggle. The tipping point was when I saw a video of someone with an oxygen monitor on their finger. I basically became completely useless for days after that, and had to book some time off work to get through it. I spoke to doctors, and ended up with a PTSD diagnosis, a course of therapy, and some time off work.

I fought my way back to health through a combination of therapy and routines. I’m not going to detail the therapy here, but it forced me to confront what had happened, and in doing that my brain processed it and moved on. The routines I put in place involved waking up at particular times, planning my week pretty rigidly on a Sunday, making sure I got some exercise, going for walks, yoga, a big variety of things.

When it came time to go back to work, I felt like I was ready. I think I needed the structure work provided. I had some conversations with my boss, and we agreed that I would only work my contracted hours, and take a 1.5 hour lunch break. This meant 8-5 with lunch usually starting at 12 and finishing at 1.30. This was all to make sure I had time to keep up the routines and coping mechanisms I had in place.

This hasn’t lasted particularly well. Last week I had to work late most days, which threw my evening plans out and has disrupted my routine a lot. I’ve spoken to my manager a few times, and we are taking steps to fix the issues. One thing we agreed today is I will do 9-6 instead, as that way emergencies late in the day won’t force me to go over my hours. This might even be better in some ways, it may give me some time in the morning to work on this blog or do some other personal project.

Is there a point to this story?

I’m not entirely sure. I wanted to share something about my lockdown experience, and this kind of dominates everything about not just lockdown but the last year for me. I certainly feel like I’ve come out of the whole thing with a better understanding of myself. I know my mental limits a bit better, and understand what I can do to keep myself mentally fit a lot better than I did. I’ve also experienced failure, on a scale that I don’t think I had before, and I’m still working on coming to terms with the idea that it wasn’t my fault, and that I failed despite doing everything right.

I guess the message for people reading this is to make sure you prioritise self-care. If you start to feel anxious every day, if you can’t focus, if you dread starting work, these are signs something is wrong. You don’t necessarily have to change jobs to fix it (right now I think we’re all pretty lucky if we have jobs) but maybe change your relationship to your job in some ways. Set some boundaries, and experiment a bit with what you can do outside of work to let that stress go. I recommend walking and yoga.

October 27, 2020

Avoiding coding traps

Brent Ozar put up a post last week called If Your Trigger Uses UPDATE(), It’s Probably Broken. In it he talks about the UPDATE() function and how it can cause unexpected behaviour in your code.

For anyone not aware, UPDATE() is a function that can be used in triggers, and which returns TRUE if an INSERT or UPDATE action happens on the column passed into the function. So, something like this:

CREATE TABLE dbo.Primarch
    (
          PrimarchID INT IDENTITY(1, 1) NOT NULL
        , PrmarchName VARCHAR(200) NOT NULL
        , ChatpterName VARCHAR(200) NULL
        , LastKnownStatus VARCHAR(100) NOT NULL
        , LastKnownLocation VARCHAR(100) NULL
        , YearsAtLastKnownLocation VARCHAR(100) NULL
        , NumberOfSuccessorChapters INT NOT NULL
    );
GO

CREATE TRIGGER Primarch_Update
    ON dbo.Primarch
    AFTER UPDATE
AS
    IF UPDATE(LastKnownLocation)
        UPDATE Prim
        SET
             YearsAtLastKnownLocation = 0
        FROM dbo.Primarch AS Prim
        INNER JOIN inserted AS ins
            ON Prim.PrimarchID = ins.PrimarchID

The idea here is to set the YearsAnLastKnownLocation to 0 if the LastKnownLocation is changed, with the logic being we are receiving real-time information on the last known location of the primarchs (this example is a bit contrived but go with it). However, if you want to run a full update on a primarch you might have a bit of code like this:

CREATE PROCEDURE dbo.UpdatePrimarchInformation
    (
          @PrimarchID INT
        , @LastKnownStatus VARCHAR(100)
        , @LastKnownLocation VARCHAR(100)
        , @NumberOfSuccessorChapters INT
    )
AS
    UPDATE Prim
    SET
          LastKnownStatus = @LastKnownStatus
        , LastKnownLocation = @LastKnownLocation
        , NumberOfSuccessorChapters = @NumberOfSuccessorChapters
    FROM dbo.Primarch AS Prim
    WHERE 1 = 1
        AND Prim.PrimarchID = @PrimarchID

Now, when you run this and input the current LastKnownLocation and LastKnownStatus, and a new NumberOfSuccessorChapters, you still trigger the update of YearsAtLastKnownLocation to 0.

The exact reasons why aren’t important, and Brent does a typically great job of explaining the issue on his blog. The reason for this post is I wanted to look at these types of errors a bit more, and a coding principal I like to adopt that helps to avoid them.

Coding traps, not bugs

Brent calls this a bug at one point, and while I agree with him that it’s bad practice to code like this I wouldn’t call it a bug myself. The reasoning behind that is without the second bit of code (the stored proc) the application functions as it was intended to. It’s not until someone codes in something that might update LastKnownLocation without changing the value that you start to have undesirable behaviour in the app.

It is, however, absolutely a trap for any future developer working on the app. The stored proc above is an entirely reasonable bit of code to write, most developers I know won’t check every table their code hits for triggers, and similarly won’t read every bit of documentation on a table before writing a simple bit of code against it.

Some people might say that testing should save you, and in theory it should but things do get missed in testing, and that’s assuming you’re working somewhere that encourages testing in the first place. And in any case, even if it does get caught in testing, that’s another bit of work for the dev to do to figure out why this update is causing the YearsAtLastKnownLocation column to always set to 0.

That’s extra time to do the work, extra stress on the dev, and all because of a bit of code in the trigger that could be written better.

Coding to the existing code

So, this is the principal I want to talk about, that I try to work to as much as I can. Essentially it means you code to what is possible within the existing code base, not what people tell you the existing code base will be used for.

One great example of this is always coding around NULLs if they are allowed in a column. It doesn’t matter if the business assures you there will never be a NULL in that column, you should always assume there will be. As an example, the following code assumes the LastKnownLocation will never be NULL:

SELECT
      Prim.PrmarchName 
        + ', primarch of the ' 
        + Prim.ChatpterName 
        + ' chapter. Current status: ' 
        + Prim.LastKnownStatus 
        + '. Last seen at ' 
        + Prim.LastKnownLocation 
        + ',  ' 
        + CAST(Prim.YearsAtLastKnownLocation AS VARCHAR(100)) 
        + ' years ago.'
FROM dbo.Primarch AS Prim
WHERE 1 = 1
    AND Prim.PrimarchID = @PrimarchID

If the LastKnownLocation is NULL, because we don’t have a last known location for the Primarch, the whole string returns as NULL. You may have a guarantee from the project manager, or the business sponsor, that we will always have a value for LastKnownLocation, but the code says something different. If you follow the business steer in these circumstances, and code as if the column will never be NULL, you are introducing a trap for developers working on this software in the future. One of them will see the column allows NULLs and will write some code that can add a NULL into the column, and then this bit of display code somewhere else in the app will break.

Another example that may be familiar to people who’ve written a lot of ETL processes is when you come to create an import table. In a lot of cases you are importing from files that don’t necessarily remain the same every time, so you have a process that reads the column headers from the file and creates a table to hold them all, something a bit like this:

DECLARE @ColumnList AS TABLE
    (
          ColumnName VARCHAR(255)
        , OrdinalPosition SMALLINT
    );
INSERT INTO @ColumnList
    (
          ColumnName
        , OrdinalPosition
    )
VALUES
      ('PrmarchName', 1)
    , ('ChatpterName', 2)
    , ('LastKnownStatus', 3)
    , ('LastKnownLocation', 4)
    , ('YearsAtLastKnownLocation', 5)
    , ('NumberOfSuccessorChapters', 6);

DECLARE
      @Counter AS INT
    , @SQLScript AS NVARCHAR(MAX) = N'CREATE TABLE dbo.PrimarchImport(RowNumber INT IDENTITY(1, 1) NOT NULL';

SELECT @Counter = MIN(OrdinalPosition) FROM @ColumnList;

WHILE @Counter <= (SELECT MAX(OrdinalPosition) FROM @ColumnList)
BEGIN
    SELECT
          @SQLScript = @SQLScript + N', ' + ColLst.ColumnName + N' NVARCHAR(MAX) NULL'
    FROM @ColumnList AS ColLst
    WHERE 1 = 1
        AND ColLst.OrdinalPosition = @Counter;

    SET @Counter = @Counter + 1;
END

SET @SQLScript = @SQLScript + N');';

EXEC sp_executesql
      @stmt = @SQLScript;

This is a bit of a cut down version as I’m just adding the column names direct to the @ColumnList table variable rather than generating them with C# code or something else, but it works for our purposes. This particular block of code will run just fine, but what happens when we get a column name in the source file with a space, or a column, or anything else that will cause us issues with our create table script? And again, we may be promised that this will never happen, but why take the risk? All we need to do to fix it is escape the column names in our dynamic SQL by wrapping them in [].

In both of these examples the application may work as expected for years, or even forever. But it may not, and if it it does break the root cause of the problem will be you not coding for a scenario the existing code told you was possible.

October 20, 2020

Another fun thing with NULLs

I saw this on a post at SQLServerCentral and wanted to share.

The poster had a CASE statement and was wondering why it didn’t work as expected. Essentially they were doing something like:

CASE WHEN @a = @b OR (@a IS NULL AND @b IS NULL) THEN 1 ELSE 0

CASE WHEN NOT(@a = @b OR (@a IS NULL AND @b IS NULL)) THEN 1 ELSE 0

And they wanted to know why both were returning 0 when @a or @b were set to NULL. The issue here is that any normal predicate involving NULL returns an unknown. They had tried to compensate with the OR, which got them the result they wanted in the first statement, but didn’t understand why it did that.

To fully understand I want to break down what each bit returns here, assuming @a IS NULL and @b is some value

First, @a = @b returns UNKNOWN, represented in T-SQL as NULL. This is because you are comparing unknown (@a) to an actual value (@b), and the SQL engine can’t say whether that is TRUE or FALSE.

Second, @a IS NULL returns TRUE, and @B IS NULL returns FALSE. Put together, @a IS NULL AND @b IS NULL returns FALSE as an AND requires both parts to be true to return TRUE. If you replace the AND with an OR it would return TRUE, as OR only requires one part to be true.

So, what we have here is UNKNOWN OR (TRUE AND FALSE). This simplifies to UNKNOWN OR FALSE when we solve the bracket. And, logically, that has to simplify to UNKNOWN. After all, what we are asking here is “are either this unknown result or this false result true?” The false result is obviously not true, but the unknown result might be.

This brings us to the essential issue in the original question, which is how an expression and the reverse of that expression could both return the same value. Taking the first CASE statement, that is saying when the expression is TRUE return 1, else return 0. The expression returns UNKNOWN, but the CASE statement can’t just say “I don’t know”, it has to return 1 or 0, and with the way predicates work in SQL Server it will only return 1 if the expression returns TRUE.

Taking the second CASE statement, that is saying when the reverse of the expression is TRUE return 1, else return 0. But the expression returns UNKNOWN, so what’s the reverse. It’s helpful here to think of UNKNOWN as more of a placeholder than a proper response. The logic will always behave as if there is a TRUE or FALSE response that would be returned if we had all of the facts. Therefore, what it wants to return is the reverse of that TRUE or FALSE, but it doesn’t know which is the correct one, so the NOT(expression) also returns UNKNOWN when the inner expression returns UNKNOWN. This obviously leads to the CASE statement returning 0 again, as it only returns 1 when the predicate returns TRUE.

An alternative way of thinking about it is to imagine someone asks you if my eyes are brown. Clearly there is an answer to that question, but you as a reader of this blog who’s never met me won’t know what it is. Now what if someone asks you if my eyes are not brown? Again, the answer is “I don’t know.” Now what if the question is “do any of these 10 people have brown eyes?” At this point we’re getting into the territory of IN/NOT IN predicates, which are another area where NULLs cause issues. If I’m one of those 10 people, then even if you know all of the other 9 don’t have brown eyes, you can’t give a definite yes or no to that question, and therefore can’t give a definite yes or no to the reverse either. Then, because in T-SQL (as in most coding languages) predicates only trigger on a definitive yes/TRUE, the predicate logic will behave as if the expression returned a FALSE result, but for the reasons outlined in this post it’s important to know that it actually returned UNKNOWN.

October 13, 2020

T-SQL Tuesday #131: Data Analogies, or: Explain Databases Like I’m Five!

This is my fourth time taking part in T-SQL Tuesday. Today the topic is all about your favourite analogies when it comes to explaining SQL Server topics to people.

I really like this topic, mainly because over the last three years I’ve found myself increasingly acting as a teacher to the less experienced T-SQL coders in the office, and anything that makes me think about how to do that better is good.

There are a few analogies I fall back on when trying to explain certain concepts, and I found myself using the old “clustered index is like a library index” one a couple of days ago when I was explaining why the order of columns in a clustered index did matter to the performance of a query. This is something that I’ve struggled to get across earlier on in my career, and having a tried and tested analogy to fall back on this time meant I was more successful this time around.

Analogy of choice: database engine as an office

The analogy I’m choosing here is something I use in my head quite a bit to help me understand certain concepts:

Imagine an office, with a set of filing cabinets filled with paper. The office workers regularly have to get information from these files, add new information, update existing information, or even get rid of some out of date stuff. None of the information is available anywhere but in the files, not even in people’s memories. This is the SQL Server query engine.

The reason I like this is twofold. First, it’s familiar. Even if you’ve never worked in an office that uses physical files you’ve seen them on TV and can understand the concept really easily. Second, it’s multi-purpose. This analogy can be used to cover quite a range of topics if necessary, and having an understanding of all of them stemming from the one analogy makes it easier to see how they fit together.

One obvious use is in explaining indexing, the order of the files is a clustered index, if files are completely unordered that maps to a heap, and extra index cards can represent non-clustered indexes.

You can also use it to look at row by row versus set based operations. In this example, a set based operation means the worker walks to the filing cabinet, gathers all the files they need, takes them back to their desk, and updates them all there. In contrast, a row by row operation means the worker walking to the filing cabinet, grabbing one of the pages they need, taking it back to their desk, updating one row there, returning the page to the filing cabinet, returning to their desk, and repeating the process again for the next row (even if it is on the same page as the previous). This makes the efficiencies of set based over ReBAR (row by agonising row) operations really obvious.

A slightly more obscure use is talking about memory grants. As part of the query execution process, SQL Server calculates how much memory the query will need to store temporary results along the way. For instance, if you are sorting the data you need to store the partial results of the sort while it’s completing. It bases this on three main factors, what actions will need to store data in memory (from the execution plan), how many rows each of these will need (from the table statistics) and how big each of those rows can be (from the column definitions, taking all variable column sizes as the maximum). This is represented in our office by desk space. The office has a number of desks, that together represent the total available memory in the system. When a worker starts work on a task, they reserve an amount of space on these desks based on how much space they think they will need. If their estimates are too high, they will take up space other workers need and hurt the overall efficiency of the office as a result. If their estimates are too low, however, they are not allowed to adjust them on the fly. Instead they have to use one of the spare filing cabinets over in the corner, which represent tempdb. This means they are now taking data from one filing cabinet and moving it to another, before retrieving it from that filing cabinet again moments later You can see how this creates an inefficiency in their work, and is something you want to avoid where you can.

So, there it is. This analogy helps me understand a lot of the inner workings of SQL Server better, from obvious mappings like clustered indexes, to the slightly less obvious like memory grants or locking and blocking. Hopefully some of you reading this will find it a useful analogy as well.

July 11, 2020

Embrace changing requirements

I’ve worked for a few different companies over the years, some small and some big, and one thing that really sets the tone for a lot of inter-team interactions is how the company deals with changing requirements.

Larger companies, and some small/medium companies that feel like they need to start the transition to a larger company, have a tendency to lock in requirements at the start of a piece of work. The developers and the BAs want to have a document that lays down exactly what will be built before any work starts, and have processes in place for changes once work has begun. This is often adopted as a way to protect developers. For smaller companies there will have been incidents in the past where pieces of work have ballooned in size, and this obviously leads to the work taking longer than planned, and this in turn often leads to negative consequences for the developer. The solution someone arrived at was the rigid requirements process, and now everything runs more smoothly.

Except things probably aren’t actually running smoothly under the surface. What’s most likely happened is you’ve traded the issue of ever-changing requirements for a different problem; nobody’s getting what they want any more.

There are two primary reasons for this: firstly, people rarely know exactly what they want; and secondly, when they do know, they suck at articulating it properly. Because of this, when you have a rigid system that asks for all requirements to be defined up front you get an imprecisely worded attempt at describing something that was only ever half an idea in someone’s head anyway.

My experience has been that this goes double when talking about reports, inevitably if someone specs a report and you produce the report to their spec, what you’ll hear is something like “well, that’s good, but looking at it what I really want to know is…”

So, what do you do? You can’t go back to the old way, because the business needs some ability to plan how long things will take, but the new way means the dev teams aren’t delivering value a lot of the time. Often companies don’t address this directly, and you see a lot of bitterness appear between the dev team and the rest of the business as a result. The business resents having to specify everything to the nth degree, and still not getting something they want at the end, and the dev team resent doing exactly what they were told to and getting criticism for it.

The answer, in my opinion, is to embrace the fact that requirements will change and work with the business to deliver valuable work. You probably read that sentence thinking it sounds nice, but doesn’t offer any practical advice, and you’re right. That is the mission statement, not a roadmap of how to get there, but there are some more practical steps you can take:

Define a goal for the piece of work, rather than an exact specification. So, maybe instead of defining every item in a report, you have a goal something like “produce a financial report to provide monthly key metrics on our clients”. Maybe you would have a bit more detail in there, but it wouldn’t define every item in the report. Agile proponents will recognise this as being similar to the user story one-line descriptions that usually begin with “As a user I want to be able to …”

The flip of this is continue to reject requirements changes that completely transform the work, or are effectively new pieces of work entirely. So, if part way through developing the monthly client financials report you get a request that it’s now needs to hook into the internal time tracker system, the HR system, and the billing system, to provide extra details, that’s a fundamental change to what you’re doing so there needs to be a new estimation and a change to the plan. Similarly, if you get a request to add a second report on client loyalty, that’s a new requirement and needs estimating and scheduling separately.

Code with extensibility in mind. It should be easy to change details of your code to adjust to changing requirements. If you make your code overly complex, or put everything in one big procedure, or have steps that are tightly coupled together, it will be harder to adjust when the requirements shift.

Keep defining your understanding of the requirements as you go, and checking with the business stakeholder.

Keep producing prototypes and sharing them with the business stakeholder.

These two go hand in hand with the next one: make sure the business stakeholder is available throughout the process. If they are, and if you are communicating your vision and progress with them, and if you are getting timely feedback from them, then you have a fighting change of producing something they will be happy with at the end.

None of this is easy, and if your company is wedded to a fully waterfall approach it can be flat out impossible to achieve, but if you can make it work there are a lot of benefits. In my current job, I’ve spent some time moving one of the projects I’m on to this more collaborative approach, and people are starting to notice that it’s working. I’ve been able to make this happen as just a developer, with no authority over requirements or project management at all, just by regularly checking in with the business sponsors, and being open to requirement changes. It’s helped that the business sponsors have made themselves available for this, not all business sponsors will and at that point you can either give up or start raising things with managers and aiming for a more formal change in how your company works. That may be the next step for me, and, if I do go down that route, having this existing project as evidence will be a key part of my argument.

Some people may be looking at this and thinking that this is only relevant to waterfall companies, or companies that have adopted agile light practices, but I have seen companies adopt every bit of scrum but still falling into these traps.

In conclusion, requirements will change, it’s a fact of development that pretty much nobody knows exactly what they want at the start of a bit of work, and you can either ignore it or embrace it. If you embrace this fact, and work with it, you increase the chance that you will produce software that adds business value, and ultimately that is what we should be aiming to do.

July 5, 2020

When to allow NULLs

Some years ago I did a post on NULLs where I detailed some of the issues you can get with them. More recently I did another post on NULLs that detailed an example of an interesting issue some NULLs had caused in work when looking at an IN statement.

Now I want to talk a bit about when we should be using NULLs and when we shouldn’t, inspired again by some things I’ve seen in legacy code a few days ago. And when I say using, I mean when should we allow values to be NULL, in tables, and parameters, and anything else.

The first thing to remember is what does a NULL represent. It’s very simple, NULL means “I don’t know”. If you keep that in mind, and you’re practising some good development design where everything has a point, you should be able to tell if it’s possible to be unsure about any particular value for a row.

A quick example:

CREATE TABLE dbo.Customer
    (
          CustomerID INT IDENTITY
        , Forename VARCHAR(100)
        , Surname VARCHAR(100)
        , MiddleNames VARCHAR(300)
        , IsEmailable BIT
        , MobileNumber CHAR(11)
        , HomePhoneNumber CHAR(11)
        , DateOfBirth DATE
    )

Now, I’ve not included any NULL/NOT NULL in this example, because I want to talk through a few of these columns before I give my opinions.

CustomerID – This is the primary key of the table. It uniquely identifies every row. We definitely want this to not be NULL, and we are using an IDENTITY value (another option here is to use a SEQUENCE object) to make sure we always have a value.
MiddleNames – Not everyone has a middle name. We could set this up to not allow NULLs and force a default like blank string (maybe by setting a default in the column definition), or we could allow NULLs. Either approach seems legitimate here.
MobileNumber/HomePhoneNumber – Again, not everyone has a mobile, or a home phone number. NULL seems like a legitimate choice here, especially as we don’t have the empty string option because any value we use has to be 11 characters. We could default to 11 spaces, but then whatever reads this table has to code around that in pretty much exactly the same way it would with NULLs.
Forename/Surname – Most people have both a forename and surname (sidenote, I hate when I see FirstName and Surname, it’s either Forename/Surname or FirstName/LastName). But not everyone does, there are some cultures where it’s less normal, and even within western cultures you have individuals like Cher or Sting. What if they end up in your database for some reason? Even if you are assuming everyone will have a forename and surname, your data may not be the best quality. In these cases you need to make a judgement call, you can either reject data without forename and surname, or allow NULLs or some default value in your database.
DateOfBirth – Everyone definitely has one of these, but your source data may not always capture it. In this case in particular, a default value can do odd things, and you are safer with allowing NULLs or rejecting any data without a date of birth.
IsEmailable – This is the column from real life that triggered this post. In our table it was set to allow NULLs, but I think that is a fundamental mistake. This column is used to decide if we can send an email to someone or not, and regardless of the quality of our source data, we have to know if we will include this person in our email campaign or not. Now, we may get customers who come through without any kind of email permissions in their data set, but when we run an email campaign we either include that customer or not (usually not) so that should be written into the insert into this table, and do not allow a NULL in that column.

What it should look like:

CREATE TABLE dbo.Customer
    (
          CustomerID INT IDENTITY NOT NULL -- primary key
        , ForeName VARCHAR(100) NOT NULL -- everyone has at least 1 name
        , Surname VARCHAR(100) NULL --  not everyone has multiple names
        , MiddleNames VARCHAR(300) NULL -- not everyone has a middle name
        , IsEmailable BIT NOT NULL -- we always know if we will send an email or not
        , MobileNumber CHAR(11) NULL -- not everyone has a mobile number
        , HomePhoneNumber CHAR(11) NULL -- not everyone has a home phone number
        , DateOfBirth DATE NULL -- we don't always get this in the data and it isn't vital to this solution
    )

NULL/Default

In a couple of the columns I said you could use a default or a NULL. I prefer NULLs when you genuinely aren’t sure, because defaults can do odd things to your queries. For example, say you have a default in the DateOfBirth column of ’01-Jan-1900′ because you’re confident that’s not going to be a real value. But, for whatever reason you need to find anyone who’s over 18, so you include a where clause of DATEDIFF(yy, DateOfBirth, SYSDATETIME()) >= 18 and you bring back everyone who’s date of birth you don’t know, and potentially you send some unsuitable mail to a minor and get sued. Leaving DateOfBirth as NULL keeps those rows out of any queries that are filtering on DateOfBirth, and means you have to make a conscious decision to include them.

Conclusion

NULLs are part of the T-SQL language for a reason, and they help capture cases where you might genuinely not know the value of something. Just setting every column to allow NULLs causes you some issues down the line, as every bit of code then has to handle that possibility, but not allowing NULLs when they could legitimately occur can create other issues. You should consider whether to allow NULLs on a column by column basis, both because future developers will have to account for your decisions in their code, and because the decision you make sends a message to those future developers about the shape of the data you are expecting to see inputted into the system.

June 23, 2020

Service Broker 101 Lesson 6: XML and messages

So far we have looked at assembling the Service Broker components, sending messages, and handling them through stored procedures. But in all of the examples, we have only ever treated the message body as a single value that we do something with, like add to a table. This lesson will go through some of the fundamentals of querying XML values to extract information. If you’re already familiar with XML in SQL Server there probably won’t be anything new in here for you.

What is XML?

XML stands for extensible markup language. It’s a way of encoding information to send between services is a semi-structured way. XML works using the concept of elements, each element can contain other elements. An element can also contain a value, or it can be an empty element, and an element can have one or more attributes attached to it. Elements are tagged using the <> notation so an element called SomeElement would appear in XML as <SomeElement>value</SomeElement>. If the element is empty, that can either be represented as <SomeElement></SomeElement> or <SomeElement />. Attributes are included in the opening tab for the elements, like so: <SomeElement Attribute: "AttributeValue">element value</SomeElement>. A more complex example is:

<people>
  <person personID: "123">
    <firstName>
      Larry
    </firstName>
    <lastName>
      Smith
    </lastName>
  </person>
  <person personID: "124">
    <firstName>
      Joanne
    </firstName>
    <lastName>
      White
    </lastName>
    <middleName>
      Rachel
    </middleName>
    <pet>
      <name>
        fluffles
      </name>
      <animalType>
        Dog
      </animalType>
    </pet>
    <pet>
      <name>
        wuffles
      </name>
    </pet
  </person>
</people>

So, what we have here is an element called people. This contains 2 person elements with personID attributes attached to them. These both contain firstName and lastName elements, but personID 124 also has a middleName and two pet elements. These pet elements each hold a name element, but only fluffles has an animalType element.

In this way, we can use XML to hold only the data we actually know. Unless we have specified an XML Schema (outside the scope of this class) there is no validation on what elements or attributes an element should contain.

Retrieving values

Often when we send service broker messages, the message body will contain XML. In these cases, the activation stored procedure often needs to extract the values from the constituent elements in order to act appropriately. SQL Server provides 5 methods you can call from an XML object to extract specific information, as part of the XQuery functionality. We will start with the value method:

DECLARE @XML XML = '<people><person id="5"><firstName>Larry</firstName><lastName>Smith</lastName></person></people>'


SELECT @XML.value('(/people/person/firstName)[1]', 'VARCHAR(100)');
SELECT @XML.value('(/people/person[@id="5"][1]/firstName)[1]', 'VARCHAR(100)');
SELECT @XML.value('/people[1]/person[1]/firstName[1]', 'VARCHAR(100)');

This returns the value “Larry”. There are a few things to note here:

The value method is called as if the XML object is a class and the value is a method in an object oriented language.
The path to the element you want to reach has to travel down form the outermost element, with each element starting with a “/”
You must specify the ordinal number of the element you want to use. This is done by wrapping the path in brackets and specifying the ordinal at the end of the brackets e.g. the first query simply says return the value of the very first firstName, in the people>person>firstName hierarchy.
You can optionally specify ordinals for any of the elements along the path, but can only do away with the round brackets and final ordinal if you specify other ordinals for every level along the way as in the third query. You can also filter at each element level based on attributes using [@attribute=”AttributeValue”] and can combine this with an element ordinal.
You need to specify the data type you want to convert the value to as the second parameter, in this case I have specified VARCHAR(100).

This is fine if you want to essentially send one row of information, so in a situation when each message would contain one person’s information. It is even a possible approach if you know you will always send a fixed number of rows, so if each person would always have three pets attached to them. But what if you want to allow your customers to have a variable number of pets (a far more plausible scenario). That requires you to use the nodes method

Nodes

The nodes method is used to shred XML. Shredding is a term that basically means turn the XML into more of a table structure. A call to the nodes method includes a specified path through the XML, and returns a table containing one row per element at the level specified, with each row containing the whole XML. From there you can use the value method to navigate up and down from the specified element, and return the value(s) you want. For example:

DECLARE @XML XML = '<people><person id="4"><firstName>John</firstName><lastName>Smith</lastName></person><person id="5"><firstName>Larry</firstName><lastName>Smith</lastName></person></people>'


SELECT Person.Person.value('(./firstName)[1]' , 'VARCHAR(100)')
FROM @XML.nodes('/people/person') AS Person(Person);

SELECT Person.FirstName.value('.' , 'VARCHAR(100)')
FROM @XML.nodes('/people/person/firstName') AS Person(FirstName);

Note that in the first SELECT we still need to specify the ordinal position, this is because there may still be multiple firstName elements contained in the person element. In the second example, because we are already at the firstName element, we know that there is only one firstName at our level and do not need the ordinal.

If you ever want to travel up the XML path after using nodes, you can use the ‘.’ notation, ‘.’ means stay at this level but each additional ‘.’ takes you up a level.

This can be a bit tricky to get your head around, so if you are struggling it may be best to do some more reading or experimenting on your own. The basic idea is that the nodes function will return one row for each element at the level you specify. Each row will contain the entire XML, but with a pointer to that row’s element, and any further methods you call against the data will start their paths at that pointer. I’ve included one more example to try and make this make sense:

<people>
	<person id="4">
		<firstName>
			John
		</firstName>
		<lastName>
			Smith
		</lastName>
	</person>
	<person id="5">
		<firstName>
			Larry
		</firstName>
		<lastName>
			Smith
		</lastName>
	</person>
</people>

--Assuming the above XML is contained in @XML

SELECT --something
FROM @XML.nodes('/people/person/firstName') AS Person(FirstName);
  -- This will return a 2 row table. Each row contains the whole XML.
  -- One row's pointer starts at line 3
  -- One row's pointer starts at line 11
  -- To access this, the select needs to reference Person.FirstName and call an XQuery function like 'value'

SELECT --something
FROM @XML.nodes('/people/person') AS Person(Person);
  -- This will return a 2 row table. Each row contains the whole XML.
  -- One row's pointer starts at line 2
  -- One row's pointer starts at line 10
  -- To access this, the select needs to reference Person.Person and call an XQuery function like 'value'

That’s it for this lesson, next lesson will go through an example of how to put everything together in a working application.