Skip to content

Service Broker 101 Lesson 1: What is Service Broker?

If you don’t really know what Service Broker is, you’re not alone. I had probably heard the term a couple of times in my 14 years as a SQL developer, but had never come across anyone using it until I started my latest job. Even then, I only discovered it when I imported a database into an SSDT database project, and saw a Queue object appear.

I did a little investigation after that; and it seemed an interesting, if little used, piece of functionality. I didn’t really think anything more of it, but filed it away in the bit of my brain that stores marked “investigate someday” (that part of my brain gets pretty cluttered and seldom cleared out).

Then, recently, I had an issue where Service Broker seemed the perfect solution, so spent a weekend experimenting, coded the fix using Service Broker, and that release is making it’s way through UAT at the moment.

But what is Service Broker?

I can hear you thinking “yeah, yeah, get to the point already”, so I will.

Service Broker is a queueing system within SQL Server. It allows you to send a message from one queue to another, and handle what happens when that message arrives. That’s pretty much it.

So, why would I want that?

Well, that’s where the story I told you at the start comes in. The issue we had was that out client wanted to run a process during the day that usually gets run at night. This process is pretty long, and locks up some tables for a minute or more, including some tables it locks through an indexed view (that’s a whole other issue that I’ll maybe blog about some other day). At the same time, users are logging onto the application to do various things, including downloading one-use vouchers. The stored proc that sits behind that API reads the data ok, but wants to write the record that these vouchers have been viewed to one of the locked tables.

What I’ve done is shift the write to table from the stored procedure to a queue. Now when a user requests their vouchers the system selects them from the right table, and fires off a message to the queue with all the details of the vouchers they just viewed, and the queue adds them to the table whenever it has a chance.

So, that’s scenario #1, when you have a process that needs to do something but doesn’t need to do it immediately, you can split that part of the process out and use a queue to have it happen as soon as possible, allowing your main process to complete quicker. Typically this will be logging that something has happened, where you need to return the result of the thing immediately but the logging table might be blocked by either a big read/write, or lots of small reads/writes.

Scenario #2 is when you are running something like the ETL systems I’ve seen and heard getting built more and more. These systems work off queues in an case, typically built as tables, where you have an overarching system that dynamically decides what happens and in what order.

As an example, you start by importing 3 files, so you add those imports to the queue and they all start executing. Once the first one is finished, you add a validation for that to the queue and that starts processing. File 2 finishes importing but that’s flagged as not needing validation so you check what data loads you can run with just that file and add them to the queue. File 3 takes a long time to load, and File 1 finishes validation in that time but there’s nothing that can be populated from File 1 until File 3 is finished loading so nothing gets added to the queue at that point.

If you have a system that wants to work like that, where you are executing a set of modules in a dynamic order, then Service Broker may be useful as a way of managing that order.

I was going to post a bit more here about the different components that make up Service Broker, but I’ve gone on for longer than I expected just on this, so I think I’ll leave that for the next post.

Lessons

I’m trying something new with my blog post this week. I want to start doing different series on subjects I think people would benefit from a deep dive into. I’m starting with Service Broker, a topic I knew nothing about until a few months ago. Other possible topics inclued:

  • Different components of SQL Server
  • SSAS
  • MDS
  • DQS
  • SQLCLR

Some of these I know quite a bit about, others I don’t have much of a clue at the moment and that’s part of the reason I want to write about them, to force myself to learn. These posts will be titled something like [topic] 101 Lesson 1: [lesson subject]. Depending on how this goes I might add some 201 or 301 series in the future, but for now the idea is to assume no knowledge from the reader and try to get to a point where they can not only (in this case) code a simple service broker solution, but also understand what they are doing.

Anyway, the first service broker post goes up today. Enjoy.

TSQL Tuesday #126: Folding@Home

This month’s T-SQL Tuesday comes from Glenn Berry, and is all about what you are doing to help during the ongoing Coronavirus crisis.

He links to Folding@Home, which allows you to use your personal machine(s) to do complex protein folding calculations to help with medical research. I don’t pretend to understand all of what they’re doing, but essentially it’s taking a problem, breaking it down into lots of mini problems, and sending these mini problems out to individual computers to find solutions. I’ve signed up to that now, and joined the Tech Nottingham team, so that’s one thing I’m doing. Plus I’m trying to get my old desktop working again, so it can run on there as well as my newer machine, and I can fold double the things.

I’m finding this a hard blog to write though, because I don’t feel like I’m doing that much apart from that. The main thing I’ve done is to set up a Microsoft Teams organisation for my local creative writing group, Nottingham Writers Collective. We’ve run a few meetups on there and it’s been a fun way to keep in touch with some people who are very important to me. We’ve also set up a number of things in the team to help people share work, and set challenges for themselves, but so far nobody has really used them. They’re there though, if anyone does feel a need, and I hope as the lockdown continues we will take advantage of them a bit more.

I think this is one of the little things that a lot of us are probably in a position to do. So much of our lives are suddenly being lived online, and it’s easy for us as techie people to forget how daunting a lot of this stuff is. So, make more of an effort to be patient with your family as they ask you for the 10th time how Skype works, or want to know why they can’t see everyone on a Zoom chat. Look out for things that can help keep people connected, and tell everyone about them.

I’ve discovered tabletop simulator, humbe bundle’s offer on Asmodee games, and board game arena’s free to play browser version of many tabletop games, and managed to organise a gaming session with some of my friends a few weekends ago. I think it helped us feel a bit less alone, and I will try to organise similar things in the future.

This was an odd blog to write, I’ve not really written anything on here until now that wasn’t very SQL/tech focussed. I hope anyone who reads this is keeping well and coping ok with everything.

AND and OR interactions

I’ve been working through a particularly nasty bug recently, and when I eventually found the cause it turned out to be a mistake in a WHERE clause including several ANDs and ORs. I thought it’d make an interesting topic to dive into for a quick blog post.

The basic issue looked something like this:

INSERT INTO dbo.TargetTable
    (
          TableGUID
        , Column1
        , Column2
    )
SELECT DISTINCT
      TableGUID
    , Column1
    , Column2
FROM dbo.SourceTable
WHERE SourceTable.StatusColumn = 'A'
    OR (SourceTable.StatusColumn = 'B' AND SourceTable.StatusDate IS NULL)
    AND SourceTable.TableGUID NOT IN 
        (SELECT TableGUID FROM dbo.TargetTable)

The problem was we wanted to apply the last AND predicate every time, but the interactions between the ANDs and the OR meant that wasn’t happening. To see exactly what I mean, here’s a couple of simplified versions of the code where I’ve used brackets to make it clearer what is happening:

SELECT 1 -- returns successfully
WHERE 1 = 1
    OR 2 = 2
    AND 2 = 1 -- we want it to not return because of this
    
SELECT 1 -- this is what is actually happening
WHERE 1 = 1
    OR (2 = 2 AND 2 = 1)

SELECT 1 -- this is what we should have done
WHERE (1 = 1 OR 2 = 2)
    AND 2 = 1

So, basically, the OR treats everything after it as being part of the OR, so when the first predicate returns true it doesn’t matter what the rest of the predicates are because they’re all on the other side of the OR. At this point we have a diagnosis, and the solution seems pretty clear: re-write the code with some brackets to tell the query engine what to do.

INSERT INTO dbo.TargetTable
    (
          TableGUID
        , Column1
        , Column2
    )
SELECT DISTINCT
      TableGUID
    , Column1
    , Column2
FROM dbo.SourceTable
WHERE (SourceTable.StatusColumn = 'A'
    OR (SourceTable.StatusColumn = 'B' AND SourceTable.StatusDate IS NULL))
    AND SourceTable.TableGUID NOT IN 
        (SELECT TableGUID FROM dbo.TargetTable)

That gives us a functionally correct solution, but to me there’s another issue. We have re-written the code to clarify things for the query engine, but I’d argue we haven’t made it particularly clear for the next developer who has to edit this code (this is all part of the same insane block of code I wrote about in my code noise post a couple of weeks ago), and that can lead to all kinds of issues further down the line.

I have a particular approach whenever I’m writing a set of predicates connected with both ANDs and ORs. I effectively layer the predicates, starting with a top layer of either ANDs or ORs, then moving to the second layer which will be the opposite. Each sub-layer is wrapped in brackets and indented, and I usually keep each predicate on a different line. For example, this is how I would lay out the code we started this post with:

INSERT INTO dbo.TargetTable
    (
          TableGUID
        , Column1
        , Column2
    )
SELECT DISTINCT
      TableGUID
    , Column1
    , Column2
FROM dbo.SourceTable
WHERE 1 = 1
    AND (SourceTable.StatusColumn = 'A' -- top layer of ANDs
         OR (SourceTable.StatusColumn = 'B' -- second layer of ORs
             AND SourceTable.StatusDate IS NULL)) -- third layer of ANDs
    AND SourceTable.TableGUID NOT IN
        (SELECT TableGUID FROM dbo.TargetTable)

This makes it quite clear that the last AND needs to be evaluated separately to the rest of the WHERE clause.

Now you might be wondering where the 1 = 1 came from. That’s something I like to include in all of my code to make it easier to debug by allowing you to comment out the first predicate easily. Without that, if you want to comment out the first predicate and keep the second you end up having to do something awkward like this:

FROM dbo.SourceTable
WHERE --(SourceTable.StatusColumn = 'A'
       --OR (SourceTable.StatusColumn = 'B'
           --AND SourceTable.StatusDate IS NULL))
    --AND 
    SourceTable.TableGUID NOT IN
        (SELECT TableGUID FROM dbo.TargetTable)

But with the 1 = 1 you can do this instead:

FROM dbo.SourceTable
WHERE 1 = 1
    --AND (SourceTable.StatusColumn = 'A'
         --OR (SourceTable.StatusColumn = 'B'
             --AND SourceTable.StatusDate IS NULL))
    AND SourceTable.TableGUID NOT IN (SELECT TableGUID FROM dbo.TargetTable)

Which saves you from messing about with the last AND predicate at all.

Now, if your query is largely ORs so you want that to be your top layer, you can’t do quite the same thing because the OR means the WHERE always comes back as TRUE. So, what you use instead is 1 = 2, which achieves the same thing as far as ease of debugging is concerned:

FROM dbo.SourceTable
WHERE 1 = 2
    OR (SourceTable.StatusColumn = 'A'
        AND SourceTable.TableGUID NOT IN
            (SELECT TableGUID FROM dbo.TargetTable))
    OR (SourceTable.StatusColumn = 'B'
        AND SourceTable.StatusDate IS NULL
        AND SourceTable.TableGUID NOT IN
            (SELECT TableGUID FROM dbo.TargetTable))

This isn’t the neatest way of writing the code, because we have to repeat the NOT IN across the different OR predicates, but it does the same thing as the rest of the code we’ve been looking at. I suppose for consistency, I should include the 1 = 1 or 1 = 2 in the bracketed predicates as well, and that would help when it comes to debugging, but it would also clutter the code more than a little as we can see:

FROM dbo.SourceTable
WHERE 1 = 2
    OR (1 = 1
        AND SourceTable.StatusColumn = 'A'
        AND SourceTable.TableGUID NOT IN
            (SELECT TableGUID FROM dbo.TargetTable))
    OR (1 = 1
        AND SourceTable.StatusColumn = 'B'
        AND SourceTable.StatusDate IS NULL
        AND SourceTable.TableGUID NOT IN
            (SELECT TableGUID FROM dbo.TargetTable))

Having said that, I do quite like the way that looks. In particular, I like the way each new AND block is clearly defined with the 1 = 1. These kind of standards are something to discuss with your team, if possible, and work together to standardise the way you write code.

Finally, here’s a made up example with several layers to show how this can look with very complex statements. The numeric predicates (1 = 1 and 1 = 2) are there to allow the commenting out of other predicates, and everything else is there as a stand-in for actual query logic:

SELECT
      1
WHERE 1 = 2 -- false or false or true or false = true
    OR (1 = 1        -- true and true and false = false
        AND 'A' = 'A'  -- true
        AND 'B' = 'B'  -- true
        AND (1 = 2     -- false or false = false
            OR 'AB' = 'AC' -- false
            OR 'AB' IN     -- false
                ('AD', 'A', 'AA', 'ABA')))
    OR (1 = 1        -- true and true and false = false
        AND (1 = 2     -- true or false = true
            OR 'X' = 'X'  -- true
            OR 'Z' = 'A') -- false
        AND (1 = 2     -- true or false = true
            OR 'G' = 'G'  -- true
            OR 'F' = 'W') -- false
        AND (1 = 2     -- false or false = false
            OR 'F' = 'G'   -- false
            OR 'H' = 'I')) -- false
    OR (1 = 1        -- true and true = true
        AND (1 = 2     -- false or true = true
            OR (1 = 1     -- false and true = false
                AND 'E' = 'F'  -- false
                AND 'G' = 'G') -- true
            OR 'A' = 'A') -- true
        AND 'B' = 'B') -- true
    OR NOT 'A' = 'A' -- not true = false

So, in conclusion, if you include an OR in your code, be aware that anything after the OR should be treated as being bracketed together. And ideally write your code with explicit brackets and style it in a way to make it clear what is going on.

Values blocks

Values blocks are a really useful little bit of code in SQL Server. Basically, they are a block of defined values that you can use pretty much like any other data set.

The main place you may have encountered them before is as a source for an input. Often when people need to add a set of records to a table I see something like this:

INSERT INTO dbo.Table1
    (
          Column1
        , Column2
    )
SELECT 1, 'some value'
UNION
SELECT 2, 'some other value';

Or even worse:

INSERT INTO dbo.Table1 (Column1, Column2)
SELECT 1, 'some value';
INSERT INTO dbo.Table1 (Column1, Column2)
SELECT 2, 'some other value';

The first attempt is ok, but it’s bulky and unnecessary, and it needs you to keep typing UNION over and over. The second attempt is actively inefficient, as each row is inserted individually instead of inserting everything as a set.

The cleaner way, that uses a VALUES block, is:

INSERT INTO dbo.Table1
    (
          Column1
        , Column2
    )
VALUES
      (1, 'some value')
    , (2, 'some other value');

This saves you from typing out the UNION all the time, and in my opinion looks neater on the page and makes your block of values easier to read.

The basic rules for a values block are:

  1. The data in each row is comma separated
  2. Each row of data is wrapped in a set of brackets
  3. The rows themselves are also separated by commas
  4. Each row has the same number of values (2 in the example above)
  5. Each value position has to hold data of the same type for every row (int, varchar in the example above)
  6. NULLs are allowed for any value

This use case is useful to know about by itself, but I think the more powerful use of values blocks comes when you start using them in other queries. To do this you need to treat them as a subquery, like in this example:

SELECT
      val.Column2
    , tbl.Column4
FROM dbo.Table2 AS tbl
INNER JOIN
    (
        VALUES
              (1, 'some value')
            , (2, 'some other value')
    ) AS val(Column1, Column2)
    ON tbl.Column1 = val.Column1;

So, all you have to do is wrap the values block in brackets, alias it , and name the columns, and you can use it like any other subquery. The only thing that’s different here is the need to name those columns when you do the aliasing, but you do that simply by listing the names in brackets after the alias.

Interestingly, this is something you can do with regular subqueries as well. It’s probably not something you will use very often as you’re more likely to rename the column in the subquery, but it never hurts to know about these things.

Finally, if I’m using a big values block in a SQL statement, it can be a bit unwieldy to have it in a subquery. It can dominate the rest of the statement, and if it’s long enough you won’t be able to see all of the statement on the screen, and the individual values rarely add much to your understanding of the code. That’s why I will often put the values block in a common table expression at the start of the statement. That also allows you to reuse it if you need to refer to it more than once. Example below:

WITH val AS
(
    SELECT
          val.Column1
        , val.Column2
    FROM
        (
            VALUES
                  (1, 'some value')
                , (2, 'some other value')
        ) AS val(Column1, Column2)
)
SELECT
      val.Column2
    , tbl.Column4
FROM dbo.Table2 AS tbl
INNER JOIN val
    ON tbl.Column1 = val.Column1;

Another option here might be to put the values into a table variable or temp table, but those won’t be as efficient for the query to process. The only time I’d consider that is if there were several statements that wanted to access the same data set, and the data set was small.

So, that’s pretty much everything I know about values blocks. Hope that was useful.

Code noise

I love the term code noise. It’s one of those terms that succinctly encapsulates a quite complex topic in a couple of words, and is instantly recognisable to anyone who’s encountered it even if they had never heard the term before.

Basically code noise is anything that pulls your attention away from what the code is supposed to be doing, or obscures the true nature of the code in some way. It’s not something we consider enough when writing T-SQL code, but I think there is a lot to be said for writing code the next person will be able to read.

As a small example, I was debugging something recently and found that all of the insert statements had ORDER BY clauses. I couldn’t work out why these were making me so angry, after all it’s not doing anything to hurt performance, and in fact isn’t doing anything at all, until one of the other devs in the office pointed out that it’s one example of the code noise that the whole code base is filled with.

More extreme examples are the tendency some developers have to load data into temporary tables after temporary table, or to write nested subqueries 5 layers deep. Both of these things largely just hide where the actual logic of the code is, and make it a nightmare to debug. Here’s an example:

WITH WorkOrderRoutingPlusLocation
AS
    (
        SELECT
              WrkOrdrRtng.WorkOrderID
            , WrkOrdrRtng.ProductID
            , WrkOrdrRtng.ActualCost
            , WrkOrdrRtng.PlannedCost
            , WrkOrdrRtng.OperationSequence
            , Loc.Name AS LocationName
        FROM Production.WorkOrderRouting AS WrkOrdrRtng
        INNER JOIN Production.[Location] AS Loc
            ON WrkOrdrRtng.LocationID = Loc.LocationID
    )
SELECT
      WrkOrdr.WorkOrderID
    , WrkOrdr.ScrappedQty
    , WrkOrdrRtng.OperationSequence
    , Prod.Name AS ProductName
    , WrkOrdrRtng.LocationName AS LocationMovedTo
    , PrevWrkOrdrRtng.LocationName AS LocationMovedFrom
    , WrkOrdrRtng.PlannedCost
    , WrkOrdrRtng.ActualCost
    , CASE
          WHEN WrkOrdrRtng.ActualCost
              > WrkOrdrRtng.PlannedCost
          THEN 'Over budget'
          WHEN WrkOrdrRtng.ActualCost
              < WrkOrdrRtng.PlannedCost
          THEN 'Under budget'
          WHEN WrkOrdrRtng.ActualCost
              = WrkOrdrRtng.PlannedCost
          THEN 'On budget'
          ELSE NULL
      END AS BudgetStatus
FROM Production.WorkOrder AS WrkOrdr
INNER JOIN WorkOrderRoutingPlusLocation AS WrkOrdrRtng
    ON WrkOrdr.WorkOrderID = WrkOrdrRtng.WorkOrderID
INNER JOIN Production.Product AS Prod
    ON WrkOrdrRtng.ProductID = Prod.ProductID
LEFT JOIN WorkOrderRoutingPlusLocation AS PrevWrkOrdrRtng
    ON WrkOrdrRtng.WorkOrderID = PrevWrkOrdrRtng.WorkOrderID
    AND WrkOrdrRtng.ProductID = PrevWrkOrdrRtng.ProductID
    AND WrkOrdrRtng.OperationSequence 
        = PrevWrkOrdrRtng.OperationSequence + 1
INNER JOIN
    (
        SELECT
              WrkOrdrRtng.WorkOrderID
            , SUM(WrkOrdrRtng.PlannedCost) AS TotalPlannedCost
            , SUM(WrkOrdrRtng.ActualCost) AS TotalActualCost
        FROM Production.WorkOrderRouting AS WrkOrdrRtng
        GROUP BY
              WrkOrdrRtng.WorkOrderID
    ) AS sq_WrkOrdrRtngTotals
    ON WrkOrdr.WorkOrderID = sq_WrkOrdrRtngTotals.WorkOrderID
WHERE 1 = 1
    AND sq_WrkOrdrRtngTotals.TotalActualCost
        > sq_WrkOrdrRtngTotals.TotalPlannedCost
ORDER BY
      WrkOrdr.WorkOrderID
    , WrkOrdrRtng.OperationSequence

I think this is fairly self-explanatory code, even without any comments. There’s not much here that isn’t necessary, just the 1 = 1 in the WHERE clause, but that’s to help with debugging. The CTE is there because we use these tables joined together more than once in the query, and one of those times is the right side of a left join. The subquery is there because we genuinely want to look at things at a different level of aggregation to the main query. Everything else is joined together very logically.

Under other circumstances I would have formatted it slightly different, but to make it fit well on the blog post I’ve tried to make it as thin as possible. To that end I’ve done things like splitting predicates across multiple lines that I wouldn’t ordinarily do, but I don’t think that affects the readability of the code too much.

Now, consider this alternative way of writing this query:

SELECT
      Sub3.WorkOrderID
    , WrkOrdr.ScrappedQty
    , Sub3.OperationSequence
    , Prod.Name AS ProductName
    , LNam AS LocationMovedTo
    , L.[Name] AS LocationMovedTo
FROM
    (
        SELECT
              Sub2.WorkOrderID
            , ProductID
            , BudgetStatus
            , OperationSequence
        FROM
            (
                SELECT
                      WorkOrderID
                    , ScrappedQty
                    , SUM(PlannedCost) AS TotalPlannedCost
                    , SUM(ActualCost) AS TotalActualCost
                FROM
                    (
                        SELECT DISTINCT
                              WrkOrdr.WorkOrderID
                            , ScrappedQty
                            , PlannedCost
                            , ActualCost
                        FROM Production.WorkOrder AS WrkOrdr
                        INNER JOIN Production.WorkOrderRouting AS WrkOrdrRtng
                            ON WrkOrdr.WorkOrderID
                                = WrkOrdrRtng.WorkOrderID
                    ) AS Sub1
                GROUP BY
                      WorkOrderID
                    , ScrappedQty
            ) AS Sub2
        INNER JOIN
            (
                SELECT
                      CASE
                          WHEN ActualCost > PlannedCost
                          THEN 'Over budget'
                          WHEN ActualCost < PlannedCost
                          THEN 'Under budget'
                          WHEN ActualCost = PlannedCost
                          THEN 'On budget'
                          ELSE NULL
                      END AS BudgetStatus
                    , WorkOrderID
                    , ProductID
                    , OperationSequence
                FROM Production.WorkOrderRouting AS WrkOrdrRtng
            ) AS Sub21
            ON Sub2.WorkOrderID = Sub21.WorkOrderID
        WHERE 1 = 1
            AND TotalPlannedCost < TotalActualCost
    ) AS Sub3
INNER JOIN Production.Product AS Prod
    ON Sub3.ProductID = Prod.ProductID
INNER JOIN Production.WorkOrder AS WrkOrdr
    ON Sub3.WorkOrderID = WrkOrdr.WorkOrderID
INNER JOIN Production.WorkOrderRouting AS WrkOrdrRtng
    ON Sub3.WorkOrderID = WrkOrdrRtng.WorkOrderID
    AND Sub3.ProductID = WrkOrdrRtng.ProductID
    AND Sub3.OperationSequence = WrkOrdrRtng.OperationSequence
INNER JOIN
    (
        SELECT
              L.LocationID
            , L.[Name] AS LNam
        FROM Production.[Location] AS L
    ) AS Sub4
    ON WrkOrdrRtng.LocationID= Sub4.LocationID
LEFT JOIN
    (
        SELECT
              WorkOrderID
            , ProductID
            , OperationSequence
            , LocationID
        FROM Production.WorkOrderRouting
    ) SubWOR
    ON WrkOrdrRtng.WorkOrderID = SubWOR.WorkOrderID
    AND WrkOrdrRtng.ProductID = SubWOR.ProductID
    AND WrkOrdrRtng.OperationSequence
        = SubWOR.OperationSequence + 1
LEFT JOIN Production.[Location] AS L
    ON SubWOR.LocationID = L.LocationID

Now, I think these two statements do the same thing, although to be honest I got a little lost writing this second one. It should be obvious, however, that the second statement is not nearly as clear, that there is a lot of extra code around it making a lot of noise.

Obviously this is a made up example, but it is similar to a lot of real-world examples I’ve seen. In particular, the overuse of subqueries (and subqueries inside subqueries inside subqueries) to filter or join data. The danger here, apart from it looking ugly, is that another developer comes along, can’t read the intention behind the original code because of all the noise, and just hacks something else onto the existing mess. You can see this has happened on line 63 when someone has added the WorkOrderRouting table to the query, because they need to join from it to the Location table. The WorkOrderRouting table is already part of this query, in Sub21 inside of Sub3, but the new developer hasn’t been able to figure this out, or maybe they’re not sure about how to bubble up the LocationID through all of the subqueries (especially as WorkOrderRouting also exists in Sub1 inside Sub2 inside Sub3 but can’t be bubbled up because there’s some aggregation along the way). Instead they’ve just hacked a new join to the table onto the existing mess and everything has gotten that much harder to understand.

Another thing that’s obviously bad about this statement is the inconsistent naming standards. Sub1 is always a terrible name, you need to alias anything, but especially a subquery, with something meaningful. I like to prefix any subquery aliases with sq_ so when you reference it elsewhere in the query you know you’re referencing a subquery. You also need to make all column names 2 part. Where does LNam in the outer SELECT come from? Or OperationSequence in Sub3? Without 2 part names for all columns, this can be a nightmare to figure out.

I want to end with an example of really bad code noise I found yesterday in my actual work. Table and column names are changed, but the rest of the code is as is:

DELETE FROM TriggerTable_A
WHERE A_GUID IN (SELECT tmp_Sync7.A_GUID
                        FROM tmp_Sync7
                        INNER JOIN TriggerTable_A a on tmp_Sync7.A_GUID = a.A_GUID
                        WHERE tmp_Sync7.UpdatedDate = a.UpdatedDate)

This looks a bit odd, but the main question you have looking at it is what’s that tmp_Sync7 table, right? Well, almost 300 lines of code previously we have this little code snippet (and I checked a few times and there is nothing in those 300 lines of code that does anything to tmp_Sync7):

IF EXISTS(SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'tmp_Sync7') 
    DROP TABLE tmp_Sync7

SELECT * INTO tmp_Sync7 from TriggerTable_A

My brain pretty much exploded when I saw this. What we’re saying here is DELETE from TriggerTable_A if A_GUID is in TriggerTable_A joined to TriggerTable_A. Basically, DELETE FROM TriggerTable_A. As a bonus, you don’t have that nasty permanent tmp_Sync7 taking up space on your database, and you improve performance because you’re doing a lot less.

Bottom line is code noise makes it harder for other developers to read your code. It makes it harder for you to read your code when you come back to it in 6 months to fix some crazy bug that’s only just started showing up. It can make it harder for the query optimiser to read your code, meaning it takes longer to come up with an execution plan, and has more chance of hitting the time-out limit and returning a sub-optimal plan. Both of these things hurt performance, but ultimately your drive to eliminate code noise shouldn’t just be driven by that. You should want to keep your code clean, simple, and elegant, so that when your fellow developer come to build on what you’ve done in 6 months or a year, they can easily understand what your code is doing and make clean, simple, elegant changes to it themselves.

T-SQL Tuesday #125: Unit testing databases ā€“ we need to do this!!

This is my second time taking part in T-SQL Tuesday This week the topic is all about unit testing in databases, and whether it is valuable or not.

This is a topic that is quite close to me heart. I don’t come from a computing background before I started working with SQL Server, so I was quite ignorant when it came to a lot of best practices that other developers who have worked with other languages are aware of. Because of this, I had no idea about unit testing until I attended a talk at a SQL Saturday all about tSQLt. If anyone isn’t aware (as I wasn’t) tSQLt is a free open source unit testing framework for use in SQL Server databases. It is the basis of Redgate’s SQL Test software, and is the most used framework for writing unit tests in SQL Server.

Since then I’ve worked to try and get employers to adopt this as part of a standard development life cycle, with mixed success at best. My current employer is quite keen, but there are two major problems. First, we have a huge amount of legacy code that obviously has no unit tests in place; and second, the way people code is not conducive to unit testing.

It’s the second issue I want to talk about today, maybe I will cover writing unit tests for legacy systems in another blog someday but for now I want to discuss how the way you code may need to change if you adopt unit testing, in particular how you need to start adopting a more modular style of coding.

What is a unit test?

Because there can be a lot of confusion about this, I thought it was best to start by defining unit tests. I see good unit tests as having a few main characteristics:

  1. They test against units of code (stored procedures, functions, views etc.)
  2. Each unit test only tests one thing.
  3. A unit test isolates all the dependencies before testing.
  4. Unit tests can be rerun in an automated way.
  5. Unit tests always pass or always fail for a given set of inputs.

In the original T-SQL Tuesday prompt post, these are listed as:

  1. Decisive – the unit test has all info to determine success/failure
  2. Valid – it produces a result that matches the intention of the code written
  3. Complete – it contains all information it needs to run correctly within the test harness
  4. Repeatable – always gives the same results if the test harness and code are same
  5. Isolated – is not affected by other tests run before nor affects the tests run after it
  6. Automated – requires only a start signal in order to run to completion

How this affects the code I write

Whichever set of criteria you want to use, the results for your style of coding start to look the same. Put simply, unit tests are easier to write when a single unit of code does one thing and one thing only. That thing can be extremely complicated (in which case you will probably have quite a few unit tests around it) or very simple (in which case it may not even need testing at all) but your unit of code should not be attempting to do more than one thing at once.

What do I mean by this? Well, if we look at the Wide World Importers Microsoft sample database, we see a stored procedure called Website.RecordColdRoomTemperatures:

CREATE PROCEDURE Website.RecordColdRoomTemperatures
      @SensorReadings Website.SensorDataList READONLY
AS
    BEGIN TRY

		DECLARE @NumberOfReadings int = (SELECT MAX(SensorDataListID) FROM @SensorReadings);
		DECLARE @Counter int = (SELECT MIN(SensorDataListID) FROM @SensorReadings);

		DECLARE @ColdRoomSensorNumber int;
		DECLARE @RecordedWhen datetime2(7);
		DECLARE @Temperature decimal(18,2);

		-- note that we cannot use a merge here because multiple readings might exist for each sensor

		WHILE @Counter <= @NumberOfReadings
		BEGIN
			SELECT @ColdRoomSensorNumber = ColdRoomSensorNumber,
			       @RecordedWhen = RecordedWhen,
				   @Temperature = Temperature
			FROM @SensorReadings
			WHERE SensorDataListID = @Counter;

			UPDATE Warehouse.ColdRoomTemperatures
				SET RecordedWhen = @RecordedWhen,
				    Temperature = @Temperature
			WHERE ColdRoomSensorNumber = @ColdRoomSensorNumber;

			IF @@ROWCOUNT = 0
			BEGIN
				INSERT Warehouse.ColdRoomTemperatures
					(ColdRoomSensorNumber, RecordedWhen, Temperature)
				VALUES (@ColdRoomSensorNumber, @RecordedWhen, @Temperature);
			END;

			SET @Counter += 1;
		END;

    END TRY
    BEGIN CATCH
        THROW 51000, N'Unable to apply the sensor data', 2;

        RETURN 1;
    END CATCH;
END;

We can see the procedure is doing a couple of things here. First it takes the input parameter @SensorReadings as a custom table data type and iterates through it. As it does this, it inserts or updates the Warehouse.ColdRoomTemperatures table with the values from the current row of the table variable. This is not the most awkward thing to test, but it could be made simpler if the code between rows 23-39 is put into its own stored procedure. Then the outer procedure would look more like this:

CREATE PROCEDURE Website.RecordColdRoomTemperatures
      @SensorReadings Website.SensorDataList READONLY
AS
    BEGIN TRY

		DECLARE @NumberOfReadings int = (SELECT MAX(SensorDataListID) FROM @SensorReadings);
		DECLARE @Counter int = (SELECT MIN(SensorDataListID) FROM @SensorReadings);

		DECLARE @ColdRoomSensorNumber int;
		DECLARE @RecordedWhen datetime2(7);
		DECLARE @Temperature decimal(18,2);

		WHILE @Counter <= @NumberOfReadings
		BEGIN
			SELECT @ColdRoomSensorNumber = ColdRoomSensorNumber,
			       @RecordedWhen = RecordedWhen,
				   @Temperature = Temperature
			FROM @SensorReadings
			WHERE SensorDataListID = @Counter;

			EXEC Warehouse.UpdateColdRoomTemperatureBySensorNumber
                  @ColdRoomSensorNumber = @ColdRoomSensorNumber
                , @RecoredeWhen = @RecordedWhen
                , @Temperature = @Temperature

			SET @Counter += 1;
		END;

    END TRY
    BEGIN CATCH
        THROW 51000, N'Unable to apply the sensor data', 2;

        RETURN 1;
    END CATCH;
END;

This way, we can write unit tests against the outer procedure to see if it is looping effectively, and separate unit tests against the inner procedure to test if it is updating the table correctly. A nice side-effect is if we want to write any other code that wants to merge a single row into the Warehouse.ColdRoomTemperatures table we can use the Warehouse.UpdateColdRoomTemperatureBySensorNumber stored procedure.

This, really, is what modular coding is all about. Making sure that each module in your code base is only trying to do one thing. You can then wrap unit tests around that module to make sure it does that thing well, and re-use it over and over again throughout your code whenever you need to do that one thing.

To keep with the example above, I don’t like the way the code does the UPDATE and then the INSERT if @@ROWCOUNT = 0. Despite the comment in the code, you can use a MERGE provided you only merge in the one row being added in the WHILE block. The end result of this change should function the same as the original code but look more elegant. If I have the code related to the Warehouse.ColdRoomTemperatures in its own stored procedure, with some unit tests around it, I can change that stored procedure however I like provided the unit tests still pass, confident that any calling stored procedures will still function the same.

Orchestration procedures and code contracts

A key part of the modular approach to code are the orchestration procedures. These are the procedures that sit on top of everything else, call other stored procedures, manage control flow, pass variables etc. They do no do much themselves but decide what will be done. The example procedure above is a simple orchestration procedure, but they can get significantly more complex. They might function as the APIs of your database, called by external applications to do something or return some values.

They are quite simple to unit test really. All you do is map the possible paths the orchestration process could take, depending on values passed in and values retrieved during the process. Then you write a test for each of these paths e.g. if Parameter A is set to 32 and function X returns less than 10,000 we should execute stored procedures 3, 5, and 17 with specific parameters. Whether function X, or any of the stored procedures, perform as expected is not something to worry about when testing an orchestration procedure, that is taken care of by the unit tests around these code units.

This idea can also be thought of as a contract between a particular unit of code and the rest of the database. Defined by a specification (hopefully well documented somewhere, but we can’t expect miracles) and enforced by the unit tests, this contract says that this unit of code will behave in this way. You can change the unit however you like, provided it doesn’t break this contract, and if at some point you find it does, then you will most likely find yourself faced with a long, tedious job of trawling through every other bit of code that calls it to make sure this new contract still fits what they expect from the code unit.

Summary

If you adopt unit testing, you may need to change your coding style to a more modular one in order to get the best from it. This does come with other benefits, however, like clearer separation of responsibilities in a database, easier code reuse, and better defined functional units of code. I think even without unit tests, more modular coding is the way to go when possible (and sometimes performance issue will make it impossible) for all of the reasons just mentioned. I also think that writing unit tests changes your perspective on the code you write, it helps you think about things like error handling, and code contracts, and code reuse, and will make you a better coder.

T-SQL Tuesday #123: Life hacks to make your day easier – Custom shortcuts in SSMS/ADS

T-SQL Tuesday logo

This is my first time taking part in T-SQL Tuesday. It’s something I’ve known about for a while but not taken part in before. But one of my goals this year is to post at least once a week on this blog for 3 months, so I figure I should take every prompt I can. Plus this week’s topic is something I feel I can contribute to.

This week’s topic is about any hacks you have to make your day to day life easier, and I wanted to share a little trick someone showed me about 10 years ago, that I’ve found incredibly useful ever since.

The basics of it is, you can add your own query shortcuts to SSMS, and when you run them they append any highlighted text to the shortcut code. To explain, I’ll use the example I’ve found most useful:

One of the first things I do whenever I start a new job or install a new version of SSMS is I set up ctrl+5 as the shortcut for “SELECT TOP 100 * FROM “

Once I’ve done that I can highlight any table or view in my code, and use ctrl+5 to see 100 rows from that object, because it appends the highlighted text to the end of my shortcut text. But you can do more with it. Take this bit of code:

SELECT
      A.Col1
    , B.Col2
    , C.Col3
FROM TableA AS A
INNER JOIN TableB AS B
    ON A.ID = B.ID
LEFT JOIN
    (
        SELECT
              X.Col5
            , Y.Col3
            , X.ID
        FROM TableX AS X
        LEFT JOIN TableY AS Y
            ON X.ID = Y.ID
            AND X.Col3 = 'Some value'
    ) AS C
    ON A.ID = C.ID

I can highlight TableA, TableB, TableX, or TableY, and in each case Cltrl+5 will show me all of the columns in the table and a sample of the data. Or I can highlight TableA AS A INNER JOIN TableB AS B ON A.ID = B.ID and get a sample of all the columns available from that. Or I can do something similar inside the subquery and see all the columns I have available to me in that subquery.

The main other shortcut I like to set is Ctrl+6 as “SELECT SUM(1) FROM “, to tell me how many rows are in a table.

If you want to set these up it’s very easy:

In SSMS go to Tools>Options>Keyboard>Query Shortcuts.

In Azure Data Studio it’s a little more hidden. You might think you’d be able to do this in File>Preferences>Keyboard Shortcuts, but that only gives you a list of pre-canned commands that you can assign to different shortcuts as you see fit. To write your own you need to go to Cog icon(bottom left)>Settings>Data>Query to create your own shortcuts. You need to be a bit careful because there will already be other shortcuts assigned to your key combinations, so you will need to go into File>Preferences>Keyboard Shortcuts to move them to other shortcuts, otherwise Data Studio will try and do both things when you use your shortcut.

Anyway, that’s it, hopefully someone will find this useful.

Useful link for .NET training

I’m interested in learning some .NET coding. I’ve always been a pure SQL developer but more and more I feel my lack of .NET knowledge is holding me back. I’ve found a surprising lack of information online about how to get started with this, but I found one set of tutorials from Microsoft that talk through the basics of making various different applications.

I’m not sure how much use they would be to a complete beginner, but to someone who’s reasonably proficient with C# coding, and is aware of what a css file contributes to a website, but really isn’t sure how the different things fit together, I found them really useful.

https://docs.microsoft.com/en-us/visualstudio/get-started/csharp/tutorial-console?view=vs-2019

Interesting NULL issue

I had a question today from someone in my team wondering why their code wasn’t returning the correct results. They’d written:

SELECT
      *
FROM TableA;

SELECT
      *
FROM TableA
WHERE 1 = 1
    AND ID1 IN (SELECT ID1 FROM TableB);

SELECT
      *
FROM TableA
WHERE 1 = 1
    AND ID1 NOT IN (SELECT ID1 FROM TableB);

The first query returned about 600 rows, the second returned 300ish, the third returned nothing.

I spotted some NULLs in TableA.ID1, and added an OR ID1 IS NULL to the WHERE clause of the third query. This produced some results but not enough.

We looked at it for a little while longer, and eventually found some NULLs in TableB.ID1. These were causing the problem because a NULL represents unknown values. Therefore, when then query tries to evaluate if any particular value for TableA.ID1 is not in the list of TableB.ID1s from the subquery, it can’t be sure because one of the items in the list has an unknown value.

In summary, if you are doing a NOT IN, make sure you don’t have a NULL in your list anywhere, or the predicate will always return false and (if the predicate is in the WHERE clause) your query won’t return any values.