Granularity in queries

April 13, 2023

Today’s post might be a bit of a rant. I’d like to talk about granularity in queries for a minute, and what I think are better or worse ways to lay your code out when you’re dealing with tables of different granularity. To start off, I’ve got an example of the type of query structure that really bugs me, taken from the WideWorldImporters database that Microsoft use as their sample database.

SELECT
      Ord.OrderID
    , Ord.OrderDate
    , Ppl_PickedBy.LogonName AS LogonOfOrderPicker
    , Ord.DeliveryInstructions
    , Ord.InternalComments
    , SUM(OrdLn.Quantity*OrdLn.UnitPrice) AS OrderCost
FROM Sales.Orders AS Ord
INNER JOIN Sales.OrderLines AS OrdLn
    ON Ord.OrderID = OrdLn.OrderID
INNER JOIN [Application].People AS Ppl_PickedBy
    ON Ord.PickedByPersonID = Ppl_PickedBy.PersonID
GROUP BY
      Ord.OrderID
    , Ord.OrderDate
    , Ppl_PickedBy.LogonName
    , Ord.DeliveryInstructions
    , Ord.InternalComments

This query is pretty simple, we’re wanting to return a set of data about the orders placed including the total cost of each order. However the orders in the database are split into an Orders table that holds the order level information, and an OrderLines table that holds information on each product ordered in each order, and we need that OrderLines table to get the cost.

So we write our query, and we join to the OrderLines table, and we sum the cost. But then, because we have an aggregate, we need to group by every other column we’re returning. And to me that just doesn’t look right, and it doesn’t convey the intention of the query properly.

What I mean with that intention remark, is we are trying to return a data set where each row represents one order, so our data set should exist at the OrderID grain. But looking at the GROUP BY doesn’t tell us that. Instead it says we are returning something with a composite grain that includes columns like DeliveryInstructions and InternalComments, but those are just attributes we want to hang off the OrderID.

A better approach, in my opinion, is to categorise any tables you’re joining to as either the primary table (or tables in some cases) that holds data at the query grain level, lookup tables that return attributes and measures that exist at a higher grain, and detail tables that return measures you need to sum up to the correct grain.

Your primary table(s) should ideally be first in the FROM clause, and any lookup tables can then be joined to them without an issue. The detail tables should be moved out to a CTE (or sub-query, or view, or something else) to sum them up to the granularity you want for the overall query. For example, I think this is a significant improvement on the above query:

WITH cte_OrderLines AS
    ( /*Sales.OrderLine data summed to OrderID granularity*/
        SELECT
              OrdLn.OrderID
            , SUM(OrdLn.Quantity*OrdLn.UnitPrice) AS OrderCost
        FROM Sales.OrderLines AS OrdLn
    )
SELECT
      Ord.OrderID
    , Ord.OrderDate
    , Ppl_PickedBy.LogonName AS LogonOfOrderPicker
    , Ord.DeliveryInstructions
    , Ord.InternalComments
    , OrdLn.OrderCost
FROM Sales.Orders AS Ord
INNER JOIN cte_OrderLines AS OrdLn
    ON Ord.OrderID = OrdLn.OrderID
INNER JOIN [Application].People AS Ppl_PickedBy
    ON Ord.PickedByPersonID = Ppl_PickedBy.PersonID

We now don’t have a GROUP BY clause in the main query at all.

It’s also easier for adding other data that’s stored at a lower granularity than OrderID. For instance, if we add the Sales.Invoices table because we want to get the total amount invoiced for this order, this joins via OrderID so there could potentially be multiple invoices for the same order. If we add this to the original query there is clearly a high risk of duplication, but if we add it as another CTE at the start of the second query then that risk is mitigated in a more graceful manner, and the overall query is easier to read (at least in my opinion).

This might not seem hugely significant when looking at the examples I’ve given here, but I’ve seen some extremely bloated queries that tried to go down the first approach and ended up with long lists of columns and calculations in the GROUP BY clause that really had nothing to do with the query granularity. I’d like to not have to deal with those types of query again, but I suspect I will end up digging through many more of them in my career.

From → T-SQL

3 Comments

Matthew Swindle permalink

At the risk of instigating a riot, I also favor SELECT-embedded subqueries for joins that are only utilized by a single column.

SELECT
Ord.OrderID
, Ord.OrderDate
, ( SELECT LogonName
FROM [Application].People
WHERE PersonID = Ord.PickedByPersonID
) AS LogonOfOrderPicker
, Ord.DeliveryInstructions
, Ord.InternalComments
, ( SELECT SUM(Quantity*UnitPrice)
FROM Sales.OrderLines
WHERE OrderID = Ord.OrderID
) AS OrderCost
FROM Sales.Orders AS Ord

Reply
- chrisjohnson0120 permalink
  
  If the rest of your team is ok with it then that’s fine, these things are always personal preference and the query engine will most likely come up with the same plan regardless. The key in all of these things is to find something that nobody on the team has a violent objection to.
  
  Although personally I find that quite clunky and unintuitive to read, and would argue against it if we were discussing standards for a team I was in. I also don’t like the extra work involved if you needed to add a second column from the People table, you’d either be adding a whole second subquery or re-writing a chunk of the query. If you’ve added People to the FROM clause then you just need to add another column to your SELECT.
  It’s the same reason I always use 2 part names for column references, even if I’m only selecting from one table. If I’ve not done that and I end up adding a second table I’ll inevitably forget to add the table part to some of the column references, and the next person to look at the query won’t be able to tell by looking which table that column belongs to. I much prefer to assume the code will be expanded in the future and code to anticipate that.
  
  Reply

Trackbacks & Pingbacks

A Thought on Query Granularity – Curated SQL

Granularity in queries

Trackbacks & Pingbacks

Leave a comment Cancel reply

Recent Posts

Archives

Categories

Meta

Subscribe to Blog via Email

Granularity in queries

Share this:

Related

Trackbacks & Pingbacks

Leave a comment Cancel reply

Recent Posts

Archives

Categories

Meta

Subscribe to Blog via Email