Skip to content

T-SQL Tuesday #131: Data Analogies, or: Explain Databases Like I’m Five!

October 13, 2020
This image has an empty alt attribute; its file name is tsqltues.png

This is my fourth time taking part in T-SQL Tuesday. Today the topic is all about your favourite analogies when it comes to explaining SQL Server topics to people.

I really like this topic, mainly because over the last three years I’ve found myself increasingly acting as a teacher to the less experienced T-SQL coders in the office, and anything that makes me think about how to do that better is good.

There are a few analogies I fall back on when trying to explain certain concepts, and I found myself using the old “clustered index is like a library index” one a couple of days ago when I was explaining why the order of columns in a clustered index did matter to the performance of a query. This is something that I’ve struggled to get across earlier on in my career, and having a tried and tested analogy to fall back on this time meant I was more successful this time around.

Analogy of choice: database engine as an office

The analogy I’m choosing here is something I use in my head quite a bit to help me understand certain concepts:

Imagine an office, with a set of filing cabinets filled with paper. The office workers regularly have to get information from these files, add new information, update existing information, or even get rid of some out of date stuff. None of the information is available anywhere but in the files, not even in people’s memories. This is the SQL Server query engine.

The reason I like this is twofold. First, it’s familiar. Even if you’ve never worked in an office that uses physical files you’ve seen them on TV and can understand the concept really easily. Second, it’s multi-purpose. This analogy can be used to cover quite a range of topics if necessary, and having an understanding of all of them stemming from the one analogy makes it easier to see how they fit together.

One obvious use is in explaining indexing, the order of the files is a clustered index, if files are completely unordered that maps to a heap, and extra index cards can represent non-clustered indexes.

You can also use it to look at row by row versus set based operations. In this example, a set based operation means the worker walks to the filing cabinet, gathers all the files they need, takes them back to their desk, and updates them all there. In contrast, a row by row operation means the worker walking to the filing cabinet, grabbing one of the pages they need, taking it back to their desk, updating one row there, returning the page to the filing cabinet, returning to their desk, and repeating the process again for the next row (even if it is on the same page as the previous). This makes the efficiencies of set based over ReBAR (row by agonising row) operations really obvious.

A slightly more obscure use is talking about memory grants. As part of the query execution process, SQL Server calculates how much memory the query will need to store temporary results along the way. For instance, if you are sorting the data you need to store the partial results of the sort while it’s completing. It bases this on three main factors, what actions will need to store data in memory (from the execution plan), how many rows each of these will need (from the table statistics) and how big each of those rows can be (from the column definitions, taking all variable column sizes as the maximum). This is represented in our office by desk space. The office has a number of desks, that together represent the total available memory in the system. When a worker starts work on a task, they reserve an amount of space on these desks based on how much space they think they will need. If their estimates are too high, they will take up space other workers need and hurt the overall efficiency of the office as a result. If their estimates are too low, however, they are not allowed to adjust them on the fly. Instead they have to use one of the spare filing cabinets over in the corner, which represent tempdb. This means they are now taking data from one filing cabinet and moving it to another, before retrieving it from that filing cabinet again moments later You can see how this creates an inefficiency in their work, and is something you want to avoid where you can.

So, there it is. This analogy helps me understand a lot of the inner workings of SQL Server better, from obvious mappings like clustered indexes, to the slightly less obvious like memory grants or locking and blocking. Hopefully some of you reading this will find it a useful analogy as well.

From → Uncategorized

Leave a Comment

Leave a comment