Stuff I’ve learned from my degree part 1
I want to write about some of the things I’ve learned as part of my Open University degree so far.
The degree is in Computing & IT and Statistics, and I started it for a couple of reasons.
Firstly, I had reached a point in my career where all of my peers seemed to have a much better understanding of computing in general than I did and I needed a way to catch up, and secondly I noticed that very few people in the BI industry have any statistical knowledge and I thought that would be a good way to stand out. In both cases, I think it’s been beneficial but I wanted to write about some of the computing fundamentals and how they helped me.
So, before the course, I was aware that a bit was a unit of memory, and that byte, megabyte, gigabyte etc. were also units of memory, and that computers thought in 1s and 0s but I never understood what that meant or the implications for data storage and data types until I did this course.
What I learned was that a bit was a single 1 or 0, and a byte was a collection of 8 of these bits. Why 8? Because that gives you 256 distinct combinations, which is enough to represent the most commonly used letters, numbers and symbols in English (so 00000000 = A, 00000001 = B, 00000010 = C and so on).
Suddenly at this point a lightbulb went off in my head. A single character in a Char or Varchar field is represented by one of these bytes. From there, the differences between Char and Varchar became clear in terms of data storage.
There would be a specific combination of 1s and 0s that represented no character and that is what is used to pad a Char field to the required length, which is why a Char field always takes up the same amount of space on each row. The computer would read this long string of 1s and 0s that represented the whole field, slicing it at every 8th bit and translating these slices into characters, and when it reached the allocated number of bytes for the column it would know that the bits after this represented the next column.
For a Varchar field, there is no fixed length. Instead, there is a specific combination of 1s and 8s that represent the end of the column, and the computer will know that only bits before this represent this column.
From this, it follows that a Char field that is too long compared to the data it holds will take up excess space in the database because of the extra bytes to represent no character. It also follows that any Varchar field will take up 1 byte’s worth of space more than its length because of the ‘end of column’ byte.
Finally, the NChar and NVarchar fields also made sense. These are used to represent characters from many languages, and so they need far more possible combinations of 1s and 0s than can be handled in a single byte. So we represent each character with 2 bytes instead. This gives 256×256 = 65536 different combinations and so we can map these to 65536 characters. Apart from that they work exactly the same as the Char and Varchar fields, they just take up double the space.
None of this was exactly new knowledge, I knew that NChar took up double the space of Char, and that you shouldn’t use Char when the values are likely to be variable in length because it wasted storage, but understanding why made a huge difference to how I thought about assigning data types.