Introducing DataBloat

Introducing DataBloat

So, what is DataBloat?

DataBloat is a term that I use to describe issues caused when a table takes up more space than necessary due to the use of inappropriate Data Types.

You don’t always notice it straight away…

I frequently run the “Disk usage by Top Tables” Report on a Database as it’s useful to see where database space is going.
A table can become very large. Fine if it’s required, but not so if it’s unexpected or unnecessary.

I recently spotted a table moving it’s way steadily up the “Top Tables” charts, much like a Black Lace single in the eighties – it’s there for a reason only just no-one knows why.

This one wasn’t a key strategic data store, nor was it a specifically large table. Compared to its neighbours, a low number of rows, but it was just BIG and it didn’t seem right!

Upon investigation, there was trouble in here. Not a great table structure, plenty of repeated data, but worse of all – inappropriate Data Types everywhere!
This isn’t a new problem – I’ve seen it before. It’s normally caused by Inexperienced, Ignorant or Lazy developers.
Going to mention “Lazy Developers” and “Lazy DBA’s” – often quoted as a good thing – automation and efficiency. That’s “Smart Lazy”, no, no. I’m talking about “Lazy” in the old fashioned way – a general reluctance to do things.

The common trouble spots for DataBloat are the Data types of Integer, Numeric and Datetime.
Let’s work through a few examples:
Starting with Integer. Int takes up 4 bytes, regardless of the Data value. It’ll cover you for up to 2,147,483,647, but do you really need it to? Example columns were “Original Term of Loan” – I don’t work for a loan shark, so hardly going to break into this value range – TinyInt would probably have sufficed (1 byte, stores value up to 255), but using SmallInt instead to play safe (up to 32,767) would reduce the column size to 2 bytes – whoopee! 2 bytes. Now multiply that by the number of rows in the table – 80 million, starts to make a bit of a difference.

OK, back to the Numeric(18,2) – normally blamed on Excel since it’s fond of that value, trimming that down to a Numeric(9,2) still gives a large maximum value of 9 999 999.99. That’ll be 5 bytes as opposed to 9 bytes for Numeric(18,2), a saving of 4 bytes per row, thank you very much. There were sixty of these in the table.

Dates aren’t immune to Bloat either. Do you really need to use DateTime, or would SmallDateTime suffice ? (8 bytes and 4 bytes respectively) . DateTime covers a larger range, and is more accurate if you’re recording the time. If you’re only need the Date element (and this might come as a then you can use Date for 3 bytes – luckily this was used in the table, so no action required here.

Also on my “no action required here” list was CHAR(). Using Fixed column sizes has advantages under certain circumstances – if , but it is possible that you’re going to be taking up more space than you expected. Another favourite are the Unicode data types – NCHAR() and NVARCHAR(). First pushed with SQL 2005, they are great for storing multi-lingual data, but what if you’re not using multi-lingual data. Is there a chance that you’ll use them – a way of future-proofing your application. Unicode values reserve 2 bytes for every character – doubling the amount of space used.
But why the fuss, disk space is cheap isn’t it? Every wasted byte will cost you – storage space on disk for the base table. Storage space for the Indexes – use Int where a SmallInt would do and that wastage will be replicated all the way through the Index.

Making your Tables more efficient will make them more agile (with a lower case). You’ll get more pages on a disk. SQL Server is essentially a large Cache – read some data from a Database and it’s brought from disk into memory. It’ll stay there until some other data is required and memory is full, so it’ll throw it out again (until you query it again!). Reduce the space in the Table and you reduce the amount of space required in memory. You will experience better performance and everyone will be happy! Backups will take less time, as will Database Maintenance – Indexing and Statistics will take less time because there’s less work to be done.

In order to identify the scale of the issue, I built a small Excel sheet to paste in the Table structure and filter on specific datatypes. I done my research, identified what could be safely changed, put the savings into the spreadsheet, calculated bytes per row and extrapolated up to get the size of my smaller table.

I then set about testing a Change to revise the structure of the Table to make it more efficient
You must ensure that any dependencies are covered – especially SSIS Packages that are going to need a refresh and any other downstream code too.

Test, test and live-prove.

Follow on actions:

  • Educate the Users
  • Stand guard – be vigilant on what your Developers want to take into Production and Just Say No!


Leave a Reply

Your email address will not be published.

Follow on Feedly

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.