In one of the sessions Kimberly and I taught this week at SQL Connections, we discussed how to choose efficient data-types – I’d like to share the discussion here with an example.


The crux of the matter is in the amount of space required to store the data-types in the schema, and ensuring that the data-type choice matches the data being stored without incurring any unnecessary overhead. Here are a couple of examples:



  • A field that stores a person’s age will usually be able to use a TINYINT type that can store a maximum value of 256 – unless its an archeological or historical database, for instance. Without putting a lot of thought into it, however, someone may choose to use an INTEGER type. What’s the difference? A TINYINT take a single byte to store, whereas an INTEGER takes *4* bytes to store – wasting 3 bytes per record.
  • A field that stores a person’s city of residence in the USA needs to be able to hold 24 characters (see my previous post on how long fields have to be) – so what data-type should you use? A CHAR (24) will be able to store all the possible values, but will *always* take up 24 bytes in the record as it’s a fixed-length column. A VARCHAR (24), on the other hand will only store the number of bytes equal to the number of characters in the city name, so using the fixed-length type will waste a varying number of bytes per record.
  • In the USA, the names of the various States are commonly abbeviated to two characters – for instance, Washington = WA and California = CA. So is the best type to use CHAR (2) or VARCHAR (2)? Given that the abbreviations are always 2 characters, they’ll both store 2 bytes all the time. BUT, a variable length column has a two-byte overhead (the two-byte pointer for the column that’s stored in the variable-length column offset array in the record) – so in this case the best data-type to use is CHAR (2).

You can see how choosing data-types without considering whether it’s the best type can lead to wasted space. Let’s look at an example schema to support a population census of the USA.



CREATE TABLE CensusSchema1 (



SSN CHAR (256),


StateName CHAR (256),


Sex INTEGER,


Employed INTEGER,


SelfEmployed INTEGER,


EthnicOrigin INTEGER,


MaritalStatus INTEGER,


NumberOfDependents INTEGER,


Age INTEGER,


CountryOfBirth INTEGER,


HouseholdIncomeGroup INTEGER,


ZipCode5 INTEGER);


GO


At first glance this may look reasonable, but digging in you’ll see that many of the columns are over-sized. Here’s a cleaned-up schema to compare against, with notes on the per-column savings:



CREATE TABLE CensusSchema2 (



SSN CHAR (9), – saving 244 bytes


StateName VARCHAR (256), – saving at least 240 bytes (longest state name is 14 + 2-byte varchar overhead)


Sex BIT,


Employed BIT,


SelfEmployed BIT, – saving 11 bytes altogether over these three fields


EthnicOrigin TINYINT, – saving 3 bytes


MaritalStatus TINYINT, – saving 3 bytes


NumberOfDependents TINYINT, – saving 3 bytes


Age TINYINT, – saving 3 bytes


CountryOfBirth TINYINT, – saving 3 bytes


HouseholdIncomeGroup TINYINT, – saving 3 bytes


ZipCode5 INTEGER); – no saving


GO


The bad schema gives a per-row size of 574 bytes and the cleaned-up schema is no more than 48 bytes per-row. I designed these two schemas to support a US census. The population of the USA is approx. 300 million. This means the bad schema would take around 190GB to store all the info, and the cleaned-up schema only takes around 15GB – more than 12 times more efficient! Now we’re starting to see how poor data-type choice can lead to poor performance.


Wider rows means:




  • Fewer rows can fit on an 8k page.


  • More CPU is necessary to crack open a record due to the number of CPU data cache line invalidations necessary to read the record into the CPU’s various caches (every time a cache line is invalidated it takes a bunch of CPU cycles – see here for an explanation of CPUs and cache lines).

Less rows per page means:




  • More pages are needed to store the data


  • Indexes could have a smaller fan-out (if the index keys are wider than then need to be), leading to more levels in the index and less efficient index navigation from the index root page to the index leaf-level.

More pages means:




  • More IOs are necessary to read the same amount of actual data


  • More buffer pool memory is necessary to hold the same amount of actual data


  • More disk space is necessary to hold the same amount of actual data

Clearly there’s a link between various aspects of workload performance and badly chosen data-types.