Another month has passed; time for another T-SQL Tuesday post. The monthly blog party is hosted by Brent Ozar (b|t) this time. And he wants us to write about our favourite (or least favourite) data type.
I don’t really have a data type I like. I also don’t hate data types. The thing I do hate is when the wrong data type is used for the purpose.
So instead of doing what Brent asks, I will write a short post about a very common misconception about a data type that triggers me whenever I read it. Which happens a lot.
Precise versus approximate
One of the most widespread falsehoods about data types is that float (or its synonym, real) should never be used because it is imprecise. And when people say that, they often come with examples where they show how computations involving float data yield incorrect results. For instance, multiplying a price by a quantity, then subtracting the amount paid, and then the remaining balance is one cent off. That demo is then used as “proof” that float is a bad data type that should always be avoided.
What’s weird is that if I run that same query with data type int, I probably also get the wrong results. Yet nobody claims int is a terrible data type that should not be used. When I use data type datetime, the same repro will likely result in an error. When I use data type varchar and use it to compute the sum of 123 and 456, the result will be 123456 and nobody will take that as proof that varchar sucks.
If I would seriously make the claims above, people would quickly respond that it’s my fault, that I’m trying to use a data type in the wrong situation. So here’s my question. Why does nobody realize that the example of multiplying quantity by price and getting a rounding error is also not “proof” that float suck – if anything, it merely proves that float is not the right choice for computing prices.
Back to high school
To explain why float is not a bad data type at all, I’ll need to take you back to your high school days. To science class.
Remember the practicum part of your science class? Remember how you were taught how to use instruments that measure size, weight, distance, speed, power, of whatever else you had to measure? And most importantly, remember how you were taught to assess how precise the instrument and your reading were? When you measured a length, you knew the centimetres (sorry Americans), but how sure were you about the number of millimetres you wrote down? Would you dare to write down fractional millimetres? In those cases, you said that you had significant digits to the centimetre, did write down the best estimation of the millimetres for good measure, but more digits were considered wrong. And of course, when you were allowed to use a high precision measuring device, you had more significant digits; but even those instruments do not have infinite precision.
So let’s say you measured an object, and you wrote down the length as 512.5 centimetres, the width as 371.3 centimetres, and the depth as 110.7 centimetres; all of these significant up to the centimetre, so they all had 3 significant digits. To computed the cubage, you had to multiply the numbers in metres: 5.125 * 3.713 * 1.107 = 21.065241375 m3. If you gave that as the answer on your test, you’d get some points deducted, because you gave way too many digits. Your measurements were trustworthy up to three significant digits, and that means that the answer needs to be given in three significant digits (this is a simplification; see here for the “beyond high school level” version), so the correct answer was 21.1 m3.
Now let’s assume your teacher asked you to compute the energy stored in a single plutonium atom, using Einstein’s famous formula: E = mc2. Of course, this formula expects standard units, so we have to express the mass of that single atom in grammes. If I did my math right and didn’t miscount the zeros, that number is 0.0000000000000000000004052. A long number, but there are only four significant digits. Scientists prefer to write this in a different way, as 4.052 * 10-22 gram.
Similarly, c (the speed of light) needs to be expressed in metres per second. Let’s say we have measured / computed this speed with four significant digits as well. That would then be 299900000 m/s. With this way of writing, it is not even immediately obvious whether there are four significant digits, of five and the fifth just happens to be a zero. So in this case there’s even more reason to write it as 2.999 * 108.
Return to the database
Let’s return to the database. Let’s figure out a way to store these numbers appropriately.
Could we use decimal (or its synonym numeric)? Well, yes. We can. We need 25 digits after the decimal place for 0.0000000000000000000004052, and 9 digits before the decimal place for 299900000, so that would fit in a decimal(34,25). But if you try to compute c2 so you can then multiply that to the m, you’ll run into an overflow error.
Okay. You can use different definitions for each. Play around with the length and precision of the various variables and you’ll find some combinations that don’t return an error. But now you’ll notice different results. I got 0.0000364437000000000000000 as the answer for one of my tests. On another test, the answer was 0.0000360000000000000000000. Still correct, but only up to two significant digits. We had four in our input! For sure, the professor will dock points for this!
And what if you are designing a general use database for a science department, where measurements will be stored and computations run for all departments? What datatype should you use for a (measured or computed) weight, when it’s used by nuclear physicists to store the mass of a single plutonium atom, but also by astrologist to store the mass of our Sun?
And this is where float makes its big entrance! With float data types, the mass of that plutonium atom is stored as 4.052E-22 (which is the standard “computer” representation of 4.052 * 10-22). The speed of light is stored as 2.999E+08. The square of that is then computed and stored as 8.994001E+16, and multiplying this with the mass of a plutonium atom results in 3.64437E-05, or as scientists would write: 3.64437 * 10-5. Is that last digit really correct? Are we sure there’s no rounding error in the computation? I don’t know. And I don’t care. Remember, we only had four significant digits in our measurements anyway, so that means only four, or at most five, digits should be retained: 3.644 * 10-5. Whether or not that seven was correct makes no difference at all!
There are no “bad” data types in SQL Server. Data types can be inappropriate for the job. For working with money, which is by definition precise numeric data, data types such as int, varchar, and float won’t work. But decimal and money are ideal for this job!
But when dealing with scientific data, often the results of measurements and computations based on measurements, we are handling data that by its very definition is an approximation. So this data is best handled by a data type designed for approximate-number data, and that would be float.
[…] Hugo Kornelis levies a defense of floating point data types: […]
[…] Float Does Not Suck by Hugo Kornelis – “To explain why float is not a bad data type at all, I’ll need to take you back to your high school days. To science class.” […]
I once looked after a ERP SQL Server db which had been set up with floats as primary keys – amazingly enough it even survived moving from 32 to 64 bit CPUs without throwing a wobbly
Uhhhmm … yikes, I guess?
Glad it worked for you. But believe me, this is really one of those “just because it works …” cases. Floating point data is even worse as a primary key than guids. In fact, I’d struggle to come up with a worse choice. (Perhaps sql_variant…)