NULLs represent the Unknown state

NULLs have been the bane of Database Engineers since they were first introduced by Ted Codd in the 70’s and are still a widely debated topic.

For a long time, I was in the “NULLs should be avoided at all costs” camp and went out of my way to remove them. Now, I’m in the “NULLS should nearly be avoided at all costs” camp.

So what changed?

I listened to a talk by C.J. Date called “The Problem of Missing Information” ¹ In this talk, he says that NULLs represent something that is unknown. This is so basic, and yet, I overlooked and subsequently overloaded the meaning of NULL for many years.

People incorrectly use NULL to mean all sorts of things.

NULL is sometimes used by application logic to indicate:

NOT - ex: a NULL in a column named verified_at might mean “not verified”
DEFAULT VALUE - ex: a NULL in a boolean might mean “false”

One problem with this is that it’s not explicit. A person needs to read application code to understand the meaning of the NULL or be apprised of some convention being used. The meaning can easily get lost as the project gets refactored over time, and this logic needs to get baked into any downstream application like a reporting system or ad hoc analysis. All too often, these subtleties are missed resulting in incorrect and inconsistent data being reported.

Instead of using NULL to have a meaning besides “unknown”, have another column that explicitly calls out the state. For the verified example above, add a specific column called is_verified with a default value of false and don’t allow NULLs. Set it to true when it’s verified. If it matters when the row was verified, it’s ok to also have the verified_at column with an initial value of NULL because it’s not yet known when the row is verified. Once it’s known, set it.

This makes it so there’s no ambiguity about whether or not something has been verified.

For booleans, allowing NULLs gets into the problem between 2 value logic (2VL) and 3 value logic (3VL). With 2VL, there are 2 possible results: true and false. With the addition of NULLs, there’s now a third option that means “unknown”.

To understand the problem, we must understand a few concepts first.

2VL (2 value logic)

With 2 value logic, these are the connectives (i.e. logical operators)

Truth Tables: 

NOT |     AND | t f   OR | t f  IF | t f  IFF | t f
-------   ---------   --------  --------  --------- 
  t | f    t  | t f    t | t t   t | t f    t | t f
  f | t    f  | f f    f | t f   f | t t    f | f t

To understand the table, read the top operand and the left operand, and find the intersection for the result. For example, in an “AND”, if the first operand is true and the second is true, return true. If either are false, return false. IFF stands for “If and only if” and is much more rare in practice. It refers to the case when the results of both are true or both are false, then return true.

3VL may introduce incorrect results

In 3VL, the truth table gets a lot more complicated and unintuitive.

Looking at this from the point of view of SQL, let’s take a simple example of an OR using the is_verified column mentioned above. If we allowed NULLs on this column, we might need to write a query like this:

SELECT * FROM table WHERE is_verified = 'f' OR is_verified IS NULL;

In this case, our application is assuming a NULL value is false. Since NULL really means unknown, we don’t know if it’s false or not. What if it’s true? Then we’d return wrong results. If we’ve written our application to mean that a NULL always means false, then there is no confusion if we simply set it to false and not allow the NULL in the first place.

The Problem of Missing Information ↩