SQL NULLs aren't weird, they're just based off of Kleene's TRUE-FALSE-UNKNOWN logic!
If you want you can read NULL as UNKNOWN and suddenly a whole bunch of operations involving them become a lot more intuitive:
1. TRUE OR UNKNOWN = TRUE, because you know you have at least one TRUE already.
2. TRUE AND UNKNOWN = UNKNOWN, because you don't know whether you have two TRUEs or not. It's just out there.
3. UNKNOWN XOR UNKNOWN = UNKNOWN, because it could darn near be anything: TRUE XOR TRUE, TRUE XOR FALSE, FALSE XOR FALSE, FALSE XOR TRUE... Internalizing this is where SQL's use of NULL / UNKNOWN really becomes intuitive.
4. (TRUE AND FALSE) XOR (TRUE OR UNKNOWN) = (FALSE) XOR (TRUE) per #1 = TRUE. See, it's consistent, you just need to keep in mind that if you have a lot of known UNKNOWNs they're quite parasitic and your final outcome is likely to be, itself, an UNKNOWN. Just like in real life!
That doesn't address anything in the second half of the post though, starting with this pull quote:
> The fact that NULLs are distinct for UNIQUE columns but are indistinct for SELECT DISTINCT and UNION continues to be puzzling. It seems that NULLs should be either distinct everywhere or nowhere. And the SQL standards documents suggest that NULLs should be distinct everywhere. Yet as of this writing, no SQL engine tested treats NULLs as distinct in a SELECT DISTINCT statement or in a UNION.
That's because "different" and "distinct" don't mean the same thing.
Two unknown values are assumed to be different, but they are not distinct from each other.
For example, take two boxes, in each box is a die, the value of the box is the value shown on the die inside. You don't know the value since you don't see the die, it may even change as you manipulate the box, so it is unknown, NULL in SQL. Because of that, you assume the two boxes have different values. They are, however, indistinguable, so, not distinct. All you need to know is that you hold two boxes with dices in it, which one you hold doesn't matter, and that's what "SELECT DISTINCT" tells you: that you have two boxes and it doesn't matter which is which, even though they have different values.
>That's because "different" and "distinct" don't mean the same thing.
The literal definition distinct is:
>recognizably different in nature from something else of a similar type.
If you want to get down to it nothing is "equal" or the same.
Is a temperature measurement 25C the same as another of 25C? No these measurements are an approximation of the actual values which are actually not equal to each other they are distinct they have just been lumped into the same 25C group due to the resolution of measurement yet equality works just fine on that value in sql.
I have used SQL for a long time null handling is weird and inconsistent and a waste of time. For all the language bugs due to the existence of null at least I can count on null=null and not write garbage like value=param or (param is null and value is null)
> The literal definition distinct is
Irrelevant. What matters is the meaning in the context of SQL.
> weird and inconsistent and a waste of time. For all the language bugs due to the existence of null
There are necessary, semantic cases that need to be dealt with. How else would you do it?
Also, it's really weird to use "bugs" to refer to well defined and well documented behavior.
I wanted to briefly reinforce this point with the fact that SQL has multiple equality operators - there is both `=` and `IS NOT DISTINCT FROM`. The later operator will treat null values as equal in alignment with the `DISTINCT` and `DISTINCT ON` operators.
It is extremely easy using partial uniques and the proper operators to treat nulls as non-distinct values and the approach we have allows some very important functionality that aligns with standard statistical data treatment which was influential to how SQL logic was originally designed.
> Is a temperature measurement 25C the same as another of 25C?
Yes, the measurements are the same.
The actual temperatures probably are not, but measurements are not the same as the thing measured.
>Yes, the measurements are the same.
By the logic two unknown (null) measurements are the same regardless of the actual value which I agree with.
An unknown measurement isn't a measurement value its a statement of (lack of) knowledge about a measurement, that doesn't tell you what the measurement is. Knowledge about a measurement is as different from the measurement as the measurement itself is from the thing measured.
Whether two unknown measurements are the same is unknown.
Whether two measurements of 25C are the same is unknown, these are just values recorded in a database. 25 is a value, null is a value.
The values in the db are the same in both cases which is what I would like my db language to deal with and not make assumptions about what that value actually means.
I see no value in treating null special when in comes to equality in a sql db, in fact it is a hinderance that it does so in my experience.
The SQL null is a database-specific keyword and not something that's part of the domain of your measurements. If you want some kind of sentinel value for your measurements that doesn't have the "unknown" meaning of SQL null, then you should use your own sentinel value and not reuse null for that purpose
Sentinel values suck especially when the language already has a perfectly good one built in.
Is 0 for a temp measurement unknown sentinel or an actual measurement, how about 2,147,483,647 great probably not a measurement now its always included in greater than queries same with max negative with less than.
Null separates the value into its own distinct group and prevents it from being including in range queries due to it not being an actual numeric value while most languages still allow you to compare equality using standard operators.
Sum types would be great in sql but currently we get a union of the sql type and null, so null for sentinel values it is except for the having to using weird syntax to compare it.
Null is not your value that the database is making assumptions about, it's the database's value that you are making assumptions about.
A real sum type would be nice, but when you're using null then you need to accept that null was not designed with your specific use case in mind.
> Null separates the value
NULL is not a value.
NULL is a statement that a value is not available or unspecified reasons.
If you want a particular value where a query would return NULL, it's your job to replace the NULLs with the contextually-appropriate value, e.g., using COALESCE(), to provide it.
It's a convenience shortcut to allow more complicated data models to be rpersented in simpler table structures than a fully normalized NULL-free data model would require, and to provide information about missing data (which can be used with things like COALESCE, where appropriate) when a more complex data model is simplified into a resultset via a query with JOINS, etc.
Changing the emphasis.
> recognizably different in nature from something else of a similar type.
But anyways, the point wasn't to justify the choices of SQL but rather as a way to make intuitive sense of its logic. SQL is one of the oldest and most successful programming languages in existence, we are not going to change it, and it is not going to disappear anytime soon, so we have to go with it, like it or not. There have been some attempts at alternatives, both at changing the paradigm (NoSQL) and at cleaning up the language, which, to be fair, would be a good thing, but without much success. The relational paradigm just works, and SQL is usable enough to make the cost of switching not worth it.
Edit:
And writing things like "value=param or (param is null and value is null)" is usually the sign of a poor understanding of the NULL logic. You are using it wrong basically. Sometimes, it is the right thing to do, but if that pattern starts appearing all over the place, it is usually a result of thinking "NULL is broken, I have to use this pattern to handle NULL properly". That's cargo culting, don't fix problems you don't understand by copy-pasting code you don't understand.
Note: this is not addressed to "you" in particular, there can be good reasons, no offense intended. But I think that in general, it is a code smell.
If it is not recognizably different than it is the same in that context correct?
Two measurements of 25C are not recognizably different therefore they are equal, correct, regardless if the actual temperatures are not the same?
Two measurements of unknown are not recognizably different therefore they are equal in the context of the database.
Having null!=null has never been intuitive to me especially since every other programming language treats them equal. I am not hoping this gets changed, I know SQL is to far along for that, I can still complain about it and agree its wierd.
>And writing things like "value=param or (param is null and value is null)" is usually the sign of a poor understanding of the NULL logic.
It's needed with parametrized sql when your db doesn't support "is not distinct from" which is itself a silly way to just write '=' or '==' like a normal programming language. The distinct predict exist for this very reason to have yet another way to express equality that includes nulls: https://modern-sql.com/caniuse/T151
This is confusing when you know that NULLs are not comparable, but it makes some sense if you consider the result of distinct/union as the output of a GROUP BY. You can consider everything that's NULL to be part of the same group, all the values are unknown.
So NULLs are not comparable but they are part of the same set.
If nulls are distinct then group by should not group them together, this just ignores the problem. Why does group by treat them as equal?
It doesn't treat them as equal, it treats them as one group. It does this because the result is more useful.
It is not the case that nulls are always the same as one another. It is also not the case that nulls are always distinct from each other. Thus, the normal rule of grouping, that same values are combined and distinct values make different groups, does not apply. Another principle is needed to determine the result of grouping with nulls.
Logic which allows for an unknown value can't be shoehorned into always giving definite true or false answers.
This is where the foundation of a relational database semantics in set theory shows through. You can’t model the behaviour with pure boolean logic.
In the SQL spec by default unique indexes consider nulls distinct because you’re adding tuple to the relation, and this is done by equality.
When doing a select distinct or group by you’re not doing a boolean grouping, you’re doing a set projection. NULLs are considered part of the set of “unknown values”, so NULLs are grouped together but they’re still not equal to each other.
The behaviour is mathematically well defined, but it doesn’t match boolean logic.
I’ve been dealing with databases for well over 20 years now and I can only think of a couple of times when this behaviour wasn’t wanted, so I think it’s the right choice.
Nulls are not necessarily distinct.
I believe this confusion is confusing the tool with the thing being measured. For simplicity, I will use the analogy of a record (stored as a row in the database) as an observation in a scientific experiment. If the tool was able to record a value, I enter a value like 579.13. If the tool was not able to record a value, the tool will enter NULL. I make a total of one hundred observations. Of one hundred rows, some have values and some are NULL.
Are NULLs distinct values? No, they are simply a failure in measurement; it is like asking if all errors are distinct or the same. Are NULLS part of the same dataset? Yes, because they are all observations for the same scientific experiment. What does it mean when "select distinct ... " returns several rows for known/measurable values and but only one row for NULL? If this is confusing, the scientist can update the rows and substitute "UNKNOWN/ERROR" for every NULL. When you do "select distinct ...", you will get the same thing. It will return several rows for known/measurable values and but only one row for "UNKNOWN/ERROR".
MS SQL Server treats NULLs as indistinct for UNIQUE constraints, SELECT DISTINCT and for UNION.
Indeed, the sqlite page the pull quote is from says as much.
> MS SQL Server treats NULLs as indistinct for UNIQUE constraints
Postgres lets you control that behaviour when creating the constraint (or index)
Although only in relatively recent versions. I had to hack around this in TypeOrm, because their Postgres backend hasn't exposed the option yet.
Yes, ergonomics dictates some weird behavior for nulls.
Luckily, Postgres nowadays lets you declare the behavior of each null in unique constraints, like it should be. We can expect this to creep down to the other DBMSs with time.
Making nulls distinct on a "select distinct" or a "group by" would be completely useless.
If you're compromising on your high-minded and unorthodox purist concept for ergonomics, you may as well finish the job of ergonomics and just use normal nulls where X=X is true.
If you're including possibly NULL columns in a distinct or group by and you want to treat them in a particular way, use the COALESCE() or NVL() or whatever similar function to give a real value to the NULL for that purpose.
Isn't "select distinct" wildly frowned upon anyway? It's the same as "group by", but with less options...
It’s not and it’s not, respectively.
Yeah nothing wrong with “select distinct” itself if it’s used correctly for its intended reasons.
But when I see select distinct at the start of a big and complex query, I do immediately suspect that the developer might have missed some join condition down the line and “got too many rows” back from the query. And since the rows look like duplicates due to the missing join predicate, for a junior (or careless) developer, the quick “solution” is to just apply distinct in the top level query to get rid of these pesky duplicates, declare success and move on.
It's possible that this is due to the underlying implementation.
In a unique column normally you'll have an index, so NULL becomes a special value in an index, but in SELECT DISTINCT you probably won't have an index, which means a full scan is performed, then every row has to be compared with every other row.
there is a pattern starting to emerge here on hackernews of highly voted posts by people who present themselves as experts and thought leaders who shamelessly put their lack of understanding at display. it's frightening.
The idea that someone should refrain from publishing a blog post about _anything_ unless they are a certifiable expert is not reasonable. Many people (correctly) write to learn, and even if they are publishing just to "present themselves as experts", it's on the reader to determine value.
In a world filled with false bullshit, crating more false unchecked writing instead of educating yourself is not a benefit to anyone.
What was false in the article?
In a world where there was less false bullshit people believed smoking is fine and sugar is healthy.
Amount of false bullshit doesn’t make qualitative difference.
Only difference to make is that people should not take something as truth just because it is written in a book or in a blog post or if person has a degree or not.
I see it the other way around.
People think if someone wrote blog post with technical details and it got upvoted - somehow it has to be an expert.
go the extra mile and click on about and then check out the linkedin profile.
i quote:
"I graduated top of my class with a BSc in Computer Science [...]. I have a strong background in software engineering and technical leadership"
Who doesn't think of themselves as an expert? That doesn't mean they are one.
Why not both? In my career, I have met countless people who are experts in programming in general, but with relatively modest skills in database systems.
Which is fine! It's really hard to be truly expert in both. There's a reason why "programmer" and "database administrator" used to be two different professions. I'd like to think that I'm better than your average developer at flogging RDBMSes, but most DBAs I've worked with can still run circles around me when it comes to information modeling and database & query optimization.
At a lot of companies, there are still full teams of people slinging t-sql or pl/SQL all day long to support their organization. Not DBAs, just developers who primarily work inside the database system their entire life.
I keep my old SQL Server Anki cards alive for just such a use case. It's been a minute since I had to jump into a 3-digit-LOC SQL script that does some arcane financial processing or what have you, but there's a nice steady niche there in case I ever want to throw my hat back into the ring.
> starting to emerge here on hackernews
It's not getting worse, you're getting better.
HN has for a long time been where I go for whatever you call the tech equivalent of watching stoners think they're having a deep conversation.
I think that's more a pattern of your understanding growing over time.
Most technical writing is actually at the start of the technical journey, there's more people there and its more interesting to talk about strategy and ideas when its not "messy" with internal details and trade offs you make.
I encourage folks at any technical level to write, and I encourage folks to check authors work with a critical eye no matter how long the source has been in the industry, no amount of experience can protect you from being wrong.
Starting?
Like on Reddit etc, which I deliberately avoid for this reason. The hiding of the vote count and the heavy moderation still help a lot that HN is still a massively better platform than any of its alternatives.
eh. This probably shouldn't have gotten so many votes, but it's a little interesting from a logic standpoint. It falls somewhere in the region of a StackOverflow question that makes some people scratch their heads and functions as nerdbait for everyone who knows the answer. These things don't rank for that long on HN, (and I agree that the self-important "expert" blog posture is silly), but I do find them to be a better daily checkin for my brain than actually going on S.O. anymore...
Or maybe people understand but still think it's dumb and hideously inconvenient?
Ergonomics matter.
- [deleted]
This is the correct way of thinking about things. Null is one of the hardest things for traditional software engineers in my experience as a guy who came up as a data admin.
Null in not-SQL (which is most things) usually isn't this tortured and isn't hard.
That's because null in not-SQL is a rather different concept, and while it's pretty easy to understand it's absolutely is hard to actually work with. Hoare didn't call it a "billion dollar mistake" on a whim.
I don't know about that. null/undefined in javascript gives it a pretty good run for it's money
Because they're two orthogonal problems. It's not like you do select distinct from program variables group by scope in your frontend programming language.
Yeah the 3 valued logic of SQL trips people up, me too from time to time
SQL is not three valued. Neither is NULL. BOOLEAN is accused of being three-valued but it has two values and like all values they can be unknown. Similarly a SMALLINT has 65,536 possible values not 65,537.
It’s not.
Your link makes the same mistake I already addressed. It conflates nullable booleans with tri-state logic.
Null is not a value. It is the absence of a value.
> The SQL null value basically means “could be anything”.
This is wrong. Null means it could be any valid value but that value is unknown. If the datatype is DATE then the value cannot be boolean TRUE or the string ‘purple’.
How is that different than “anything”?
If I’m comparing a value of type date to a null I still think it works as it should if value is “unknown”. What greater insight or context do we have if it’s a small-int null?
Suppose you have table with two columns to represent a classroom’s exam scores. It has the columns student_id (varchar) and score (int).
If a student is sick and has not taken the exam, yes you could enter -99 to represent they did not take the test. But if you want to find the class average, you would have to do something like this:
select average(case when score =-99 then null else score end) as class_avg from …
Or you could have entered null to begin with.
> How is that different than “anything”?
Because the possible values are known.
> What greater insight or context do we have if it’s a small-int null?
The insight is that null is not a value. It’s not a smallint or a boolean or a date. It’s the absence of a possible value. The only way to see a null as tri-state is to conflate it with a nullable boolean. This is an incorrect mental model which leads to confusion.
- [deleted]
The Maybe monad is really well understood at this point.
> Null is one of the hardest things for traditional software engineers
Making them harder is not better.
Never said it was
If only it had a name that was more indicative of that, like UNKNOWN, or UNDEFINED or INDERTIMINATE or something.
Javascript has both null and undefined and I'm not sure that's a good idea. At least in SQL we only have one of them, but it can mean unknown or it can mean N/A or even false. It's like a joker, what it means depends on how you use it.
Or VBA, which has Empty, Null, and Nothing:
https://excelbaby.com/learn/the-difference-between-empty-nul...
(and sometimes Missing)
No, it's not those other things, that's just using the tool incorrectly. A NULL is definitely "we dont know", not false, not N/A, especially not any known value.
Except in every other programming language with a null, null is the definite absence of something
Sure, and we're talking about SQL nulls in this context, which is why I am strict in my definition.
Null is shorter, and given the nightmarish queries I've had to read/write, I'll take any mercy that comes my way.
Honestly, at this point I just wish SQL servers supported proper discriminated union types and nullable columns were implemented as some kind of MaybeKnown<T> and I could just have a normal Maybe<T> with normal equality semantics if I wanted.
SQL needs to join 21st century type systems... or needs to get replaced altogether. SQL is the FORTRAN of relational programming languages, but hangs around because every time somebody tries to replace it they throw the relational-theory baby out with the bath-water.
> SQL is the FORTRAN of relational programming languages
and what is an alternative to sql ... quel?
It's unfortunate to namesquat on 'boolean' if your elements have three possible values. Just call it 'ternary logic' and let individuals decide which system to use.
Who's name squatting boolean? The bool column is exactly what it claims to be, you just have the option of introducing unknowability if you define it allow nulls.
The distinction is that not all formal logic systems are Boolean. Meaning that it is nonsensical and confusing to use "Boolean" as a generic synonym for "truth value" in the same way that it's nonsensical to use "Pantone" as a generic synonym for "color value", including when the specific kind of color value you're talking about is CMYK or HSV and definitely not Pantone.
There are two values, TRUE and FALSE. Null is not a value, it the the lack of a value.
You have a list of people and you ask if they own a car. You didn't get around to asking George, so that, somehow means he owns a car because you are using boolean logic? Or does it mean he doesn't own a car, because you are using boolean logic?
No, it means you haven't gathered this data point and don't know.
If there are exactly two possible values, TRUE and FALSE, you're working with Boolean logic.
If there are three possible values, TRUE, FALSE and NULL (unknown), then you're probably working with something like Kleene logic. You can't truly be working with Boolean logic, though, any more than you can be doing integer arithmetic when 15.37 is a member of your domain.
To put it another way, if we're talking about the mathematical definition of boolean algebra and not just some programming language's quirky implementation that happens to be called "bool", then boolean values would by definition be non-nullable. That logic that allows nulls has been implemented using the same unified set of keywords and operator names is a pragmatic decision that simplifies the language implementation and spec, not a principled one that tries to be pedantic about mathematical terminology.
> No, it means you haven't gathered this data point and don't know.
This is how it should be.
> Somehow means he owns a car because you are using boolean logic?
This is how it unfortunately is. There are 3 people, and there are 3 people who don't have a NULL car. Therefore George has a car.
Elsewhere people have argued that NULL propagates, so that your small unknowns infect larger queries. I could get behind that line of thinking, but the query above confidently returns 3.CREATE TABLE people(name text, carId uuid); INSERT INTO people values('Bill', '40c8a2d7-1eb9-40a9-b064-da358d6cee2b'); INSERT INTO people values('Fred', '3446364a-e4a5-400f-bb67-cbcac5dc2254'); INSERT INTO people values('George', NULL); SELECT Count(*) FROM people WHERE name NOT IN ( SELECT name FROM people WHERE carId = NULL );
> There are 3 people, and there are 3 people who don't have a NULL car.
This is not what you are asking with your query: as someone else stated, NULL is meant to be "UNKNOWN", or "it could be any valid value".
So nothing is ever equal to something that can be anything, because even another NULL (i.e. unknown) value is in general different.
So in the line
the condition will always be false. Now if instead if meant to search for the rows where carId is actually unknown you have to writeSELECT name FROM people WHERE carId = NULL
And your query will return as one may expect 2.SELECT name FROM people WHERE carId is NULL
But it is a boolean value, there's only two possible values TRUE and FALSE. But because it's SQL you can define any column as TYPE | NULL.
You could say that a boolean column with a NULL value is FALSE like how a lot of programming languages coerce it but if you wanted that you would just make a default of FALSE. The meaning of NULL in general being "value not specified" lends itself pretty nicely to "either true or false."
What I want is for e.g. "x OR y" where y is NULL (and/or of nullable type) to be an error rather than silently giving surprising results. Just like in a decent programming language I can't do x || y where x and y are of type boolean?, I have to explicitly handle the case where one or other of them is null (or e.g. write x!! || y!! - and that will at still error if either is null rather than silently evaluating to a funny result).
The meaning of NULL in general being "value not specified" lends itself pretty nicely to "either true or false."
You mean neither true or false?
I think I mean either. So yes NULL is a distinct value from true and false so I think it's also right to say it's neither true nor false. But the value NULL does represent is roughly "could be true or false, we don't know yet."
It could mean anything or nothing depending on context, which is part of the problem. (Plenty of people think the meaning of NULL is clear and obvious and consistent, unfortunately they give different answers about what that meaning is)
It could also be neither. It's whatever you define it to be. Null could mean you don't know if it's true or if it's false, or it could mean you know it's neither true nor false.
This is the map territory relation problem.
We bring in the 'Trinity of thought' a priori and forgot about the advantages and costs.
You have the principal of excluded middle, principal of identity, and the principal of non-contradiction.
If your problems territory fits things are convenient.
Once you introduce known, unknown, or unknowable unknowns, the classic model falls down.
Unfortunately you have to choose what to sacrifice based on the context of the problem at hand.
This is exactly where Rice's theorm, the halting problem etc.. arise.
AC from ZF(C) and IID from statistics bring PEM, which gives or forces the classical behavior.
The 'non-trivial' properties of Rice's theorm are exactly those properties where PEM doesn't hold.
The frame problem in machine learning is also related. As is Gödels (in) completeness theories.
Sometimes you can use failure as negation, other times you can use methods like accepting that NULL is an unknown.
But you have to decide what you can sacrifice and still solve real problems.
I think thinking of a problem and it's complement is better.
E.G P is the truthy T and co-P is the truthy F. Obviously P=co-P means that the traditional method works, but jump to NP vs co-NP and ot doesn't.
NP chained ORs, co-NP is chained ANDs is another way of thinking about it.
It could be true or false, but it’s unknown. For example. a user doing a survey is yet to fill in the answer. That doesn’t mean there is no answer, it’s just unrecorded.
Maybe GP was edited, but it doesn't use the word "boolean" anywhere.
Correct, I edited "boolean" out prior to ^^P's comment. My apologies.
- [deleted]
- [deleted]
A null boolean good very well mean true, if the default value is true (in the code if not in the DB).
Is this just being pedantic wrt uninitialized values?
Even simpler, note that the first two examples are essentially just two of the most basic boolean algebra identities. Replace UNKNOWN with a variable and change the notation to the more conventional boolean algebra notation and you've got:
1. x ∧ 1 = 1 (identity law for conjunction)
2. x ∨ 0 = 1 (identity law for disjunction)
They are weird because they are inconsistent compared to nearly everything else.
Being based on someone's logic is not sufficient. Most weird things are based on some (weird) logic.
- [deleted]
> SQL NULLs aren't weird, they're just based off of Kleene's TRUE-FALSE-UNKNOWN logic!
Kleene's TRUE-FALSE-UNKNOWN logic is weird.
SQL nulls effectively violate the reflexive property of equality, because X=X does not result in a value of TRUE.
And in many contexts in SQL, NULL is treated as equivalent to false, such as within a WHERE clause.
So that means that X=X is effectively FALSE in SQL*.
That is a clown language.
*(yes, I have the same opinion about ANSI NaN, but that wouldn't come up so often if Javascript didn't love NaN for some stupid reason.)
> And in many contexts in SQL, NULL is treated as equivalent to false, such as within a WHERE clause.
NULL is not equivalent to FALSE, it is neither FALSE nor TRUE. It has the same effect as FALSE as the final result of evaluating a WHERE clause condition only because WHERE clause conditions allow a row to be included only when they evaluate strictly to TRUE. But if NULL were equivalent to FALSE in a WHERE clause, than a WHERE clause condition which would evaluate to NULL that was instead negated would be equivalent to TRUE but instead it remains NULL which remains not TRUE.
No, because NOT NULL/UNKNOWN is still NULL/UNKNOWN.
I realized earlier I was using the term NULL, but going forwards let's use the ANSI SQL concept where the null state of a Boolean is called UNKNOWN. You'll have to forgive me for using the term NULL this far, but in my defense the concept of NULL column-values and UNKNOWN boolean expression results are pretty intertwingled.
resolves toSELECT * FROM foo WHERE NOT (1 = NULL)
which is the same value thatSELECT * FROM foo WHERE UNKNOWN
resolves to.SELECT * FROM foo WHERE 1 = NULL
So the WHERE clause is treating UNKNOWN/NULL as equivalent to false. The rest of the Boolean algebra is not.
SQL likes to have it both ways. Sometimes UNKNOWN/NULL is equivalent to false sometimes it's not.
It does this because UNKNOWN/NULL Booleans are incredibly inconvenient and tedious and sometimes they'd rather not think about them.
I'd like to do that as well, but this hoary old language won't give me the same permission it gives itself.
> And in many contexts in SQL, NULL is treated as equivalent to false, such as within a WHERE clause.
I don't think any databases treat `NULL` as `FALSE` in the WHERE clause. `SELECT * FROM foo WHERE bar = NULL` doesn't return rows with a NULL in the bar column. `SELECT * FROM foo WHERE bar != NULL` doesn't return rows without NULL in the bar column. `SELECT * FROM foo WHERE (bar = 'a') = NULL;` doesn't return rows where bar is not equal to `a`[1]. As far as I know every DB treats NULL as what it is, an unknown value.
It also doesn't to my mind violate the reflexive property because NULL is not equal to anything. It is a marker for an unknown value, not a value in and of itself. If you have a database of every person in a room and what color shirt they're wearing, and in your database, Alice and Bob both have NULL in their "shirt_color" column, that does not mean that Alice and Bob have the same color shirt. Nor does it mean that they don't have the same color shirt. Nor does it mean that someone with a green colored shirt has the same color shirt as Bob or Alice. It doesn't mean they don't have a shirt either. It means you don't have a record of/don't know what color their shirts are. You can't violate the reflexive property because you can't say what color shirt they have. You're not doing `X = X -> false`, you're doing `X = UNKNOWN -> UNKNOWN`
It treats the NULL/unknown value of the boolean as false
1 <> NULL => Boolean UNKNOWN,
so SELECT * FROM foo WHERE 1 <> NULL returns nothing.
1 = NULL => Boolean UNKNOWN,
so SELECT * FROM foo WHERE 1 = NULL returns nothing.
That's the thing that's being treated as FALSE. That UNKNOWN. Not the value of NULL itself.
> You're not doing `X = X -> false`, you're doing `X = UNKNOWN -> UNKNOWN`
That's not how "=" works. If you want a relationship for testing equality than handles unknown, don't call it equality.
Basic properties of equality, from Wikipedia
https://en.wikipedia.org/wiki/Equality_(mathematics)
- Reflexivity: for every a, one has a = a.
- Symmetry: for every a and b, if a = b, then b = a.
- Transitivity: for every a, b, and c, if a = b and b = c, then a = c.
edit:
We can also see the incoherence of this concept when we look at set theory.
Because UNKONWN booleans are neither true or false, if you use them in a WHERE clause you get the ugly result that the set of
is not everything.X ⋃ Xᶜ
> That's the thing that's being treated as FALSE. That UNKNOWN. Not the value of NULL itself.
I think the difference between how we're looking at this is for me there is no "value" of NULL. NULL has NO value which is why you can't do `1 = NULL` or `1 <> NULL` and have to specifically use `1 IS NOT NULL` or `1 IS DISTINCT FROM NULL`
>That's not how "=" works. If you want a relationship for testing equality >[that] handles unknown, don't call it equality. >Basic properties of equality, from Wikipedia >https://en.wikipedia.org/wiki/Equality_(mathematics) >- Reflexivity: for every a, one has a = a. >- Symmetry: for every a and b, if a = b, then b = a. >- Transitivity: for every a, b, and c, if a = b and b = c, then a = c.
Sure, that's all well and good, but equality only can work IF you know what the values are on either side. NULL means you don't know what a value is and therefore you can't answer whether or not a = NULL because NULL isn't known yet.
Or let me put it another way, is the launch price of the iPhone 17 != $799? The answer is neither true nor false, because the launch price of the iPhone 17 is unknown. We can make reasonable guesses, but if I give you a database with entries for iPhones from the first model and placeholder rows for the 17, 18, 19 and 20 models, none of them belong in the list when someone asks "which iPhones cost $799 at launch?" But equally true that none of them belong in the list when someone asks "which iPhones did not cost $799 at launch?"
> That's the thing that's being treated as FALSE. That UNKNOWN. Not the value of NULL itself.
No, it's being treated as UNKNOWN, and the semantics of SELECT...WHERE only returns rows where the value of the condition is TRUE.
I think you need to look into https://en.wikipedia.org/wiki/Negation_as_failure
Except that NULL is not the same as UNKNOWN! NULL is a data value (like integers) that can appear in data expressions (like NULL + 1) and comparisons (like NULL = 1) whereas UNKNOWN is a truth value that can appear in boolean/logical expressions constructed from logical connectives like AND, OR, NOT.
A data expression always evaluates to a data value, and usually whenever any part of the expression is NULL, the entire expression evaluates to NULL.
A comparison evaluates to a truth value, and usually when a comparison invovles a NULL it returns UNKNOWN. This leads to weird behaviors where both `SELECT 3 WHERE NULL = NULL;` and `SELECT 3 WHERE NULL <> NULL;` returns nothing (because the query engine does not output a row if the predicate returns UNKNOWN on it).
What you listed above only comes into play for boolean/logical connectives like AND, OR, NOT, and in that case we follow 3-valued logic.
And there's more annoying corner cases when you deal with DISTINCT. The situation is so hopeless that SQLite has a whole table documenting divergent behaviors of NULL in different systems: https://www.sqlite.org/nulls.html
Indeed, they're not identical - that's why I just said "based on", and that's likely why the word UNKNOWN itself isn't used in SQL.
Nevertheless I find it a useful intuition pump. I wager that most people reading `UNKNOWN = UNKNOWN` or `UNKNOWN <> UNKNOWN` and thinking about the examples above would stop and say, "Wait, I actually don't know the value of that statement for sure either, since the LHS and the RHS could be completely different things," and would then double check what their SQL dialect would actually do in this situation.
> the word UNKNOWN itself isn't used in SQL
That leads to an even more confusing point, that some systems (at least SQLite) overloads NULL to mean UNKNOWN, for example `SELECT 1 WHERE NULL = (NULL = NULL)`.
And the dangerous thing about NULLs is not when they are explicitly used in the query as a literal (as I did for brevity), but when they appear in tables. It's perfectly reasonable to assume `SELECT COUNT( * ) FROM t;` should be the same as `SELECT COUNT( * ) from t WHERE t.x = t.x OR t.x <> t.x`, but they are not the same because the latter does not return NULL rows. This has lead to real query optimizer bugs. For more examples see this paper https://dl.acm.org/doi/10.14778/3551793.3551818
> quite parasitic
This is the mental model that I use and teach: a NULL is a virus that will likely make everything it touches NULL... unless cordoned off by a COALESCE or IFNULL or something else that explicitly speaks about its NULL-sensitive behavior in its documentation.
So if you see a bug where a WHERE or FILTER clause is unexpectedly failing, and you're in a time crunch, rather than pounding your head against the wall, start sprinkling COALESCE statements around (while being thoughtful about the domain meaning of what you're doing) and it's more likely than not to solve your problem!