In a prior post
about Sorbet's T.nilable, I hinted at having a rant
about (untagged) union types. This is not that rant, but it is
probably going to get that open loop off my chest forever either
way.
In lieu of talking about theory, let's focus on ergonomics. This was originally inspired by an annoying interface in the Scala 3 reflection API. To describe the problem more minimally, the problem statement is as follows:
def bar(x: Int): Unit = // ...
def foo(xs: List[Int] | List[String]): Unit = {
if (xs(0).isInstanceOf[Int]) {
bar(xs(1))
}
}Should the call to bar be allowed to typecheck? I
would argue that yes, it should - the argument xs is
either List[Int] or
List[String], so if xs(0) is
Int, then xs must be
List[Int].
However, every1 language with union types I
could think of trying gets this wrong! More than just Scala2 -- at the time of writing, Typescript,
Python
(via Pyright, mypy also does this but I couldn't figure out how to
share the mypy playground) and Sorbet
Ruby, all claim that xs[1] has type
Int | String rather than specializing to
Int.
So why is this such a common mistake? While the specific
semantics of each language differ dramatically, something common
to all of them is that lists don't carry type
information. In Scala, this is due to type erasure; in the
others, it's because the underlying (runtime) type is just
list. This means that we can't just ask whether
xs is a List[Int] or
List[String], we actually need to query an element3. And while flow-sensitive
typing is pretty good these days, none of these type-checkers are
yet good enough to back-propagate the knowledge that
x[0] is an Int into the type of
xs.
Having discovered this hole, I then went to check if any of
these systems seem to think List[Int] | List[String]
is equivalent to List[Int | String]. The only one
that did was Sorbet;
given what I know of the Sorbet internals this doesn't surprise me
overly-much -- Sorbet was designed to be simple and performant,
leading to a one-pass model with a lot of "good enough"
assumptions.
As a graduate student in compilers, I will admit to having been
nerd-sniped into trying to solve this issue. Without having dug
into any particular algorithm, I suspect the issue has to do with
the checker performing the naive union of all possible output
types (i.e. nth applied to List[Int]
returns Int, nth applied to
List[String] returns String, so
nth applied to List[Int] | List[String]
returns Int | String). To fix this, we'd need to
observe that the return value here is actually an instantiated
type variable and instead produce some kind of "weak union type
variable" that can be backfilled once we examine
xs(0). Messy stuff, but it doesn't actually sound too
different from what Hindley-Milner and/or OCaml's value
restriction already does.
Strictly speaking, C-style unions sidestep this problem by forcing you to name which half of the list-union you're using before you can index. However, because there's no way to examine which variant you're in without building in extra runtime metadata, I'm not counting those.↩︎
Properly, the way to discriminate between two sides of a union is with
match; information gained fromisInstanceOfdoesn't flow into the if body period. When written that way, it's quite obvious that learning the type ofxs(0)won't affect the type ofxs, as the match givesxs(0)a new name (with the refined type). I've glossed over the details here to keep the examples as parallel as possible.↩︎It also means that we can't distinguish between different flavors of empty list. In theory this can have edge ramifications (if we want to add elements, for example), but I am of the opinion that this doesn't matter much.↩︎