Quick thoughts: Union types

In a prior post about Sorbet's T.nilable, I hinted at having a rant about (untagged) union types. This is not that rant, but it is probably going to get that open loop off my chest forever either way.

In lieu of talking about theory, let's focus on ergonomics. This was originally inspired by an annoying interface in the Scala 3 reflection API. To describe the problem more minimally, the problem statement is as follows:

def bar(x: Int): Unit = // ...

def foo(xs: List[Int] | List[String]): Unit = {
  if (xs(0).isInstanceOf[Int]) {
    bar(xs(1))
  }
}

Should the call to bar be allowed to typecheck? I would argue that yes, it should - the argument xs is either List[Int] or List[String], so if xs(0) is Int, then xs must be List[Int].

However, every¹ language with union types I could think of trying gets this wrong! More than just Scala ² -- at the time of writing, Typescript, Python (via Pyright, mypy also does this but I couldn't figure out how to share the mypy playground) and Sorbet Ruby, all claim that xs[1] has type Int | String rather than specializing to Int.

So why is this such a common mistake? While the specific semantics of each language differ dramatically, something common to all of them is that lists don't carry type information. In Scala, this is due to type erasure; in the others, it's because the underlying (runtime) type is just list. This means that we can't just ask whether xs is a List[Int] or List[String], we actually need to query an element³. And while flow-sensitive typing is pretty good these days, none of these type-checkers are yet good enough to back-propagate the knowledge that x[0] is an Int into the type of xs.

Having discovered this hole, I then went to check if any of these systems seem to think List[Int] | List[String] is equivalent to List[Int | String]. The only one that did was Sorbet; given what I know of the Sorbet internals this doesn't surprise me overly-much -- Sorbet was designed to be simple and performant, leading to a one-pass model with a lot of "good enough" assumptions.

As a graduate student in compilers, I will admit to having been nerd-sniped into trying to solve this issue. Without having dug into any particular algorithm, I suspect the issue has to do with the checker performing the naive union of all possible output types (i.e. nth applied to List[Int] returns Int, nth applied to List[String] returns String, so nth applied to List[Int] | List[String] returns Int | String). To fix this, we'd need to observe that the return value here is actually an instantiated type variable and instead produce some kind of "weak union type variable" that can be backfilled once we examine xs(0). Messy stuff, but it doesn't actually sound too different from what Hindley-Milner and/or OCaml's value restriction already does.

Strictly speaking, C-style unions sidestep this problem by forcing you to name which half of the list-union you're using before you can index. However, because there's no way to examine which variant you're in without building in extra runtime metadata, I'm not counting those.↩︎
Properly, the way to discriminate between two sides of a union is with match; information gained from isInstanceOf doesn't flow into the if body period. When written that way, it's quite obvious that learning the type of xs(0) won't affect the type of xs, as the match gives xs(0) a new name (with the refined type). I've glossed over the details here to keep the examples as parallel as possible.↩︎
It also means that we can't distinguish between different flavors of empty list. In theory this can have edge ramifications (if we want to add elements, for example), but I am of the opinion that this doesn't matter much.↩︎