Ramblings on notation - The Uncommon Trajectory

Alternative title: “Notation is fake”

Twitter math

Consider the now-famous expression:

\[6 \div 2(1+2)\]

What number does this evaluate to? This flares up on twitter every now and then (it cropped up in my discord server about two weeks ago, giving the inspiration for this post), with two main parties, claiming that the answer is 9 or 1 respectively.

Let’s lay out the strongest arguments for both sides. PEMDAS (or BODMAS, or what have you) suggests two ways to parse this (with the \(\times\) inserted for clarity):

\[(6 \div 2) \times (1+2)\]
\[6 \div (2 \times (1+2))\]

People subscribing to the first parse will usually state the “rule” that “multiplication and division are the same precedence”, and that operators with the same precedence evaluate in left-to-right order. To use more formal syntactic terms, they bind equally tightly (have the same priority) and left-associate.

As far as I know, “operators with the same precedence evaluate left-to-right” only really exists for multiplication/division and addition/subtraction, and is only controversial when the first pair is involved. Most western languages (of which English is the primary language used to communicate mathematics) are written left-to-right, and so it makes sense that the operator we see first should evaluate first. On the other hand, it can feel strange that we decided to syntactically associate operations that are associative (namely, \(+\) and \(\times\)) with their decidedly non-associative inverses (\(-\) and \(\div\)), Note that, even amongst “elementary” operations, we don’t always even follow the left-to-right rule! \(a^{b^c}\) is \(a^{(b^c)}\), not \((a^b)^c = a^{bc}\)! You can make all sorts of allowances (for example, the fact that the \(b^c\) is written “inside” the operator suggests that it should come first), but that is equally arbitrary – why didn’t we pick something similarly “unambiguous” for multiplication?

Proponents of the second interpretation instead tend to note that written mathematics is a natural, human language, and that it should be interpreted in the way that is most intuitive. For the record, I agree with this, in principle (even though I wouldn’t really agree that the expression evaluates to 1 – more on this later).

I’ve heard a few different reasons as for why it’s more intuitive to do the multiplication first, such as “parentheses are already used to say do this first”, so multiplication via parentheses should be done earlier”. The argument I find most compelling is that, by writing the multiplicands without an operator to separate them explicitly (e.g., \(\times\)), it suggests to the reader that they are morally one “term”.

Note that the two views don’t have to be incompatible! The statement “multiplication and division happen at the same time in the order of operations” is a claim about syntax, not semantics – there’s nothing fundamental about “multiplication” that makes it “stronger” than “addition”, only the way we, as humans, read a formula that has been written down. To put it another way, the “rules” are intended to let us drop some parentheses (which are really just a way of disambiguating our syntax tree) by saying that, in certain cases, they’re implicit.

But if that’s the case, then there’s nothing stopping us for having two notations for the same operation with different syntactic properties – it’s totally fine for \(\times\) and \(\div\) to bind equally tightly and left-associate and also for multiplication via the \(\cdot (\cdot)\) to apply first.

Something fundamental about multiplication

So why did we decide that \(\times\) binds more tightly than \(+\)? I did some (admittedly brief) googling, but was unable to nail down a concrete explanation.

One suggestion, which I like, is that it’s due to distributivity. Namely, \(a \times (b + c) = (a \times b) + (a \times c)\), and \((a \times b)^c = (a^c) \times (b^c)\) – the stronger operator distributes over the immediately weaker one. This would mean that, given an expression consisting of only the PEMDAS operators, fully distributing exponents over multiplication, then fully distributing the resulting multiplications would result in an expression with no parenthesized expressions.

One problem with this reasoning is that it only works “one level” at a time. For example, exponents definitely do not distribute over addition, so we’d need to do some polynomial expansion to fully “flatten” an expression.

On the other hand, there are other operators with a precedence convention, if we expand beyond just numbers. In boolean logic, the \(\land\) (conjunction, “and”) operator distributes over both \(\oplus\) (“xor”) and \(\lor\) (disjunction, “or”), and indeed we find that most programming languages give && a higher precedence than ^ and ||. On the third hand, ^ traditionally binds more tightly than ||, but does not distribute over it.

Restricting only to \(\land\) and \(\lor\) does make for another rather interesting conversation. Many logicians actually write \(A \cdot B\) or just \(AB\) to mean \(A \land B\) and \(A + B\) for \(A \lor B\), precisely because \(\times\) distributes over \(+\). In fact, while we’ve so far been using “multiplication” and “addition” to refer to the arithmetic operations on numbers, \(\times_R\) distributing over \(+_R\) is actually baked into the definition of an arbitrary ring \((\times_R, +_R, R)\), separating us from the fact that, numerically, \(\times\) is defined in terms of \(+\).

This would be score one for the 9 camp – Under this view, “division” (as in, the abstract field operation) is shorthand for “multiply by the inverse”, so multiplication and division are actually the same operation, which associates to the left.

For humans to read

If we’re going to bring in things like “abstract ring operations”, we should probably also talk about more “casual” constructs that more people are probably familiar with.

Here is an equation from an actual paper I was reading the other day:

\[P_p = F_p(V-W)/\eta_p\]

(this was indeed written with a slash and not a fully-qualified fraction)

For the sake of a more illustrative example, let’s rewrite the right hand side:

\[\frac{1}{\eta_p} F_p(V-W)\]

How are we to interpret this, exactly as I’ve written it? Under what order should we perform the operations?

On an entirely unrelated (yeah, right) note, consider also the cumulative distribution function \(F_X(x)\) of a random variable \(X\).

In your head, put in all the parentheses to make this fully unambiguous before you keep reading.

How did you parenthesize? As before, there are two main ways:

\[\left(\frac{1}{\eta_p}F_p\right)(V-W)\]
\[\frac{1}{\eta_p}(F_p(V-W))\]

When I first saw the first interpretation, something didn’t sit right with me. For one, it isn’t even well-formed… if \(F_p\) is a function. Secondly, it also flips the association from the original expression, which is to be expected, we did move a term to the other side, but still makes me uncomfortable.

In fact, the \(F_p\) in the paper is actually a scalar, the \(F\) being the standard physics letter used for force, and is entirely unrelated to \(F_p(x)\), the cumulative distribution function over a random variable \(p\), defined as \(F_p(x) = P(p \le x)\). For bonus points, notice that the capital letter \(P\) is also overloaded here, to mean “probability” in the, well, probabilistic tradition, and “power” in the physics sense.

To put it another way, parsing math isn’t even context-free! Traditionally, function application is the tightest possible binding (because it’s shorthand for an entirely new, parenthesized expression), and functions are named \(f\), \(g\) or \(F\). This means that whether or not \(F_p\) is a function actually changes the way an expression associates!

With these in mind, we can put forward two principles which feel unobjectionable:

Replacing all variables with their concrete values should yield “the same” expression
It is preferable to need less context (as opposed to more context) to correctly parse a term

When taken in combination, we actually find a point for the 1 camp:

Consider the expression \(x \div f(y+z)\). Properly, to parse this, we would need to know whether \(f\) is a function or a scalar. However, \(f\) is most commonly used to denote a function, and so by principle 2 we should assume that \(f(y+z)\) is the tightest binding present, giving a parse or \(x \div (f(y+z))\). But then, if we then reveal that, in fact, \(f = 2\), by principle 1 we would get that the correct parse is \(x \div (2(y+z))\)!

Nearly everyone I’ve given this explanation to has found something wrong with it, but they aren’t finding the same wrong thing. People seem split on whether the two principles are reasonable (I could certainly come up with a counterargument for either), which one isn’t reasonable, or whether they’re reasonable individually but not in combination. Similarly, the step of assuming that \(f\) might be a function seems to rub people the wrong way.

All that said, I actually still think this is a reasonable argument. As I said before, mathematical expressions are written entirely for humans to read, and most people who have reached the level of being comfortable with functions would agree that the expression \(f(y+z)\) suggests a function application, which is always “one term” that sticks together. It’s only natural, then, to want the similar-looking expression \(2(1+2)\) to also be “one term”.

Who even writes math like this?

The last time this came up, someone mentioned (paraphrasing) “you could get around this by writing the fraction properly”, and “nobody writes \(2(2)\) outside of school anyway”.

Setting aside my immediate objections (like the fact that \(x(e)\) notation is quite common if \(e\) is a more complex expression), this is more-or-less my viewpoint as well. The point of notation is to be readable, and writing something in an ambiguous form goes against that, even if “the rules” make it technically well-formed. The expression \(6 \div 2(1+2)\) doesn’t have a single value, because the real answer is “it depends” (on whether multiplication via adjacent values binds more or equally tightly as division via \(\div\)), because we haven’t settled on a convention in this case, whatever the twitter mob thinks is “correct”.

People who work on theoretical programming languages or logical foundations run into this a lot. What’s the notation for substituting all occurrences of the free variable \(x\) with \(y\) in expression \(e\)? I’ve seen:

\[[x/y]e\]
\[[y/x]e\]
\[e[x/y]\]

and all other variants besides. The context generally makes things clear by “typechecking” the syntax tree in your head (although, in my experience, not always, which is always fun when one of the first two cases is used). This isn’t even getting into all the different ways to notate simultaneous substitution.

Suffice to say, notation sucks and is made up anyway.