Specs That Run

§13.3.3 of the C++ standard tells a compiler how to pick which overloaded function to call. This is the core of it:

A viable function F1 is defined to be a better function than another viable function F2 if for all arguments i, ICSi(F1) is not a worse conversion sequence than ICSi(F2), and then for some argument j, ICSj(F1) is a better conversion sequence than ICSj(F2), or, if not that, the context is an initialization by user-defined conversion and the standard conversion sequence from the return type of F1 to the destination type is a better conversion sequence than the standard conversion sequence from the return type of F2 to the destination type.

Read it again. Maybe a third time. It’s a comparison function — a sort predicate that takes two candidates and returns which one wins. It cascades through tie-breakers (“or, if not that”) exactly the way you’d write a chain of compare clauses. It’s a program, expressed in a dialect of English so dense that two compiler teams can read it and still ship different behaviour for the same code.

In Haskell, the whole thing is about six lines:

isBetter :: Function -> Function -> [Arg] -> Bool
isBetter f1 f2 args =
  all (\i -> ics i f1 >= ics i f2) args
  && or [ any (\j -> ics j f1 > ics j f2) args
        , isNonTemplate f1 && isTemplate f2
        , isTemplate f1 && isTemplate f2
            && moreSpecialized f1 f2
        ]

No ambiguity. You can run it, test it against corner cases, and get a definitive answer — not a committee discussion, not a defect report, not a decade-long disagreement between GCC and Clang about what “if not that” means when the conversion sequences are indistinguishable.

But the deepest advantage isn’t even that you can run it. It’s that the type checker reads it before anyone else does. If ics returns an ICS but you compare it with >, the compiler will tell you that ICS doesn’t have an Ord instance — which is really asking: did you define what it means for one conversion sequence to be “better” than another? The prose spec buries that definition three subclauses away and hopes you’ll find it. The type checker won’t let you proceed until you’ve answered the question. Contradictions that survive decades in prose — two clauses that quietly disagree about whether a conversion is defined — surface as type errors before the spec ever runs.

I spent eight years on IBM’s compilers — C++ and Fortran — through the C++0x era and into C++11, implementing features like initializer lists and drafting spec text. I watched smart people argue for hours over English phrasing that was trying, desperately, to be unambiguous, and failing, because natural language doesn’t compile. The spec was describing algorithms. It just wouldn’t admit it.

the crack

In 2009, Gerwin Klein and a team at NICTA formally verified seL4 — a microkernel, 7,500 lines of C and 600 lines of ARM assembler — in Isabelle/HOL. Not tested, not fuzzed — proved. Two hundred thousand lines of machine-checked proof establishing that the C implementation was a faithful refinement of an abstract specification. Every possible execution path. Every possible input. The kernel would never crash, never perform an unsafe operation, never deviate from its spec.

One detail stuck with me. They didn’t go straight from spec to C — they went through an intermediate executable specification written in Haskell. The abstract spec described what the kernel should do; the Haskell prototype described how; the C code made it fast. Two refinement steps, each machine-checked. The Haskell layer was the bridge between intention and implementation, and it was the part that made the whole thing feasible.

If you could do this for an operating system kernel — prove, line by line, that 7,500 lines of C faithfully implement a formal spec — then the idea that a language standard had to be written in English suddenly looked like convention. Not necessity. Just convention.

I wrote a short blog post about it at the time. The pitch was simple: take the C++ standard, apply a minimal transformation to Haskell, and you’d have a runnable reference implementation. Test your compiler against it. Prove your compiler matched it. The formal methods existed. The language existed. The only missing piece was the will.

what actually happened

Anyone who’d worked inside a language standard and also touched formal methods would have the same reaction — the gap is that obvious. And independently, in patches, in corners, people did exactly what you’d expect.

WebAssembly got it right from the start. When the W3C designed Wasm in 2015–2017, they wrote the spec using formal small-step operational semantics — mathematical notation that is the definition, not a description of it. Andreas Rossberg, one of the spec’s principal authors, had come from programming language research and built formalism into the process from day one. The formal rules came first; the prose was transliterated from them afterward. The spec ships with a reference interpreter in OCaml. When two implementations disagree about a Wasm program’s behaviour, there is a definitive answer, and it comes from running the spec, not from arguing about what a paragraph means.

Conrad Watt later mechanized the semantics in Isabelle/HOL and found real bugs in the spec during the soundness proof — bugs that would have survived indefinitely in prose form, because prose can’t be type-checked. Their PLDI 2017 paper calls it “the first mainstream language that has been designed with a formal semantics from the start.”

But WebAssembly had an advantage no existing language has: it started fresh. No legacy, no committee of implementors with decades of deployed code to protect, no accumulated historical accident resisting clean formalization. The institution was built around the formal spec, not the other way around.

The K Framework showed what happens when you try the other direction. Grigore Roșu and Chucky Ellison at Illinois defined formal semantics for C — the full core language, not a subset, not a toy — as over a thousand rewrite rules in K. Their executable semantics could run real C programs and produce definitive answers about undefined behaviour, sequence points, memory model edge cases. Run against the GCC torture test suite, it passed 770 out of 776 tests — more than GCC itself (768) and more than Clang (763). It could detect all 77 categories of core language undefined behaviour in C11. John Regehr’s team at Utah used it as an oracle in C-Reduce, which helped Csmith report nearly 400 previously unknown bugs in GCC and LLVM. A genuine executable specification for a legacy language — and it worked better than the compilers.

Nobody adopted it as the standard. WG14 never referenced it, never acknowledged it as normative, never used it to resolve defect reports. A complete, tested, executable semantics for C exists right now, passes more tests than the major compilers, and the actual C standard is still prose.

why specs stay prose

The easy explanation is a failure of adoption — people hadn’t noticed, or the tooling wasn’t mature, or it would happen eventually. But the specs stay prose for reasons that are structural, not accidental, and once you see them you can’t unsee them.

Ambiguity in a specification is not always a bug. Sometimes it’s a feature. A spec is a negotiated document — a treaty between implementors, each with their own platform constraints, optimization strategies, and deployed user bases. When the committee writes prose that leaves room for interpretation, that wiggle room isn’t sloppy drafting. It’s the space that lets GCC and MSVC and Clang coexist, each conforming, each different, each serving users who rely on those differences. An executable spec eliminates that space. It forces resolution on every edge case. And sometimes the committee genuinely cannot afford to resolve an edge case, because resolving it would break one implementor’s user base, and that implementor has veto power, and the prose ambiguity is the political compromise that keeps the standard moving forward.

I saw this firsthand with a proposal for discriminated unions — algebraic data types in C++. The syntax was clean:

enum struct Expr {
    Literal(int);
    Add(Expr&, Expr&);
    Negate(Expr&);
};

enum switch (e) {
    case Expr::Literal(n): return n;
    case Expr::Add(a, b):  return eval(a) + eval(b);
    case Expr::Negate(a):  return -eval(a);
}

Pattern matching. Exhaustiveness checking. The kind of feature that’s a one-liner in Haskell’s type system, that ML has had since the 1970s. The committee’s response was to do it in the library instead — std::variant, a type-erased tagged union with a visitor pattern and no exhaustiveness checking at the language level.

Not because the library solution was better, but because library additions don’t require rewriting the core language spec. They go in a different chapter. They need fewer committee members to agree. std::variant is what you get when the standardization process optimizes for consensus over coherence.

This is the dynamic that keeps specs in prose: the spec isn’t just describing a language, it’s governing a community. Prose — ambiguous, negotiable, interpreted by humans — is the governance mechanism. An executable spec would be technically superior and politically impossible.

A second force, quieter but just as strong: the people who write language standards are, by long selection, people who are good at reading and writing standardese. Language lawyers. Asking them to switch to an executable spec isn’t asking for a format change — it’s asking them to become a different kind of expert, to give up the skill that makes them valuable in the current system. Institutions don’t do that voluntarily.

And then there’s sheer weight. C++ has forty years of accumulated decisions, special cases, compatibility shims, and deprecated-but-still-specified features. Formalizing that isn’t like writing a clean spec for a new language — it’s like transcribing a legal code amended by hundreds of different legislatures over decades. Every special case exists because someone needed it, and removing or clarifying it risks breaking their code. Compromises don’t formalize cleanly.

the new reader

Language specifications have a new consumer. LLMs read specs — to generate code, to answer questions about semantics, to translate between languages — and they read them literally, without the interpretive flexibility that human language lawyers bring. When an LLM encounters the overload resolution clause I quoted at the top of this post, it doesn’t have twenty years on the committee and a sense for which interpretation the authors probably intended. It parses the text. The text is ambiguous. A Haskell reference implementation of C++‘s type system would be better training data than §13.3.3’s prose, because the Haskell can’t be misread. It can only be run.

I wonder whether this new consumer matters enough to break the equilibrium. Language committees don’t answer to LLMs. But they answer to people building tools with LLMs — and those people, increasingly, need machine-readable ground truth about what a language actually means. If enough of the ecosystem starts treating formal semantics as infrastructure rather than research, the committee might discover that prose ambiguity is no longer a viable compromise.

Or maybe not. Maybe the specs stay prose forever, and we build translation layers between the human-readable standard and the machine-readable one, maintained separately, drifting apart, introducing exactly the kind of bugs that formal verification was supposed to prevent. That would be the most C++ outcome imaginable.