omitting spans | arya dradjica

I have a large backlog of posts to write, but I quickly wanted to share a fun idea I explored on Thursday. I’ve been thinking about how incremental compilation should work in Krabby, and a major sticking point there is spans. But I think I have a plan.

A span refers to some portion of source code, e.g. 21:43 -- 22:17 in "foo.rs"; it is used for referring to source code when rustc prints diagnostics. You can find spans in most compilers. But rustc encodes more information into them: in particular, the edition and macro expansion they originate from. Macro expansion data is useful for diagnostics in a very similar way to line and column information; rustc diagnostics can refer to code originating from a macro invocation and print information about the macro invocation and the macro definition. But the most interesting part here is editions.

Rust’s edition mechanism is very interesting. It’s not a language standard, like C11 or C23; a Rust project can choose a certain edition and remain fully compatible with other Rust libraries. The choice of edition affects how the code within the project is processed, but remains a project-local choice. Conceptually, all the crates in your dependency graph are processed into an (effectively) edition-independent representation before being compiled together. Editions can can change syntactic choices (e.g. whether async is a keyword) but (usually) not cross-crate interactions like type-checking behavior. I’m hedging my words a bit because rustc sometimes abuses editions to make bigger changes.

The association of edition data to individual tokens causes … complications. In most compilers, span data would only be relevant for printing diagnostics. But rustc depends on the edition data encoded by spans throughout compilation:

Parsing: async changed from a regular identifier to a keyword in Rust 2018
Declarative macros: $p:pat allows for a | b patterns from Rust 2021
Name resolution: paths (foo::bar) are relative to self from Rust 2018
Type-checking: Box<[T]> implements IntoIterator from Rust 2024
Lints: foo.try_into() could give a warning before Rust 2024
Borrow-checking: the order in which temporaries are dropped in if conditions changed in Rust 2024

The end result? rustc has to pass around spans everywhere, all the time. Spans are embedded in every intermediate representation within rustc – every token, every AST node, every HIR expression, every MIR statement. Spans are 8 bytes in size, but they reference more data stored in thread-local storage. rustc doesn’t use struct-of-arrays layouts for its data structures (due to the added complexity), so spans live next to important data and take up valuable space in cache. To worsen the problem, spans affect incremental compilation: innocuous changes to spans (e.g. because you re-order the functions in a file) can cause unnecessary rebuilds. rustc does the best it can in difficult circumstances, but I think a bigger change is necessary.

I think it is possible for Krabby to treat spans differently from rustc.

Step one: track edition data separately from spans. Based on the use sites I found in rustc (see below), edition data is usually extracted from the spans of particular keywords (e.g. dyn, async, if). Like all identifiers, keywords are interned into 32-bit IDs; I can embed edition data in there for these specific keywords.

Step two: compute spans on demand. Compiler passes that transform IRs will come in two versions, one that excludes spans and one that includes them. By default, span-excluding passes will be used; but if a span within a particular item turns out to be needed, all previous passes for that item will be repeated with spans. This will require a lot of effort.

I’m curious to explore this; this could improve performance in ways that are very hard to measure (caching effects and unnecessary recompiles).

appendix: use sites

Here are all the use sites I found in rustc for Span::edition() and some related convenience functions, excluding cases where a diagnostic (warning/error) is being generated. It’s probably not exhaustive – there are many ways to access information from Span and I didn’t check them all – but I found it quite informative.

N.B. I tried to understand the context for these uses, but I may have gotten them wrong! If you are more familiar with the codebase and can correct me, please reach out.

Raw lifetimes (e.g. 'r#a) seem to be allowed from edition 2021; when the lexer is building a raw lifetime, it checks the span to ensure this is allowed.
The interpretation of some meta-variables in declarative macros has changed across editions. $t:expr and $t:pat changed in editions 2024 and 2021 respectively, so the editions of the expr and pat identifiers is checked.
Various keywords are introduced or changed across editions, including dyn, async, await, try, and gen. The parser checks their editions to figure out how to interpret them.
This one was surprising; absolute paths, e.g. ::foo::bar are transformed into {{root}}::foo::bar, and {{root}} is a magic identifier (and is named kw::PathRoot in the compiler). The interpretation of absolute paths changed in edition 2018, so the edition of {{root}} (which is presumably based on the edition of the next token) is checked.
The TryInto trait was added to the standard library prelude in edition 2021; when evaluating foo.try_into(), the type checker considers the edition of the try_into identifier.

There are similar cases for .poll() and .into_future().
IntoIterator was implemented for [T; N] and Box<[T]> in edition 2021 and 2024 respectively; the edition of into_iter in foo.into_iter(), where foo is of one of these types, is checked.
Because of the change in absolute paths in edition 2018, the foo in pub(foo) struct Bar changed meaning, so its edition is checked.
In edition 2015, #![no_implicit_prelude] did not affect the macro_use prelude, so the editions of identifiers that end up being resolved against the macro_use prelude are checked.
The span of a closure expression, e.g. || a + b is considered by the type checker; I’ve lost track of why.
Universal function call syntax, e.g. Foo::bar::method::<u32>() can be linted (based on the edition of the span of the whole expression), I think because of how absolute paths changed in edition 2018.
In edition 2024, it became mandatory to precede extern { ... } blocks with unsafe. The edition of the extern keyword is checked.
There is a nightly feature, introduced in edition 2024, to allow the pattern mut ref mut foo; foo would have type &mut T, and the first mut would mark foo as being mutable itself. The edition of that first mut is checked.
let a = b can be interpreted as an expression for the purposes of if let and let chains. But the latter were only introduced in edition 2024; every time they are parsed, the edition of the let keyword is checked.
Some standard-library functions that used to be safe became unsafe in edition 2024; the editions of call expressions referring to them are checked. At this moment, the affected functions are std::env::{set_var,remove_var} and the deprecated std::os::unix::process::CommandExt::before_exec.
During type-checking, the edition of an opaque type like impl Foo is checked in certain places (incl. RPITIT) to determine whether precise capture rules, introduced in edition 2024, should be assumed.
All the way down in MIR building, the edition of if-else (AFAICT of the leading if) is checked, because the scoping rules for temporaries changed subtly in edition 2024.