I have a large backlog of posts to write, but I quickly wanted to share a fun idea I explored on Thursday. I’ve been thinking about how incremental compilation should work in Krabby, and a major sticking point there is spans. But I think I have a plan.
A span refers to some portion of source code, e.g. 21:43 -- 22:17 in "foo.rs"; it is used for referring to source code when rustc prints diagnostics. You can find spans in most compilers. But rustc encodes more information into them: in particular, the edition and macro expansion they originate from. Macro expansion data is useful for diagnostics in a very similar way to line and column information; rustc diagnostics can refer to code originating from a macro invocation and print information about the macro invocation and the macro definition. But the most interesting part here is editions.
Rust’s edition mechanism is very interesting. It’s not a language standard, like C11 or C23; a Rust project can choose a certain edition and remain fully compatible with other Rust libraries. The choice of edition affects how the code within the project is processed, but remains a project-local choice. Conceptually, all the crates in your dependency graph are processed into an (effectively) edition-independent representation before being compiled together. Editions can can change syntactic choices (e.g. whether async is a keyword) but (usually) not cross-crate interactions like type-checking behavior. I’m hedging my words a bit because rustc sometimes abuses editions to make bigger changes.
The association of edition data to individual tokens causes … complications. In most compilers, span data would only be relevant for printing diagnostics. But rustc depends on the edition data encoded by spans throughout compilation:
- Parsing:
asyncchanged from a regular identifier to a keyword in Rust 2018 - Declarative macros:
$p:patallows fora | bpatterns from Rust 2021 - Name resolution: paths (
foo::bar) are relative toselffrom Rust 2018 - Type-checking:
Box<[T]>implementsIntoIteratorfrom Rust 2024 - Lints:
foo.try_into()could give a warning before Rust 2024 - Borrow-checking: the order in which temporaries are dropped in
ifconditions changed in Rust 2024
The end result? rustc has to pass around spans everywhere, all the time. Spans are embedded in every intermediate representation within rustc – every token, every AST node, every HIR expression, every MIR statement. Spans are 8 bytes in size, but they reference more data stored in thread-local storage. rustc doesn’t use struct-of-arrays layouts for its data structures (due to the added complexity), so spans live next to important data and take up valuable space in cache. To worsen the problem, spans affect incremental compilation: innocuous changes to spans (e.g. because you re-order the functions in a file) can cause unnecessary rebuilds. rustc does the best it can in difficult circumstances, but I think a bigger change is necessary.
I think it is possible for Krabby to treat spans differently from rustc.
Step one: track edition data separately from spans. Based on the use sites I found in rustc (see below), edition data is usually extracted from the spans of particular keywords (e.g. dyn, async, if). Like all identifiers, keywords are interned into 32-bit IDs; I can embed edition data in there for these specific keywords.
Step two: compute spans on demand. Compiler passes that transform IRs will come in two versions, one that excludes spans and one that includes them. By default, span-excluding passes will be used; but if a span within a particular item turns out to be needed, all previous passes for that item will be repeated with spans. This will require a lot of effort.
I’m curious to explore this; this could improve performance in ways that are very hard to measure (caching effects and unnecessary recompiles).
appendix: use sites
Here are all the use sites I found in rustc for Span::edition() and some related convenience functions, excluding cases where a diagnostic (warning/error) is being generated. It’s probably not exhaustive – there are many ways to access information from Span and I didn’t check them all – but I found it quite informative.
N.B. I tried to understand the context for these uses, but I may have gotten them wrong! If you are more familiar with the codebase and can correct me, please reach out.
-
Raw lifetimes (e.g.
'r#a) seem to be allowed from edition 2021; when the lexer is building a raw lifetime, it checks the span to ensure this is allowed. -
The interpretation of some meta-variables in declarative macros has changed across editions.
$t:exprand$t:patchanged in editions 2024 and 2021 respectively, so the editions of theexprandpatidentifiers is checked. -
Various keywords are introduced or changed across editions, including
dyn,async,await,try, andgen. The parser checks their editions to figure out how to interpret them. -
This one was surprising; absolute paths, e.g.
::foo::barare transformed into{{root}}::foo::bar, and{{root}}is a magic identifier (and is namedkw::PathRootin the compiler). The interpretation of absolute paths changed in edition 2018, so the edition of{{root}}(which is presumably based on the edition of the next token) is checked. -
The
TryIntotrait was added to the standard library prelude in edition 2021; when evaluatingfoo.try_into(), the type checker considers the edition of thetry_intoidentifier.There are similar cases for
.poll()and.into_future(). -
IntoIteratorwas implemented for[T; N]andBox<[T]>in edition 2021 and 2024 respectively; the edition ofinto_iterinfoo.into_iter(), wherefoois of one of these types, is checked. -
Because of the change in absolute paths in edition 2018, the
fooinpub(foo) struct Barchanged meaning, so its edition is checked. -
In edition 2015,
#![no_implicit_prelude]did not affect themacro_useprelude, so the editions of identifiers that end up being resolved against themacro_useprelude are checked. -
The span of a closure expression, e.g.
|| a + bis considered by the type checker; I’ve lost track of why. -
Universal function call syntax, e.g.
Foo::bar::method::<u32>()can be linted (based on the edition of the span of the whole expression), I think because of how absolute paths changed in edition 2018. -
In edition 2024, it became mandatory to precede
extern { ... }blocks withunsafe. The edition of theexternkeyword is checked. -
There is a nightly feature, introduced in edition 2024, to allow the pattern
mut ref mut foo;foowould have type&mut T, and the firstmutwould markfooas being mutable itself. The edition of that firstmutis checked. -
let a = bcan be interpreted as an expression for the purposes ofif letandletchains. But the latter were only introduced in edition 2024; every time they are parsed, the edition of theletkeyword is checked. -
Some standard-library functions that used to be safe became unsafe in edition 2024; the editions of call expressions referring to them are checked. At this moment, the affected functions are
std::env::{set_var,remove_var}and the deprecatedstd::os::unix::process::CommandExt::before_exec. -
During type-checking, the edition of an opaque type like
impl Foois checked in certain places (incl. RPITIT) to determine whether precise capture rules, introduced in edition 2024, should be assumed. -
All the way down in MIR building, the edition of
if-else(AFAICT of the leadingif) is checked, because the scoping rules for temporaries changed subtly in edition 2024.