krabby: the architecture


by arya dradjica on

I felt a bit stuck this week. While I have developed an interesting AST to experiment on, going any further with it is difficult. The next major step is name resolution, and that will require the ASTs of different files to interact with each other. My current testing / benchmarking setup, while good for executing things per file, can't deal with that. Now that I know these memory layout techniques will be effective, I decided to start over with a focus on krabby's high-level architecture.

the task architecture

At a high level, compilation is composed of a large number of small tasks, such as the parsing of a source file or the type-checking of a specific function. Every compiler defines some way for tasks to be generated and distributed throughout the system, whether it is implicit in the control flow or explicitly implemented as an "engine" of some kind. For example, rustc uses a query-based architecture.

While rustc's architecture works pretty well for incremental builds, I think non-incremental builds can be made faster. I'm intrigued by the idea of batched execution: it might be faster to execute similar tasks (e.g. the type-checking of some function bodies) in a group rather than independently. I'm also optimistic about an "eager" architecture, where some tasks (e.g. the parsing of a source file) are generated upfront even if we don't know whether they'll be used. In contrast to a "lazy" architecture like rustc's, this adds more opportunities for early parallelism, and is a nice optimisation if these "guessed" tasks are likely to be used.

My vision for krabby is an eager, push-based task architecture. For every kind of task (e.g. parsing, name resolution, type-checking), there is a queue of pending tasks and a set of runners consuming from that queue. Compilation starts from (for example) a "parse lib.rs" task, which will lead to new tasks being pushed onto the various queues. Runners will wait on these queues and execute tasks as they arrive.

There is a fixed number of worker threads, and they are capable of executing any kind of task (i.e. every thread has a runner for every kind of task). A work-stealing architecture is used: tasks default to being added to a thread-local queue, but if another thread is idle, it can steal those tasks. This way, threads prioritize working on data that is already in their local cache. Runners representing earlier stages (e.g. lexing) are prioritized over later ones, as they have longer critical paths (i.e. they will have more work performed on them before compilation is finished, so they need to be executed earlier for compilation to finish quickly).

Runners attempt to load tasks in batches. This way, the CPU is executing the same code repeatedly, which means that it will stay in (and benefit from) the code cache. Combined with the memory layout techniques from previous posts, this opens up opportunities for vectorization and batching across tasks. I'm particularly excited for tasks that perform lots of "database" lookups (e.g. trait resolution).

Tasks can be added to the system speculatively. For example, the compilation of a crate might start with a "parse all Rust files within the src directory" task. Even though some of these files might not be included in the crate, it is highly likely that they will be, so parsing them upfront (instead of waiting for mod statements to be discovered in parsed files) is a good idea.

daemonization

While rustc performs some parallelization, it was not originally designed for it; like the build systems of C compilers, Cargo would (and still does) execute rustc in parallel for every crate in the dependency graph, so parallelism within a single crate is less important. But each invocation of rustc suffers some overhead, tasks can't be batched across crates, and every crate can only start compiling after its dependencies.

rustc is also incompatible with the needs of a language server, and so rust-analyzer has to re-implement a lot of compilation logic. At a high level, a language server basically re-compiles your code repeatedly with very small incremental changes; in theory, this could be supported as part of the compiler's standard incremental compilation architecture.

I want krabby to operate like a daemon. It can compile multiple crates and a language server (while using slightly different implementations of some tasks) can be implemented in the same codebase. By using a long-lived daemon, I can cache data in memory (as a faster alternative to the disk), execute tasks in bigger batches, and share resources across different compilations. By providing a wrapper that looks like rustc but submits jobs to the daemon, I can support any existing build system (but I will build in support for Cargo for better performance).

As a fallback, for conditions where a daemon is unsuitable (e.g. due to security restrictions or containerization), a one-shot runner executable will also be provided. This will offer the same functionality as the daemon, but will only be capable of executing an initially-provided set of tasks. It will be faster to start up but more costly if used repeatedly.

implementation

Here's the plan, in order: