krabby: trying out parent ptrs


by arya dradjica on

I was planning to write a post about the overall architecture I envision for krabby, but there are some complicated bits (regarding macro expansion) that don't feel resolved yet. Maybe that'll come next week.

In the meantime, I decided to tackle the first non-trivial (but relatively simple) compilation stage I could think of: local name resolution. This doesn't try to resolve imports or references to items (e.g. functions), but rather only the variables defined within a function body. While my code certainly isn't perfect, it was a great experiment, and it's helped me figure out parts of the compiler infrastructure that will be crucial as we go on.

Since I didn't want to get bogged down in lexing and parsing, I decided to just use syn. It's a crucial library for writing proc macros, as they only get token streams as input, and syn is universally used to parse those into easy-to-use ASTs. It turns out it can parse plain text too! It made it really easy to get a parsing stage going.

I spent a few hours poring over the pertinent internal documentation and source code in rustc, trying to figure out how local name resolution worked. The first thing I noticed was that rustc uses a high-level AST format that looks really similar to that provided by syn. Apparently, rustc doesn't distinguish local and global name resolution, and performs them in the same "late stage" step (where macro-related processing is the "early" step). It's implemented as a simple top-down traversal of the high-level AST. syn actually provides the same traversal technique—this makes it really straightforward to replicate.

Once I knew how rustc worked, it was time to optimize. I was surprised by the memory layout of its AST; it uses lots of small heap allocations, and it's a nightmare for the CPU cache and prefetcher. I was sure that an optimized memory layout would result in much better performance, but I had to be careful about hyper-focusing on local name resolution—there are many other stages in the compiler, and they might benefit from a different memory layout. Rather than try to optimize for compiler stages I don't know about, I decided to take a step back and do some research.

prior work

I have two sources of inspiration regarding efficiency in compiler architecture. Concerted efforts at designing fast compilers can lead to great success, but I've also noticed that the most widely used compilers don't try things like this. Maybe it's because big, well-established codebases, where the operational logic is incredibly complex and hard to test, don't like being rewritten completely. This is the motivation for krabby— I want to learn what good compiler architecture looks like, before I try to write a serious compiler. But the success stories I've seen (specifically, the Zig compiler and Co-dfns) are amazing.

Both these projects have a lot to say with regards to compiler design, and I'm going to be referencing them constantly as I work on krabby. For now, I'm just going to share my findings with regards to data structures and memory layout, because the two projects have different approaches. In particular, I read the Zig release notes for 0.8.0 (the section on memory layout changes), and Aaron Hsu's dissertion introducing Co-dfns.

Zig uses a highly compact AST representation that is built on an SoA (struct-of-arrays) format. Every AST node is serialized as a 1-byte tag (identifying what kind of node it is) and an 8-byte data pack. The contents of the data pack vary, but for binary expressions like addition, it's simply two 4-byte indices identifying the left and right operands, which are AST nodes themselves. There are lots of additional considerations (e.g. a lot of information can't fit in the tag or data pack, leading to overflow mechanisms), but this is basically it. The nodes are laid out in pre-order traversal (i.e. node, left child and its contents, right child and its contents). Importantly, all the different kinds of AST nodes are packed into the same array; thus, the only way to find specific nodes (e.g. addition operations or function bodies) would be to scan through the whole AST.

Section 3.2 of Hsu's dissertion is a great discussion on choosing a memory layout, and it covers the same sort of ground as this section. My understanding (of the whole dissertion, really) is that Hsu is advocating for bottom-up AST traversal as a technique for unlocking fine-grained parallelism. Conventional compilers (including Zig) focus on top-down AST traversal, where you have to search the whole tree to find the nodes you care about for a specific transformation. Co-dfns uses bottom-up traversal, where you start from the nodes you want (which are easy to find, because they of the format of the AST), and iteratively look up parent nodes until you find all the information you need.

While both of these projects have interesting things to say, I think Co-dfns takes things much further than Zig does here. Local name resolution is actually pretty well suited to bottom-up traversal, because you can start from references to variables and search through their parent scopes iteratively. Hsu's dissertion spends some time discussing the lower-level details of memory layouts; but they're just describing the SoA format, which I'm already familiar with. I decided to try implementing local name resolution using bottom-up traversal, but with more lessons about memory layouts from Zig and my personal experience.

implementation

I set up a Rust crate with two big modules: flat, which uses an efficient memory layout and tries to optimize everything, and boxy which replicates rustc and serves as a reference point. I implemented a local name resolver for both modules: flat uses bottom-up traversal while boxy does top-down traversal. While local name resolution isn't that complicated, the bottom-up traversal needs some explanation, and the experience of writing both implementations is an important data point. I've put the code on SourceHut.

Before we get to the actual name res, I should quickly explain the parsing setup. At some point in the future, I will write a parser for flat that parses directly into an optimized AST representation; I expect this to be much faster than syn. For now, I have to transform syn's AST into the format flat expects, and I will perform this as a setup step and I won't count it towards benchmarking. I'm only going to compare the local name resolution steps as performed on the preferred data format for both implementations.

I'll start with the boxy implementation because it's simpler. I initially tried writing my own AST structure to parse syn's output to, but it grew into a lot of code and I decided to just use syn's format for now. Maybe I'll come back to the benchmarks below when I have the custom structure prepared, as that will likely improve boxy's performance a bit (e.g. I don't intern identifiers right now).

Beyond that, it's a simple AST traversal that maintains a stack of scopes. Every time it reaches a node such as a function definition or a block expression, it pushes a new scope on the stack; it will be popped when exiting that node. Every local variable declaration (from function/closure parameters, match arms and let statements) is added to the bindings known for the innermost scope. Every expression that appears to reference a local variable will be resolved by looking for a matching binding, from the innermost to the outermost scope (ending at a function boundary). If a reference can't be resolved, it's fine -- it needs to be left to "global" name resolution, which accounts for items (functions, types, etc.) and imports.

Okay, it's time to get into the fun stuff. In order to look up names using a bottom-up traversal, we need an array containing every reference, and we need to be able to find the scope of each reference. I wasn't particularly excited to write a whole AST in the optimized memory layout (just yet), so I decided to make a data structure specific to name resolution: a scope tree. The scope tree holds a hierarchy of scopes, and a list of identifiers declared within each scope. Within a scope, later identifiers can shadow (i.e. replace) previous ones, so everybody keeps track of where in a scope (i.e. how many declarations in) they are located. The first step of local name resolution would be extract the scope tree, and the list of references, from the AST. Once it's ready, we can resolve each reference by looking up its scope, checking every declaration (in reverse order) to find a matching identifier, and repeating the process with the parent scope (if any).

While I'd like to shed some more light on the memory layout of the scope tree, it's hard to explain without visuals, and I'm pretty lazy. The source code for the flat implementation is quite clean (unlike boxy, because it was easier to get away with), so I'd just recommend reading it. The crux of the layout is parent pointers: every scope tracks the index of its parent (and its location within the parent, for shadowing), and we repeatedly look these up to traverse upwards. I don't use hash maps for the bindings in each scope: most scopes only hold 10-20 declarations, and a linear scan is more performant.

benchmarking

I spent some time looking for big repositories of source code to benchmark against. My first thought was to try rustc itself, but it uses a bunch of nightly features which means that syn can't parse it. I was considering using crater, which rustc itself uses for benchmarking; but it looked like quite a bit of hassle to set up. Eventually, I did come up with a great solution: I just used my local Cargo cache (typically located at ~/.cargo/registry/src). It contains the source code of every crate you've installed or used as a dependency. In my case, it was almost a gigabyte in size, and it proved to be more than enough code.

At some point, I'll implement reading Cargo.toml and discovering crates; for now, krabby just reads a list of paths to Rust files from standard input. There isn't much more to say: here's the benchmarking results with 16,000 files / 315 mebibytes / 8 million lines of code:

--- Parsing with syn
    ... done (26.631s)
        memory usage: +11569.488M
--- Resolving local names (boxy)
    ... got 1365432 resolutions
    ... done (0.897s)
        memory usage: +0.465M
--- Building the scope tree
    ... done (1.082s)
        memory usage: +73.297M
--- Resolving local names (flat)
    ... got 1687238 resolutions
    ... done (0.025s)
        memory usage: +0.000M

The results are striking. Local name resolution with flat is about 35 times faster than with boxy. There's definitely some bias towards flat here, since I haven't accounted for the time to build the scope tree (which needs to be built from an optimized AST structure which doesn't exist right now), but it still looks really impressive. Also note that the scope tree is 150 times smaller than syn's AST; while it includes significantly less information, I expect flat's full AST (when I get around to implementing it) to still be significantly smaller than syn's.

conclusions

The initial results are impressive, exciting, and most importantly, really motivating. There's a lot of things to be done next, and I'm trying to avoid hyper-focusing on one aspect and/or burning myself out. The Co-dfns approach works really well, but I need to investigate it further, especially beacuse Hsu's dissertation doesn't cover advanced compiler stages like type checking. But overall, I'm just really excited.