krabby: for context

Continuing on from the proof of concept, I'm making good progress on Krabby! There are three major steps I've accomplished in the last two weeks: a faster TOML parser, a custom lexer, and most importantly, the completion of the task architecture. I will need all of this infrastructure before I can tackle name resolution, and it's coming together very well.

parsing TOML

The proof of concept implementation would print trace logs of all the tasks it was executing, along with helpful latency measurements. During my testing, I noticed something a bit strange: the first page of trace logs was filled with threads not having enough work to do. This is somewhat expected: until the Cargo manifest of the project to be checked is parsed, there is no work to parallelize. It was still a bit frustrating, and it seemed to go on for too long, so I checked the latency of the parsing task.

It turned out that parsing a simple 80-line Cargo manifest was taking Krabby about 250-300 microseconds. This was significantly longer than I expected. The latency of opening and reading the file into memory was only 10 microseconds. 250 microseconds—about 750,000 cycles—is an absurd amount of time to spend parsing just 2,500 bytes. My implementation was deferring to the cargo_toml crate to parse the text, and then extracting some fields from the parsed manifest data. I was fairly sure the extraction process was not taking time, so that just left the TOML parsing.

A quick peek at the code showed that cargo_toml was simply relying on serde to do the actual parsing, via the standard toml crate. I've heard that serde / some serde formats could be somewhat inefficient at times, but I was incredibly surprised by the slowdown. While I could peek through the source code of serde and toml to try to identify the performance killer, I thought it would be easier to just write a new TOML parser from scratch. I estimated that a custom TOML parser could run 10 to 20 times faster.

I implemented a scheme I've seen referred to as "pull parsing", where every unit of information in the parsed file (e.g. a table section or a key-value entry) is represented as an "event" produced by the parser. I took some care to avoid unnecessary allocations and copies -- when an unquoted key or a string literal without escapes was used, I would reference it directly from the source file. The parser ended up being about 1,500 lines of code—not bad at all.

I ended up with an interesting architecture for consuming parse events. I defined some "Cargo manifest specification" types, whose fields and layout corresponded with the Cargo manifest format exactly. These types then implemented a ConsumeTOML trait, through which they could consume parse events and update their data. If a parse event corresponded to an unexpected location, or used the wrong type (e.g. introduced an array where a table was expected), an error would be returned. When an event had to be delegated to a sub-table, the sub-table's event-consuming functions could be called directly. The whole system composed really well, and I should probably set up some derive macros and publish the whole thing as a standalone crate.

The end result was perfect—the manifest loading task dropped from 250-300 microseconds down to just 30 microseconds. I did have some further optimization ideas, but I don't think they're going to be worthwhile.

the new lexer

Lexing is the only computationally intensive task the compiler is capable of right now, so we may as well focus on it while we can. While there are plenty of interesting computational optimizations I could perform (I'm especially excited about SIMD approaches), I decided to start light and use some more efficient data structures. Until now, I was just using proc-macro2, and had copied and modified its data structures in order to make it thread-safe. Now I've introduced a "token buffer" representation that is about 5 times more compact, and that should provide much better cache locality. There are a number of specific optimizations and design choices here:

A struct-of-arrays (SoA) memory layout is used, where every token is represented by three fields (an 8-bit tag, a 32-bit data field, and a 32-bit byte position). proc-macro2 used 40 bytes for every tag (at least on 64-bit architectures); the new representation is 4 times more compact in this regard.
The token buffer representation is flattened; nested token streams are inlined into the same buffer, at the exact point they are used. This provides better cache locality, especially for the (upcoming) parser, where the tokens between delimiters will be parsed before moving on to later components.

Even with the flattened representation, a token stream nested within the buffer can be represented as a range of indices. There is no loss of functionality regarding access to nested token streams; they are just more efficient to access.
While the proc-macro2 approach utilizes Rc to re-use token streams when they appear in multiple locations, this will never happen during Krabby's lexing or parsing. I have some ideas for how to approach this when we get to macro expansion, and I feel pretty confident that it will work out.

Initially, I would simply call proc-macro2's lexer, traverse the resulting token stream, and build up the token buffer from it. I have now inlined that lexing code and modified it to build up the token buffer directly. I'm also making some more adjustments to make the lexer implementation more readable, rather than necessarily faster; I will provide separate performance-focused implementations later.

the task architecture

While I was able to demonstrate a proof of concept of Krabby in action, there was something missing. Running krabby build would certainly try to build the packages you threw at it, but I neglected to mention that the command would never terminate. Krabby was simply unable to tell when it was out of work, and its worker threads simply poll for tasks in a hot loop. After finishing the requested tasks, it would max out on CPU usage and output an infinite stream of meaningless logs. Oops!

The individual tasks in the compiler were already associated with some concept of context, which would represent why the task was being executed, and would route its outputs towards the next compilation step. I've now built a hierarchy around this context, so that tasks know who called them. Most importantly, this adds an event hook that is called when a context is finished (i.e. when its last child is finished). This allows me to determine when a high-level task (e.g. "check a crate") is finished, and whether it was successful. And krabby build correctly terminates now!

what's next

Now that the compilation terminates on time, it's significantly easier to benchmark and profile. Rather than moving forward into parsing, I'm going to take some time to optimize the lexer. It turns out less than 50% of the runtime is spent actually lexing; a lot of time is lost in the work-stealing architecture and in some concurrent data structures I neglected to optimize. I've been dreaming about vectorized lexing for about three years now, but I'm going to address my data structures first. It'll be fun!