introducing 'takeaway'

I've written a work-stealing task queue system for high-performance, CPU-bound applications, called takeaway. It provides an intuitive high-level API and its internal architecture allows it to provide some unique features. I wrote it because the only popular work-stealing library for Rust, crossbeam-deque, does not support task prioritization, a feature I need for my very-very-WIP Rust compiler (krabby).

It's taken 2 months of work, but I'm really happy with the current state of takeaway. I've released v0.1.0 as the first "stable" release; I expect to make some (minor) breaking changes in the future, but I think it's ready for public use. If you're writing a high-performance, task-based application like me, please check it out! It offers competitive (and often better) performance than crossbeam-deque, with a nicer API and more features. Over here, I'm going to expand on takeaway's design and the process of implementing it.

my use case

I'm writing a Rust compiler called krabby because I want to explore novel ideas in compiler design. Like any compiler, its workload consists of individual tasks like parsing files and type-checking functions. It tries to optimize here in two ways: by defining very fine-grained tasks, so that it can distribute work among CPUs more evenly; and by prioritizing certain tasks over others. For example, parsing tasks would be prioritized over type-checking tasks, because the former comes earlier in the compilation pipeline (it has a longer critical path).

While I was initially using crossbeam-deque here, its lack of support for prioritization became a hassle. For a while, I worked around it by having 3 independent task queue instances running simultaneously, but this only added more overhead. I decided to write my own task queue system before progressing further with the actual compiler tasks.

I wrote out the design requirements I had for the task queue:

It must be possible to prioritize certain tasks over others, so that they are (very likely to be) executed earlier.
Tasks should stick to the worker thread they were added on, unless another thread runs out of better work to do. This benefits CPU-local caches.
There must be some efficient way for a worker thread to determine whether it should steal tasks, and from whom.
Synchronization costs should be amortized by acting on batches of tasks at a time.

Internally, crossbeam-deque keeps thread-specific tasks in a concurrent ring buffer, such that every enqueue and dequeue operation is atomic. In this architecture, there isn't any opportunity to sort the tasks by priority.

the design

I initially considered storing all tasks in a global concurrent B-tree, sorted by priority. While this would offer perfect prioritization, it would suffer from a lot of contention (both for enqueueing and dequeueing) and lose all thread locality.

To resolve both those issues, I decided to avoid a global data structure for tasks, and store them in thread-specific spaces. My global data structure would instead track all the worker threads, providing information about where to steal from. Each worker thread would "publish" a set of tasks with a known priority, allowing stealers to filter for high-priority tasks.

More specifically, each worker thread will occasionally sort its internal queue of tasks, pick some high-priority ones to publish, and also publish their minimum priority. A stealer will only steal that set if that minimum priority is greater than any locally-available priority; this ensures tasks will not bounce back and forth between threads. This system cannot guarantee that the highest-priority tasks are being executed at any point, but rather that relatively high-priority tasks are executed on a best-effort basis.

Publishing tasks for theft, sorting the local queue of tasks, and stealing new tasks, are all expensive operations that should only be performed occasionally. This frequency is encoded by the concept of a batch; a batch is a fixed number of tasks a worker thread will process before performing these expensive operations. When a worker's batch is depleted, it will collect all tasks it knows about, sort them by priority, steal new tasks, select tasks for the new batch, and publish tasks for theft.

There's one last consideration here: in the worst case, a single thread has all the highest-priority tasks and they need to be distributed to the remaining threads quickly. Which tasks should the worker threads publish for theft? It can't be all the highest-priority tasks, because some of them should be executed locally, and it can't be the lower-priority tasks, because they're not the ones we want to distribute. My system opts to take every other high-priority task and publish them; this ensures that a roughly equal distribution of task priorities are executed locally as are published.

the implementation

Once I had a design, it only took a week to implement an initial prototype. By the second week, I was reasonably happy with the code, as an internal module within krabby. I then decided to publish it as a standalone crate for others to use as well; this took a surprising 6 weeks, where I restructured the code, significantly developed the API, and improved performance. I'm happy with its state now, but it was frustrating to see so much time slip by.

I really enjoyed the difficulties of concurrent programming, and I learnt a lot. The implementation is littered with unsafe code, and it became crucial to document the assumptions and invariants I was maintaining at every step of control flow. I also developed some incredibly complex atomic variables, with many states and operations by different threads; I began drawing state machine diagrams to understand them better. This was the only way I could catch some particularly nasty deadlocks. I plan to add support for loom one day, but its not straightforward.

The API initially reflected crossbeam-deque quite heavily, but as I wrote example programs and learnt usage patterns, it evolved to do a lot of work for the user. It can automatically track when the system runs out of tasks and shut down the queue. It provides an asynchronous interface so that it can sleep if tasks aren't available, without blocking the thread entirely. At least for krabby, this is significantly easier to handle than crossbeam-deque.

In local benchmarks, takeaway can hold its own against crossbeam-deque. By looking at the profiler, I've optimized some hot spots (in particular, how tasks are enqueued), and takeaway is often faster on my machine. Some graphs are available on the Git repository. It's somewhat surprising, because takeaway is optimized for the harder case of task prioritization. takeaway is also very sensitive to the choice of batch size; small batch sizes tend to do very poorly, and users may need to tweak it if they have very small tasks.

krabby has really benefited from takeaway. While it's hard to judge crossbeam-deque here, given my hacky use of it, it was taking roughly 20% of the compiler's runtime, about 6.5% for each instance. With takeaway (batch size 64), that time has dropped to 1.5% in total. It was 4 times faster than the individual queues, and ~13 times faster than the whole thing.

what's next

I'm proud of where takeaway is today, and I'd love for you to check it out. There's still more work to do before I'm fully satisfied, but I think it's ready for use.

I have a list of pending ideas that will appear in v0.2.0 and beyond. Once these are done, I expect to publish v1.0 and passively maintain the codebase.

krabby will one day support executing tasks in batches for greater efficiency (e.g. lexing 20 files at once), and this needs takeaway to group tasks in a certain way internally.
One feature crossbeam-deque supports that takeaway currently lacks is global enqueueing (i.e. Injector); this allows tasks to be added to the system outside of any worker thread.
takeaway could also report statistics about its tasks (some of which are already being tracked internally), which may be nice for displaying progress indicators.
Users might sometimes find takeaway a bit hard to handle because the Worker type is not Send. This can complicate how the system is initialized, and crossbeam-deque doesn't have this issue. I'm not sure how I want to address this yet.
There's still some work to be done in cleaning up the codebase, adding more safety comments, and documenting internal invariants. I'm sure I can eke out some better performance along the way too.