I’m a loud and proud perfectionist. In the context of programming, I believe there is always an ideal way to solve a problem. I’ve developed a habit of noticing imperfections, particularly when it comes to API design, software architecture, and performance. Over time, I’ve accumulated a list, at least in my head, of things that could (and maybe should) work better. I’m writing that list down so that I can remember it better, and perhaps others will find it interesting; but if someone were to pick up and solve one of these, I’d be eternally grateful. I hope to solve some of them myself.
First, an important note
Below, I talk about problems with other people’s work. Sharing negative opinions is not something I take lightly. I want to explain my intention and my stance towards these works. I want to stress that I am not blaming anyone for these problems – I am talking about these problems because I think solving them would be a good thing.
First of all, I am glad that each and every one of these works exist. Making Things™︎ takes a lot of effort and I always respect that. I’ve been programming for over a decade, and regardless of how passionate I am or how many big dreams I have, I have very few concrete projects to share.
Solving problems requires time and energy, and these are not always available. This is incredibly common for open-source projects, which are usually developed and maintained by a single individual, out of passion and/or need. Developers work on projects at their own discretion; after all, these projects are only available because the developers chose to publish them. Nobody is responsible for solving the problems I point out.
The developers may not have discovered these problems. Perhaps it falls outside their purview (e.g. it might not be an important use case for them). Even if they are aware of a problem, they might not value its costs the way I do. Missing a problem does not make one a bad developer. And technical disagreements are common and perfectly valid.
Problems are often caused by deliberate technical tradeoffs, which might not be visible at first glance. It is entirely possible that a problem I point out has been considered before, but cannot be solved without causing more problems. I may have missed obvious reasons why potential solutions are intractable. I will try to note down such cases.
Large, structural problems tend to be caused by decisions made a long time in the past, that made sense at the time (and perhaps remain worthwhile in hindsight) but are difficult/impossible to undo now. These usually involve a library API or high-level software architecture; they can be resolved with a complete rewrite or painstaking incremental effort. Rewrites tend to be impossible because of the amount of churn they would cause, halting other development and preventing contributors from working in parallel. Incremental effort can be worthwhile, but take years to achieve.
Regardless of these barriers, I think discussing these problems is worthwhile. While I would love to solve them myself, I simply don’t have enough time or energy. I hope that these discussions are interesting, and that looking for solutions to these problems will make us better programmers. And I would be incredibly grateful if someone does put in the effort to solve any of them.
loom
last updated , loom version 0.7.2
loom is a tool for Rust library that helps check concurrent programs; it runs a test many times, intercepting atomic operations and ensuring that each iteration evaluates a different sequence of concurrent operations. I think it is a fascinating tool, and as a big fan of concurrent data structures, I’d love to use it to verify the correctness of my designs. It’s supported by a number of concurrent data structure libraries like crossbeam. But it has a few drawbacks that make it painful to use.
-
loomis used by substitutingstd’s concurrency-related types (primarily atomics) for the corresponding ones provided byloom. Butloom’s API does not exactly correspond tostd’s API, requiring the user to build their own better-matching abstraction. While it appears that some API differences are forced by the underlying implementation, other parts of thestdAPI are not implemented at all. -
loom’sAtomic*types do not have the same in-memory representation as the standard types – they are all (effectively)usizes. This is particularly painful when the caller uses customized allocation or pointer arithmetic code; I find myself in this case quite often.I’m not the first person to notice this, or to consider a solution for it; an open issue has existed since 2023. The poster even implemented their proposal, but their questions regarding an edge case were never answered.
-
Development seems to be very slow, essentially stalled. The current major version (0.7.0) was released in 2023, and only a handful of minor enhancements have occurred since then. Many issues and pull requests have been open for years (see
loom’s GitHub issues page).
While the technical issues have (potential) solutions, the real problem is the stalled development. The only thing I can do is spread awareness about loom and its utility, and hope to reach somebody who has the passion/need, time, and energy.
rustls
last updated , rustls version 0.23.36
rustls is the de-facto Rust library for communicating over TLS. Its defining feature is that it just works – it’s well maintained, it has easy-to-use APIs, and it’s pretty much always the right thing to reach for.
At work, we have two daemons that communicate through a TLS-based protocol. While both sides are usually implemented by us, they have to be compatible with other client or server implementations. We’re trying to maximize throughput (in fact, we open a separate TLS connection per CPU); while trying to optimize this further, I noticed some performance issues with rustls.
-
rustlsperforms a lot of allocations internally, even in its “unbuffered” API. I’m sure this helped simplify the implementation, but the internals have grown such that removing these allocations is incredibly difficult. Others have tried to improve the situation already, with small incremental changes, but progress is slow. -
rustlspairs the reading side and the writing side of its TLS connection state together; this leads to a subtle issue for request-response protocols. A client should be able to send multiple requests while previous responses are waiting to be received; butrustls(both in its high-level and low-level interfaces) makes this separation of the request sender and response receiver very difficult. I managed to hack something together with mutexes, but it was too messy to be worthwhile.
I invested some time in contributing to rustls and trying to resolve these. I opened an issue to discuss the reader-writer split, and built a proof of concept for the reader-writer API. While the PR wasn’t meant to be merged, it seems the maintainers are interested in such an API and are updating the internal architecture to make it possible. I’m really happy to have been able to help, and I hope I can find time to contribute some incremental improvements.
Beyond rustls (which aims for hard-to-misuse APIs), I’m curious to see what a really lean TLS interface looks like. For sending data, I can imagine the user would provide a buffer, write the outgoing data in a library-specified range, and then the library would encrypt the message in place and add the necessary headers. The library-specified ranges would account for message fragmentation, outgoing alerts, and perhaps even the initial handshake messages. A similar interface could exist for receiving data. A generic request-response mechanism could be built around this, which would be applicable to many TLS-based protocols. I’m excited to see what’s possible.
CachePadded<T>
last updated
False sharing is a somewhat subtle problem when working with atomic variables. On most architectures (well, at least x86), atomic operations like compare_exchange are implemented by cache-line locks; the CPU will lock the containing cache line, taking exclusive access to it, and will then perform the requested operation. If two atomic variables, contended by different CPUs, end up on the same cache line, performance can be seriously affected.
This is usually solved by padding every atomic variable (that might suffer contention) so nothing else occurs in the same cache line. In Rust, this can be implemented with a relatively simple wrapper type, but figuring out the correct cache line size for different architectures is hard. crossbeam-utils provides one such implementation … in fact, it was the only implementation I could find. For my work-stealing task queue library, takeaway, I ended up copying it.
Aside from some small issues with the API, I’d really like to see this code moved to the standard library. It’s a hundred lines of code (200 including documentation) that you always need for high-performance multi-threaded work. It’s too small to have its own crate, and crossbeam-utils is too large to include if you only need CachePadded.
I don’t have experience in it, but it seems that adding CachePadded to the standard library would require significant effort. An ACP (API change proposal) is needed to officially propose the change and discuss the interface. In my experience, the community’s stance towards getting things right (which I find incredibly valuable!) can prolong this process, sometimes indefinitely. A PR should only begin after consesus is built; there may be little code to add, but documentation will be important.
SIMD intrinsics
last updated
I absolutely adore working with SIMD (in particular on x86). While it is often applied in simple use cases (low-level data processing or numeric computation), I think it could significantly benefit something as complicated as a Rust compiler. The really hard part is reorganizing data structures and algorithms so that SIMD approaches are possible.
But SIMD intrinsics can be painful to use, due to frustrating APIs. There is no platform-independent API for them, and while the Intel intrinsics may be widely supported, they are not pleasant to use. Simple integer operations can look like _mm256_add_epi32() and the AVX-512 instructions get even more hairy, e.g. _mm512_mask_cmp_epi16_mask(). While it’s possible to get used to them, they are simply long and complicated to type, and take no advantage of Rust’s type system.
Within Rust, there is a long-running nightly feature called Portable SIMD. It introduces a simple, high-level, and platform-independent SIMD abstraction that can satisfy common use cases for explicit SIMD intrinsics. This would make it a lot easier to work with SIMD, especially when a single implementation can be used on multiple architectures.
While I hope that the Portable SIMD efforts continue, I don’t think they would often serve the use cases I have in mind. I’m primarily interested in x86, and there are plenty of specialized instructions that I reach for daily. PSHUFB is great for classifying bytes, PSADBW provides horizontal additions, and there are various operations for min/max and saturating arithmetic. Portable SIMD can’t provide for these (while it covers some functionality, e.g. swizzle_dyn for PSHUFB, it does not replicate PSHUFB’s parsing of the index vector). And algorithms written with portable SIMD cannot account for platform-specific performance characteristics, such as the lack of unsigned integer comparisons in AVX2 or the cost of crossing 128-bit lanes. I want to squeeze out every iota of performance I can.
For my personal SIMD programming, I’m much more interested in high-level APIs tailored to particular architectures. I did try implementing a library for this concept of “non-portable SIMD” back in 2024, with npsimd. npsimd tried to offer high-level SIMD vector types with strongly-typed elements and operator overloading, while supporting dynamic instruction set detection. I had a lot of trouble with the optimizer (as LLVM avoids inlining functions that use target-specific features), and the #[target_feature] attribute was harder to use at the time. I hope to find the time to return to that project, reevaluate it, and find a good interface that achieves the right optimizations.