We all love Rust because it doesn't need a GC to guarantee memory safety, among other things.
One idea that keeps creeping to me is what if garbage collection can be done in a few picoseconds utilizing a hardware device that keeps track of object lifetimes either separately as in a client/server model or using additional bits added to the word to keep track of them individually?
As Rust forums is a place with a lot of professionals knowledgeable in hardware, I wanted to ask this here. I'm by no means a hardware person.
To answer your question: yes, I principle anything that can be run on a Turing Machine can have a direct hardware implementation.
However, it would be worse than pointless to do so. First off, the hardware likely wouldn't accelerate all that much, since the garbage collection algorithm is superlinear, and hardware can do nothing to reduce asymptotic complexity.
Secondly, it likely wouldn't be economical do develop some kind of ASIC for garbage collection purposes. Too large an investment, but not nearly enough return, in both the technical and economical realms.
Lastly, and most fundamentally, garbage collection is an incomplete solution to the actual problem: resource management.
Yes, RAM is a resource, but it is far from the only one. To pick some easy examples, think of sockets, or file handles: garbage collection does nothing to properly manage those, and instead just invites language-level hacks like with-statements that are unnecessarily restrictive, all things considered.
Rust's ownership and borrowing systems on the other hand, do tackle the resource management issue properly. You can get access to sockets or file handles, manipulate them, move them, store them in eg a struct, and when you're done, you just (implicitly) drop them and let the ownership/borrow systems do the actual cleanup. No fuss, no muss.
Among many crazy ideas that was tried by Intel. Over 40 years ago, in fact. It haven't worked back then, it wouldn't work today.
The real problem with GC is not where you think it is. The promise of GC “don't think about memory allocations, we have your back” is seductive. It's also wrong. I remember how we traced down unacceptable jitter in. Android app (back when Android it was using Apache Harmony) and have found that one, single, printf("%d", x) in Java would allocate (and then garbage-collect) almost 300 objects. C version would only allocate stack space.
You may use any tricks that you want imagine but if you program generates useless allocations on the industrial scale then it would always produce slow-downs. Worse: an attempt to solve excessive memory use leads to crazy tricks in the implementations that lead to strange bugs that are then fixed with something like this:
Not sure what you mean by that. Standard compacting garbage collection algorithms run in amortized time proportional to the amount of memory being allocated.
Standard compacting garbage collection introduces pauses which can be easily eliminated totally with help of hardware: introduce two RAM modes where data is either stored twice or once—that would give you the ability to instantly create snapshots—then GC can be run in software, but on entirely separate CPU core without affecting the program at all. It would be entirely separate and would spend zero time on worker CPU cores… but as I've said the real issue is not with GC, but with what GC enables: sloppy designs that create a lot of memory pressure.
A particular CPU/firmware design, for simplicity/practicality may not "totally eliminate" the pauses, but I certainly agree that malloc/free logic could be provided by hardware/firmware.
Agree 100%. Memory management is part of coding strategy. Tools like Rust that simplify this process are ideal. Relative to GC languages, and particularly while learning Rust memory management, Rust may increase the complexity of coding. However, relative to integrating memory management into coding strategy, Rust is a help. Once one learns the concepts, one can be productive, and the resulting program performance is excellent...
Such crates are a helpful and key part of coding strategy. And such crates teach engineers how to think about memory in their design strategy.
By recognizing patterns, businesses and engineers, can build up higher-level crates (models) that handle memory management for common application scenarios, thus greatly simplifying program development, and enhancing productivity, such that GC languages have little productivity advantage. Yet the GC languages cannot match the performance, and code size (runtime system bloat). At that point, a trade-off consideration is hardly needed, and languages like Rust (currently only Rust?) are the obvious solution - especially as the Rust infrastructure continues to mature.
No. You may move all logic of GC to dedicated core with help of hardware changes, then the other cores may pretend that GC doesn't exist at all. Memory allocations would still take take, that's unavoidable—but that's always true, even in Rust. But deallocations may be free. The core observation here: if you may have some garbage in old memory snapshot then that garbage, without work of GC, would stay garbage forever. That means that you only need zero-cost snapshorts fort tha—and that can be achieved using many approaches.
Whether that's a good idea is debatable, but it's certainly possible.
Arguably Lisp Machines predates these, but it's pretty close. I wouldn't be surprised if there were earlier examples either.
Only in sensible systems, yes: unfortunately finalizers in both Java and .net can "resurrect" allocations. If you don't have them then GC is much less useful, unable to eventually reclaim non-memory resources, and even if you somehow ban resurrection you would need to figure out how they run against the new memory space without the freed object and still do something useful.
There's also the issue that modern GCs are generational, so there isn't any single "snapshot" to take; and compacting which makes tracking object identity tricky. You could still do it of course, but it seems quite a bit more complex than you generally see in hardware, so I'd expect something more like transactional memory, which I think gets you most of the way.
If you don't have them then GC is much less useful, unable to eventually reclaim non-memory resources, and even if you somehow ban resurrection you would need to figure out how they run against the new memory space without the freed object and still do something useful.
With hardware support there are no “new memory space”. Garbage collector observes memory in the state it was when it started, the rest of the system continue unmodified. When garbage collector signals “I've done a pass” two states are synchronised.
Question of “can the existing GC be made better by adding hardware support” is entirely different kettle of fish. We are talking about whether “garbage collection can be done in a few picoseconds”, not about if you can keep existing design and still do that.
Have anyone done working transactional memory, in practice? Keeping two copies of data that are hardware-sychronised or de-sychronised sounds simpler than transactional memory…
I don't see that as advantageous or even practical from a hardware point of view.
The CPU instructions managing a GC should already be close to optimal in number of cycles. If you had to perform the same task in a new hardware unit, you'd face the same critical path anyway, which is the memory accesses. I'm not sure you'd get any significant gain.
You'd still need to pass through the MMU, which authorizes the memory accesses, translates their virtual addresses to physical addresses, and manages the loading of pages when necessary. The new unit would need to respect the order of completion of those accesses vs the CPU instructions, since they may depend on what needs to be freed. Each GC having its own routines to allocate and deallocate memory, that unit would have to interleave those calls in the flow, too (I don't even want to think about the synchronization between the two).
So all you'd get at best is a hardware unit generating the microinstructions that are now generated from the software's instructions, bar the flexibility to implement different or new strategies, and plus some hairy complications.
Configuration-wise, it doesn't sound easy, either. Each process having its own GC, it means you'd have to switch that configuration each time you switch a core's process. The unit would need to know which deallocation code to call, what the parameters are, what the potential error could be triggered and how to manage those... On its side, the running software would also need to send the information about the use of each block, so sharing counters. Finally, if one process has more than one GC, it's even worse.
No, really, I don't think it would be worth the trouble.
If you goal is to make GC itself faster, then you are 100% correct. If your goal is to make GC-using app faster it's different kettle of fish.
Sure. But if that is happening on dedicated core then your app can continue its work without interruptions.
Why?
What do you mean by “each GC”? There are obviously only one, in that scheme. You are clearly discussing something else than what topicstarter was asking about.
You have just added tons of extra requirements to what topicstarter was talking about. MMUs. Processes. Multiple GCs. Where are they in the original?
Probably in your with some [unspecified] limitations. Would be nice to know what they are, first, though.
Because. Think Lisp machine: all code in memory-safe GC-handled language. No MMU (because there are no need for memory protection), no processes (because they are not needed), one system-provided GC. All these things that you talk about are simply not there to impede anything.
I'm not sure what you even mean by "GC-using app". I'm only answering the OP's question about a hardware-accelerated GC.
Because it's the basic requirement for memory integrity.
A CPU is made of different parts, among which an MMU. You have to understand how a modern CPU works in order to know what a new unit must interact with. As for multiple GCs, each language uses its own, so a CPU would need to support that in order to be economically viable.
The OP is talking about a hardware accelerator for a practical use, not a toy experiment, so no, you can't discard all those constraints. But even if you did, you'd still need to deal with the pipeline flow of the CPU; if you're not familiar with how modern superscalars work, which seems to be the case, I invite you to read up on the subject, as it's a fascinating one.
You're talking about LISP machines, but they're limited to their own context and as such have nothing to do with the current discussion. There've been dedicated hardware for Java and Forth, too. All those experiments were interesting at a time when superscalar wasn't a word yet, or they were used in a very specific context like old embedded microprocessors, but they've not survived in today's context and are completely irrelevant.
I guess this is what UC Berkeley team demonstrated with an FPGA co-processor design:
Figure 1a shows the fraction of CPU time spent in GC
pauses for different workloads. We confirm that workloads
can spend up to 35% of their time performing garbage
collection.
Holy mackerels! I didn't know the workload on GC is so high! This explains my electricity bill
In conclusion they report an improvement of 3.3x on overall GC workloads by using 64KB extra SRAM from the SoC:
We introduced the design of a hardware accelerator for
GC that can be integrated into a server or mobile SoC and
performs GC for the application on the CPU. The unit can be
implemented at a very low hardware cost (equivalent to 64KB
of SRAM), generalizes to stop-the-world and concurrent
collectors and does not require modifications to the SoC
beyond those required by any DMA-capable device.
By implementing a prototype of this design in the context
of a real SoC, we demonstrate that the integration effort is
manageable, and that the unit takes up 18.5% the area of a
small CPU, while speeding up GC by 3.3Ă— overall (4.2Ă—
for marking). At this low cost, we believe that there is a
strong case to integrate such a device into SoC designs.
You are asserting that moving GC to separate CPU core is pointless because “ you'd face the same critical path anyway, which is the memory accesses”—which is true for speed GC, not true for speed of business logic that may continue to be excuted on entirely different CPU using entirely different RAM.
By using some unspecified assumption. It would be nice to know what these assumptions are.
For business logic—yes, for GC—no. GC only need consistent snapshot of memory as existed at some time in the past. You can organise such snapshot without affecting foreground operations.
MMU is not an essentional part of CPU and is not even mandatory one. You assumed it would be used. Why?
If your goal is something like “let's add one hardware unit to speedup IntelliJ on Windows 11”, then sure. That wasn't what original question was about. At least it havening sounded like that.
Only OP may answer whether that assumption is correct, but I have read it precisely as thought experiment. Because if hardware acceleration in Windows was possible then someone would have developed and sold such accelerator… that's not worth even discussing. But if we are no trying to run existing binaries then possibilities open.
I can talk instead to hardware guys… try that, some time. You may find out that there are OSes besides Android and Windows and CPU architectures besides ARM and x86.
The GC co-processor would still use some fraction of the total available memory bandwidth, robbing the rest of the CPU. Compared to a non-GC design (C, C++, Rust, ...) that will be worse. Sure deallocation takes memory bandwidth as well, but if you free data immediately after using it, chances are higher that it will be at least partly in cache. A tracing GC necessarily uses more memory bandwidth.
Not all applications are bottlenecked on memory bandwidth of course, but for those that are this would be worse.
If you want to speed up deallocation your best bet is to reduce the amount of work needed during deallocation. Prefer trivial drop, bump allocators, arenas, less indirection, etc.
If you are not memory bandwidth bound, moving specific deallocations to background threads can also be a valid approach, and is the same approach taken by GC on a broader scale.
And if you want to compare hardware assisted GC to software GC, the question becomes if the silicon usage is worth it, or it could be more useful for some other function, such as part of an extra core, more cache, etc.
Not necessarily. One can do what SEGA did with Saturn: GC can use spare bandwidth not used by foreground processes. Utilising such architecture is not trivial, but developers successfully used 2nd CPU in Saturn for specialized processes, it should be possible to do the same with GC.
It's not really clear. very hard to say without actual tests.
The core issue is not even that. It's the permanent chicken-and-egg problem: without software for such system it's not very useful and without such hardware no one would write software for it.
I think things have changed in the 30 or so years since the SEGA Saturn. Today's processors can easily saturate memory bandwidth, especially when you have half a dozen cores hammering on it. I don't think you will find any spare bandwidth.
That needs investigation, obviously, but workloads that push memory bandwidth to the limits tend to be very light on memory allocations: huge arrays of data, rarely changing. And workloads that include lots of allocations and deallocations are very often not saturating memory. Plus DDR5 trick (two parallel requests to memory) can be brought further.
Whether that would still be net positive or net negative is good question.
However, it's on a small CPU in a non-switching, single flat memory context only, so the problems I mentioned earlier would still need solving if one wanted to use it in a real environment, which may (should?) require a relocation of the logic.
Besides, I'm not convinced that the huge proportion required by that unit (18.5%!) would have a significantly smaller ratio with a full-sized CPU. I haven't seen any details on the timing closure numbers, either, but I could have missed that; still, I'd be curious to see how it'd fare in a hyper-pipelined ASIC.
That being said, it's an impressive work and breakthroughs come from trying again and again, but I also think that the advantages given by Rust's ownership and memory management system are already proven, and their scope is larger than reducing the overhead of GC.
That's also what I thought. It seems to be a better fit for microservices on the server and embedded applications using MicroPython feasible for things like robotics.
18.5% area on the chip should be compared to the same area with added compute capability and how they end-up in real-world efficiency. But given the fact that most apps don't floor the gas pedal all the time, gains can turnout to be more than losses in the wider spectrum.