Is a hardware accelerated garbage collector possible?

To be honest, I saw there were other research papers in a more general context, including multicores, but I haven't read them. GC is used in many popular languages, so I understand why it would be appealing to improve its performances, and the interest it could generate for R&D in that area. Maybe I'm underestimating that. After all, they've added HW support for AI in general-purpose CPUs.

That one, at least, makes sense. They are supporting minifloats that are trivially implementable in hardware and much faster than software emulation, but more importantly: they enable things that are simply not feasible without them.

GC accelerators is much more problematic.They may save quite a lot—but not in a form that can be easily plugged in Windows PC to run IntelliJ. Much harder things to sell. They are like Jazelle DBX: perfectly usable thing in theory, absolutely useless in practice.

Remember, a "memory space" is just an definition of what memory addresses mean: if you're "taking a snapshot" that means essentially making the current memory space read only and creating a new (probably copy-on-write) memory space.

For the context I was talking about, finalizers, this matters because they need to write to do anything useful, so they seemingly need to run in the non-snapshot memory space. This further seems to mean that either they run before the normal GC cleans up, in which case they might restore access to the object that was about to be cleaned up; or after, in which case they don't have access to what was freed in order to do something useful.

I'm sure there's some way to resolve this, but it's an extra source of messiness.

Sure, but there's good algorithmic reasons to make GC generational, and good hardware reasons to make it compacting. Even if you only count the time to create a snapshot for the background process you aren't making things easier for yourself when that background process can't trace faster than new memory is being allocated.

... define "working" I suppose :grinning_face_with_smiling_eyes:. There's certainly been enough attempts, but it seems to be pretty cursed. That said, I don't get how this "background GC" would actually "synchronize" at the end without at least something like it, though a lot simpler if hardware can enforce memory protection at the object level.

Speaking of that, though, how the allocator works sounds deeply confusing, effectively it would have to be pretty tightly integrated with the hardware, right?

For resource cleanup without resurrection, you can offer an interface like (Rust pseudocode):

let fd: i32 = ...;
let file = Gc::new(File { fd });
when_garbage(&file, || close(fd));

That is, a separate function (closure) is scheduled to be called after the object becomes garbage. The closure cannot resurrect file because it doesn't refer to file, and, if snapshots are involved, would indeed run fully in the non-snapshot space. (The caveat in this system is what to do if the closure does refer to file; naively, that makes it uncollectable garbage similar to a reference-counting cycle. Arguably, this is the only correct thing to do if you want the correctness property “objects can never be observed in their post-finalization state”.)

I don’t recall what the standard name of this technique is, if it has one.

1 Like

You describe an illusion that's created by hardware. “Down below”, on the hardware level, nothing like that exists. You have separate RAM chips, numerous caches, MOESI protocol of caches syncronisations, and other such things.

If the goal is to create periodic hardware snapshots for GC needs then you would plug into system on that level. Perhaps even using double-RAM chips that make either one copy or two copies of data sent to them.

Finalizers only matter if you language does have them. Not all language include finalizers. And even if you do have finalizers you can just run them, if they need to be run, then run GC again. And only collect garbage when none of finalizers done anything.

Sure, they can do that, why not?

Not if our goal is realtime. I have worked with people who did HFT in Java. It was achieved, more-or-less, by taming GC and ensuring that it would never run when markets are open. All the tuning where GC happens was happening either before/after or explicitly when something catastrophic was happening.

Similarly here: if our goal is explicitly “GC in a hard real-time system” then we may need to give up on the niceties of generational GC or compaction.

They are cursed because people constantly find out ways to do things that are not supposed to happen.

It's easier because it's not supposed to support existing semantic of memory accesses. Transactional memory is in the same bucket as most other things that we are discussing here: not too hard to introduce into a new system, but very hard to add in a compatible matter.

Not really. It needs to be integrated with GC, obviously, while hardware role is limited to the support of “shadow state”—precisely one snapshot of “things that existed before the cutout moment” (obviously only supported in special GC mode, similarly to how today many things are only allowed in the SMM mode). In fact if we are real serious about runtime capabilties then GC mode and SMM mode should be merged, for SMM code can cause arbitrary delays that we want to precent.

... yes? I thought that's what I was saying to you. We're programmers; everything we do is "an illusion created by hardware" eventually.

The point was just:

because now you're having a "background process" that's no longer cleanly on the background, and therefore making this more complex. Like all the other issues I'm mentioning, this isn't fatal, but it makes it a lot bigger and fragile a feature than it first might seem.

Most have some form of them, eventually. JavaScript added FinalizationRegistry a few years ago for example. Without it, it makes GC far less useful, only handling memory. Whether they can resurrect objects is split, but being able to seems to be more common from the ones I'm aware of.

I'm not sure what context you're thinking these CPUs would be used, but if it's acceleration of existing workloads the majority of languages really do want to support finalization for compatibility if nothing else, which would outweigh performance by default, making the use of these features a flag with a big warning message or worse, randomly working until it doesn't.

There's another name for that: not using a GC. That's something you do by carefully contorting your use of the language specifically to avoid allocation at all.

More power to them, though I'm glad I don't have to deal with those codebases. I'm not really sure what that has to do with - apparently - wanting to make GC massively slower on the same input? Like, not using generational GC takes you from microseconds single threads pauses to avoid 90+% of objects getting out of Gen 0 to seconds to minutes of entire process pauses to pass over the entire memory space as you get under more memory pressure.

That's inherently pointer chasing: memory bandwidth limited, and maybe even thrashing paging. There's no magic bit of hardware that gets those bytes to you faster, you need to be more clever, and that more clever is inherently more complex.

2 Likes

Not when we are discussing if something can be hardware-accelerated. You have to consider what's doable in hardware and what is simpler to do in software. Similarly to how HWASan was designed.

On that I can agree. It's something on the scale of iAPX 432 or Itanic with squirreled registers.

That's another interesting question. But I would assume some kind of real-time work. Although there are sometimes ultimate form of garbage collector speedup is used.

If we are talking about porting the existing code then chances are very high that the end result would be highly unsatisfactory.

It's used to reconfugure the system. But not for the use of it. It works, at least from what I've heard, better than use of C/C++ for everything (Rust wasn't yet at 1.0 when I talked to these guys, not sure if they tried Rust by now or not).

If the majority of the system doesn't need GC at all, but small, yet important part of it needs it then you can dun that code that recofigures your system in parallel to everything else.

The solution to that is to never get under more memory pressure. All existing GCs need overhead around 2x, or, sometimes 3x compared to “ideal” GC, or else they start taking more and more time as you approach 1x. This would just need more memory… and that's where we hit something that I wrote in my very first message: if you are really careful about memory allocations then you, often, may happily live without any tracing GC at all, and if you are not careful then hardware wouldn't save you thus usefullness of all that we are discussing here is highly questionable.

honestly that goes for most threads here :sweat_smile:

1 Like

Maybe, but I have learned to separate discussions about possibility and discussions about feasibility.

These are two very different beasts: when something is not possible then it's final, we couldn't ever do that, because of this or that reason. You have your answer, you can stick to it.

When something is not feasible then its open-ended discussion: if the reason that makes it not feasible is lifted (because of some changes in the things that we have) then answer may change, too.

1 Like