How hard would it be to build a Rust transpiler that translates Rust to C?

FYI: The NIM programming language (https://nim-lang.org) operates in this manner - compiling first to C. Could be worth checking that out!

1 Like

As others have said, targeting LLVM is actually easier, since it's designed for that. And it would be possible to make a TCC-style backend for LLVM too, that compiles faster, but that's not really what LLVM is designed to do right now.

But one absolutely could make a rustc backend that produces C code. It would be a fair bit of work, probably need to use a bunch of extensions to handle certain things, and would need to produce incredibly non-idiomatic C code to not violate a bunch of C rules that Rust doesn't have, but it'd be possible.

If you wanted to play with doing so, you might be interested in the other backends like GitHub - bjorn3/rustc_codegen_cranelift: Cranelift based backend for rustc and GitHub - antoyo/rustc_codegen_gcc: libgccjit AOT codegen for rustc, which also compile rust to non-LLVM.

1 Like

I know C doesn't have the active union variant rules C++ does (since C99 TC3, union punning is explicitly allowed), so I went to double check if C has TBAA rules. I wasn't able to confirm one way or the other[1], but I did learn something else fun — casting directly between two pointer types where the pointer value is insufficiently aligned for the target type is UB.

I was originally going to say that no, a rust2c frontend would be marginally easier than rust2llvm; you could theoretically go straight from fully elaborated Rust HIR (or would it be THIR) to C, staying in structured control flow without lowering to the basic block CFG form used for MIR. This'd still require most of the complicated frontend work (type inference, trait solving, constant evaluation, monomorphization, etc) and require doing the older less-capable HIR-borrowck instead of the modern MIR-borrowck, but it would be very marginally less work than required to emit LLVM bytecode to get to a place capable of emitting C code. Even with basic-block form, though, C supports function-local goto, so you don't need anything like the emscripten relooper algorithm.

But the fact that standard C does make things defined in Rust implementation defined or even undefined means that just simply "lowering" (cf transpiling) directly from Rust to C and leaving any interesting work to the C compiler isn't going to work; you're going to need to go all the way to byte-level translation of working with structs and field offsets. And at that point, emitting C doesn't save you any work (except maybe the THIR->MIR lowering) and adds extra work to comply with C's model, compared to emitting LLVM IR and complying with its model designed to be a compiler back-end.

So TL;DR: transpiling to a higher level language seems appealing, but only works if the language semantics are a strict weakening of your starting semantics. Because of this, C makes a poor lowering target, and LLVM (or Cranelift, libgccjit, etc) was designed to be a usable lowering target.

A C target can be extremely useful (e.g. Zig uses a C target for bootstrap) but will be a similar amount of work to other backend targets and not produce anything remotely close to usable C function interfaces.

The other direction, generating unsafe Rust from C, is perhaps even worse in full generality, but surprisingly tractable if you make some simplifying assumptions (e.g. already preprocessed source, only LP64 int sizing, only a single translation unit) since C as a language is designed to be friendly to single-pass line-by-line translation.


  1. As best as I could determine, it's essentially "implementation undefined" for the typical cast case. Casting between two non-char pointer types produces an implementation-defined value, except that casting back to the original type produces a pointer that compares equal to the original. It's thus implementation defined whether dereferencing a typecast pointer value is undefined behavior. But because union punning is allowed, pointers of unknown provenance could be to members of the same union and allowed to alias that way. ↩︎

3 Likes

To elaborate a bit on why C isn't a great compilation target, even for the stuff that can be dealt with "easily":

One might think x = (a + b) as i32; in Rust could be x = a + b; in C. But of course C has a bunch of implicit promotion and UB signed overflow, so it's not. That might need to be, for example, x = (int32_t)(int16_t)((uint16_t)a + (uint16_t)b); to get the behaviour right.

Whereas in LLVM it's add i16 then sext i16 to i32 -- a much more direct and obvious translation of the behaviour.

4 Likes

It absolutely does. In C parlance it's called the "strict aliasing rule". The broad rule is that you are not allowed to access (read or write) a value of type T through an lvalue of type U unless T == U (modulo CV-qualification) or U == signed/unsigned char.

Anything else is UB. So things like extracting bits from a float by casting its address to the equivalent-width unsigned type and then dereferencing it is instant UB. Such type punning is simply not possible before C99 IIUC; with the arrival of C99, it's doable via unions.

A bit of a nitpick: As well as qualified versions, U can also be the signed or unsigned version of T, if T is an integer type. This is an exception which is very useful for real code, but often elided in discussions of the strict aliasing rules, and I worry that people may imagine a strawman version of the rules that disallows it. There are a couple more minor exceptions; cppreference.com has the full list of rules.

1 Like

I have tried on several occasions to build a C backend for rustc. Each time I got stuck on generating valid C syntax. How do you turn something like fn() -> [fn(); 8] into C? And how do you handle statics when you don't know where relocations are in case of forward declarations and imports and thus can't produce a valid type with pointers at the right location? Or what about cases like casting a captureless closure to a function pointer where Rust will happily convert between fn(self: ZstClosureType, arg: T) and fn(arg: T)? Not every target ignores ZST types in it's ABI. Some reserve a register for it: https://github.com/rust-lang/rust/blob/c54c8cbac882e149e04a9e1f2d146fd548ae30ae/compiler/rustc_ty_utils/src/abi.rs#L365-L376

3 Likes

All right, so I think it's pretty clear now that using C as a target is not a good idea :slight_smile: . Thanks guys.
As I said, using C was an attempt to make things easier. After seeing how easy it can be to build a C compiler I was hoiping that the hard part in a compiler is the bytecode/machine code generation and optimization.
So by now it should be clear that it's not :slight_smile: .
The borrow checker seems particularly complicated, and that definitely is not related to any code generation.
So I guess I have to give up. It's just that it looks so damn cool.

Well, it is a difficult problem. That's why LLVM (and Cranelift, and libgccjit, etc) are such amazing projects: they're big projects developed by very talented people that makes it possible to just write a compiler frontend and then hook it into the compiler backend API provided by the library, instead of having to implement an entire compiler backend yourself.

It's just that writing the frontend is also a difficult problem. A frontend for a relatively simple language like wasm can itself be relatively simple. For a more sophisticated language, you need a more sophisticated frontend. And Rust is one of the most sophisticated language frontends out there, at least among "production languages" rather than "research languages" (for what little the distinction is worth).

For compiling Rust? Yeah, probably.

For writing a compiler? No, go for it! Pick one of the resources to help you get started, and make up your own little language and grow it organically as you add the features you want. There's a lot to learn and grow in the process, even if you won't end up with a "useful" artifact in the end. And if you write your compiler in Rust, you maybe won't learn about Rust by writing a compiler for it, but you'll certainly learn about Rust by using it.

5 Likes

In WebAssembly there's a WASI standard runtime that probably could do it. It lacks setjmp.h in C and the representation of the multithreaded model environment is still being worked on. However, Rust compiles to that bytecode and W2C2 translates that bytecode into ANSI/ISO C 1989-1990 edition. Also W2C2 is written in that same edition of C and can translate its own source code back to C (in a messy fashion) if compiled into WebAssembly using WASI-SDK's Clang version. WebAssembly makes lots of stuff possible! (Or at least it soon will, once the runtimes are feature complete.)

I just want to add my thanks for that link, as someone who had a similar opinion of compilers. I’ve been working through it the past few evenings, translating the Pascal to Rust, but still producing 68k assembly for now. So far there’s a couple of code errors and some statements about compiler optimization that I’m sure have aged like milk, but as a non-CS graduate it’s certainly more accessible than many of the more formal sources. I’ve bookmarked Crafting Interpreters for next.

I think this recent advance counts as a Rust-to-C transpiler.