C to Rust translator

Whatever happened to the C to Rust translator projects? There was Corrode, and there was C2Rust. They're still around, but they just convert C to rather painful Rust which does the same thing:

C:
p[j] = p[j-1];

Rust:
*p.offset(j as isize) = *p.offset((j - 1 as libc::c_int) as isize);

This clearly works by turning subscripting into pointer arithmetic. Has anyone gotten as far as converting C arrays to Vec and using subscripts. That would produce Rust you could work on.

1 Like

I'm pretty sure the answer is no.

A while ago I've written a converter that prefers to generate readable code over semantically accurate one.

I've then used it to convert lodepng to pure Rust:

However, I'm not pursuing this further, because the result was underwhelming. I've spent a lot of time cleaning the output, only to realize I still have a C-style codebase, I've merely replaced clang with rustc. The minor inaccuracies in translation also introduced bugs. I was lucky lodepng had proper unit tests.

In the process I've realized that converting C code is way more work than just replacing pointers with slices. The are high-level idioms, and architectural choices typical for C programs that don't make a good Rust program. For example, instead of generics it used byte offsets of sizeof(type). It had ad-hoc buffer management and writes to raw pointers for I/O instead of io::Write, etc.

10 Likes

Huawei posted something recently about https://c2rust.com

It was mentioned earlier this week in their discussion on joining the Rust Foundation:

https://trusted-programming.github.io/2021-02-07/index.html

1 Like

They say: "We have created automated tools to refactor and clean up this generated Rust code through source-to-source transformations."

I can't find a link to it at the moment, but they showed some demo refactorings that looked nice.

As much as I love the idea of RiiR, I've always been an advocate of using the original library over FFI and getting a human to create an idiomatic wrapper around it. It's just not (currently) possible for a computer to understand things like architecture choices and high-level idioms from one language and convert them to the equivalent in another.

I'm not saying people shouldn't try to make these sorts of projects, I think being able to comprehend a codebase is an important step towards developing true AI after all, just that you won't see them in mainstream use for a while because of these sorts of issues.

1 Like

I was hoping to convert OpenJPEG, which is a decompressor for JPEG 2000 files. It works, but it has multiple known security holes and is the subject of at least three security advisories. it's used in PDF readers and is a common attack vector. Converted to safe Rust, bad images would make it subscript out of range and abort, not let someone attack the computer. Converting it to unsafe Rust with pointer manipulation just preserves the security holes.

There's a FFI interface to it from Rust now. It's no safer than the C version, of course.

So that's a use case for not using the FFI approach.

1 Like

That right there tells me why this is an impossible task. Or at least an unsolved research problem that AI might manage one day.

The C language does not really have any notion of arrays. Despite what it might look like in the syntax.

For example often arrays are passed into functions with signatures like:

int myFunc (int* myArray, int myArrayLen)

What can we make of that?

Apparently "myArray" is a pointer to an int. Or is it a pointer to many consecutive ints? If that latter how may consecutive `int's ? How can we know?

Well, the human programmer sees the other parameter, 'myArrayLen', and assumes that whoever wrote that code is not crazy and has not written buggy code and therefor that myArrayLen likely means that the pointer is actually a pointer to many consecutive ints and not just one.

It's going to be really hard for a machine translator to figure this out so that it could create a function that accepts a Vec instead.

The C language has no notion of strings either. There is no such type in C. There is char* and there is the convention that a string ends with a byte of zero. Might be clear to a human reader. Virtually impossible for an algorithm to figure out.

Which is why programs like c2rust create the unintelligible mess of an output that they do.

2 Likes

It's not just C that's hard to translate to a readable form of another language. RatFor was a Fortran-like language with more modern syntax that got translated to Fortran. Unfortunately, you had to look at the generated output when debugging. It was almost impossible to figure out what was going on in your program. A c2rust translator is going to have both the RatFor problem and the one @ZiCog points out.

I was originally going to suggest porting it one function at a time but I just had a look at the project and it's something like 200k lines of C, which is a bit... impractical.

Are there any alternative libraries out there? I feel like 200k lines is a little excessive for an image format parser and wouldn't be surprised if a library like pest or nom let someone write something that'll get the job done in about 2-5k lines.

Can you isolate any specific algorithms? If you can extract and test against against a small subset of the full code you might be able to make progress more easily, even if you have to write some amount of glue code from scratch.

Of course, a translation of a program must preserve its semantics. Therefore, a correct translation of a buggy program has the same bugs of the original. So, I cannot see the point of such translation. You should first remove all the bugs from your C programs, and then you can hope to have a correct Rust translation. It's like to say "I have some sentences in French, but they contain some contradictions, so I want to translate them into English".
The fact that safe Rust has less bugs than C means that it is impossible to translate any C code to Rust.

1 Like

If you want "abort on bad memory access" semantics and don't mind a moderate slowdown you could compile the C code with clang's -fsanitize=address.

2 Likes

So many "it can't be done" comments. Here's an outline of how to approach the problem.

Here's some C code with pointer arithmetic from OpenJPEG. This module has has buffer overflows reported in security advisories, so it's a good place to start.

static void opj_dwt_deinterleave_h(const OPJ_INT32 *OPJ_RESTRICT a,
                                   OPJ_INT32 *OPJ_RESTRICT b,
                                   OPJ_INT32 dn,
                                   OPJ_INT32 sn, 
                                   OPJ_INT32 cas)
{
    OPJ_INT32 i;
    OPJ_INT32 * OPJ_RESTRICT l_dest = b;
    const OPJ_INT32 * OPJ_RESTRICT l_src = a + cas;

    for (i = 0; i < sn; ++i) {
        *l_dest++ = *l_src;
        l_src += 2;
    }

    l_dest = b + sn;
    l_src = a + 1 - cas;

    for (i = 0; i < dn; ++i)  {
        *l_dest++ = *l_src;
        l_src += 2;
    }

So how would an analyzer proceed?

First, expand OBJ_RESTRICT, which is defined as C "restrict". That means no aliasing. Nice to know.

Which parameters are arrays? This function calls no other functions, so this is decidable from the code here. Parameters a and b are potentially arrays.

Pointer arithmetic is detected at *l_dest++ and at l_src = a + cas. l_dest gets its value from b. So parameters a and b are definitely arrays. *l_dest++ appears on the left side of an assignment, so b is mutated.

We can now write the function definition in Rust.

fn opj_dwt_deinterleave_h(a: &[i32], b: &mut[i32], dn: i32, sn: i32, cas: i32)

That's straightforward, safe Rust.

OK, here's a pointer that gets mutated.

`OPJ_INT32 * OPJ_RESTRICT l_dest = b;`

What do we do there? l_dest always points to
some part of b. So it can be expressed as a
subscript into b. So

let mut l_dest: usize = 0;

Next,

const OPJ_INT32 * OPJ_RESTRICT l_src = a + cas;

which always points to some part of a. So

let mut l_src: usize = cas as usize;

Now the hard part, pointer arithmetic in the first FOR loop.

for (i = 0; i < sn; ++i) {
    *l_dest++ = *l_src;
    l_src += 2;
}

The "for" statement is such a common C idiom that it needs to be recognized. Corrode does that. So

for i in 0..sn {

Now, we know that l_dest points to somewhere in b, and l_src points to somewhere in a. So we just have to calculate the subscript.

b[l_dest] = a[l_src];

Now we need to figure out the incrementation.

l_dest++

increments by one size of the thing pointed to,
so it's

l_dest += 1;

and

l_src += 2;

And that's the first loop, converted to safe Rust.

This is a bit long, but I wanted to show that this is not an impossible problem.

There are hard cases to resolve. Those involve cases where you have a pointer that's hard to tie back to the underlying array. Those may need human assistance. But most code doesn't do that, unless it's a storage allocator. If you can determine which array a pointer points to, you can convert pointer arithmetic into subscripts automatically.

After conversion to Rust, subscript checking will protect against buffer overflow attacks.

So, this is not impossible. Just hard. A good thesis project for someone getting their CS degree.

1 Like

Oh, it totally does. It's spectacularly unhelpful, though, when arrays are passed to functions due to array-to-pointer decay. But if there's no function boundary between the definition and use of an array, it pretty much works as expected (e.g. sizeof is correct).

As usual with halting problem-like constructs, I suspect it's feasible in many practical cases, feasible but very (perhaps prohibitively) difficult in some others, and theoretically impossible in the rest. I'm sure you have to read the above comments as "impossible in the general case", not as "always impossible".

Incidentally, alias analysis, provenance analysis, and in general data flow analysis is the lion's share of the work when it comes to optimizing compilers, so there's been a ton of research into the specific problem of "which pointer points into which allocation". Maybe the implementation of LLVM (or that of another open source, well-designed, readable, optimizing compiler) could serve as inspiration for this task.

That's on a good day. You're going to find plenty of int myFunc (int* myArray) where the size of the array is implied from context.

For example, in image processing it's common to take just a uchar pointer to image data, and expect it to have a size such as width*height*(bits_per_pixel + 7)/8).

1 Like

Yes, I was a bit extreme there. Let's just say that:

"does not really have any notion of arrays" == "has only a minimal notion of arrays"

I would say that even with a reasonable/not crazy programmer, that function signature is still ambiguous.
Like, here is a few way that this C signature can be interpreted in Rust

fn myFunc(myArray: &[isize]); // read only

fn myFunc(myArray: &mut [isize]); // mutable

fn myFunc(myArray: Box<[isize]>); // owned

fn myFunc(myArray: &mut [MaybeUninit<isize>]); // write buffer

fn myFunc(myArray: Box<[MaybeUninit<isize>]>); // owned write buffer

If we do ever get a nice C-to-Rust translator, it will also have to infer all this information, probably from the function body if it was ever feasible.

3 Likes

Exactly. I used to do proof of correctness work,where you can run into undecidability. However, if you find completing a loop or staying within array bounds to be approaching undecidablity, the program is essentially broken. That's the position Microsoft took with the Static Driver Veriifer - if it can't prove safety in some number of minutes, the submitted driver is broken.

Worst case is you have to fall back to "fat pointers", where you carry along a reference to the array and an offset into the array so you can check at run time.

Incidentally, alias analysis, provenance analysis, and in general data flow analysis is the lion's share of the work when it comes to optimizing compilers, so there's been a ton of research into the specific problem of "which pointer points into which allocation". Maybe the implementation of LLVM (or that of another open source, well-designed, readable, optimizing compiler) could serve as inspiration for this task.

Yes. This sort of flow analysis is a known problem in optimizing compilers.

It's a good problem for someone to work on. I'm off on a different hard problem this year, but, as I said before, this is a good thesis topic. Could even lead to a startup.

1 Like

I think you're confusing the notion of array with the one of slice.

C has the notion of arrays, you can store them in structs and even pass them to functions. The problem is that they work like rust's arrays, you need to statically know their size and that doesn't happen very often.

Slices however are non existant, you need to manually pass a pointer and a length.