Rust-accel linking raw *.cu files

  1. This is a bit of a desperate hail mary.

  2. For those using (already filed a ticket there), is there a way to call CUDA device functions defnied in raw *.cu files from the #[kernel] functions of rust-accel ?

Long Version:

rust-accel is a crate that lets us write CUDA functions in Rust (by using the nvptx backend). Unfortunately, it's not clear if CUDA-volatile keyword is supported in rust-accel.

Thus, I would like to write part of my kernel in raw *.cu (which allows me to use the volatile-keyword). Then, I want to call these CUDA device functionsin raw *.cu from the #[kernel] rust-accel functions that I define.

It is not obvious to me how to do this.

Has anyone done this before ? If so, example would be greatly appreciated.

I'm pretty sure accel does not currently support linking to custom PTX files. As you've no doubt noticed, CUDA support in Rust is limited at the moment.

Excusing my ignorance, PTX is basically just "CUDA Assembly" right?

So it seems that if we can do
*.cu -> ptx (nvcc)
#[kernel] *.rs -> ptx (rust-nvptx)
multiple *.ptx -> CUDA bin file (nvcc?)

Then it should be possible to get rust-accel to call raw *.cu files

In theory, is this right?

You're right that there's no fundamental reason why it would be impossible, I just don't think it's supported currently.

Combining multiple PTX files is one limitation. This would require a linker, and as far as I can tell there is no well-supported linker that works for PTX files (rust-ptx-linker links LLVM bitcode and then uses it to generate PTX files, and the only other one I can find appears to be abandoned).

It is possible to link PTX files together at runtime. There are functions in the CUDA driver API for it, at least. However, I don't think accel exposes this feature. Accel also doesn't easily expose the PTX files that it generates, so that's another problem. I also don't know if every CUDA driver supports this feature.

I'd like to say that you could just use my RustaCUDA library instead, with rust-ptx-builder to compile your Rust code to PTX, but RustaCUDA also doesn't expose the runtime linking feature yet. I probably won't get around to building that out any time soon, either.

That being the case, I think your options are as follows:

  • Call cuda-sys functions directly to do your linking at runtime.
  • Contribute support for just-in-time linking to Accel or RustaCUDA, and then use that
  • Write your kernels in CUDA C and then load and launch them with RustaCUDA

Honestly, my recommendation would be that last one (though I might be biased). Support for compiling Rust to PTX is really limited and unstable at this point, and it's missing important features like shared memory, volatile reads, vector loads/stores, and even basic things like the syncthreads intrinsic doesn't work. There's an unofficial group working on improving this (and I am part of that group) but it's still early days yet - we're still trying to make sure that the compiler even works reliably.

1 Like

@bheisler : Completely true story: I didn't even realize you were the author of Rustacuda ... and after looking over all the "Cuda" crates on, independently from you, decided tha

"use Rustacuda + write everything in cuda C" was the right approach.

I'm glad we got to the same conclusion, and thanks for all your effort in creating Rustacuda!

Haha, no worries. Yeah, hopefully in a year or two we'll have a better story for compiling Rust to PTX. In the meantime, I hope RustaCUDA works out for you. Let me know if you have any trouble with it!

What tool are you using to write *.cu files?

I'm using IntellikJ/Rust for Rust, but can't find a corresponding plugin for IntelliJ/Cuda.

NVidia appears to have Eclipse/Nsight, but I'm not sure if it's worth it to get familiar with Eclipse just for the sake of a few *.cu kernels.

@bheisler : I find that many of my kernels end up being "map" or "reduce" over either a vector or a matrix. Are you writing kernels in raw *.cu, or have you found some C++ lib that has nice abstractions for mapping/reducing over Cuda Arrays AND also generates ptx that works well with Rust?

In my projects so far the code required to implement my own map or reduce loops over an array has been trivial compared to the function that I'm applying to the elements of the array, so I just write my own.

As for C++, in theory PTX generated from C++ should work fine (subject to the normal limitations on data structure layout in FFI). RustaCUDA does nothing with the PTX files except pass them on to CUDA so it should work with any valid PTX file.

1 Like

@bheisler :

  1. How do you debug Cuda Kernels that generate runtime errors when called from Rust ?

  2. Are you able to use any of Debugging Solutions | NVIDIA Developer ?

  3. Do you need to setup custom C/C++ calling code that mimics what Rust does (relative to the kernel) so you can debug the kernel?

I don't know - I haven't tried debugging CUDA kernels launched from Rust before. I expect it would be the same as debugging CUDA kernels launched from C/C++, although the debugger might get a bit confused by rustc's name-mangling of the host-side functions if it displays that.
If you figure it out, you might write a blog post or something similar and share it.

@bheisler : How do you debug CUDA kernels then? Do you test them first from C/C++, and only call them from Rust once everything is working?