Rust programs automatically optimized for CPU+FPGA+GPU

vitiral · June 16, 2017, 7:28am

I read a very interesting article recently. One of the items that popped out to me the most was:

ARM chips are already integrated with all leading FPGA solutions. ARM chips are low power at the cost of performance but GPU’s are extremely fast and also power efficient so the GPU can provide the processing muscle while the ARM cores can handle the mundane IO and UI management tasks that don’t demand a lot of compute power. The growing body of Big-data, HPC, and especially machine learning applications don’t need Windows and don’t perform on X86.

This got me thinking... is is possible that parts of the LLVM IR could be compiled into a HDL and offloaded to an FPGA or GPU -- and the compiler could figure that out for us?

It turns out... you already CAN compile LLVM to HDL: this project allows compiling C99, OpenCL and some of C++ LLVM "down to synthesizable processor RTL (VHDL and Verilog backends supported) and parallel program binaries.".

I am willing to bet rust would be uniquely positioned as a programming language for automatically optimizing FPGA / GPU code running on hybrid CPUs. The lifetime system could tell the compiler a LOT about where the data dependencies are, which could signiicantly reduce the fpga binary size (I would think). Not to mention that the concurrency primitives are pretty simple as well.

HadrienG · June 16, 2017, 9:31am

Not meaning to distract from your post's main point, which I find extremely interesting and will come back to at the end, but as someone working in the field of scienfic computing, I would like to warn about the common simplification, featured in this article among many others, that GPUs, FPGAs, or the latest fancy ANN ASIC are fast and performance-focused while CPUs are slow and should be relegated only to low-performance tasks.

This train of thoughts is partly driven by genuine performance advantages of alternative hardware solutions (e.g. more FLOPS/main memory bandwidth for GPU), but also fueled by decades of obscenely biased vendor-funded benchmarking that will not shy away from comparing a full GPU to a single CPU core (protip: always ask for, and examine carefully, the details of benchmarking protocols), or artificially inflating a benchmark's problem size to unrealistic dimensions in order to measure raw DRAM bandwidth/latency as opposed to the characteristics of the appropriate layer of the CPU cache hierarchy.

The situation is not helped either by less knowledgeable technology commentators who blindly regurgitate marketing terms and material of hardware manufacturers, for example by calling "processing cores" what is actually closer to an SIMD lane or an hyperthread in Intel jargon, both of which exhibit very different performance characteristics with respect to the physical CPU cores that they are being compared to. Or by treating ARM chips with integrated GPUs as a new and revolutionary thing, even if we've had them for years now.

In practice, the hardware that will perform best depends significantly on the characteristics of your problem, and on the amount of manpower which you are willing to expend into tuning your algorithm to fit the characteristics of each individual chip. In this respect, CPUs remain arguably the most flexible and "forgiving" computing hardware available today, making any announcement of their death as a platform for performant software greatly exaggerated. The reality is more that they are getting increasingly complemented by more specialized chips that perform some common tasks such as linear algebra (a lot) better.

Which finally allows me to get back on topic: yes, we really need more tools to ease porting CPU code to other platforms in order to better explore the increasingly heterogeneous computing ecosystem that we have today and see what works best for each individual problem

jonh · June 16, 2017, 11:22am

To keep it short I will just say the article is low on technical facts and high on statistics.

As you say you CAN automatically convert. Optimized on the other hand is unlikely.
Rust does not optimize for multi core CPUs. It is down to the programmer to correctly add threading to be optimum. CPUs SIMD use is also currently mostly down to the programmer to add on a library.
GPUs are best for high volume SIMD. "data dependencies", "concurrency primitives" are both CPU focused, GPUs desire is for no dependencies so processing can be done in parallel.
(Cant comment much constructively about FPGA difference.)

In addition to @HadrienG comment; many times when you hear about the wonders of hardware acceleration what is failed to be mentioned is the cost of on and off loading. (Article blames intel for this.) It will always exist just to what extent is down to the task and hardware architecture being used.

Botev · June 16, 2017, 12:05pm

From my personal experience on HPC, specifically in ML on GPUs, the actual choice of language for programming the kernel or interaction with the device does not really make that much a difference. Whether I use something like pycuda, Matlab or c++ the overhead is minimal. In these situations I think there are two a lot more important things:

Understanding the device architecture and how to actually write a very good kernel.
Understanding the whole problem of the program and how to optimize memory usage and combining operation where possible.

This is one of the main reasons why currently tools like Tensorflow, Theano, Pytorch etc... are popular as they do both.
And yes having Rust to do this would be nice, but it would be very hard I think to get a lot of people on that train without good reasons for demmand.

vitiral · June 17, 2017, 6:56am

Thanks for the great replies.

My design for a rust HDL compiler would be that

it is hybrid, some operations are cpu or explicitly GPU while others are run on a program specific FPGA.
you could require some functions to be run on the FPGA with something like #[fpga(always, cycles=num)]. The compiler would then make sure that your program could actually be compiled an fpga with the required num of cycles.

Hopefully this would help a little bit with some of the negatives. And yes, this wouldn't remove the need for explicit design for CUDA, etc.

Botev · June 19, 2017, 2:01am

What exactly is the use case for this compiler, what is going to achieve or what gap is it bridging?
Is this just you think it is cool to try to do this in Rust just because you like or is there any real goal - e.g. targeted at mathematical HPC, targeted at some form of parallel simulations, Tree searches with HPC... I think it is important to start from there rather than what you want unless it is option 1.

vitiral · June 19, 2017, 4:19am

The main goal would simply be to be able to create the fastest possible implementation but without needing much (if any) specialized code.

The goal is to make rust the best general purpose programming language for running as fast as possible and being ultra low power, with very little effort from the programmer.

Botev · June 19, 2017, 7:21am

So this is more or less not possible. Specialised code, especially in HPC is going to be much harder to optimize without architectural knowledge ... just check how many papers of research there are on how to optimise something as basic as GEMM...

As from what you wrote I assume your target is to be able to run Rust on any platform, e.g. by enabling a compiler flag for converting LLVM IR when needed to HDL?

vitiral · June 19, 2017, 4:58pm

I'm sure each architecture would have to have its own compiler extentions. It probably wouldn't be easy until there was a standardized way of dealing with fpgas inline.

I'm imagining that the kernel could support binaries that contain fpga code, and there was a standardized method of loading them to fpga when the program was run, similar to how memory is loaded now. The compiler would then create binaries with this layout (for architectures that support it, i.e. linux-arm64-hdl or something)

femomelo · November 15, 2017, 2:29am

Hi,

Good idea, but let focus in: "The lifetime system could tell the compiler a LOT about where the data dependencies"

I think this can be very useful for HSA, http://www.hsafoundation.com. Have similar approach with focus strictly in APU (cpu+gpu ). Where latency is lower to acces GPU, making the solution less complex.

Maybe if HSA use Rust. Or if rustc have a option for HSA LLVM IR.

HSA already exists but can be better with Rust.

Topic		Replies	Views
Rust and FPGAs, is it possible? community	19	14187	April 10, 2021
FPGA backend for Rust community	4	4524	January 12, 2023
Rust beyond CPU's (for HW such as TPUs, GPUs, accelerators etc.) community	15	2269	February 28, 2024
GPU programming in rust help	10	3153	February 10, 2023
Heterogeneous data parallelization: Rust plans?	3	817	August 16, 2024

Rust programs automatically optimized for CPU+FPGA+GPU

Related topics