TheRustBandwidthBenchmark

Hello Rustaceans,

I just wanted to showcase my little project that I did a few months ago. The project was to port C + OpenMP code called TheBandwidthBenchmark, that is a collection of simple microkernels used for benchmarking. Inspired from Dr. Bandwidth's (John McCalpin) benchmarks, TheBandwidthBenchmarks offers more sophisticated kernels with different data transfer rates, allowing one to measure effective memory bandwidth. These micro kernels are used in HPC to perform basic measurements in your node/machine. I ported it to the project called TheRustBandwidthBenchmark.

Link to my project : GitHub - adityauj/TheRustBandwidthBenchmark: This is a collection of simple streaming kernels.

The general output format of the code is as follows:

Benchmarking with 8 threads.
Total allocated datasize: 3840.00 MB.
Initialization of arrays took : 506.008814ms.
----------------------------------------------------------------------------------------------------------
Function        | Rate(MB/s)      | Rate(MFlop/s)   | Avg time       | Min time        | Max time        |
----------------------------------------------------------------------------------------------------------
Init:           | 8923.15         | -               | 0.1120         | 0.1076          | 0.1372          |
Sum:            | 19562.93        | 2445.37         | 0.0549         | 0.0491          | 0.0883          |
Copy:           | 11859.23        | -               | 0.1655         | 0.1619          | 0.1868          |
Update:         | 17723.62        | 1107.73         | 0.1100         | 0.1083          | 0.1143          |
Triad:          | 13162.47        | 1096.87         | 0.2207         | 0.2188          | 0.2255          |
Daxpy:          | 18254.28        | 1521.19         | 0.1604         | 0.1578          | 0.1643          |
STriad:         | 14149.89        | 884.37          | 0.2732         | 0.2714          | 0.2819          |
SDaxpy:         | 17545.77        | 1096.61         | 0.2219         | 0.2189          | 0.2240          |
----------------------------------------------------------------------------------------------------------
Solution Validates

As of now, the closest I could implement was to use Rayon with Rust to mimic the OpenMP static scheduling. However, I have also learned that Rayon and OpenMP works differently (work stealing method is used in Rayon during micro kernel execution while most of the workload is balanced in OpenMP before executing the micro kernel).

And also Rust + Rayon implementation was a bit slower than the C + OpenMP version. If you see in my code, I had to use zip operator in order to iterate elements of slice together. When I used perf annotate to see the hotspot in kernels, I could easily see that the zip operator took 30%-40% of the time, leading to poor performance compared to C implementation. C implementation offers more effective bandwidth, which I thought could be harnessed with Rust implementation as well. In general, Rust might not be good for micro kernel benchmarks like these, but I did enjoy learning Rust.

I would be happy to get the thoughts on the current code, in order to improve it and get it to C + OpenMP level.

This post was just to showcase something that I created as a side project and get communities opinion.

I have been really a Rust fanatic and also want to continue contributing to the Rust community going further. I am reading this excellent book by David Drysdale on Effective Rust, plus I also follow Jon Gjengset on Youtube. Any other material, links or suggestion would be really appreciated =)

1 Like