Performance issues with AtomicU8?

Hello,

I have written my first Rust program here:

It takes an image and adds a constant value to the rgb-values for all pixels.
I have one sequential implementation and two paralllellized test implementations.

One function is running code sequentially in memory: profile_sequential, another parallelized but in sequential batches: profile_parallel_batches, the other divides the image to logical blocks (no data-restructuring) and processes them in parallel: profile_parallel_blocks

I even have been conservatively optimistic for the measurements of the parallellized code and not accounting for i.e. Vec<u8> to Vec<AtomicU8> conversions back and forth since I'm mostly interested in the raw performance where I want to assume I have good input-data for the pixel processing.

Still the performance suffers on my computer with 12 cores.
The best I can get is about half the execution time which I think is poor.
Posting one example run here:

Image: "pillars.png", dimensions: (675, 675), ColorType: Rgba, BitDepth: Eight
Stored seq, batch, block in: "out_seq.png", "out_batch.png" and "out_block.png"
Number of logical cores: 12
Number of samples: 1000
Performing performance analysis on sequential code
Performing performance analysis on parallel code with batch_size: 4096
Performing performance analysis on parallel blocks with size: (64, 64)
Sequential     time: 1.234812ms, standard deviation:  148.542µs
Parallel Batch time: 422.524µs, standard deviation:  37.498µs
Parallel Block time: 1.913075ms, standard deviation:  298.475µs

I have run with several different sizes both to block_size and to batch_size but it's basically the same.
Are there more efficient means to update pixel data as I want to do here?
Do I need to do something else with rayon?

Thanks for anything!

Without looking at the code in any detail, to get high performance I would suggest you need to be breaking work into more significant chunks, locking (synchronising) at the level of a single byte is always going to be inefficient if there are a lot of them.

I know and I tried here:

But I was told that I could not write to the buffer 'in-place' as I want to do in this case without using AtomicU8.
I should read up more on parallel programming in Rust.

That is going to be a bit tricky to process in an efficient way with multi-threading, I guess the first thing you need to do is to split it into chunks. I think there are ways to do that, perhaps this ( although I never used it and I may be wrong... ).

I have not looked. over your code but from what you describe it sounds like you could use the rayon crate to iterate over your buffers splitting them into chunks that can be processed in parallel by separate cores. Rayon has the par_iter_mut()method for this.

2 Likes

I like Rust as an abstract language that is not too close to the hardware, but performance wise it's not even close to intrinsics.
I did this calculation in a very simple single threaded intrinsics program that clocks about 120 microseconds in C/C++ (NOTE: using intrinsics so it's not pure C/C++).
So for me who likes perfmance I say Rust is not yet a viable option, unless you have opened up intrinsics, which I seriously doubt.

Rust's atomic ops are going to compile down to intrinsics, so there must be some other difference between the two tests (Rust vs C/C++). Also, did you use --release when testing with Rust?

1 Like

As I mentioned, I wrote handcoded intrinsics in C/C++, not C/C++ code.
I ran "cargo run -r"

Here is the core of the loop, I also handle the remainder but it's almost the same.

	int32_t intr_n = Ntot >> 4;
	const uint8_t inc[16] = { 64, 64, 64, 0, 64, 64, 64, 0, 64, 64, 64, 0, 64, 64, 64, 0 };
	__m128i intr_inc = _mm_load_si128((const __m128i*) inc);
	do {
		__m128i intr_x = _mm_load_si128((const __m128i*) cur_ptr);
		__m128i intr_sum = _mm_adds_epu8(intr_x, intr_inc);
		_mm_store_si128((__m128i*) cur_ptr, intr_sum);
		cur_ptr += 16;
	} while (--intr_n > 0);

Looks like Rust does expose intrinsics (unsafe of course), for example, _mm_load_si128.

If you're ok with using unstable, you can also use the generic Simd API for cross-platform portability.

3 Likes

Thanks, that is really useful. :slight_smile:

You shouldn't need to use explicit SIMD — LLVM is quite good at auto-vectorization — but you do need to write code in a simd-friendly way to get SIMD optimizations. A good post about this:

Using rayon's parallel iterator will likely have synchronization overhead. Your absolute best performance bet will be manually chunking in a way where chunks don't cross cache borders and using scoped threads (either fresh threads with std or pooled with rayon, shouldn't matter too much) working with &mut and <[_]>::align_to

Directly using SIMD can certainly be easier than coercing LLVM to see what you want, though, if you already know what you want in terms of SIMD ops.

6 Likes

You probably have a false sharing problem. If you modify bytes in memory that are near bytes accessed by other CPU cores (typically within 128 bytes), it becomes very slow, because the CPU caches have to be synchronized across all cores.

When processing images, you should never parallelize per pixel. Parallelize per line of images, and ideally larger groups of lines.

As a bonus, if you split image into lines of &mut [u8], you avoid cost of atomic operations, and cost of bounds checks. For image processing code, buf[i] is very expensive, and you should always use iterators instead.

7 Likes

Likely (but not checked) a vector index bound check is getting added (and not optimised away) to sequential code.

Now this is very confusing. Your opening post here was about speeding up code by using multiple threads on multiple cores. But now you are talking about speeding up code by using vector operations.

Firstly I have to disagree with your characterization of Rust as "an abstract language that is not too close to the hardware". This is clearly not true. Rust compiles down to native instructions with no run-time support needed. Just like C/C++, Pascal, Ada, etc. Anything one can do in C one can do in Rust. This has been shown many times:

Or if you want to get more serious:
https://cliffle.com/blog/m4vga-in-rust/

Yes, Rust has higher level abstractions than C, but still it compiles down to the metal.

Secondly, you are now comparing Rust with, shall I say, "not C/C++" by introducing intrinsics for vector operations, which are not part of either language.

Anyway, I look forward to seeing how you get on using that same intrinsics in Rust and C/C++.

2 Likes

Hi, sorry if I caused some flame... that was entirely unintentional.

Since I'm new to Rust and this was my very first hack (program) in Rust, I didn't know that the compiler allowed to compile "unsafe code" (whatever we mean with that). This can change the picture a bit since I now can get access to both hardware and low-level instructions with pointers assuming I want to. The result will most likely end up in whenever the profiler tells me that I need to optimize some code I will disregard memory safety assuming the gains in speed are worthwhile.

I also added the minimal example in C++ in my repo to be able to compile for anyone interested.
Intrinsics is not magic to me, it's not the optimal way for a junior/mid engineer to work with due to it's low-level nature.
For my personal preference, I do like the fact that I actually have full control of what is executing on my target machine. Naturally I do not recommend people to develop intrinsics code as a general concept of success, but in this case I just found it trivial to do so.

I believe in Rust when it comes to memory-safety concerns, but as a general concept for raised security I don't know. I would ask myself a couple of basic questions (regardless if they wrote it in Rust or not). Beginning with:
Who wrote this code?
Junior/Senior?
What company?
Do I trust these implementors?

Since I'm new to Rust and this was my very first hack (program) in Rust, I didn't know that the compiler allowed to compile "unsafe code" (whatever we mean with that). This can change the picture a bit since I now can get access to both hardware and low-level instructions with pointers assuming I want to. The result will most likely end up in whenever the profiler tells me that I need to optimize some code I will disregard memory safety assuming the gains in speed are worthwhile.

Don't assume that unsafe code is going to be faster. Often the compiler can produce optimal machine code from “naive” Rust code, or from slightly tweaked code — writing unsafe code is then taking on additional risk with no benefit.

Profile your program. Benchmark changes. Examine the assembly output. Ask for help. Only use unsafe code if you find that the results are actually better with it.

3 Likes

It's best not to conflate "safety" as defined in the Rust language with "security". They are closely related but very different.

"safety" in Rust is all about memory usage. Checking you don't use the wrong types in the wrong place, checking you don't do out of bounds access on arrays, checking objects do actually exist and are initialised, preventing data races in multi-threaded code, etc, etc.

None of that will save you from security issues when you make a mistake such that your password checker always returns "Ok" or your crypto algorithm fails too encrypt anything, or you have that hard coded password in your program, etc, etc.

As it happens Microsoft and others have analysed their known history of security issues and published reports indicating that 70% or so of their security problems could be traced back to mistakes in memory usage. Which Rust can help greatly to prevent. That still leaves 30% of security issues down to other logical problems in their code.

3 Likes