Request for review - an example of using std::sync::mpsc for cryptographic puzzle

Hello,

I am about to write an article about using std::sync::mpsc with multiple threads of execution to solve a computation-heavy cryptographic puzzle (in fact it's a very basic implementation of the idea behind the problem of "mining" cryptocurrencies). I have already prepared the code!

Here's the repository. Please note, that I'm trying to keep it as simple as possible, so that readers will easily grasp the idea of communication between worker threads and the main thread (in std::sync::mpsc's fashion). I picked that particular cryptographic puzzle, because it's relatively easy to understand, it has real life applications (see: cryptocurrencies) and should be ease to experiment with (by manipulating the value of one constant in the source code).

Thanks, in advance, for valuable comments!

2 Likes

One thing you may want to address is the workers that don’t find a solution keep going even when a solution has been found by another worker. That’s ok here because your main exits, but in a real application you’ll want the workers to receive a message from the consumer to stop. The other thing is the producers can detect a consumer going away (getting dropped), which is a neat detail of the channels - this requires interacting with the channel to detect that though, which goes back to the example code you have that doesn’t do that apart from the worker that finds a solution.

Maybe this is beyond “keep it really simple” but if you want to show mpsc channels the example should make use of its features a bit more, including explicitly handling termination.

Good luck with the article!

2 Likes

Thank you @vitalyd - you are right. I just updated the code with detecting whether the receiver end of the channel is still "listening" - if not, the worker threads finish their job and drop.

Now off to write an article!

I must admit that even though I know that this code has pedagogical intents and is written first and foremost for readability, my software optimization OCD is a bit disturbed by the worker function:

fn search_for_solution(start_at: usize, sender: mpsc::Sender<Option<Solution>>) {
    // In the body of a tight loop with many iterations...
    for i in (start_at..).step_by(THREADS) {
        // - Invoke the big format machinery just to do the job of ToString
        // - Dynamically allocate a string for at most 20 digits (ceil(log10(2^64)))
        // - Immediately turn it into bytes as all you really need are ASCII chars
        // - Compute a hash, dynamically allocate a string to store the result even
        //   if we know the exact number of hex digits at compile time (256/16 = 64)
        let hash: String = Sha256::hash(format!("{}", i * BASE).as_bytes()).hex();

        let result = if hash.ends_with(DIFFICULTY) {
            Some(Solution(i, hash))
        } else {
            None
        };

        // Then, still on every iteration, synchronize with other threads, most of
        // the time only to swamp the main thread with a huge amount of Nones and
        // check if we need to terminate.
        match sender.send(result) {
            Ok(_)  => {},
            Err(_) => {
                println!("Receiver has stopped listening, dropping worker number {}", start_at);
                break;
            },
        }
    }
}

If you want to later do a follow-up post on "now that we got the basic algorithm working, how can we speed it up?", I'd be interested in helping out :slight_smile:

1 Like

Thanks @HadrienG, I'd appreciate any suggestions on how you think the worker function could be improved/optimized (while still keeping readability) :slight_smile:

Here are a few ideas, from least invasive to most invasive:

  • Replace format!("{}", i * BASE) with (i * BASE).to_string(). The later is arguably easier to read, and may be faster if the format! macro does not treat the common "{}" pattern specially.
  • Replace (i * BASE).to_string().as_bytes() with a specialized function that stores the results in a stack-allocated ArrayVec. This gets rid of one heap allocation. Here's a basic implementation, if it turns out to be a bottleneck (I wouldn't expect that when compared to the SHA256 computation next to it) it can be optimized further.
// NOTE: Not portable to >64-bit hardware architectures, but assuming that such
// archs exist, rustc would not support them anyhow. In future Rust, we'll be able
// to do this more cleanly using const_fn.
const MAX_USIZE_DIGITS: usize = 20;

// Turn an usize into an array of decimal digits
fn to_digits(x: usize) -> ArrayVec<[u8; MAX_USIZE_DIGITS]> {
    // Set up the digits accumulator
    let mut digits = ArrayVec::new();

    // Extract the decimal digits from right to left
    let mut remainder = x;
    while remainder != 0 {
        digits.push(b'0' + ((remainder % 10) as u8));
        remainder /= 10;
    }

    // Put the digits in the right order
    digits.as_mut().reverse();

    // This algorithm does not work for zero (it produces an empty array),
    // so we need to handle that input value specially.
    if x == 0 { digits.push(b'0'); }

    // And we're done
    digits
}
  • Patch easy_hash so that it can write its hex output in an ArrayString as well (may want to discuss this with the maintainer first, as he may or may not want this to be an optional crate feature). This gets rid of the other heap allocation in the hot path.
  • Use an AtomicBool rather than the built-in mpsc drop detection mechanism in order to notify the workers that the solution has been found, and only check if it has been found infrequently (e.g. every 1000 worker iterations) to keep thread synchronization away from the hot path.
  • Do not send anything to the main thread if the solution has not been found.

Also, optimization aside, swallowing errors like this is a bit of a Rust antipattern:

    loop {
        match receiver.recv() {
            Ok(None) => continue,
            Ok(Some(Solution(i, hash))) => {
                println!("Found the solution.");
                println!("The number is: {}.", i);
                println!("Result hash: {}.", hash);
                break;
            },
            _ => {},  // Please, don't do this
        }
    }

You may want to make the match more explicit and panic in the error case instead:

    loop {
        match receiver.recv() {
            Ok(None) => continue,
            Ok(Some(Solution(i, hash))) => {
                println!("Found the solution.");
                println!("The number is: {}.", i);
                println!("Result hash: {}.", hash);
                break;
            },
            Err(RecvError{}) => panic!("Worker threads disconnected before the solution was found"),
        }
    }

Wow, thank you very much for this thorough review @HadrienG. These tips are priceless to me - I learned new things!

If you want to have more fun with error handling, note that the i * BASE multiplication in the worker thread can overflow. Actually, the underlying loop on i can do it as well if you give it enough time. I leave it up to you to decide whether you want to handle this one, and if so how :slight_smile:

Nice. For the purpose of the article, I will probably leave the last one off (as well as using AtomicBool). While I know these are the most correct ways to deal with these issues, I want to keep the example quite simple (and take the optimistic approach to both integer overflows and performance flaws caused by "flooding" the receiver with None values most of the time).

Anyway - thanks!

Fine by me! Taking a quick look at the documentation, it seems you weren't aware that cargo run also accepts the --release flag when you wrote the README. It would be a simpler alternative to what you are proposing, I think.

1 Like

Whoa. Another trick learned today! Just updated the docs.

Just profiled the puzzle code out of curiosity. It spends roughly 25% of its CPU cycles computing SHA-256 hashes, which gives you an upper bound on achievable performance improvements without modifying the underlying sha2 crate: something in the range of 4x faster, assuming you can optimize out the overhead of everything else.

Targets for performance optimizations which emerge are pretty much the ones that I expected: synchronization and memory allocation/liberation/copies. I'm a bit surprised at the magnitude though: adding together the various mpsc-related contributions, queue traffic alone seems to account for nearly as many CPU cycles as the hash computation itself (~23%).

That's interesting. Does it mean that just switching from "flooding" the receiver with Nones to sending a message only when the solution was found (plus, using AtomicBool to control the execution of worker threads) could greatly boost the performance?

If my interpretation of the profile is correct, you could get up to a 2x performance boost from that alone.

Great, I will definitely check that out. It might be worth doing. Could you tell me, what did you use for the profiling? I'm curious.

The "perf" profiler integrated in the Linux kernel. I wrote a bit about what it does and how one uses it in Profilers and how to interpret results on recursive functions - #2 by HadrienG .

@HadrienG I followed your tips and used an Arc<AtomicBool> to detect whether any of the worker threads has found the solution (I check it every 1000 iterations within worker thread). Here's the pull request with a diff. The only thing I am really confused about is the Ordering. For now, I used Ordering::Relaxed for both store and load. This is the first time I use any Atomic and I find the documentation for Ordering quite confusing. Would you mind taking a look?

If you've ever used C++'s std::atomic, the concept of Ordering is similar to the concept of std::mem_order there.

What problem is it solving? Well, let me first tell you a story about one of the many things that compilers and hardware do behind your back. Even though you like to think of your program as reading from memory and writing to it in a well-defined order, both the Rust compiler and the CPU will not hesitate to reorder your memory accesses in order to improve their performance, as long as the result is invisible to sequential code. This is usually a good thing, as most programmers are not very good at writing programs whose memory access patterns are optimal in hardware.

However, even though sequential code does not see them, these memory access reorderings can break the correctness of parallel code. For example, imagine what would happen if your compiler or CPU felt free to reorder reads and writes to a mutex-protected piece of data outside of the code region where the mutex is locked... as you can imagine, this can get ugly rather quickly.

The atomic access orderings are designed to restrict what kind of memory access reorderings the compiler and hardware are allowed to perform:

  • Acquire has the same semantics as acquiring a mutex ("no read or write should be reordered before the Acquire operation").
  • Release has the same semantics as releasing a mutex ("no read or write should be reordered after the Release operation")
  • AcqRel combines Acquire and Release restrictions.
  • SeqCst additionally guarantees that all CPU cores will see the corresponding atomic operations as occuring in the same temporal order. This is not something that you get for free, because atomic memory operations take some variable time to propagate from one CPU core to another, which naturally tends to shuffle the order in which various CPUs perceive them. It is also, generally speaking, a very expensive guarantee to ensure in hardware, as you typically need to lock the bus used for communication between CPU cores, rollback memory transactions, or something similarly ugly.
  • Relaxed means "I don't care, do whatever you want". It is the most dangerous ordering to use in general, but is appropriate for simple signaling flags like the AtomicBool that we're discussing here because all the "useful" synchronization happens via the mpsc queue.
1 Like

Got it. Looks like Ordering::Release would also do here for store operation (along with Ordering::Acquire for load), but to keep things simple I'll stick with Ordering::Relaxed.

BTW - it's interesting that the docs don't mention that Ordering::Acquire should not be used with store operation (and the other way around - one should not use Ordering::Release for load). Both such attempts panic in runtime. Even though it might be intuitive, I would probably love to see such explanation in the documentation.

Thanks again @HadrienG. I will definitely mention your effort in the article. Cheers!

1 Like