How can I parallelize code on a GPU?

Here is some given code below. I am currently at a dead end. I am trying to figure out how I can port this program over to something related to Cuda (primarily), maybe OpenCL. Or anything that can speed up this code by ALOT.

Basically the entire while loop in this code, needs to be placed on every GPU core instead of every CPU core.

I have done a little bit of research, but I can't really find much options for Rust.

num_cpus = "1.16.0"
secp256k1 = { version = "0.29.0", features = ["rand-std", "global-context"] }
tiny-keccak = { version = "2", features = ["keccak"] }

strip = true
opt-level = 3
lto = "fat"
codegen-units = 1
panic = "abort"


use secp256k1::generate_keypair;
use secp256k1::rand::thread_rng;
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::{Arc, Mutex};
use std::thread;
use tiny_keccak::{Hasher, Keccak};

use crate::VanityKeypair;

struct AddressPart {
    part: String,
    part_lower: String,

impl From<String> for AddressPart {
    fn from(part: String) -> Self {
        let part_lower = part.to_lowercase();
        Self { part_lower, part }

impl From<&str> for AddressPart {
    fn from(part: &str) -> Self {
        let part_lower = part.to_lowercase();
        let part = part.to_owned();
        Self { part_lower, part }
struct Rules {
    prefix: Option<AddressPart>,
    suffix: Option<AddressPart>,

struct Configuration {
    thread_count: usize,
    rules: Rules,

const ADDRESS_HEX_LENGTH: usize = 40;
const KECCAK256_HASH_BYTES_LENGTH: usize = 32;


const HEX_CHARS_LOWER: &[u8; 16] = b"0123456789abcdef";

fn to_hex<'a>(bytes: &[u8], buf: &'a mut [u8]) -> &'a [u8] {

    for i in 0..bytes.len() {
        buf[i * 2] = HEX_CHARS_LOWER[(bytes[i] >> 4) as usize];
        buf[i * 2 + 1] = HEX_CHARS_LOWER[(bytes[i] & 0xf) as usize];

    &buf[0..bytes.len() * 2]

fn to_checksum_address<'a>(addr: &[u8], checksum_addr_hex_buf: &'a mut [u8]) -> &'a [u8] {

    let mut hash = [0u8; KECCAK256_HASH_BYTES_LENGTH];
    let mut hasher = Keccak::v256();
    hasher.finalize(&mut hash);

    let mut buf = [0u8; KECCAK256_HASH_HEX_LENGTH];
    let addr_hash = to_hex(&hash, &mut buf);

    for i in 0..addr.len() {
        let byte = addr[i];
        checksum_addr_hex_buf[i] = if addr_hash[i] >= 56 {
        } else {


fn is_addr_matching(prefix: &[u8], suffix: &[u8], addr: &[u8]) -> bool {

    addr.starts_with(prefix) && addr.ends_with(suffix)

pub fn vanity_eth_address(starts_with: String, ends_with: String) -> VanityKeypair {

    let secret_key = Arc::new(Mutex::new(String::new()));
    let public_key = Arc::new(Mutex::new(String::new()));
    let num_cpus = num_cpus::get();

    let cfg: Configuration = Configuration {
        thread_count: num_cpus,
        rules: Rules { prefix: Some(AddressPart::from(starts_with)), suffix: Some(AddressPart::from(ends_with)) },

    let rules = cfg.rules;

    let prefix = rules.prefix.clone().unwrap().part;
    let suffix = rules.suffix.clone().unwrap().part;

    let prefix = prefix.into_bytes();
    let suffix = suffix.into_bytes();

    let prefix_lower = rules.prefix.clone().unwrap().part_lower;
    let suffix_lower = rules.suffix.clone().unwrap().part_lower;

    let prefix_lower = prefix_lower.into_bytes();
    let suffix_lower = suffix_lower.into_bytes();

    let found = Arc::new(AtomicBool::new(false));

    let mut threads = Vec::with_capacity(cfg.thread_count);

    for _ in 0..cfg.thread_count {
        let secret_key = Arc::clone(&secret_key);
        let public_key = Arc::clone(&public_key);

        let found = found.clone();

        let prefix = prefix.clone();
        let suffix = suffix.clone();

        let prefix_lower = prefix_lower.clone();
        let suffix_lower = suffix_lower.clone();

        threads.push(thread::spawn(move || {
            let mut hash = [0u8; KECCAK256_HASH_BYTES_LENGTH];
            let mut addr_hex_buf = [0u8; ADDRESS_HEX_LENGTH];
            let mut checksum_addr_hex_buf = [0u8; ADDRESS_HEX_LENGTH];

            const ITERATIONS_UNTIL_CHECK: usize = 100_000;

            while !found.load(Ordering::Acquire) {
                for _ in 0..ITERATIONS_UNTIL_CHECK {

                let (sk, pk) = generate_keypair(&mut thread_rng());

                let pk_bytes = &pk.serialize_uncompressed()[PUBLIC_KEY_BYTES_START_INDEX..];

                let mut hasher = Keccak::v256();
                hasher.finalize(&mut hash);

                let addr_bytes = &hash[ADDRESS_BYTES_START_INDEX..];
                let addr = to_hex(addr_bytes, &mut addr_hex_buf);

                if is_addr_matching(&prefix_lower, &suffix_lower, addr)
                    let checksum_addr = to_checksum_address(&addr, &mut checksum_addr_hex_buf);

                    if  !is_addr_matching(&prefix, &suffix, checksum_addr)

          , Ordering::Release);
                    let mut public = public_key.lock().unwrap();
                    *public = String::from_utf8(checksum_addr.to_vec()).unwrap();

                    let mut secret = secret_key.lock().unwrap();
                    *secret= sk.display_secret().to_string();

    for thread in threads {

    let pk = public_key.lock().unwrap().clone();
    let sk = secret_key.lock().unwrap().clone();

    let keypair = VanityKeypair {
        pk: format!("0x{pk}"),

In practice, you're rewriting all this, including the crypto libraries, in a different language: CUDA, GLSL (for Open CL/GL), HLSL (for DirectX), or - my preference, the shiny new WGSL for WGPU.

So... learn one of those languages?

rust-gpu is the closest thing to what you're asking for, but it's basically still an experiment and isn't intended to be used for arbitrary Rust code.


Writing the crypto libraries is no problem, and rust-gpu will suit me well in this situation? i mean it mainly looks like it's intended for graphics. but are you saying the best bet, is porting all this to c++ cuda?

Or any of the other languages, sure. If you can already write CUDA, then that's an obvious choice.

You can still use Rust bindings and libraries if you like:

As for rust-gpu, there's not really any difference between the shader code that you write for graphics and for compute, other than the annotations for what inputs and outputs the entry point has, so I doubt it can't do compute.

Hmmm, it seems all these rust cuda libraries were updated 5 years ago lol

but i found this Rust Package Registry

Hmm, that's the 12th result on, I wonder why? It's normally quite good at ranking by quality.

so could i use cudarc, is it complete enough

Sorry, I have no idea! :sweat_smile:

I mostly mess with wgpu for my GPU programming needs, but while I have spent a decent bit of time looking into what options exist overall, CUDA never came up past a "what is this, how is it different" cursory look.

so would i literally have to use the wgpu crate? and where can i get the boilerplate to initialize the gpu

Not sure I understand?

You can use any library that lets you program a GPU, the point was more that there's no (at least no reliable) magic to get normal Rust code to work on a GPU yet.

If you're already familiar with CUDA, feel free to go ahead with one of those binding libraries, it should be a smoother road than learning a new shader language, I'm just saying I don't know how to use those.

WGPU (sort of) requires you to learn WGSL, the new shader language created for WebGPU (which WGPU is an implementation of). The sort of is because WGPU uses Naga to provide WGSL by translating from WGSL to the backend shader language, but it can also translate to WGSL if you're ok with translating twice.

If you still want to try wgpu, there's plenty of examples on GitHub, you probably want to start with hello_compute


btw, if i go with cudarc can i just spawn the entire function i already made on a gpu thread, that's making use of other crypto libraries and it just runs them?

If you mean the Rust code you posted above, not from what I can see from the docs; you need to have CUDA code (.cu) and compile with eg compile_ptx - but that's just from a quick look at the docs.

That's the basic theme: GPU code is different enough than CPU code that you can't just take code written for one and use it in the other. There's some efforts to make an environment that can target both, but they're essentially new languages.

1 Like

I was also looking into these questions a lot due to some numerical simulations that I am developing. It seems that there is not de-facto solution (like serde for (de)serialization) at the moment. I am planning on using the opencl3 crate in the future for those tasks but I do not have any experience working with it at the moment.