Cheap random number generator with `std`

jbe · March 9, 2023, 12:51pm

For writing a benchmark, I would like to use some random numbers. They wouldn't need to be cryptographically secure, so I wonder if the following approach is suitable:

use std::collections::hash_map::RandomState;
use std::hash::{BuildHasher, Hasher};

fn main() {
    let mut hasher = RandomState::new().build_hasher();
    for _ in 0..10 {
        hasher.write(&[0]);
        let random = hasher.finish();
        println!("{random}");
    }
}

(Playground)

Some questions:

Is this approach reasonably fast?
Is there another, more idiomatic way to generate cheap random numbers?
Is the chance of collision much higher than with a truly random number generator?
Could this leak a pattern which interferes with usages of HashMap and HashSet such that they become slower, e.g. if I would store such random number in a HashMap/HashSet which uses the same hashing algorithm?

Click here for a slightly nicer, encapsulated version (using the same algorithm)

use std::collections::hash_map::{DefaultHasher, RandomState};
use std::hash::{BuildHasher, Hasher};

pub struct RGen(DefaultHasher);

impl RGen {
    pub fn new() -> Self {
        RGen(RandomState::new().build_hasher())
    }
    pub fn random(&mut self) -> u64 {
        self.0.write(&[0]);
        self.0.finish()
    }
}

fn main() {
    let mut g = RGen::new();
    for _ in 0..10 {
        let random = g.random();
        println!("{random}");
    }
}

(Playground of encapsulated version)

tczajka · March 9, 2023, 1:21pm

If your goal is to avoid collisions, why do you need "random" numbers, rather than consecutive numbers 0, 1, 2, 3, ... (which would guarantee no collisions)?

jbe · March 9, 2023, 1:23pm

Consecutive numbers might result in a worse or better performance than if I used real and/or random data.

tczajka · March 9, 2023, 1:28pm

If you want fast weak pseudo-random numbers, a better approach would be to use SmallRng from the rand crate.

The default Hasher is not really designed as a pseudo-random number generator, so there is no guarantee of randomness quality. There is certainly a chance that this will interact differently with your algorithms than truly random data.

SmallRng might also interact poorly with your algorithm, although it's less likely because it's specifically designed to be "pseudo-random".

However, since you care about quality of your random numbers, you should probably just use StdRng from the rand crate.

steffahn · March 9, 2023, 1:28pm

But also: Random data might result in a worse or better performance than if you used real data and/or consecutive numbers. (I.e. the real data might be a lot unlike random data, too; if the form of the data turns out to be relevant for the benchmark in the first place.)

Of course, random data is a reasonable test case nonetheless. If the reason you are looking for particularly fast RNG is because you want to avoid benchmarking the time for generating random data together with the actual thing to be benchmarked, it might also be reasonable to simply generate the data in advance, and only then start the benchmark.

jbe · March 9, 2023, 1:44pm

Could you elaborate on this? What's exactly the goals of the default hasher? Isn't the goal to keep collision probability (relatively) low, which would be achieved with some sort of "randomness"? How else does the default hasher achieve its goal if not employing pseudo-randomness? What does it do then?

Forgive me my ignorance, I'm not familiar with these things.

That's right, but I find it more difficult to model real data than using random data.

Yeah, I thought random data is more reasonable than consecutive numbers, but I see your point.

I guess this might consume a lot of memory, though perhaps that's not influencing the benchmark a lot?

Actually, I made my OP for two reasons: My particular use case and out of general interest. So let me split up both requests:

Is my approach suitable to create random numbers that could be used as test data for algorithms / data structures? I interpreted @tczajka's post as a "no" to that question. I would like to understand why. I think even if just 10% of all possible values were used (assuming equal distribution of those and no occurrence of the remaining 90%), that would still be sufficient for most cases. Or is the default hasher worse yet?
How to write a benchmark for key-value stores?

The second question might be more complex to answer as there are a lot of aspects to be taken into consideration (reading vs writing, database size, etc., etc). In this post, I was more interested in issue #1 rather than issue #2, but feel free to give me advice on #2 as well. (I have been already looking into criterion for that matter.)

P.S.: I will likely use rand::rngs::SmallRng for my purposes as you advise, but would still like to understand why the hasher is a bad idea.

Is the problem that only collisions are attempted to be avoided, but the "structure" (not sure what's the mathematically correct term) of the generated data might be non-random? E.g. it's not guaranteed that the output could be something like this:

hash([0]) => 1
hash([0, 0]) => 2
hash([0, 0, 0]) => 3
hash([0, 0, 0, 0]) => 4
…

bjorn3 · March 9, 2023, 1:58pm

Note that on wasm32-unknown-unknown libstd uses a hard coded seed rather than using an actual random number. While every successive call will produce a new random number, the exact sequence of "random" numbers is always the same for every execution. The getrandom crate (which the rand crate uses) however knows how to use wasm-bindgen to use the javascript api for getting random numbers.

bjorn3 · March 9, 2023, 2:04pm

The goal of RandomState using random numbers is to prevent HashDOS attacks by preventing an attacker from knowing the key used to initialize the hasher. In other words it is meant to prevent an attacker from being able to generate arbitrarily many collisions. For example wasm32-unknown-unknown doesn't allow non-determinism unless an api for this is explicitly exposed to it. As such we had to not use any randomness in the wasm32-unknown-unknown implementation of RandomState. Given that wasm32-unknown-unknown code generally runs in the browser with all or most input coming from a trusted server, defending against HashDOS is not as important. (but still important in some cases.)

jbe · March 9, 2023, 2:11pm

I understand that HashDOS is a different use case than creating random numbers. But in order to mitigate HashDOS, I think randomness is employed. Afterall, it's called RandomState (even though I understand on some platforms it may not be actually random).

I think the problem with using a hash algorithm is that you can avoid collisions (which is the goal of the hash function) even when there exist some input sequences which generate non-random output sequences, like my example above:

(as long as there are not more invariants like that which can be exploited in combination)

tczajka · March 9, 2023, 2:17pm

Low collision probability is the goal, but that is a weaker property than pseudo-randomness. Low numbers of collisions doesn't mean that the numbers behave randomly in other ways.

For instance, a great hashing algorithm for u8 -> u64 is just the identity function: it has 0% collision probability! But the identity function is a poor pseudorandom sequence.

Now DefaultHasher does something more complicated, because the actual mapping that is used by HashMap consists of two parts Key -> u64 -> (a smaller range), but the general idea stands -- its goals are weaker than behaving randomly in every possible way.

scottmcm · March 9, 2023, 6:08pm

The important part about hashes for hash tables is that they're a tradeoff between more speed and fewer collisions.

The best hash tables hashers are intentionally only just good enough to work, and if analysed from a crypto bit-mixing perspective are often rather bad.

jbe · March 9, 2023, 9:21pm

Well, a tradeoff between more speed and quality was actually what I was aiming at.

However, I don't need a really fast random number generator anymore. See this post by me for a solution of my concrete problem.

system · June 7, 2023, 9:21pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
HashMap performance	26	28050	January 22, 2020
Passing custom hasher to HashMap help	5	5657	January 12, 2023
Hash-rs needs cpus and algorithms!	19	2625	January 12, 2023
How are you using `rand`?	11	3071	January 12, 2023
Simple fast hashing	4	3004	January 12, 2023

Cheap random number generator with `std`

Related topics