Testing functions that use randomness

We have a team-internal discussion how one should test functions that use randomness in Rust. So I would like to hear if there are any community-standards / guidelines abut this and if not what are the opinions of community members.

To give a stupid simple example, lets say we have the following function:

fn my_fun(mut vec: Vec<u64>, rng: &mut impl RngExt>) -> u64 {
    ... do some stuff with vec, mutating it in a deterministic way
    vec.shuffle(rng);
    ... process the shuffled vec deterministically, do some stuff with it and compute the end result
}

Now if we want to write some unit tests for this, we have some options:

  1. Seed an rng with a fixed seed, then the function should always return the same result and we can assert this result.
  2. Use a custom implementation of the Rng trait to "mock" the randomness. This is difficult since you can only implement the Rng methods and not the higher level RngExt methods so you would need to reverse engineer the shuffle implementation to do a low-level mock to achieve exactly the wanted shuffle.
  3. Introduce a high level trait for shuffling that can be mocked in the test (i.e. have a trait Shuffler that can shuffle a given vec and instead of impl Rng use impl Shuffler, then the unit test can have a custom implementation of that shuffler trait that is deterministic and predictable by the unit test)
  4. Write the test in a way that the asserts only check properties of the result that always are correct regardless what the shuffling did. (depending on the type of the algorithm you might not be able to find many such checks and you might easily miss bugs that still produce valid results according to those properties but doesn't do what you want with the probabilities you want)
  5. Do some statistical tests (i.e. run with many different seeds and apply a statistical test to the distribution of the result).
  6. Extract the determiinstic parts of the function into private functions (in the example two functions, one for the stuff done before the shuffling and one for the stuff done afterwards), then only test those parts.

That is all the options I could think of for now, but maybe I missed some important ones? Also I guess the answer might depend on the type of randomised algorithm one wants to test (for example if we have some sort of biased sampling, probably a statistical test about the outcome distribution would be helpful, while for other stuff statistical tests might get too complicated) but maybe there are some guidelines to avoid certain things at all times and also when to use what (I guess sometimes also a combination of the above ideas can be useful).

I've personally found option 3 to be best - having a higher level concept makes testing cleaner. In the shuffle case, you can have e.g. a no-op shuffler or a sorting shuffler that can make it much easier to know a priori what the output will look like and therefore how the rest of the logic will behave.

Testing is hard!

I think having a few seeds and snapshotting the results is a pretty reasonable approach, but really the ideal is something deterministic with meaningful, semantic tests, and I don't see how you can do that without 6.

In my limited need to test a random using function (cryptography related) I ended up with both snapshot tests and testing factored out code, but I don't think I was ever particularly happy.

My opinions:

This is a good “cheap” option: low coverage, but very strong assertions except that you have to change it if you change the algorithm's use of the RNG.

In my experience, mocking (in the narrow technical sense where fakes and stubs are different than mocks) is extremely brittle and, as the codebase evolves, is likely to result in tests that fail to reflect reality. So I would recommend against it. That said, in some cases it might be the only practical option.

This is a good strategy, but one part of it is wrong: don’t “write the test”, “write a test”. You can and should write multiple tests that apply different strategies (e.g. a property test and a fixed-seed test).

I’ve never done this, but it could be reasonable. It requires setting a threshold, but so do property tests involving floats.

If you’re looking for thorough testing, then testing the deterministic parts in isolation is a good idea, because it lets you feed them edge cases. But just like I said for 4, “only” is not a good way to look at the problem. Test the deterministic parts separately, and test the whole function.

Personally, I prefer the first option. To improve coverage of corner cases the test can be executed several times or you could find an appropriate seed.

Small comment about the example: I would recommend to use impl Rng instead of impl RngExt. It may result in importing both traits, but IMO the code should be generic over the base trait, not over its extension.

These are testing implementation details rather than the documented behavior. You should generally be testing the interface, not implementation particulars, so I'd say no to these. The main benefit of testing is that you can change the implementation and the test should still be passing. Those tests wouldn't have that property.

These are the right things to test.

5 depends on why you're using randomness. Is it for performance? Have performance benchmarks. It is to achieve a certain distribution? Have a statistical test for the distribution.

This might make sense too.

Thank you for your answer. For me personally I feel 1. has a few quite big drawbacks:

  1. Someone reading the code does not know if the asserts make sense or not, they need to run the test to see if it works or not.
  2. Every time a small detail changes in the implementation (maybe swapping two operations that both access randomness) the test will fail and nobody will understand why. I had such tests in the past and what happened (especially if you are not the one that wrote it) is that people just update the test with the new values and thats it. So it is manual work and the failing test didn't help at all catching a bug and even if it would have catched an invalid semantic change people would just ignore it and update the values to the new outcomes.
  3. It is hard to test edge cases that happen only very rarely (you would either need to test a LOT of seeds, or somehow use tooling to search for a seed such that your wanted edge case appear (this gets even harder if you cannot tell if the edge case happened base on the output, then you need some logging or whatever to find out).
  4. As others already mentioned it depends a lot on a third party dependency, namely the Rng you are using, so if this one changes the tests break again.

Of course the main advantage of 1. is that it is the simplest but I still feel it is not worth the trade-offs.

but really the ideal is something deterministic with meaningful, semantic tests, and I don't see how you can do that without 6.

I feel like 3. can also be a meaningful semantic test if your semantic is defined as "this method does XYZ, then shuffles the list and then does ABC ...", it depends on the contract of the method, if the usage of the randomness is defined in the contract, then for me 3. is still just testing the contract.