I'm building a scientific computing tool (specifically, a computational fluid dynamics solver) in Rust to learn the language and see how suitable it is for scientific computing. All you really need to know is that the program does the following:
Read floating-point data from a CSV file.
Create several linear algebra systems from that data.
Solve those systems using iterative techniques, such as Jacobi, BiCGSTAB, etc
Print some aggregated values from the solution to the console
Based on the solution, generate new linear systems, solve those new systems, and so on
All of the preceding steps are hard-coded. The only external influence should be:
The contents of the input file read in step 1
The number of times step 5 is completed, which is specified by a command line argument
Consequently, if the program is run repeatedly with the same command and the same input data, it should do exactly the same thing every time and produce the same console output. However, I have found this is not the case. About half the time, the output is mostly the same. That is, most of the values it spits out are equal or roughly equal to each other. The other half of the time, the output is completely unpredictable.
It's my understanding that nondeterministic (I believe I'm using this term correctly?) behavior like this can be caused by a few things:
Hashing functions
External factors
Accessing uninitialized memory
Intentional randomness, obviously
Differences due to threading as pointed out by @ZiCog
Assuming I didn't miss anything, I'll go through those point by point.
While my program does use HashMaps, they shouldn't have any effect on the results: as long as the same key corresponds to the same value, everything is good. Also, I'm initializing them with a constant hash builder, e.g.:
let hash_builder = ahash::RandomState::with_seed(31415);
let example_hashmap = HashMap<key_type, value_type, ahash::RandomState> = HashMap::with_hasher(hash_builder);
The input file is set read-only, so that shouldn't be an issue, and I'm running the same command on the same binary each time.
I don't believe this is possible in Rust without unsafe, right? I do not use unsafe anywhere, though I believe the libraries I'm using for linear algebra (nalgebra, nalgebra_sparse) use unsafe.
n/a
This issue occurs with and without threading.
It seems to me like #3 is the only plausible option. However, my formal CS training is weak, so I know I have a lot of blind spots. What do you all think?
Unfortunately, it's not feasible to distill my source code into a minimal example, but you can see the code in its entirety here. There's also a bit more information on this specific problem in its issue.
I see that you are spawning threads in there. That is a source or random behaviour. The ordering of thread scheduling may not always be the same. Changing the order of floating point operations can change the results.
You can try and run your code with miri. It can find some errors in unsafe code and might give you a first hint at what the error might be and you can post the result here.
Thanks! I should have mentioned that I see this issue occurs with and without threading. I just disabled the threading on line 98 of solver.rs, and I still got two wildly different results on two subsequent runs. I also know that nalgebra_sparse performs all operations in serial.
For education's sake, what magnitude of difference in floating point ops are we talking due to threading?
I tried running your project and the first thing I notice is that the command outputs
Reading solution data from ./examples/couette_flow.csv...
...
Writing data to ./examples/couette_flow.csv...
Sure enough, a follow-up git status shows examples/couette_flow.csv has changed. If the output you are looking at depends at all on the data read from that file then that seems like a good place to look.
Good catch - that is generally the desired behavior, but for testing I've locally set the data file to read-only and also abort the program before it gets to that point. My local git status confirms that the file hasn't changed since its last commit.
Given that you've considered all the reasonable things to consider and the problem is still mysterious, I would recommend that you pursue minimization: take your large program and figure out what you can cut away from it, while preserving the misbehavior. Start a new git branch and start ruthlessly deleting things.
Remove the file reading; use data stored in literals.
It happens even without threading? Delete the code that would use threads.
Replace functions with simpler functions. Don't worry about whether the program still computes the correct answer you originally sought; for this task, your definition of “correct” is “still exhibits nondeterminism”.
It can be tricky to find a good place to cut next, but it is something that gets easier with practice. At the end of the process, you should have a program which is simple enough that either you can see where your mistake was, or it makes a good bug report for whichever of your dependencies is misbehaving.
Smaller programs are also better candidates for running under slow tools like Miri or loom.
Thanks - I was hoping to not have to do that, but here we go. I already made a simple program that does some large matrix-vector multiplication with nalgebra, and that didn't turn anything up, so it seems the answer is more complex than that. I'll start hacking away.
Given the size of the data structures involved (~8k x ~8k matrices), I foresee it being difficult to translate those to literals. I do have smaller test cases (as small as 9x9 matrices), but I haven't seen the issue on those. I think smaller systems are just not numerically unstable enough to where a bit of numerical noise can cause them to diverge. I suppose I will just set up some print commands to directly print the systems as code.
Yeah, Miri is definitely not going to be the answer as-is - after 20min, it still hadn't completed the first step, which constitutes <<1% of execution time.
The same thing happens for me. I would change the code to write to a different file. That way, you can easily compare the outputs of different runs from the same inputs.
I've simplified the code a lot, but I'm running into an issue. The issue at hand is seemingly random and can be hard to reproduce. As I simplify things, it's not clear whether the issue has gone away or whether it's just chance.
It is apparent, however, that certain values are still inconsistent from one run to the next. These values are TINY though, on the order of 10^-20. Forgive my lack of CS knowledge, but is there some level of expected variation in small floating-point values even in a perfectly-written, completely serial program with fixed input values?
When running the exe multiple times, I noticed that the Face zone messages always print in random orders, even though you are supposedly using a fixed seed. I was able to remove the randomness by disabling the runtime-rng feature flag (it is documented here).
Change the ahash dependency line in your Cargo.toml:
ahash = { version = "0.8.11", default-features = false, features = ["std"] }
2nd edit: RandomState::with_seeds should be used instead of singular with_seed. The latter includes the source of randomness. This is the real solution.
If you move to a different math library (typically from trying a different machine or OS), things like tan might return slightly different answers.
But there's no inherent randomness to floats. If you do the exact same primitive operations in the same order, you'll get the same thing. (RTS games have been a constructive proof of this for decades now, for example.)
Well I'll be darned! I cannot fathom how randomness in the hashmap would be producing irregular results, as I don't think anything depends on the order of the items in the hashmap, but that seems to have fixed it! With let hash_builder = RandomState::with_seed(31415) I get a different result each time, and with let hash_builder = RandomState::with_seed(0, 1, 2, 3) I get identical results as far as I can tell no matter how many times I run it. Thank you!!
I guess it might be simply the fact that floating-point operations are not real-value operations, in particular, they're in general neither commutative not associative. Different order of iterations - different order of operations - slightly different rounding of results - slightly different output.
Thanks for the explanation. To try to understand what's actually happening here, I made a branch that reverts to with_seed(31415) and changes all of the instances of for (key, value) in hashmap { ... }
to
for key in 0..hashmap.len() {
let value = &hashmap[&key];
...
}
I similarly changed for key in hashmap.keys(), for value in hashmap.values(), etc. (For the record, all keys are just indices counting upward from 0.) The issue remains. I must be missing a way that hashmaps' randomness manifests. I'm happy to perform further tests to try to document what's actually happening here, but I'm out of my league and content with @parasyte's solution.
Edit: Actually, I missed some for _ in _ instances. I'll update after fixing and testing.
Edit 2: I stand corrected - changing the iteration methods as described above fixed the issue even without the fixed hash generator.
Is the documentation of with_seed incorrect, then?
The provided key does not need to be of high quality, but all RandomStates created from the same key will produce identical hashers. (In contrast to generate_with above)
Thanks - I'll fix other iterators like what you show as I come across them.
Correct me if I'm wrong (again, no formal CS background here), but I have read that large hashmaps generally perform better than large vecs, with constant-time lookups instead of linear-time lookups. I'm writing the code to be able to handle millions or billions of elements, so lookup speed becomes important.
Edit: Also, before I understood the issues that can arise when iterating for (key_value) in hashmap, I found that to be convenient compared to using vector.iter().enumerate(). Rust noob though so maybe just skill issues?