I think to clarify this question in general also, your questions about these code examples actually aren't really about "memory ordering" in particular.
I'll ignore the code that has to run for println! for the following explanation just to keep it simple.
Your first code example
s.spawn(move || {
println!("lexically Nrd thread: {}", counter.load(Relaxed));
counter.fetch_add(1, Relaxed);
});
This code has two different accesses to counter, A thread running this code will have to run (at least) 2 instructions - one to do the atomic load, and one to do the atomic fetch-add. This means that, concurrently, any arbitrary amount of "other stuff" could happen in the cpu between the first and second instruction, and if you have (say) four different threads running this code at once, you could experience any possible interleaving of the instructions that you could imagine. For example lets imagine we have only two threads. Given the whims of your scheduler and CPU archiecture the following is entirely possible:
thread-A: counter.load(Relaxed) // This will load 0
thread-B: counter.load(Relaxed) // This will load 0
thread-A: counter.fetch_add(1, Relaxed) // this will store 1
thread-B: counter.fetch_add(2, Relaxed) // this will store 2
Outcome: 0 0
Alternatively, and also completely possible:
thread-A: counter.load(Relaxed) // This will load 0
thread-A: counter.fetch_add(1, Relaxed) // this will store 1
thread-B: counter.load(Relaxed) // This will load 1
thread-B: counter.fetch_add(2, Relaxed) // this will store 23
Outcome: 0 1
These are both entirely possible with the same code - it is just up to the scheduler, your cache coherency protocol, and a bunch of other complex and largely out of your control factors. There was no "agreement" between threads or coordination or anything like that, they all just looked at whatever counter happened to be when they happened to get to that respective line of code.
Now what about your other code example?
s.spawn(move || {
println!("lexically Nth thread: {}", counter.fetch_add(1, Relaxed));
});
This code (again ignoring println!ing) now only does one "access" to the counter variable, which it can do in a single instruction. The threads can still run this code in any arbitrary relative-order to eachother, but because they all are only running a single instruction, reorderings don't have nearly as much of an effect. Consider a couple different possibilities the scheduler could spit out at you, pretending we have three threads:
thread-A: counter.fetch_add(1, Relaxed) // loads 0, stores 1
thread-B: counter.fetch_add(2, Relaxed) // loads 1, stores 2
thread-C: counter.fetch_add(2, Relaxed) // loads 2, stores 3
Outcome: thread-A: 0; thread-B: 1; thread-C: 2
Or:
thread-C: counter.fetch_add(1, Relaxed) // loads 0, stores 1
thread-A: counter.fetch_add(2, Relaxed) // loads 1, stores 23.
thread-B: counter.fetch_add(2, Relaxed) // loads 2, stores 34. 4.
Outcome: thread-C: 0; thread-B: 1; thread-C: 2
To make it more complicated: in reality the println! itself is of course not atomic, so you could have the threads also load the values in a different order than they end up printing, giving you something like this:
thread-C: counter.fetch_add(1, Relaxed) // loads 0, stores 1
thread-A: counter.fetch_add(2, Relaxed) // loads 1, stores 2
thread-A: println!(...)
thread-B: counter.fetch_add(2, Relaxed) // loads 2, stores 3
thread-C: println!(...)
thread-B: println!(...)
Outcome: thread-A: 1; thread-C: 0; thread-B: 2
Welcome to the joys of concurrent programming. Keep in mind too, as per my comment immediately above this one, these "orderings" are somewhat "made up" for the sake of example - in reality a lot of these things are happening genuinely concurrently (assuming a multicore system), even if in the end you can decide some temporal ordering of them that "fits" the observed result.
(PS: If my examples have any mistakes let me know)