Today i'm trying to figure out how I can make some code suitable for parallelization with a single processor in Rust. I'm building a decoder of u64s and i want to make each symbol decoding process distinct.

This is the current code implementation:

/// Decodes the whole sequence given as input.
pub fn decode_all(&self) -> Vec<RawSymbol> {
let mut states = self.states; // is a [u64; 4]
let mut decoded = vec![
unsafe { MaybeUninit::<u64>::uninit().assume_init() };
self.sequence_length as usize
];
let mut norm_bits = self.normalized_bits.iter(); // is a Iter<u32>
let mut current_symbol_index: usize = 0;
while ... { // some condition
decoded[current_symbol_index] = self.decode_sym(&mut states[3], &mut norm_bits);
decoded[current_symbol_index + 1] = self.decode_sym(&mut states[2], &mut norm_bits);
decoded[current_symbol_index + 2] = self.decode_sym(&mut states[1], &mut norm_bits);
decoded[current_symbol_index + 3] = self.decode_sym(&mut states[0], &mut norm_bits);
current_symbol_index += 4;
}
decoded
}

In my understanding of the whole thing, i have to isolate each decode_sym call but that mutable reference given to each call prevents the instruction pipelining. The fact is that this is something that i have to do since the logic inside decode_sym involves, sometimes, sequentially pulling out elements from that vector to use them in the method itself.

unsafe { MaybeUninit::<u64>::uninit().assume_init() }; is instant UB; you need to either leave this being a MaybeUninit, and deal with the states in the while loop plus a transmute at the end (or decoded.into_iter().map(|v| unsafe {v.assume_init()}).collect(), where the optimizer works out what you meant and elides the copy) or just initialize to 0 or similar and let the optimizer remove the initialization if it's not read. Note, though, that you must ensure that every value has been made valid via MaybeUninit::write before you call assume_init, otherwise there's a risk of UB.

Calling decode_sym with &mut norm_bits cannot be made to work reliably in parallel, since you're saying that decode_sym can mutate norm_bits freely, but there's no synchronization in place between two different calls to decode_sym, so there's a risk of a data race between two calls. You either need to pass &norm_bits, or work out how to synchronize at runtime for the cases where you mutate norm_bits in decode_sym (e.g. using an RwLock).

Separately, replacing indexing with iteration helps the optimizer do better, because each indexing operation can panic with a different message (so the compiler first has to prove that there's no panics, then it can optimize), while iteration can't panic, so the compiler avoids the proof step. There's a trick where you use a Iterator<Item = &mut T> as an "output":

for (output, input_data) in decoded.iter_mut().zip(input_bits.into_iter()) {
*output = decode(input_data);
}