Simulating an outage: panic while holding a lock

Hi,

We currently test for outages occurring during operation to see how we can recover.

The artificial panic!() occurs after we got a RwLock for write.

The RwLock is in a Mmaped file.

If I remove the poison from the lock right after the program panicked via another program, without trying to read/write the lock again, it works: once the program finishes and I relaunch my main program, everything is working properly.

But if the script reloads the main executable (without the one only clearing the poison beforehand), then I have a deadlock that no poison clearing can recover.

Simplified section:

 match take_my_lock {
      Some(out) => {

        // ------------------------
        // Check for healthy lock
        // ------------------------
        let j = &mut *out;

        match j.lock_admin.write() {
            Ok(e) => {

                    // -----------------
                    // Checked healthy
                    // -----------------
                    e
            }
            Err(f) => {

              println!("Write Poison error {:?}", f);

              // -----------------------
              // Poisonned. Retry once
              // -----------------------
              j.lock_admin.clear_poison();

              match j.lock_admin.write() { <- deadlocked here IFF it runs after a panic!(). Does it happen too fast after the poison clearing ?
                  Ok(out) => {

                      // -----------------
                      // Checked healthy
                      // -----------------
                      println!("Write OK for second write");

                      out
                  }
                  Err(f) => {

                      // -------------------
                      // Definitely f***d.
                      // -------------------
                      println!("Write Poison error 2 {:?}", f);

                      return Err("Poisonned lock".into());
                  }
              }
            }
        }
      }
      None => return Err("No admin header".into())
}
};

So clearing the poison in that specific case is not enough anymore

Again the panic!() is just to simulate an outage/catastrophe, there are no panics/unwrap in the code otherwise.
How can I do ? I need to "force reset" the lock in my mmaped memory.

Just in case, adding a sleep doesn't change anything:

j.lock_admin.clear_poison();
let th_millis = time::Duration::from_millis(1000);
thread::sleep(th_millis);
match j.lock_admin.write() {

What I would need is a my_lock.release() that would work no matter the thread calling it.

The platform is linux x86_64 only.

Sleeping during

Or please propose an alternative that could go in a persistent memory and easy to recover without reinventing the wheel on my side.

Thanks!

I'm not sure about using it with persistent memory, but the parking_log::RwLock does not use poisoning.

1 Like

Yes that version of a RwLock works perfectly in an outage scenario.
Thanks!

1 Like