A subtle concurrency bug that I can't seem to solve

So I have this in my code:

static CONTROLLERS: Lazy<RwLock<Vec<NvmeController>>> = Lazy::new(||RwLock::new(Vec::new()));
static INTRS: Lazy<RwLock<Vec<Interrupt>>> = Lazy::new(|| RwLock::new(Vec::new()));

Where Interrupt is defined to be:

#[derive(Clone, Copy, Debug, Eq, PartialEq, Ord, PartialOrd, Hash)]
struct Interrupt(pub u128, pub u16);

At the same time, I have this code:

        loop {
            if matches!(INTRS.try_read(), Some(i) if i[self.intrspos].1 > 0) {
                break;
            }
            hlt();
        }

And:

    register_interrupt_handler(
        dev.int_line,
        Box::new(move |_| {
            INTRS
                .write()
                .iter_mut()
                .filter(|i| dev.unique_dev_id == i.0)
                .for_each(|i| i.1 += 1);
        }),
    );

This final function is called when a given interrupt is received. The issue is this:

  1. This code works perfectly fine the first time around. The driver sends the command to the controller, the controller responds, the writer lock is acquired, the update is performed and the lock is released. Then the driver attains a reader lock, checks to see if this particular instance of it has gotten something, acquires a writer lock to decrement the counter, then releases the lock.
  2. This doesn't, however, work the second time around. The driver gets stuck, no matter where I put the hlt instruction. I've tried using classic mutexes, atomics (which didn't compile), SeqLocks and back to reader-writer locks. One of two things happens:
    1. The interrupt is received and the function is called, but the driver never awakens to check to see if the interrupt handler was called to begin with; or
    2. The interrupt never occurs at all.
      In both instances, the driver hangs.

I'm dealing with a (potential) situation of infinite controllers and infinite interrupts. I know the emulated system I'm programming this in has only one controller and, therefore, only one interrupt source for this driver, however I mustn't assume anything about the execution environment, because for all I know someone who runs this in the future might have 30 or 300 controllers.

I've thought about an MPMC queue. The problems I've hit are:

  • The MPMC queues provided by the heapless crate only offer up to 64 entries in the queue at a time. The NVMe specification mandates that a controller can have up to 65534 queues, plus the admin queue, so this is insufficient for my needs.
  • I (can't seem) to find an unbounded MPMC queue for no_std. Maybe one exists and I just skipped past it when I was looking in the Concurrency category on crates.io.

What do you guys suggest? I figured that I might hit a concurrency bug like this sooner or later, I just hadn't prepared for it.

For an MPMC queue, maybe crossbeam::queue::ArrayQueue for a bounded MPMC queue of any size (or crossbeam::queue::SegQueue for an unbounded MPMC)? It says it works with no_std with alloc enabled which it looks like you might have since you're using Box

That worked, but I'm still getting weird halts. Code is like this:

static CONTROLLERS: Lazy<Mutex<Vec<NvmeController>>> = Lazy::new(|| Mutex::new(Vec::new()));
static INTRS: SegQueue<Interrupt> = SegQueue::new();

#[derive(Clone, Copy, Debug, Eq, PartialEq, Ord, PartialOrd, Hash)]
struct Interrupt(pub u128, pub usize);

// ...

    fn process_command(
        &mut self,
        req: Self::CommandRequest,
    ) -> Result<Self::Response, Self::Error> {
        debug!("Processing command {:?}", req);
        debug!("Waiting for controller to be ready...");
        loop {
            if self.read_csts().get_bit(0) {
                break;
            }
        }
        debug!("Writing request");
        self.sqs[req.qid].queue_command(req.entry);
        if req.qid == 0 {
            debug!("Queue is admin queue, writing admin queue doorbell");
            self.write_adm_sub_tail_queue_doorbell(self.sqs[req.qid].get_queue_tail());
        } else {
            debug!("Writing to queue {} doorbell", req.qid);
            self.write_sub_tail_doorbell(req.qid, self.sqs[req.qid].get_queue_tail());
        }
        debug!("Waiting for response");
        let mut i: Interrupt;
        loop {
            if !INTRS.is_empty() {
                i = INTRS.pop().unwrap();
                if i.0 == self.id {
                    break;
                } else {
                    INTRS.push(i);
                }
            }
        }
        let mut entries: MiniVec<queues::CompletionQueueEntry> = MiniVec::new();
        self.cqs[req.qid].read_new_entries(&mut entries);
        if req.qid == 0 {
            debug!("Writing to admin completion queue doorbell");
            self.write_adm_comp_head_queue_doorbell(self.cqs[req.qid].get_queue_head());
        } else {
            debug!("Writing completion queue doorbell for queue {}", req.qid);
            self.write_comp_head_doorbell(req.qid, self.cqs[req.qid].get_queue_head());
        }
        if entries.len() > 1 || i.1 > 1 {
            warn!(
                "Retrieved {} responses; returning only first",
                entries.len()
            );
            entries.truncate(1);
        }
        let entry = entries[0];
        if entry.status.sc != 0x00 {
            Err(entry.status)
        } else {
            Ok(Response {
                qid: req.qid,
                entry,
            })
        }
    }

// ...

pub async fn init(dev: PciDevice) {
    info!(
        "Registering interrupt handler for interrupt {}",
        dev.int_line
    );
    register_interrupt_handler(
        dev.int_line,
        Box::new(move |_| {
            INTRS.push(Interrupt(dev.unique_dev_id, 1));
        }),
    );
    let mut controllers = CONTROLLERS.lock();
    let controller = unsafe { NvmeController::new(dev).await };
    if let Some(c) = controller {
        controllers.push(c);
    } else {
        error!("Cannot add NVMe controller");
        return;
    }
}

I've no idea why its halting now.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.