How do I store raw data + borrowed parsed pieces in one object?

I am trying to implemnt a library that efficiently parses and transofrms Linux audit logs. The audit events usually consist of multiple entries that are spread across several lines and those lines may not have been transmitted together. My idea is pretty straightforward: Store all the records belonging to an event that hasn't been fully processed and return them after an end-of-event marker has been received.

Measurements on a prototype implementation in another language have taught me that allocation/deallocation leads to significant overhead at high message rates, so I'll want to avoid copying raw data. My basic types (Key, Value) mostly are &[u8] slices borrowed from the raw message:

#[derive(Debug,Clone)]
pub struct EventBody<'a> {
    data: HashMap<MessageType,Vec<(Key<'a>,Value<'a>)>>,
    raw: Vec<u8>,
}

#[derive(Debug)]
pub struct Coalesce<'a> {
    inflight: HashMap<EventID,EventBody<'a>>,
}

(I am pretty sure that I will want to work with some sort of buffer pool for the raw messages at some point, but that's going to be another story.)

My peg grammar for individual lines has a top-level definition as:

pag::parser! {
    grammar audit_parser() for [u8] {
        pub rule record() -> (MessageType, EventID, Vec<(Key<'input>, Value<'input>)>)
            = ...
    }
}

I assume that the resulting function looks something like this:

mod audit_parser {
    pub fn record<'input> (&'input[u8]) -> (MessageType, EventID, Vec<(Key<'input>, Value<'input>)
    { ... }
}

So far, things look fine. The parser does not copy every bit of data it is fed and I am aware that this means that the input data has to live as long as values. Since I have no use for the input data after passing it into Coalesce object, I figured, I'd just pass a Vec<u8> or a Box<[u8]> and store that as raw data.

Alas, at this point I have run into a wall. Here's a code snippet from one of the many variants I have tried, edited for brevity:

impl<'obj> Coalesce<'obj> {
    pub fn add_line(& mut self, raw: Vec<u8>) -> Result<(EventID, Option<EventBody>), String>{ 
        let (typ, id, values) = audit_parser::record(raw.as_ref()).map_err(|e|e.to_string())?;
        match typ {
            MessageType(1300) =>  {
                if self.inflight.contains_key(&id) {
                    return Err(format!("duplicate SYSCALL for id {}", id))
                }
                let mut data = HashMap::new();
                data.insert(typ, values);
                self.inflight.insert(id, EventBody{data, raw: raw});
                return Ok((id, None));
            }
            // …
            _ => Err("not implemented".to_string())
        }
    }
}

The compiler tells me:

  • error[E0597]: `raw` does not live long enough
  • error[E0505]: cannot move out of `raw` because it is borrowed

I think I understand both errors. And I think that the compiler is trying to tell me in twisted terms that raw is going to need a lifetime. Sure, I could just change add_line so it takes an &'obj[u8], but then the library's users would need to worry about keeping that data alive, wouldn't they? What am I missing?

You cannot store owned data and references to that data in the same struct, as you're trying to do with EventBody; you'll have to restructure your code to avoid that. One workaround is to store usize indices into raw in your HashMap, and construct references with &self.raw[start..end] as needed.

1 Like

Thank you very much. I think I learned something today. :slight_smile:

After thinking a bit more about this, I realized why I shouldn't be allowed to mix owned an borrowed data like I tried: Nothing would prevent other code from changing the owned data, thereby corrupting the borrowed data.

Having to build my own indirection has another nice side-effect: Since I can be sure that the underlying data blobs will be at most a few kB each, I can get away with 2x u16 instead of a a pointer/usize pair.

No, this isn't the problem—the fact that you have a shared reference to self.raw already prevents (or would prevent) other code from getting a mutable reference to it. I admit I don't fully understand the reasons why the borrow check doesn't support "self-referential structs" like this; one that's often cited is that you could use safe code (like std::mem::replace) to cause an EventBody value to be moved from one memory location to another, which potential invalidates the internal reference. That doesn't actually happen in this case because moving a Vec<u8> doesn't cause its owned data to be moved, but the borrow check probably can't reason in that way.

1 Like

Might it be that you have both owned and a separate borrow ref. That means you could corrupt the aliasing rules by creating a mut ref using the owned value. The borrowed ref won’t know anything about that... thus the issue.