Parsing on read with LazyCell

I’m analysing some large tab-separated log files in various ways, and due to (1) the size of the logs, (2) the cost of parsing some fields, and (3) each analysis only accessing a small subset of the fields, I want to parse individual fields only when they’re read.

I was able to achieve what I wanted with code of the following form (here simplified to parsing a couple of trivial values to avoid cluttering this post with unnecessary detail). However, the boxed closure feels messy to me. Is this just inherent complexity, or am I missing a more elegant way of achieving this?

use std::cell::LazyCell;
use std::num::{ParseFloatError, ParseIntError};
use std::str::FromStr;

fn main() {
    // in real code this string would come from BufRead's Lines iterator
    let test = "12\t3.45".to_owned();
    
    let values = Values::from_str(&test).expect("String has sufficient values");

    assert_eq!(&Ok(12), values.a());
    assert_eq!(&Ok(3.45), values.b());
}

type ParseResult<T> = Result<T, <T as FromStr>::Err>;

type LazyParse<'source, T> =
    LazyCell<ParseResult<T>, Box<dyn FnOnce() -> ParseResult<T> + 'source>>;

// pretend there are many more fields and the types being parsed are much more
// expensive to parse than u32/f32
pub struct Values<'source> {
    a: LazyParse<'source, u32>,
    b: LazyParse<'source, f32>,
}

#[derive(Debug)]
pub struct ParseValuesError;

impl<'source> Values<'source> {
    pub fn from_str(s: &'source str) -> Result<Self, ParseValuesError> {
        let mut parts = s.split('\t');
        let a = parts.next().ok_or(ParseValuesError)?;
        let b = parts.next().ok_or(ParseValuesError)?;

        Ok(Self {
            a: LazyCell::new(Box::new(|| a.parse())),
            b: LazyCell::new(Box::new(|| b.parse())),
        })
    }

    pub fn a(&self) -> &Result<u32, ParseIntError> {
        &self.a
    }

    pub fn b(&self) -> &Result<f32, ParseFloatError> {
        &self.b
    }
}

I don't think it's necessarily more elegant, but storing the raw strings inline instead of putting them on the heap as part of the closure could be an alternative:

use std::cell::OnceCell;
use std::num::{ParseFloatError, ParseIntError};
use std::str::FromStr;

fn main() {
    // in real code this string would come from BufRead's Lines iterator
    let test = "12\t3.45".to_owned();

    let values = Values::from_str(&test).expect("String has sufficient values");

    assert_eq!(&Ok(12), values.a());
    assert_eq!(&Ok(3.45), values.b());
}

type ParseResult<T> = Result<T, <T as FromStr>::Err>;

type LazyParse<T> = OnceCell<ParseResult<T>>;

// pretend there are many more fields and the types being parsed are much more
// expensive to parse than u32/f32
#[derive(Default)]
pub struct Values<'source> {
    a_raw: &'source str,
    b_raw: &'source str,
    a: LazyParse<u32>,
    b: LazyParse<f32>,
}

#[derive(Debug)]
pub struct ParseValuesError;

impl<'source> Values<'source> {
    pub fn from_str(s: &'source str) -> Result<Self, ParseValuesError> {
        let mut parts = s.split('\t');

        let a_raw = parts.next().ok_or(ParseValuesError)?;
        let b_raw = parts.next().ok_or(ParseValuesError)?;

        Ok(Self {
            a_raw,
            b_raw,
            ..Default::default()
        })
    }

    pub fn a(&self) -> &Result<u32, ParseIntError> {
        &self.a.get_or_init(|| self.a_raw.parse())
    }

    pub fn b(&self) -> &Result<f32, ParseFloatError> {
        &self.b.get_or_init(|| self.b_raw.parse())
    }
}

Playground.

1 Like

I was writing almost an identically same solution as jofas, I was so close to done that I decided to post it anyway :'D, it's just a bit different. I used a wrapper type, it's useful when you have multiple structs needing lazy parsing, or want add shared behavior in one place. For a single struct, jofas's response is simpler. In your case, just storing the string slice directly and caching the parsed result will help, you don't need the box.

use std::cell::OnceCell;
use std::num::{ParseFloatError, ParseIntError};
use std::str::FromStr;

#[derive(Debug)] 
pub struct ParseValuesError;

pub struct LazyField<'a, T: FromStr> {
    source: &'a str,
    cached: OnceCell<Result<T, T::Err>>,
}

impl<'a, T: FromStr> LazyField<'a, T> {
    pub fn new(source: &'a str) -> Self {
        Self {
            source,
            cached: OnceCell::new(),
        }
    }

    pub fn get(&self) -> &Result<T, T::Err> {
        self.cached.get_or_init(|| self.source.parse())
    }

    // Bonus: exposes raw string publicly without parsing
    pub fn raw(&self) -> &str {
        self.source
    }
}

// with wrapper, 2 fields instead of 4 (no seperate raw + cache pairs)
pub struct Values<'source> {
    a: LazyField<'source, u32>,
    b: LazyField<'source, f32>,
}

impl<'source> Values<'source> {
    pub fn from_str(s: &'source str) -> Result<Self, ParseValuesError> {
        let mut parts = s.split('\t');
        let a = parts.next().ok_or(ParseValuesError)?;
        let b = parts.next().ok_or(ParseValuesError)?;

        Ok(Self {
            a: LazyField::new(a),
            b: LazyField::new(b),
        })
    }

    pub fn a(&self) -> &Result<u32, ParseIntError> {
        self.a.get()
    }

    pub fn b(&self) -> &Result<f32, ParseFloatError> {
        self.b.get()
    }
}
1 Like