I’m analysing some large tab-separated log files in various ways, and due to (1) the size of the logs, (2) the cost of parsing some fields, and (3) each analysis only accessing a small subset of the fields, I want to parse individual fields only when they’re read.
I was able to achieve what I wanted with code of the following form (here simplified to parsing a couple of trivial values to avoid cluttering this post with unnecessary detail). However, the boxed closure feels messy to me. Is this just inherent complexity, or am I missing a more elegant way of achieving this?
use std::cell::LazyCell;
use std::num::{ParseFloatError, ParseIntError};
use std::str::FromStr;
fn main() {
// in real code this string would come from BufRead's Lines iterator
let test = "12\t3.45".to_owned();
let values = Values::from_str(&test).expect("String has sufficient values");
assert_eq!(&Ok(12), values.a());
assert_eq!(&Ok(3.45), values.b());
}
type ParseResult<T> = Result<T, <T as FromStr>::Err>;
type LazyParse<'source, T> =
LazyCell<ParseResult<T>, Box<dyn FnOnce() -> ParseResult<T> + 'source>>;
// pretend there are many more fields and the types being parsed are much more
// expensive to parse than u32/f32
pub struct Values<'source> {
a: LazyParse<'source, u32>,
b: LazyParse<'source, f32>,
}
#[derive(Debug)]
pub struct ParseValuesError;
impl<'source> Values<'source> {
pub fn from_str(s: &'source str) -> Result<Self, ParseValuesError> {
let mut parts = s.split('\t');
let a = parts.next().ok_or(ParseValuesError)?;
let b = parts.next().ok_or(ParseValuesError)?;
Ok(Self {
a: LazyCell::new(Box::new(|| a.parse())),
b: LazyCell::new(Box::new(|| b.parse())),
})
}
pub fn a(&self) -> &Result<u32, ParseIntError> {
&self.a
}
pub fn b(&self) -> &Result<f32, ParseFloatError> {
&self.b
}
}
I don't think it's necessarily more elegant, but storing the raw strings inline instead of putting them on the heap as part of the closure could be an alternative:
use std::cell::OnceCell;
use std::num::{ParseFloatError, ParseIntError};
use std::str::FromStr;
fn main() {
// in real code this string would come from BufRead's Lines iterator
let test = "12\t3.45".to_owned();
let values = Values::from_str(&test).expect("String has sufficient values");
assert_eq!(&Ok(12), values.a());
assert_eq!(&Ok(3.45), values.b());
}
type ParseResult<T> = Result<T, <T as FromStr>::Err>;
type LazyParse<T> = OnceCell<ParseResult<T>>;
// pretend there are many more fields and the types being parsed are much more
// expensive to parse than u32/f32
#[derive(Default)]
pub struct Values<'source> {
a_raw: &'source str,
b_raw: &'source str,
a: LazyParse<u32>,
b: LazyParse<f32>,
}
#[derive(Debug)]
pub struct ParseValuesError;
impl<'source> Values<'source> {
pub fn from_str(s: &'source str) -> Result<Self, ParseValuesError> {
let mut parts = s.split('\t');
let a_raw = parts.next().ok_or(ParseValuesError)?;
let b_raw = parts.next().ok_or(ParseValuesError)?;
Ok(Self {
a_raw,
b_raw,
..Default::default()
})
}
pub fn a(&self) -> &Result<u32, ParseIntError> {
&self.a.get_or_init(|| self.a_raw.parse())
}
pub fn b(&self) -> &Result<f32, ParseFloatError> {
&self.b.get_or_init(|| self.b_raw.parse())
}
}
I was writing almost an identically same solution as jofas, I was so close to done that I decided to post it anyway :'D, it's just a bit different. I used a wrapper type, it's useful when you have multiple structs needing lazy parsing, or want add shared behavior in one place. For a single struct, jofas's response is simpler. In your case, just storing the string slice directly and caching the parsed result will help, you don't need the box.
use std::cell::OnceCell;
use std::num::{ParseFloatError, ParseIntError};
use std::str::FromStr;
#[derive(Debug)]
pub struct ParseValuesError;
pub struct LazyField<'a, T: FromStr> {
source: &'a str,
cached: OnceCell<Result<T, T::Err>>,
}
impl<'a, T: FromStr> LazyField<'a, T> {
pub fn new(source: &'a str) -> Self {
Self {
source,
cached: OnceCell::new(),
}
}
pub fn get(&self) -> &Result<T, T::Err> {
self.cached.get_or_init(|| self.source.parse())
}
// Bonus: exposes raw string publicly without parsing
pub fn raw(&self) -> &str {
self.source
}
}
// with wrapper, 2 fields instead of 4 (no seperate raw + cache pairs)
pub struct Values<'source> {
a: LazyField<'source, u32>,
b: LazyField<'source, f32>,
}
impl<'source> Values<'source> {
pub fn from_str(s: &'source str) -> Result<Self, ParseValuesError> {
let mut parts = s.split('\t');
let a = parts.next().ok_or(ParseValuesError)?;
let b = parts.next().ok_or(ParseValuesError)?;
Ok(Self {
a: LazyField::new(a),
b: LazyField::new(b),
})
}
pub fn a(&self) -> &Result<u32, ParseIntError> {
self.a.get()
}
pub fn b(&self) -> &Result<f32, ParseFloatError> {
self.b.get()
}
}