Which method is fastest to dump/load data to/from local disk?

I now use below method to dump/load data to/from local disk as below:

pub fn load<'a, T>(file_name: &str) -> Result<T, Box<dyn Error>>
    where
        T: serde::Deserialize<'a>,
    {
        assert!(
            std::path::Path::new(file_name).exists(),
            "{:?} doesn't exist.",
            file_name
        );
        let data: T = serde_pickle::from_reader(
            std::fs::OpenOptions::new().read(true).open(file_name)?,
            Default::default(),
        )?;
        Ok(data)
    }

    pub fn dump<T>(file_name: &str, data: &T) -> Result<(), Box<dyn Error>>
    where
        T: serde::Serialize,
    {
        let mut coin_file = std::fs::OpenOptions::new()
            .create(true)
            .truncate(true) // If the file already exists we want to overwrite the old data
            .write(true)
            .read(true)
            .open(file_name)?;
        serde_pickle::to_writer(&mut coin_file, data, serde_pickle::SerOptions::new())?;
        Ok(())
    }

I mainly dump/load Vec<Vec<f64>> data by above fn, but I find that `dump fn' is too slow, and dump/load fn are called frequently; I want to know which method is fastest to dump data on local disk?

Pickle, being a generic serialization format for a dynamically-typed language, isn't the fastest thing. Furthermore, you mention that these functions are called frequently, which involves re-opening the file(s) many times, which in turn also has considerable overhead.

If you are mainly working with raw numeric data, then you have other options:

  • Pick a compact binary format, such as bincode, msgpack, or cbor
  • If you need to interoperate with Python, then work with contiguous ndarray instances and dump them in npy format, which is about as fast as it gets;

And on top of all of that, you probably want to use a real database instead of a file (or many separate files), to avoid opening and closing a file many times. There are small, embeddable databases that store their data in a single file on disk; the industry standard is SQLite.

1 Like

Hi @H2CO3
Thank you for your help;
If I only want to load/dump many files as quickly as possible and not considering other factors, which method do you think I shall select?

Measure all of your possible options. It's not possible to predict performance perfectly, there are just too many factors.

1 Like

is fastest to serialize. You may try to combine it with fast compression like lz4.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.