I have some code that parses some JSON files and converts them into a set of parquet files. It all works great, but I'm hoping the memory usage could be reduced. I would really appreciate any advice on how to do so.
One area where I would really like to reduce allocations is in the deduplication and output section. In summary, I have a struct with fields of interest that I would like to assign integer IDs to unique values, let's call it Key, and write out a parquet file of each unique Key with its ID. I am using Polars to generate the files, and Polars really wants column-ordered data to turn into Series to be turned into a Dataframe; that is, I need a struct of column vectors rather than a vector of row structs.
The code I have uses the once_cell crate to set up a singleton cache for all "global" state. It has a hashmap to dedupe the Keys. The function that looks up the integer ID for a key also inserts it into a column-ordered writer struct that will eventually be used to save to a parquet file if the key has not yet been seen. I have written a simplified version of the code which I append below.
My issue is that every String field needs two copies, one as the key in the key_ids hashmap, and a copy for the KeyCols writer struct. I would very much like to halve my memory usage by doing something clever.
My first thought was just to use Arc<str> instead of String, but I don't think I can get from a Vec<Arc<str>> to a Polars Series. I also experimented with Cow<'a, str>, but couldn't figure out all the lifetime issues, and similarly with just storing references in KeyCols.
I'm not hoping someone to rewrite my code for me, but any advice on how to procede, or any notes on the code at all, would be greatly appreciated.
use std::{collections::HashMap, sync::Mutex};
use once_cell::sync::OnceCell;
// Struct with various fields of interest.
#[derive(Debug, Hash, PartialEq, Eq)]
struct Key {
a: String,
b: String,
}
// Column-ordered rows to be turned into a Polars dataframe for writing out to parquet.
#[derive(Debug)]
struct KeyCols {
key_id: Vec<u64>,
a: Vec<String>,
b: Vec<String>,
}
impl KeyCols {
fn new() -> Self {
KeyCols {
key_id: Vec::new(),
a: Vec::new(),
b: Vec::new(),
}
}
// Add a row
fn append(&mut self, key_id: u64, key: &Key) {
self.key_id.push(key_id);
self.a.push(key.a.clone());
self.b.push(key.b.clone());
}
}
#[derive(Debug)]
struct Caches {
key_id: u64, // Next unused unique ID
key_ids: HashMap<Key, u64>, // key to ID map
key_cols: KeyCols, // columns to be turned into a polars dataframe
}
impl Caches {
// Return a unique ID for each different Key.
fn key_id_for_key(&mut self, key: Key) -> u64 {
*self.key_ids.entry(key).or_insert_with_key(|k| {
// Not previously seen. Assign a new ID.
let id = self.key_id;
// Also insert it into key_cols to be written out.
self.key_cols.append(id, k);
self.key_id += 1;
id
})
}
}
// Global singleton with all state that we're accumulating.
static CACHE: OnceCell<Mutex<Caches>> = OnceCell::new();
fn setup_cache() {
CACHE.set({
let cache = Caches {
key_id: 0,
key_ids: HashMap::new(),
key_cols: KeyCols::new(),
};
Mutex::new(cache)
})
.expect("Unable to initialize cache");
}
fn main() {
setup_cache();
let mut cache = CACHE.get().unwrap().lock().unwrap();
let k1 = Key {
a: "Hello".into(),
b: "World".into(),
};
let id1 = cache.key_id_for_key(k1);
let k2 = Key {
a: "Foo".into(),
b: "Bar".into(),
};
let id2 = cache.key_id_for_key(k2);
dbg!(&cache, id1, id2);
}