I have a rust program that reads a CSV file (approx 3000 rows with 5 pieces of data / row) and inserts it into an SQLite database. Currently I have one function that reads the CSV into a vector of structures and a second function that takes the vector and writes it to an SQLite database. It works. Does it make more sense to write the data to the SQL database as it is being read from the CSV file i.e. record by record, eliminating the need to create a large vector of structures?
It depends. It depends on your requirements. Only the requirements.
So ask yourself a few questions:
- How large will the typical CSV file be?
- Under realistic assumptions, how large could the largest CSV file be?
- Will you always process files, or do you also need to read from a stream?
- What are the memory constraints of the machine your tool runs on?
- What are the processing constraints of the machine your tool runs on?
- And so on.
Most importantly: what do you need now?
In this context, “now” could mean this week, this month, or this year. It depends on the environment your program runs in.
There are various factors which can affect whether one or the other is more efficient, and it's difficult to say without taking measurements.
Ways writing all the records at once can be more efficient:
- The CPU does not have to switch between executing code to read CSVs and executing code to write SQLite, potentially making better use of both instruction cache and data cache.
- SQLite may be more efficient at writing multiple records in a single transaction. (I don't know if this is true).
Ways writing the records one at a time can be more efficient:
- Whenever an item is added to a vector exceeding its capacity, the vector has to grow — copying the data into a new memory allocation. If you write the records one at a time, you don't need to do this copying.
- The process takes less memory overall, so the operating system does not have to displace other potential uses of the memory (e.g. disk cache or other processes’ memory).
A frequently useful compromise between these modes of operation is to pick a number of records to be your buffer size, and read only up to that many records before switching to writing. However, before considering complicating things this way, you should measure the performance of both of the basic strategies, on large files and small files.
It is true. Depending on whether Write-Ahead Logging is enabled, the difference can be enormous.
SQLite can read and import CSV. There is no need for any manual copying.
I think for my case reading/writing one record at a time is the way to go. I wrote a function (below) that utilizes the CSV crate to read a CSV file line by line and inturn insert each line into an SQLite database. I use a transaction, which appears to be more efficient. I read that there is a crate called serde that will deserialize and allow me to easily convert the input to rust types like f64. Will look at that next. Thank you for the assistance.
pub fn csv_sql() {
let db_file_path: &str = "stock_data.db";
let mut conn = Connection::open(db_file_path).unwrap();
match conn.execute(
"CREATE TABLE IF NOT EXISTS data (
date TEXT NOT NULL PRIMARY KEY,
open REAL NOT NULL,
high REAL NOT NULL,
low REAL NOT NULL,
close REAL NOT NULL,
volume INTEGER NOT NULL
)",
[],
) {
Ok(_) => {
println!("table created successfully")
}
Err(_) => {
println!("unable to create table");
return
}
};
let trx = match conn.transaction(){
Ok(transaction) => {
transaction
},
Err(e) => {
println!("Error starting transaction: {}", e);
return;
}
};
let csv_file_path: &str = "/Users/Chris/Developer-Rust/TestData.csv";
let mut csv_reader = Reader::from_path(csv_file_path).unwrap();
let mut record = StringRecord::new();
loop {
match csv_reader.read_record(&mut record) {
Ok(success) => {
if success {
match trx.execute(
"Insert or Ignore INTO data (date, open, high, low, close, volume)
VALUES (?1, ?2, ?3, ?4, ?5, ?6)",
(
record.get(0).unwrap().to_string(),
record.get(1).unwrap().to_string().parse::<f64>().unwrap_or(0.0),
record.get(2).unwrap().to_string().parse::<f64>().unwrap_or(0.0),
record.get(3).unwrap().to_string().parse::<f64>().unwrap_or(0.0),
record.get(4).unwrap().to_string().parse::<f64>().unwrap_or(0.0),
record.get(5).unwrap().to_string().parse::<i64>().unwrap_or(0),
),
) // end of match conn.execute
{
Ok(_) => {
},
Err(e) => {
println!("error = {}", e)
},
}
} else {
println!("end of csv file reached");
break
}
},
Err(_e) => {
println!("error reading csv file");
return
}
}
}; // end of loop
match trx.commit(){
Ok(_) => {
println!("transaction committed successfully")
},
Err(e) => {
println!("Error commmitting the transaction: {}", e);
return;
}
};
}