CSV import to postgres

I am trying to import a CSV file to postgres. All my code is posted below. This is a simple thing and i can't figure the 'rust' way and i get an error during the insert that i can't figure out how to troubleshoot.

The error first. The table has some int and varchar columns. In the csv if i see "NA" i am turning that into a "0" so it will convert correctly. Based on the error below i can't tell if that is what is wrong or not.

dan@dan-VirtualBox:~/Downloads/csvreaders/rustcsvreader$ ./target/release/rustcsvreader  < ~/Downloads/2008.csv
thread 'main' panicked at 'Insert failed: Db(DbError { severity: "ERROR", parsed_severity: None, code: SyntaxError, message: "syntax error at end of input", detail: None, hint: None, position: Some(Normal(188)), where_: None, schema: None, table: None, column: None, datatype: None, constraint: None, file: Some("scan.l"), line: Some(1074), routine: Some("scanner_yyerror") })', /checkout/src/libcore/result.rs:859

Also, is the way i am building the SQL parameters correct? Is there a better way?

dan@dan-VirtualBox:~/Downloads/csvreaders/rustcsvreader$ rustup show
Default host: x86_64-unknown-linux-gnu

stable-x86_64-unknown-linux-gnu (default)
rustc 1.18.0 (03fc9d622 2017-06-06)

thank you.
code starts here:

extern crate csv;
extern crate postgres;
use std::fs;
use std::string::String;
//use std::io;
//use std::fs::File;
//use std::path::Path;
use postgres::{Connection, TlsMode};
type OnTimeRecord = (
	String, String, String, String, String, String, String,
	String, String, String, String, String, String, String,
	String, String, String, String, String, String, String,
	String, String
);
fn main() {
	//let path = Path::new("/home/dan/Downloads/2008.csv");

	//http://stat-computing.org/dataexpo/2009/the-data.html
	let mut rowcnt = 0;
	let mut rdr = csv::Reader::from_reader(std::io::stdin());
	//let mut rdr = csv::Reader::from_path("/home/dan/Downloads/2008.csv").unwrap();
	let conn = Connection::connect("postgres://dan:dan@localhost/rowcounttest", TlsMode::None)
		.expect("Connection failed");
	let mut rec = csv::ByteRecord::new();
	//while rdr.read_byte_record(&mut rec).expect("all done") {
	for result in rdr.records() {
		let  rec = result.expect("no result");
		//println!("{:?}", rec);
		let mut fields : String = rec.iter().map(|f| if f == "NA" { "0" } else { f } ).collect();
		rowcnt += 1;
		conn.execute("INSERT INTO ontime_performance VALUES (
			$1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11,
			$12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22,
			$23, $24, $25, $26, $27, $28, $29
			", &[&fields]
				// &rec.get(0), &rec.get(1), &rec.get(2), &rec.get(3), &rec.get(4), &rec.get(5),
				// &rec.get(6), &rec.get(7), &rec.get(8), &rec.get(9), &rec.get(10), &rec.get(11),
				// &rec.get(12), &rec.get(13), &rec.get(14), &rec.get(15), &rec.get(16), &rec.get(17),
				// &rec.get(18), &rec.get(19), &rec.get(20), &rec.get(21), &rec.get(22), &rec.get(23),
				// &rec.get(24), &rec.get(25), &rec.get(26), &rec.get(27), &rec.get(28)
				// ]
			).expect("Insert failed");
	}
	
	println!("rowcount = {}", rowcnt);
}

stack backtrace:

   0: std::sys::imp::backtrace::tracing::imp::unwind_backtrace
             at /checkout/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::_print
             at /checkout/src/libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at /checkout/src/libstd/sys_common/backtrace.rs:60
             at /checkout/src/libstd/panicking.rs:355
   3: std::panicking::default_hook
             at /checkout/src/libstd/panicking.rs:371
   4: std::panicking::rust_panic_with_hook
             at /checkout/src/libstd/panicking.rs:549
   5: std::panicking::begin_panic
             at /checkout/src/libstd/panicking.rs:511
   6: std::panicking::begin_panic_fmt
             at /checkout/src/libstd/panicking.rs:495
   7: rust_begin_unwind
             at /checkout/src/libstd/panicking.rs:471
   8: core::panicking::panic_fmt
             at /checkout/src/libcore/panicking.rs:69
   9: core::result::unwrap_failed
  10: rustcsvreader::main
  11: __rust_maybe_catch_panic
             at /checkout/src/libpanic_unwind/lib.rs:98
  12: std::rt::lang_start
             at /checkout/src/libstd/panicking.rs:433
             at /checkout/src/libstd/panic.rs:361
             at /checkout/src/libstd/rt.rs:57
  13: __libc_start_main
  14: _start

You are missing a closing ) at the end of the set of values.

best backtrace line ever!

5 Likes

Can you be more specific? Which line(s)? The manual list of values is commented. The code compiles but bails on execution. I don't understand how a missing paren would change that.

I think they were referring to this. Looks like the ( after the VALUES keyword isn't closed.

Yep that is one problem. Fixed. Thank you. Now i don't know what datatype to make this variable:

let mut fields = rec.iter().map(|f| if f == "NA" { "0" } else { f } ).collect();

I tried string but that is wrong as there is only one element and the number of parameters doesn't match the size of the 'array'. the compiler also won't infer this data type.

dan@dan-VirtualBox:~/Downloads/csvreaders/rustcsvreader$ cargo build --release
   Compiling rustcsvreader v0.1.0 (file:///home/dan/Downloads/csvreaders/rustcsvreader)
error[E0277]: the trait bound `postgres::types::ToSql: std::marker::Sized` is not satisfied
  --> src/main.rs:28:7
   |
28 | 		let fields = rec.iter().map(|f| if f == "NA" { "0" } else { f } ).collect();
   | 		    ^^^^^^ the trait `std::marker::Sized` is not implemented for `postgres::types::ToSql`
   |
   = note: `postgres::types::ToSql` does not have a constant size known at compile-time
   = note: all local variables must have a statically known size

using &[&str] seemed right based on the postgres doc. Truthfully i don't know what i am doing. (This is my second day using rust.)

   Compiling rustcsvreader v0.1.0 (file:///home/dan/Downloads/csvreaders/rustcsvreader)
error[E0277]: the trait bound `&[&str]: std::iter::FromIterator<&str>` is not satisfied
  --> src/main.rs:28:78
   |
28 | 		let fields: &[&str] = rec.iter().map(|f| if f == "NA" { "0" } else { f } ).collect();
   | 		                                                                           ^^^^^^^ the trait `std::iter::FromIterator<&str>` is not implemented for `&[&str]`
   |
   = note: a collection of type `&[&str]` cannot be built from an iterator over elements of type `&str`

error: aborting due to previous error

error: Could not compile `rustcsvreader`.

I got a working version of this code. It is ugly and probably not very fast. I would enjoy comments that make this more idiomatic.

extern crate csv;
extern crate postgres;
use std::fs;
use std::string::String;
//use std::io;
//use std::fs::File;
//use std::path::Path;
use postgres::{Connection, TlsMode};
type OnTimeRecord = (
	String, String, String, String, String, String, String,
	String, String, String, String, String, String, String,
	String, String, String, String, String, String, String,
	String, String
);
fn main() {
	//let path = Path::new("/home/dan/Downloads/2008.csv");

	//http://stat-computing.org/dataexpo/2009/the-data.html
	let mut rowcnt = 0;
	let mut rdr = csv::Reader::from_reader(std::io::stdin());
	//let mut rdr = csv::Reader::from_path("/home/dan/Downloads/2008.csv").unwrap();
	let conn = Connection::connect("postgres://dan:dan@localhost/rowcounttest", TlsMode::None)
		.expect("Connection failed");
	let mut rec = csv::ByteRecord::new();
	//while rdr.read_byte_record(&mut rec).expect("all done") {
	for result in rdr.records() {
		let rec = result.expect("no result");
		//let fields: [String; 29] = rec.iter().map(|f| if f == "NA" { "0" } else { f } ).collect();
		//println!("{:?}", fields);
		let f0:i32 = if rec.get(0).unwrap() == "NA" { 0 } else { rec.get(0).unwrap().parse().unwrap() };
		let f1:i32 = if rec.get(1).unwrap() == "NA" { 0 } else { rec.get(1).unwrap().parse().unwrap() };
		let f2:i32 = if rec.get(2).unwrap() == "NA" { 0 } else { rec.get(2).unwrap().parse().unwrap() };
		let f3:i32 = if rec.get(3).unwrap() == "NA" { 0 } else { rec.get(3).unwrap().parse().unwrap() };
		let f4:i32 = if rec.get(4).unwrap() == "NA" { 0 } else { rec.get(4).unwrap().parse().unwrap() };
		let f5:i32 = if rec.get(5).unwrap() == "NA" { 0 } else { rec.get(5).unwrap().parse().unwrap() };
		let f6:i32 = if rec.get(6).unwrap() == "NA" { 0 } else { rec.get(6).unwrap().parse().unwrap() };
		let f7:i32 = if rec.get(7).unwrap() == "NA" { 0 } else { rec.get(7).unwrap().parse().unwrap() };
		let f8:&str = if rec.get(8).unwrap() == "NA" { "0" } else { rec.get(8).unwrap() };
		let f9:i32 = if rec.get(9).unwrap() == "NA" { 0 } else { rec.get(9).unwrap().parse().unwrap() };
		let f10:&str = if rec.get(10).unwrap() == "NA" { "0" } else { rec.get(10).unwrap() };
		let f11:i32 = if rec.get(11).unwrap() == "NA" { 0 } else { rec.get(11).unwrap().parse().unwrap() };
		let f12:i32 = if rec.get(12).unwrap() == "NA" { 0 } else { rec.get(12).unwrap().parse().unwrap() };
		let f13:i32 = if rec.get(13).unwrap() == "NA" { 0 } else { rec.get(13).unwrap().parse().unwrap() };
		let f14:i32 = if rec.get(14).unwrap() == "NA" { 0 } else { rec.get(14).unwrap().parse().unwrap() };
		let f15:i32 = if rec.get(15).unwrap() == "NA" { 0 } else { rec.get(15).unwrap().parse().unwrap() };
		let f16:&str = if rec.get(16).unwrap() == "NA" { "0" } else { rec.get(16).unwrap() };
		let f17:&str = if rec.get(17).unwrap() == "NA" { "0" } else { rec.get(17).unwrap() };
		let f18:i32 = if rec.get(18).unwrap() == "NA" { 0 } else { rec.get(18).unwrap().parse().unwrap() };
		let f19:i32 = if rec.get(19).unwrap() == "NA" { 0 } else { rec.get(19).unwrap().parse().unwrap() };
		let f20:i32 = if rec.get(20).unwrap() == "NA" { 0 } else { rec.get(20).unwrap().parse().unwrap() };
		let f21:i32 = if rec.get(21).unwrap() == "NA" { 0 } else { rec.get(21).unwrap().parse().unwrap() };
		let f22:&str = if rec.get(22).unwrap() == "NA" { "0" } else { rec.get(22).unwrap() };
		let f23:i32 = if rec.get(23).unwrap() == "NA" { 0 } else { rec.get(23).unwrap().parse().unwrap() };
		let f24:i32 = if rec.get(24).unwrap() == "NA" { 0 } else { rec.get(24).unwrap().parse().unwrap() };
		let f25:i32 = if rec.get(25).unwrap() == "NA" { 0 } else { rec.get(25).unwrap().parse().unwrap() };
		let f26:i32 = if rec.get(26).unwrap() == "NA" { 0 } else { rec.get(26).unwrap().parse().unwrap() };
		let f27:i32 = if rec.get(27).unwrap() == "NA" { 0 } else { rec.get(27).unwrap().parse().unwrap() };
		let f28:i32 = if rec.get(28).unwrap() == "NA" { 0 } else { rec.get(28).unwrap().parse().unwrap() };
		rowcnt += 1;
		conn.execute("INSERT INTO ontime_performance VALUES (
			$1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11,
			$12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22,
			$23, $24, $25, $26, $27, $28, $29
			);", //&[&fields]
				&[
				&f0, &f1, &f2, &f3, &f4, &f5,
				&f6, &f7, &f8, &f9, &f10, &f11,
				&f12, &f13, &f14, &f15, &f16, &f17,
				&f18, &f19, &f20, &f21, &f22, &f23,
				&f24, &f25, &f26, &f27, &f28
				]
			).expect("Insert failed");
		if rowcnt % 100000 == 0 {
			println!("rowcount = {}", rowcnt);
		} 
	}
	
	println!("rowcount = {}", rowcnt);
}

@dan You probably want to use serde to deserialize your records. Since you have some custom logic like NA in fields that are otherwise floats, you'll probably want to define your own type that does deserialization appropriately. The key thing I'd try to fix first is the repetitive code. You could, at the very least, define a helper function, and you probably want to do error handling.

If you send me a sample of your CSV data (masked if necessary), then I'd be happy to try to come up with something for you.

4 Likes

The data comes from here: 2009 - Joint Statistical Computing and Statistical Graphics Section. The 2008 link.

The goal of this was to get a feel for the performance of Rust in an ETL setting. I am comparing it to python. I was able to put this data in a postgres table in 15 min in python. My Rust implementation takes 55 min. Simply iterating over the file (7million records) took 1.5 seconds (11 in python) so i was very hopeful that Rust would significantly outperform python (go took 50 seconds to iterate over those same records).

Thank you for any optimization and code improvement.

--https://www.transtats.bts.gov/Fields.asp?Table_ID=236
CREATE TABLE ontime_performance (
 	Year 	INT,
 	Month   INT, --1-12
 	DayofMonth 	INT, --1-31
 	DayOfWeek 	INT, --1 (Monday) - 7 (Sunday)
 	DepTime 	INT, --actual departure time (local, hhmm)
 	CRSDepTime 	INT, --scheduled departure time (local, hhmm)
 	ArrTime 	INT, --actual arrival time (local, hhmm)
 	CRSArrTime 	INT, --scheduled arrival time (local, hhmm)
 	UniqueCarrier 	VARCHAR(4), --unique carrier code
 	FlightNum 	INT, --flight number
 	TailNum 	VARCHAR(10), --plane tail number
 	ActualElapsedTime INT, --in minutes
 	CRSElapsedTime 	INT, --in minutes
 	AirTime 	INT, --in minutes
 	ArrDelay 	INT, --arrival delay, in minutes
 	DepDelay 	INT, --departure delay, in minutes
 	Origin 		CHAR(3), --origin IATA airport code
 	Dest 		CHAR(3), --destination IATA airport code
 	Distance 	INT, --in miles
 	TaxiIn 		INT, --taxi in time, in minutes
 	TaxiOut 	INT, --taxi out time in minutes
 	Cancelled 	INT, --was the flight cancelled?
 	CancellationCode VARCHAR(2), --	reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
 	Diverted 	INT, --1 = yes, 0 = no
 	CarrierDelay 	INT, --in minutes
 	WeatherDelay 	INT, --in minutes
 	NASDelay 	INT, --in minutes
 	SecurityDelay 	INT, --in minutes
 	LateAircraftDelay INT --in minutes
);

I would recommend performing the writes in a transaction and preparing the statement once. If you really want to go fast, you can try using postgres_binary_copy - Rust.

1 Like

Actually, you already have a CSV, you can just dump that straight into Postgres with COPY and postgres::stmt::Statement - Rust

COPY ontime_performance FROM STDIN WITH (FORMAT csv);
2 Likes

thank you for the options. I am considering buik copy activities separately. In this case i need to 'transform' the data before i import it. I am after a tech stack that does this the fastest. the postgres binary copy crate is a good option.

1 Like

putting everything in one transaction reduced the time to 6 min. that is way better and more inline to what i expected. I would really like to see this written better as an example of how to write rust correctly.

thanks to everyone that responded.

1 Like

You should really try the copy in, it is always MUCH faster from my experience.

1 Like

This is the current version; it is much slimmer but likely not the best way to write it. For the curious:

extern crate csv;
extern crate postgres;
extern crate chrono;

use chrono::prelude::*;
use std::str;
use postgres::{Connection, TlsMode};

fn main() {
	let insert_sql = "INSERT INTO ontime_performance VALUES (
			$1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11,
			$12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22,
			$23, $24, $25, $26, $27, $28, $29
			);";
	//let path = Path::new("/home/dan/Downloads/2008.csv");
	//http://stat-computing.org/dataexpo/2009/the-data.html
	//unzipped ~700MB and 7million records.  This is 5.5 min on my VM (15 in python)
	let mut rowcnt = 0;
	let mut rdr = csv::Reader::from_reader(std::io::stdin());
	//let mut rdr = csv::Reader::from_path("/home/dan/Downloads/2008.csv").unwrap();
	let mut rec = csv::ByteRecord::new();
	let mut ints : Vec<i32> = vec![0; rdr.byte_headers().unwrap().len()];
	let conn = Connection::connect("postgres://dan:dan@localhost/rowcounttest", TlsMode::None)
		.expect("Connection failed");
	let tran = conn.transaction().unwrap();
	let stmt =  tran.prepare(insert_sql).expect("Prepare failed");
	let mut utc: DateTime<Utc> = Utc::now();
	while rdr.read_byte_record(&mut rec).expect("all done") {
		for idx in 0..29 {
			if idx != 8 && idx != 10 && idx != 16  && idx != 17 && idx != 22 {
				let temp = str::from_utf8(&rec[idx]).unwrap();
				if temp == "NA" { 
					ints[idx] = 0;
				} else { 
					ints[idx] = temp.parse().unwrap();
				}
			} 
		}
		let f8:&str = str::from_utf8(&rec[8]).unwrap();
		let f10:&str = str::from_utf8(&rec[10]).unwrap();
		let f16:&str = str::from_utf8(&rec[16]).unwrap();
		let f17:&str = str::from_utf8(&rec[17]).unwrap();
		let f22:&str = str::from_utf8(&rec[22]).unwrap();
		rowcnt += 1;
		let rn = stmt.execute(
				&[
				&ints[0], &ints[1], &ints[2], &ints[3], &ints[4], &ints[5], &ints[6], &ints[7], 
				&f8, &ints[9], &f10, &ints[11], &ints[12], &ints[13], &ints[14], &ints[15],
				&f16, &f17, &ints[18], &ints[19], &ints[20], &ints[21], &f22, &ints[23],  &ints[24], 
				&ints[25], &ints[26], &ints[27], &ints[28]
				]
			).expect("Insert failed");
		if rowcnt % 100000 == 0 {
			println!("rowcount = {} in {}", rowcnt,  Utc::now().signed_duration_since( utc) );
			utc = Utc::now();
		} 
	}
	tran.commit();
	stmt.finish();
	println!("rowcount = {}", rowcnt);
}
1 Like

Thanks for sharing! It's always nice to see how other people solve problems!

Quick remarks

  • Since you mentioned wanting this to be an example for others, I'd suggest to pay a bit more attention to layout and comments. A few empty lines here and there to separate the semantic blocks ("prepare and compile statement", "connect to db", "process a single row", etc) would greatly improve readability for other newcomers to Rust (and yourself if you're re-using the code five months from now :wink: )
  • Why are you using a CSV::ByteRecord if you are doing str::from_utf8() directly afterwards for every single field anyway? In that case you might as well take CSV's normal record, and let the library do utf8 conversions, making your own code smaller and less cluttered.
  • Readability again: probably extract the temp/NA conversion to a small function.
  • As BurntSushi mentioned, a LOT of readability can be gained by using serde and a row-struct. Example from CSV docs. BurntSushi is sort of an authority on this topic, he literally does exactly this kind of data wrangling for a living, which is the reason Rust has nice things like RipGrep, XSV and the CSV crate :smile:

(I'm posting from mobile on my commute, otherwise I'd have written some code examples of what I mean.)

Disclaimer:: your code is perfectly fine for "quick one-off" levels of programming :smile: . My remarks are only intended for taking this forward to "nice, reusable example")

3 Likes

@juleskers @dan Sorry I haven't gotten around to looking at this more closely, but this example might help: csv::invalid_option - Rust --- In particular, I'm envisioning you might use it like this:

extern crate csv;
extern crate serde;
#[macro_use]
extern crate serde_derive;

use std::error::Error;
use std::io;
use std::process;

#[derive(Debug, Deserialize)]
#[serde(rename_all = "PascalCase")]
struct Row {
    #[serde(deserialize_with = "csv::invalid_option")]
    year: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    month: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    #[serde(rename = "DayofMonth")]
    day_of_month: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    day_of_week: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    dep_time: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    #[serde(rename = "CRSDepTime")]
    crs_dep_time: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    arr_time: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    #[serde(rename = "CRSArrTime")]
    crs_arr_time: Option<i32>,
    unique_carrier: String,
    #[serde(deserialize_with = "csv::invalid_option")]
    flight_num: Option<i32>,
    tail_num: String,
    #[serde(deserialize_with = "csv::invalid_option")]
    actual_elapsed_time: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    #[serde(rename = "CRSElapsedTime")]
    crs_elapsed_time: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    air_time: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    arr_delay: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    dep_delay: Option<i32>,
    origin: String,
    dest: String,
    #[serde(deserialize_with = "csv::invalid_option")]
    distance: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    taxi_in: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    taxi_out: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    cancelled: Option<i32>,
    cancellation_code: String,
    #[serde(deserialize_with = "csv::invalid_option")]
    diverted: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    carrier_delay: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    weather_delay: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    #[serde(rename = "NASDelay")]
    nas_delay: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    security_delay: Option<i32>,
    #[serde(deserialize_with = "csv::invalid_option")]
    late_aircraft_delay: Option<i32>,
}

fn main() {
    if let Err(err) = run() {
        println!("{}", err);
        process::exit(1);
    }
}

fn run() -> Result<(), Box<Error>> {
    let mut rdr = csv::Reader::from_reader(io::stdin());
    for result in rdr.deserialize() {
        let row: Row = result?;
        println!("{:?}", row);
    }
    Ok(())
}

Note that this doesn't have strictly the same semantics as your code. Namely, a numeric field will get None if it fails to parse when the value isn't NA, but your code will trip an error in that case. (If you want that behavior, then you'll need to do a bit more work.) And if you want to treat a NA as a 0 when inserting it into a table, then you'd just do it on access. For example, row.dep_delay.unwrap_or(0).

1 Like

Ha, of course you have an option for that :slight_smile:
I vaguely recalled that there was some way to override CSV/SerDe's parse handling, but didn't know enough to find this on a tiny touchscreen device.
Thanks for sharing!

1 Like

Thank you. First i didn't know how to turn NA in to NULL. I used 0 because that is all i knew how to do but really wanted NULL. Option<> was the piece i was missing.

I am really after performance and based on reading a lot of stuff about CSV parsing in Rust, the byte_record is apparently the fastest. I used a StringRecord first and this code reduced execution time by a minute (6 - > 5). The string record is definitely easier to read. I will have to make a trade-off at some point. I will (re)implement this code with the struct you have supplied and time that so i have timings on that implementation.

1 Like

It is, because it skips the bytes->valid-utf8 conversion that the normal record does.
But since you immediately do that conversion yourself, there shouldn't be any (significant) difference. (Or does the normal Record do even more magic than I'm aware of?)

Are you sure it wasn't due to the order you executed the tests? (file-system caching of the input files can also really make a difference, for example)

In any case, 5 minutes or 6 minutes are both instances of "go get coffee, return later"-quick, so the next question is, how much time should you spend saving that minute?