CSV and multi threading the inserts


#1

Based on my other post of rust based ETL activities (CSV import to postgres), i am now wondering if I can spawn a few threading and insert the data simultaneously. I don’t know how at this point. Here is my idea is pseudocode. I would appreciate some feedback. There could be data structures that do this already.

I have read about using r2d2 as a connection pooler so the connection can be shared across threads.

Create a Vector of "Record"
Load all the data into that Vector. This is the tough part. The file is 700MB.
create x slices of that vector and send each of those to a different thread

does that work?

Another option could be to pass the data from the csv reader to ‘inserter’ threads. Since the csv reader is faster that the inserts, this would seem to be a reasonable option as well and it wouldn’t require that the entire file end up in memory.

thoughts?

thank you
dan


#2

Postgres has a bulk insertion call (COPY) that should be able to ingest 700MB in a few minutes (or seconds).

Please don’t create a table, create indices, and then run insert on the table as this will need to update the indices on each insertion resulting in a lot of lost performance.