Language/Setup for ETL (Extract, Transform, Load)

  1. I'm moving to a style of programming where instead of using serde_json, I use rusqlite/diesel, and store all state in Sqlite3.

  2. I now have a number of tasks of the form "take this file which has a bunch of data in an ad-hoc, ill-defined, sorta-specified format ... and convert them into a Sqlite3 table."

  3. Because there isn't a Context-Free-Grammar defined upfront, the work involves alot of: read some data, print it out, try to parse it, get some error, etc ... (i.e. 99% of the battle is defining the types that one would derive JSON encode/decode on)

-- because of the above, alot of the work is: repl oriented, exploring the data to see how it's structured.

  1. If you have run into problems similar as the above, what language / setup do you use for converting "structured but no formal grammar" to "nice SQL tables" ?

I've ran into a bottleneck in my program with sqlite, which is basically single threaded. I was able to easily parallelize most of my code to get 8x speedup, and all the perf tricks for sqlite maxed out at 1.5x-2x or so. So beware. Use postgres or some other storage.

I failed to specify:

  1. the 'data' is < 100MB, the issue here is the mess of ETL, not 'big data'

  2. at this point, I'm not sure Rust is a good fit for ETL, compared to langaues like, Python or Clojure

Ah, OK, 100MB is easy for Rust (and even sqlite). I thought your case would be more similar to this scenario.

I'm not sure how bad your grammar is. In the worst case, the Rust regex crate is quite good and fast (but pcre available in other languages is fine too). If there are structured bits in the data, then nom is quite nice for making parsers.

I botched this question. The issue here is not how long it takes to run the ETL process, but the developer time it takes to write the ETL code.

So I have this situation where I have lots of random input data formats, I want a ETL process that, for each case, stuffs it into a nicely formatted Sqlite3 Table (which I can then use Diesel with).

The issue is -- what is a good workflow for writing lots of one-off ETL processes. :slight_smile:

So:

  1. I have lots of "ETL tasks"
  2. Each "ETL task" has a different input format.
  3. Each "ETL task" is write once, run one, and no need to run again.
  4. However, each "ETL task" has a different format, so I need a way to rapidly write one-off ETL tasks.

Oh I see. Rust probably isn't the best choice for write-once-run-once code. You don't need speed, you don't need security, you don't need maintainability.

Unless, you'd like to write some kind of framework for processing, essentially reinventing a Perl interpreter in Rust :slight_smile:

Thanks for the "debugging" help. Now that I've rephrased the specs, it's pretty clear that this isn't a Rust question (and Rust isn't the right answer).

For this sort of once-off thing I'd normally either hack together a quick bash script or sed/awk/grep pipeline (for simple transformations) or write it in Python using some sort of interactive interface like a Jupyter notebook.

Normally it's more effort than it's worth to write a Rust program for this. You'd spend a lot of development time getting the types in order, properly handling errors, and getting everything to compile than actually getting anything done. Unfortunately that's one of the disadvantages of using a statically typed language.

1 Like

Yeah. A script might be better but it depends on whether you want to do it just once or you'll be repeating the same process later in future.

If you want more on ETLs with Rust - you can look at this reddit post

Amazon Turk or some freelancer. If there is enough data, you can combine this with machine learning later.
If I'd really have to do it myself, I'f go with some interactive process: go through what parses, for the rest open text editor and fix data or parser, try to apply fixed code for remaining data.

Python Pandas or R dataframes or Julia DataFrames package. DataFrames packages are written specifically to deal with exploratory or ad-hoc data tasks like this where you need to take poorly type specified data, give it structure, map-reduce or groupby or split-combine-apply, and process it in some way initially for later downstream processing. R and Pandas are the industry standard.

Thanks to everyone for the great advice.

In the end, I went with Rust + println! + Vi. :slight_smile:

println! + Vi for exploration, Rust for everything else. [Turns out, dealing with txt data isn't all that bad].

1 Like