I had a dialogue with my fellow researcher, discussing alternatives to Python & Pandas. The more I work with Pandas, the more I see issues with code. In SQL you do
select column_a, column_b, column_in_table2
from table1
join table2 on foreign_key1=foreign_key2
where column_b>1
and you don't need to prefix column names unless they appear in both tables. In Python/Pandas the equivalent will be painful, because everything has to be explicit:
( # parenthesis to make this line-breakable
df1[df1.column_b > 1][['column_a', 'column_b', 'foreign_key1']]
.merge(df2[['column_from_table2', 'foreign_key2']],
left_on='foreign_key1', right_on='foreign_key2')
)
(I have 6 years of experience with Pandas, and I swear this is enough idiomatic. Could only slightly shorten this.)
We tried Julia, he likes it because of better speed and more specific syntax for vectorization (write a function for one value, run against a Vec at the speed of C.). But otherwise it's pretty much the same.
# (need to import DataFrames.jl)
innerjoin(
select(
filter(:column_b => v -> v > 1, df1), # rust equivalent: filter(df.column_b, |v| v > 1)
[:column_a, :column_b, :foreign_key1]),
select(df2, [:foreign_key2, :column_in_table2])
)
Not more convenient. But some expressions are shorter than in Python.
Another big problem with both Python and Julia is that if you get column names wrong, you only discover this in runtime. It can't be statically checked.
Now we found Polars. And he immediately points me at this:
use polars::prelude::*;
fn example() -> Result<DataFrame, PolarsError> {
LazyCsvReader::new("foo.csv")
.has_header(true)
.finish()?
.filter(col("bar").gt(lit(100)))
.groupby(vec![col("ham")])
.agg(vec![col("spam").sum(), col("ham").sort(false).first()])
.collect()
}
He says this is just unreadable spaghetti code.
I'd agree with him, because
- to point at a column, you must write col("bar").
- filtering is just unreadable.
.filter(col("bar").gt(lit(100))
.
I uderstand that these are tradeoffs of Rust.
- You still need a var to point at a columns. Or to pass this lazy evaluator to the
filter()
method of the DF. - You can't override comparison operators in Rust to output vectors. The traits require the method to output a single bool, while in Python you override
__eq__
,__gt__
to output a vector. So you have to make some kind of.eq/gt/lt()
method.
In Rust, this is more compact than defining own struct
for Serde-CSV and then aggregate the values in a HashMap
. But for anyone considering R/Python/Julia, this is worse in terms of code, and no benefit of Rust static analysis and reliability.
Makes me think, wouldn't it be more profitable to make a struct where fields would keep columns, and use all the benefits of static analysis? (And use macros where we need to access columns in context of a dataframe.)
struct MySourceDf {
bar: Col<f32>,
ham: Col<i32>,
spam: Col<Option<f32>>,
}
struct MyGroupedDf {
ham: Col<i32>,
spam: Col<f32>,
}
let my_source_df:MySourceDf = read_csv("foo.csv");
let my_group_df:MyGroupedDf = agg!(
group!(
filter!(my_source_df, bar, |v| v > 100)
// ^ expands to filter(my_source_df, my_source_df.bar.map(...))
// and can checked at compile time
ham),
spam => sum)