Isn't Polars code too unreadable?

I had a dialogue with my fellow researcher, discussing alternatives to Python & Pandas. The more I work with Pandas, the more I see issues with code. In SQL you do

select column_a, column_b, column_in_table2
from table1
    join table2 on foreign_key1=foreign_key2
where column_b>1

and you don't need to prefix column names unless they appear in both tables. In Python/Pandas the equivalent will be painful, because everything has to be explicit:

(  # parenthesis to make this line-breakable
    df1[df1.column_b > 1][['column_a', 'column_b', 'foreign_key1']]
    .merge(df2[['column_from_table2', 'foreign_key2']],
        left_on='foreign_key1', right_on='foreign_key2')
)

(I have 6 years of experience with Pandas, and I swear this is enough idiomatic. Could only slightly shorten this.)

We tried Julia, he likes it because of better speed and more specific syntax for vectorization (write a function for one value, run against a Vec at the speed of C.). But otherwise it's pretty much the same.

# (need to import DataFrames.jl)
innerjoin(
    select(
        filter(:column_b => v -> v > 1, df1),   # rust equivalent: filter(df.column_b, |v| v > 1)
        [:column_a, :column_b, :foreign_key1]),
    select(df2, [:foreign_key2, :column_in_table2])
)

Not more convenient. But some expressions are shorter than in Python.

Another big problem with both Python and Julia is that if you get column names wrong, you only discover this in runtime. It can't be statically checked.

Now we found Polars. And he immediately points me at this:

use polars::prelude::*;

fn example() -> Result<DataFrame, PolarsError> {
    LazyCsvReader::new("foo.csv")
        .has_header(true)
        .finish()?
        .filter(col("bar").gt(lit(100)))
        .groupby(vec![col("ham")])
        .agg(vec![col("spam").sum(), col("ham").sort(false).first()])
        .collect()
}

He says this is just unreadable spaghetti code.

I'd agree with him, because

  1. to point at a column, you must write col("bar").
  2. filtering is just unreadable. .filter(col("bar").gt(lit(100)).

I uderstand that these are tradeoffs of Rust.

  1. You still need a var to point at a columns. Or to pass this lazy evaluator to the filter() method of the DF.
  2. You can't override comparison operators in Rust to output vectors. The traits require the method to output a single bool, while in Python you override __eq__, __gt__ to output a vector. So you have to make some kind of .eq/gt/lt() method.

In Rust, this is more compact than defining own struct for Serde-CSV and then aggregate the values in a HashMap. But for anyone considering R/Python/Julia, this is worse in terms of code, and no benefit of Rust static analysis and reliability.

Makes me think, wouldn't it be more profitable to make a struct where fields would keep columns, and use all the benefits of static analysis? (And use macros where we need to access columns in context of a dataframe.)

struct MySourceDf {
    bar: Col<f32>,
    ham: Col<i32>,
    spam: Col<Option<f32>>,
}

struct MyGroupedDf {
    ham: Col<i32>,
    spam: Col<f32>,
}

let my_source_df:MySourceDf = read_csv("foo.csv");
let my_group_df:MyGroupedDf = agg!(
    group!(
       filter!(my_source_df, bar, |v| v > 100)
     // ^ expands to filter(my_source_df, my_source_df.bar.map(...))
     //   and can checked at compile time
       ham),
    spam => sum)

As a not-scientist and a professionnal programmer, those first 4 examples are all on the same level of readability (except for Julia which I'm just not used to reading).

1 Like

Note that SQL is objectively kinda bad as a language syntax. There's good reason that Splunk and Azure Data Explorer and such use a very different structure. And even things that do support SQL-like syntax often make you clarify where the values came from anyway (like in CosmosDB's SQL-like queries).

YOLOing where the fields come from is convenient for exploration, but for anything you're making a Rust program to do, you don't want it to start failing just because someone added a new column to one of the tables you're joining. So requiring source.field instead of just field makes great sense to me.

4 Likes

This. :point_up_2: When authoring SQL queries, I increasingly find myself in paranoid mode, spelling out table.column explicitly, exactly for this and similar reasons.

The thing is, I still think that SQL is a good and suitable DSL w.r.t. its semantics. Unfortunately, its syntax is way too verbose and ugly to my taste, and the overwhelming majority of implementations is simply not strongly typed enough to squeeze out correctness guarantees of queries comparable to Rust (or say, Haskell).

What you (OP) are observing is the famous "object-relational" mismatch. (I use scare quotes because it has little to nothing to do with "objects", especially in a non-OO language like Rust, but I digress.) The thing is, it's spectacularly hard to express relational queries in a way that is:

  • readable
  • logically correct
  • statically type-checked
  • easy to optimize for the query planner and executor
  • can represent arbitrary, complex constraints and aggregations (e.g. recursion and graph traversal/building)
  • amenable to integration into the structured, tree-like data model of "regular" programs

at the same time. In the traditional SE world, ORMs are supposed to "solve" these problems; unfortunately, most of them (Diesel included) fall short on the "can express arbitrary queries" and "can get an arbitrary data structure out" fronts.

Some pretty solid work has recently been done by @2e71828 in his mem_query crate which you might be interested in. In the meantime, I've been working on and off on a database DSL with the ambitious goal of being the Ultimate Solution That Everyone Should Use. (Unfortunately, shenanigans such as having to earn money mostly preclude me from dedicating as much time to it as would be necessary for steady progress.)

But otherwise, correct, elegant, and efficient data querying remains a massive pain in the ass an active area of research, and integration with in-memory data models requires considerable boilerplate.

As for your Pandas example, you could make it considerably more eye-friendly by using .query() (what a coincidence), moving the projection to the end, and reformatting:

df1.query('column_b > 6')
   .merge(df2, left_on='foreign_key_1', right_on='foreign_key_2')
   [['column_a', 'column_b', 'foreign_key_1']]

I'd argue that the Polars code isn't too bad – .gt() instead of > is not pretty, but not the end of the world, either. Regarding your suggestion:

That's certainly doable, but it comes with a different trade-off. Basically, you can either have generic soup with real functions (so-called "type-delimited eDSL"), or macro soup with macros (so-called "quote-delimited eDSL"), but you can't avoid both. I guess Rustaceans prefer generics and value-level computation over macros by default (and if possible), since macros are an untyped syntactic abstraction, so it's harder to enforce guarantees in a predominantly macro-based system.

In my experience with trying to implement my aforementioned database DSL thrice, neither approach (nor a combination) really works as nicely as I would like to, and thus I reluctantly concluded that an embedded DSL for sophisticated query planning, either type-delimited or quote-delimited, is unrealistic at best.

Thus, I'd encourage you, until better tools and languages come out, to bite the bullet and go with whichever semi-sophisticated query system is the most convenient for you or best-integrated with the rest of your system, and don't give up the search for a better option over time.

10 Likes

Yes, I'm familiar with query, but prefer native code, probably irrationally. As for projection, in this example this is correct, but in to avoid potential collision of columns, I prefer doing it first (same as you write table.column). I probably should have written SQL code a bit more verbose, for a fair comparison:

select t1.column_a, t1.column_b, t2.that_other_column
from table1 t1
    join table2 t2 on ...

The Rust language is somewhat cluttered if we want to write queries with it. On an API level I think it is more fair to compare the polars python API with pandas API.

# pandas (your query)

(  
    df1[df1.column_b > 1][['column_a', 'column_b', 'foreign_key1']]
    .merge(df2[['column_from_table2', 'foreign_key2']],
        left_on='foreign_key1', right_on='foreign_key2')
)


# polars
(
    df1
    .filter(pl.col("column_b") > 1)
    .join(df2, left_on="foreign_key1", right_on="foreign_key2")
    .select(["column_a", "column_b", "column_in_table2"])
)

It probably is a matter of taste, but if you ask me the intent of a polars query is more clear. In pandas, both the filtering operation and the select operation use an index [] access and that operation thus has some ambiguity.

In rust the API is more cluttered because we have no automatic default arguments, references, clones, etc. I agree that all of this hurts readability of a query somewhat.

3 Likes

To be frank it sounds like you're repeatedly using tools that aren't a good fit for your problem domain. If you feel that Pandas isn't expressive enough for your use case after using it for six years, I don't think the problem is Pandas. If the concision and ergonomics you're able to achieve using SQL are better, then use SQL (or use SQLAlchemy, either on its own or as input into Pandas, thus giving you the best of both worlds). Replacing one dataframe implementation with another one isn't working for you, yet you keep trying to find one that will work. I think that if you take a step back and reflect upon your choices here (Pandas no good, Julia no good, Polars no good) you might reasonably conclude that R probably won't work for you either. My advice is to take some time to reconsider your approach to the problem.

3 Likes

I do in fact use SQL where possible and replace pandas code with SQlAlchemy for production, as much as possible. (because with our amount of data, Pandas would simply crash from memory overflow in a typical cloud VM). The method you point at, I use often as a shorthand to load/save to Postgres.

Trying to answer to you, I wrote a big summary of all tools, and then realized that Pandas isn't as bad -- in fact it turns out usually the tersest one, but I just use it to the limits.

Regarding Polars, my doubt is that it does offer Rust's speed, but not as much Rust's reliability as I'd expect.

With a bit of work this could have been rewritten as:

fn example() -> Result<DataFrame, PolarsError> {
    LazyCsvReader::new("foo.csv")
        .has_header(true)
        .finish()?
        .filter(cond!(bar > 100))
        .groupby("ham")
        .agg(vec![col("spam").sum(), col("ham").sort(false).first()])
        .collect()
}

I can understand how you get to the conclusion that diesel cannot do certain things, but I nevertheless want to clarify that this is technically not true. There are certainly things in diesel's built-in dsl that are missing to represent "arbitrary queries". In addition there are use cases, which are mostly documented as "well that's not how diesel is supposed to work", but that mostly refers to the built-in dsl as well. What is in common of both statements is that they refer to the built-in dsl, which means:

  • It is possible to extent this dsl to do essentially whatever you want.
  • Fall back to lower level API's to do whatever yo want for specific use cases.

These points allow to essentially perform any query and return an arbitrary datastructure ( as it allows to access the data in the format returned form the database).

2 Likes

Yeah, but that's missing the point.

Of course, you can always drop down to raw SQL for querying and dynamically-typed rows for data retrieval. But then you don't get the same guarantees that you would get had the ORM's own DSL supported the query in question.

I'm obviously not arguing that it's suddenly impossible to express certain queries just because you happen to be using an ORM (along with perhaps DB-native SQL). My point is that I'd like such queries to be type-safe and return strongly-typed data. That's what doesn't work beyond some (pretty reasonable) complexity.

Maybe that was a bit unclear above: I did not talk about just using diesel::sql_query (although that's always a valid solution to get something quickly done). I did talk about extending the DSL, which gives you exactly the same guarantees as code written in diesel itself. From a technical point of view you can write anything as diesel DSL extension. This makes such queries type safe and strongly typed (assuming the extension in itself is sound). You can express arbitrary queries there as rusts type system itself is Turing complete, which implies you can represent anything there.

From a practical point of view it might sometimes be hard to implement a extension for certain queries, as this could easily require implementing large extensions or even reimplementing diesel internal stuff. What I want to express is essentially: It can be done and for certain use cases it might be worth (especially for such cases where you can publish such extensions as standalone crate).

I have been thinking about this for years. I would have been the happiest if it were practical for more complex queries, but alas, after a certain point, it simply isn't.

I investigated this and other problems during my PhD studies. You can check out the preprint of my publication on the matter, as well as the code in the supplementary material, which demonstrates that rewriting all but the most trivial queries in a type-safe manner absolutely destroys the readability (and sometimes also the performance) of the code.

The thing that comes closest to being somewhat readable without requiring excessive trickery is the Haskell/Persistent/Esqueleto implementation (which is funny because I have literally zero previous experience with this framework). It is not an ORM, however, as it is (and claims to be) merely an embedding of SQL into the Haskell type system. This means that some very rudimentary type checking is performed, but most subtly-broken queries that you could write in untyped SQL and have it pass the DB's parser/planner will still compile and silently do the wrong thing under this embedding.

4 Likes

At this point sqlx approach (which just simply verifies all your queries against actual database during compilation) sounds simpler and more robust.

Thanks for providing these documents. I did not have the time to go over everything in your publication in detail, so I might be wrong with the following statements. The first point to notice here is that your examples do not include diesel. It might be tempting to conclude based on this examples it's the same for all frameworks. I personally would not word it that way, maybe only that there might be some indication that it is this or that way. To be clear here: I do not say that your work is bad. On the contrary it's quite interesting to read these findings. As a fellow PhD student I'm well aware that all of this is hard work.

I tried to find the "base line" SQL queries in the linked repository, but it seems like I looked in the wrong place. Having a look at these queries would allow to reason about what's already possible with diesel's built-in DSL, what requires "simple" extensions and what requires more work. Anyway if you are interested in knowing that, feel free to reach out.

To specifically go over the points listed in your paper as "Issues of Current Data Abstraction APIs"

The friction associated with strongly, statically typed ORMS and the additional layer of
abstraction between the database and the programmer is high

See the point above about drawing general conclusions from a limited set of examples. In general it's hard to judge whether this applies to diesel or not as this would require reimplementing your research for diesel as well.

Relatedly, ORMs generally appear to abstract over the lowest common denominator of
various kinds of databases

This is not true for diesel. It allows (and encourages you) to write highly database system specific queries if that's required. It also checks that a certain query is valid for the given backend at compile time.

The next point seems to be specific for one of your chosen example framework.

Refining the previous point, we can address the problem that ORMs today also lack the
ability to extend their own DSLs by 3rd parties in a principled manner, via specialized
and strongly-typed interfaces, instead of making the programmer write raw SQL by hand.

As pointed out in the previous answer: Diesel has these extension points. To represent a SQL fragment you need to implement QueryFragment (which describes how something is translated into SQL) and Expression (which provides compile time type information about some SQL fragment). In addition there is the Query trait that allows to mark such a custom type as complete query. (There are a few additional traits to control prepared statement caching and such things but they are not relevant for this discussion).

Esqueleto is probably the library in other languages, that is most similar to diesel. In contrast to Esqueleto diesel does provide some ORM features on top of this query builder syntax. As I have not used Esqueleto by myself I cannot really comment on what you mean by "rudimentary type checking is performed"?. At least for diesel we treat something like that as a bug, which should be fixed. Similar to the guarantees from rust itself, this only covers cases which are clearly broken, not such cases that are technically correct by likely not desired. (A famous example here is comparing to a null value via = in SQL, which is valid SQL, but likely not that what users want).

All of that above should not be read as "diesel is perfect". That's by no means true as well. There are many points that could see improvements in diesel as well, for example turning more of these Technically it's possible to write an extension for this" statements into a sane API like for example sql_function!. This requires input from others what's missing and design work to figure out how to provide a suitable API. That's likely something that will be work in progress for a long time.

The point is: SQLx does not really make queries by itself type-safe. They "only" provide an interface for strongly typed data. That might sound like a minor difference, but has practical implications as soon as you start building conditional queries or try to check a certain query part only. I'm not saying that their approach does not work, just that there are limitations. I even have some ideas how that could be improved, maybe I will add that to diesel someday.

Indeed, but I'm not intending that publication to be about Diesel, so… this point is moot. I also ended up excluding Esqueleto from the publication, because time was pressing (at that time at least), and I couldn't find a fast and generic-enough Haskell lexer/parser that could have done a job comparable to the parsers of the other languages. Still, it's one of the strongest contestants.

They can be found under ExampleData.

Exactly how many ORMs do you expect me to master before I'm allowed to come to conclusions? Obviously, there can't be an exhaustive study on all ORMs, which automatically means that we need to be inductive if we ever hope to come to any conclusions. I'm not saying that all of this automatically applies to Diesel or any other ORM in particular, but ORMs have been around for decades, and it's very easy to see the general pattern of them being convenient for CRUD and breaking down with increasing complexity and requirements.

I'm aware of QueryFragment, and in an upcoming DB abstraction system, I was planning to provide a similar escape hatch for writing raw SQL/NoSQL. However, that's not the end of the story. This approach is basically the equivalent of unsafe for ORMs. It's not exactly reasonable to expect this to be an ultimate solution, because while it's workable and the encapsulation of raw SQL can be maintained to some point, programmers will make mistakes if you make them write raw SQL, so the lack of automation there is still a problem.

Basically, my opinion is that needing such extensions should be exception rather than the norm, and the ORM should strive for supporting the more complex use cases as well, instead of essentially defining away the problem by telling the user that they can just re-implement it themselves.

I have a feeling that this is very abstract, and thus there are a couple of concrete use cases that I would like to see in a modern ORM. For example, does Diesel support any or all of the following, out of the box, in a type-safe manner?

  • Recursive/hierarchical queries (the equivalent of SQL's rCTE)
  • Sum types (not just trivial enums, but enums with associated data) and pattern matching on them
  • Checked/typed migrations, ensuring that migrating up and down are inverses of each other (as much as it's possible given the potential data loss of e.g. dropping fields/columns)

It means that trivial "type mismatch"-style errors, such as equating the PK of an entity with a FK referencing a mismatching entity, are detected, but basically not much more. For example, the CRUD layer behind Esqueleto called Persistent happily stows a list of entities into a different entity by-value as a big blob of JSON, causing unintentional denormalization, instead of using foreign key relationships.

I don't expect you to master any ORMs, I merely suggest to word your conclusions a bit more cautions. Maybe just say that this seems to be the case for everything that works similar to your example ORM's? This does not reduce your work there, it just removes one point where others could claim that some hypothetical framework did things better.
Other than that I agree with your general conclusion that at least the "classic" ORM pattern breaks down with increasing complexity, which is one of the reasons diesel work how it works. Such research is gives at least some indication how to improve diesel.

Another more general point I noticed today: I'm not sure if you are comparing the "same" thing in your publication. After all the base line SQL queries, that you cite as "best" case do not perform the same work as the ORM's. It misses a rather critical piece of work: Mapping results to something that could be used in your program. After all that's an important step and you should maybe include that in your considerations by performing these SQL queries via the most convenient SQL interface of the corresponding language.

The point is: QueryFragment is not an escape hatch, it's the fundamental building block of diesels abstraction. All the DSL provided by diesel does work via QueryFragment. Of course that assumes that a certain impl is correct, but it gives users the necessary tools to write there own extensions that work exactly the same way as what diesel provides. In addition it allows to build abstractions to register extensions for certain specific use cases.

sql_function! is really a good example for cases where extensions are required. You cannot ever provide bindings to all sql functions provided by a database systems, as users could define their own functions there. sql_function! just takes some signature and generates the corresponding QueryFragment impl based on that. I think this might be similar true for other cases as well.

Recursive/hierarchical queries (the equivalent of SQL's rCTE)

These are not supported by the built-in dsl yet. That's mostly because I myself had not a use case for such queries. As they have some tricky problem to figure out (how to reference the fields of the inner selection in rusts type system?) I rather spend my time working on stuff that I actually use.

Sum types (not just trivial enums, but enums with associated data) and pattern matching on them

It depends a bit on what you are actually want to see here, but yes you can load/store such enums via diesel. You must define the corresponding mapping to the underlying table via a manual implementation of the corresponding trait, just because there is no standard way to perform these mappings.

Checked/typed migrations, ensuring that migrating up and down are inverses of each other (as much as it's possible given the potential data loss of e.g. dropping fields/columns)

Diesel itself does not handle migrations at all. It generates a schema based on what's there in your database. This is then used to perform compile time checking of your queries. Obviously its possible that the schema and the actual database disagree about something, but I really do not see any way how you could prevent that for a shared mutable database at all.
There is the diesel_migrations crate which do migrations based on SQL strings. Future versions of diesel might also support the generation of these migrations based on the difference between the "built-in" schema and the actual database state, but I see that mostly as convenient way to fill in migration stub.

All in all I agree with you that there is definitively work left to do in diesel. My point here is mostly: Just because it's not done yet, it does not mean it's not possible, but rather: I didn't had the time yet.

Diesel catches quite a few more things at compile time. For example it disallows mixing aggregated and non-aggregated expressions as long as that does not match your group by clause at compile time or it enforces postgres invariants around distinct on and order by at compile time.