Jaccard Similarity in Rust Polars

Eager to learn Rust (and to be honest: really struggling). Trying to convert something I did in Python. Unfortunately not really any help that result in a solution.

Apart from the fact that I would really like this solved, I was wondering why there was so little response. I posted 2x on SO, 3x on the Polars Discord server and twice on 2 Rust Discord servers. Is it just a boring or stupid question? Didn't I explain it well? Or is Polars perhaps niche and opaque?

2 Likes

I think the function from your last snippet should look like this:

fn jaccard_similarity(sa: &Series, sb: Vec<String>) -> f32 {
    let s1 = sa.utf8()
                .unwrap()
                .into_iter()
                .collect::<HashSet<_>>();
            
    let s2 = HashSet::from_iter(sb.iter().map(|x| Some(x.as_str())));

    let i = s1.intersection(&s2).count() as f32;
    let u = s1.union(&s2).count() as f32;
    let result =  i / u;

    result
}

I'm not very surprised that learning Rust with polars makes you struggle. I find the API rather complex and as of yet not too well documented. It also requires a decent amount of knowledge of collections and iterators in Rust. Having to fight with Rust's strict type system while also fighting against Arrow/Polars doesn't sound very fun.

As for why you haven't gotten much support, I don't know the Polars Discord and Rust Discord channels and how active they are. On SO I've noticed a few Polars questions lately, but I'm afraid the community doesn't have many Polars experts yet. Its a relatively new technology after all. Here on URLO we rarely have Polars questions (compared to SO) IMO.

I think the best chance for an answer is by providing one unit-test like code snippet with the behaviour you expect. That allows me to copy it to the playground or Rustexplorer and start debugging.

7 Likes

Cool, after googling I now know what Jaccard Similarity is. Thanks. Now, what is "Polars"?

I guess not many people are familiar with all that and Rust at the same time.

2 Likes

Polars is a data frame implementation, for manipulating tabularly structured (mostly numeric) data. It's basically a glorified matrix for statisticians. :sweat_smile: (The expression "data frame" originated in the R community as far as I can tell, and propagated to Python's Pandas user base when Python started to gain traction in statistics and machine learning. Polars also has a Python API that is offered as an alternative to Pandas.)

3 Likes

Thanks so much Jonas. That solved the HashSet inconsistency. Afaict the function is used as a closure in a df.apply construction so it should produce a PolarsResult<Series> instead of f32 as in this example: pyo3-polars/parallel_jaccard_mod.rs at main · pola-rs/pyo3-polars · GitHub

A test seems like a good idea if I only knew how to do that with a df. I had hoped that the python example would provide enough information.

1 Like

This can definitely be improved (I'm not a Polars expert), but I guess it is a starting point for collecting multiple Jaccard Similarities into a Series:

/*
[dependencies]
polars = "0.28.0"
*/
use polars::prelude::*;

use std::collections::HashSet;

fn jaccard_similarity(sa: &Series, sb: &[&str]) -> f32 {
    let s1 = sa.utf8().unwrap().into_iter().collect::<HashSet<_>>();

    let s2 = HashSet::from_iter(sb.iter().map(|x| Some(*x)));

    let i = s1.intersection(&s2).count() as f32;
    let u = s1.union(&s2).count() as f32;
    let result = i / u;

    result
}

fn main() {
    let words = vec!["a", "ab", "abc", "abcd"];

    let df = df! [
        "keys" => ["a ab", "a ab abc", "b ba abc abcd", "b ba bbc abcd bbcd"],
    ]
    .unwrap();

    let d = df
        .column("keys")
        .unwrap()
        .iter()
        .map(|k| {
            let s = Series::from_iter(k.get_str().unwrap().split(' '));
            jaccard_similarity(&s, &words)
        })
        .collect::<Series>();

    println!("{d:?}");
}

Rustexplorer.

1 Like

Great. That certainly produces the right computation. Almost there! The purpose of that column is to sort the df. I hadn't managed to look into that yet. I will get back to you.

DataFrame::sort looks very promising :smile:

The purpose of the Jaccard Similarity function is to sort the column "keys" in the DataFrame and extract that data from the df. So that column shouldn't be changed, only used to create new columns.

My approach was to first create a column "new_keys":

let df = df.lazy().with_column(col("keys").str().split(" ").alias("new_keys"));

And then try to use the Jaccard similarity function in a map closure whilst creating yet another column ("jc") that can be used to sort the df:

let o = GetOutput::from_type(DataType::Float32);
let lf = lf.with_column(col("new_keys").apply(|s| Ok(jaccard_similarity(&s, words)), o).alias("jc").collect());

That doesn't work because the apply closure expects a series. And probably other errors.

Is this the right approach or should the jc function be adapted? I tried your map approach but that also doesn't work. Because of using a lazy frame? Or can't your .iter().map() idea be used in the with_column context?

How to approach such a problem? Is my reasoning sound or am I missing something essential?

Here is how I'd add the jaccard similarity to the data frame and sort by it:

/*
[dependencies]
polars = "0.28.0"
*/
use polars::prelude::*;

use std::collections::HashSet;

fn jaccard_similarity<'a, 'b>(
    sa: impl Iterator<Item = &'a str>,
    sb: impl Iterator<Item = &'b str>,
) -> f32 {
    let s1: HashSet<&str> = HashSet::from_iter(sa);
    let s2: HashSet<&str> = HashSet::from_iter(sb);

    let i = s1.intersection(&s2).count() as f32;
    let u = s1.union(&s2).count() as f32;
    let result = i / u;

    result
}

fn main() {
    let words = vec!["a", "ab", "abc", "abcd"];

    let mut df = df! [
        "keys" => ["a ab", "a ab abc", "b ba abc abcd", "b ba bbc abcd bbcd"],
    ]
    .unwrap();

    let js: Series = df
        .column("keys")
        .unwrap()
        .iter()
        .map(|k| {
            let s = k.get_str().unwrap().split(' ');
            jaccard_similarity(s, words.iter().map(|x| *x))
        })
        .collect();

    df.with_column(Series::new("jaccard-similarity", js))
        .unwrap();

    df.sort_in_place(&["jaccard-similarity"], vec![true])
        .unwrap();

    println!("{df:?}");
}

Rustexplorer.

I tried to do that but found it relatively hard. Constructing a Series that contains collections rather than scalars seems to be an edge-case (or at least an easy way to do it didn't appear to me from the API docs).

Thanks. Very neat. My only concern is that the .lazy() context is recommended for things like this. The actual use case is in a complex search and it needs to be as fast as possible.

let js: Series = df
        .column("keys")
        .unwrap()
        .iter()
        .map(|k| {
            let s = k.get_str().unwrap().split(' ');
            jaccard_similarity(s, words.iter().map(|x| *x))
        })
        .collect();

Is this possible in a lazy frame context?

Took me a little, but this is what I came up with using the lazy API:

/*
[dependencies]
polars = { version = "0.28.0", features = ["lazy"] }
*/
use polars::prelude::*;

use std::collections::HashSet;
use std::sync::Arc;

fn jaccard_similarity<'a, 'b>(
    sa: impl Iterator<Item = &'a str>,
    sb: impl Iterator<Item = &'b str>,
) -> f32 {
    let s1: HashSet<&str> = HashSet::from_iter(sa);
    let s2: HashSet<&str> = HashSet::from_iter(sb);

    let i = s1.intersection(&s2).count() as f32;
    let u = s1.union(&s2).count() as f32;
    let result = i / u;

    result
}

fn main() {
    let words = Arc::new(vec!["a", "ab", "abc", "abcd"]);

    let df = df! [
        "keys" => ["a ab", "a ab abc", "b ba abc abcd", "b ba bbc abcd bbcd"],
    ]
    .unwrap()
    .lazy()
    .with_column(
        col("keys")
            .map(
                move |s| {
                    Ok(Some(
                        s.iter()
                            .map(|k| {
                                jaccard_similarity(
                                    k.get_str().unwrap().split(' '),
                                    words.clone().iter().map(|x| *x),
                                )
                            })
                            .collect(),
                    ))
                },
                GetOutput::from_type(DataType::Float32),
            )
            .alias("jaccard"),
    )
    .sort("jaccard", Default::default())
    .collect()
    .unwrap();

    println!("{df:?}");
}

Rustexplorer.

2 Likes

Thanks. Terribly impressed. The only thing I had to change for my use case (although I am only guessing which multithreading option to select) is:

.sort("jaccard", SortOptions { descending: true, nulls_last: true, multithreaded: true })

It is also a bit intimidating. How do I learn to do this? Is there any sort of approach that I should be acquainted with? The map function seems incredibly important. And where did this come from all of a sudden?:

impl Iterator<Item = &'a str>,

argument position impl Trait is just a shorthand for generics with unnamed (and unnameable) type parameters.

You follow the documentation and look for names, types, functions, and traits that suggest they might be doing at least part of what you are trying to do.

Like I said earlier, polars is a complex tool without much documentation. So it is no wonder that you are struggling. I personally arrived at the answers I gave you by sifting through polar's documentation, the user guide and used a bit of SO, but it was by no means easy. The only advice I have is just to stick with it. Come back here when you are stuck, continue programming in Rust, hone your skills and one day you'll look back and appreciate how far you've come and how much you've learned.

That was me trying to make the function more general and easier to use by using anonymous type parameters. It does not change the functionality whatsoever and you can ignore it. It only allows a nicer interface, since now the function accepts any kind of iterator (like a vector, series, slice or anything else that implements the Iterator trait) that has string slices (&str) as elements. I find generics in Rust quite complex and one of the most advanced topics, so my tip would be to keep using concrete types like we did before, until you are further along on your Rust journey.

4 Likes

Unfortunately when I try to use it in my search function it gives me: borrowed data escapes outside of function. Not a very helpful message. I can't quite see what causes the problem.

I put together a minimal test case (the eventual one is a bit more elaborate):

use polars::prelude::*;
use std::collections::HashSet;


fn jaccard_similarity<'a, 'b>(
    sa: impl Iterator<Item = &'a str>,
    sb: impl Iterator<Item = &'b str>,
) -> f32 {
    let s1: HashSet<&str> = HashSet::from_iter(sa);
    let s2: HashSet<&str> = HashSet::from_iter(sb);

    let i = s1.intersection(&s2).count() as f32;
    let u = s1.union(&s2).count() as f32;
    let result = i / u;

    result
}

fn search(lf: LazyFrame, input: &str) -> PolarsResult<Vec<String>> {

    let words: Vec<&str> = input.split(' ').collect::<Vec<_>>();

    let df = lf
        .with_column(
            col("keys")
                .map(
                   move |s| {
                        Ok(Some(
                            s.iter()
                                .map(|k| {
                                    jaccard_similarity(
                                        k.get_str().unwrap().split(' '),
                                        words.clone().iter().map(|x| *x),
                                    )
                                })
                                .collect(),
                        ))
                    },
                    GetOutput::from_type(DataType::Float32),
                )
                .alias("jaccard"),
        )
        .sort("jaccard", SortOptions { descending: true, nulls_last: true, multithreaded: false })
        .collect()
        .unwrap();

    let ca = df.column("keys")?.utf8()?;
    let to_vec: Vec<&str> = ca.into_no_null_iter().collect();
    let vec_str = to_vec.iter().map(|&x| x.into()).collect();

    Ok(vec_str)
}

fn main() {
    let input = "a ab abc abcd";

    let lf: LazyFrame = df! [
        "keys" => ["a ab", "a ab abc", "b ba abc abcd", "b ba bbc abcd bbcd"],
    ]
    .unwrap().lazy();

    let vec_words = search(lf, input);

    println!("{:?}", &vec_words.unwrap())
}

Well, you can't pass input as &str into the callback you use in col("keys").map(...), because it may not live long enough. We don't know when callbacks you use in the lazy API are executed. That's the beauty of the lazy API that allows us to build the compute graph and optimize it, but it adds a constraint to the interface of our callback closures, namely that every captured variable must be 'static. Since we don't know when the callback will run, it could very well run at a point in time when input is not valid any more, because the data it references went out of scope and got invalidated or deallocated. This'd be a dangling pointer Rust's ownership model prevents you from.

The 'static bound is fulfilled by all owned data types (everything that is not a reference and does not contain any references) and &'static and &'static mut references. input: &str in this case—even though it is a &'static str—is not 'static, because we use lifetime elision and let the compiler infer the lifetime of input to some anonymous lifetime.

In this contrived example, this would fix your problem:

- fn search(lf: LazyFrame, input: &str) -> PolarsResult<Vec<String>> {
+ fn search(lf: LazyFrame, input: &'static str) -> PolarsResult<Vec<String>> {

But this is not very helpful for your real problem I assume. You have to pass owned data instead. So instead of &str, pass input as String and construct the words vector inside the callback. The move keyword in front of the closure makes it capture the environment (input) by value, moving the ownership of input into your closure (you couldn't use input in the search function afterwards, since it got moved into the closure). Since the closure now takes only owned data, it fulfills the 'static requirement of Polar's lazy API and we solved your lifetime problem:

/*
[dependencies]
polars = { version = "0.28.0", features = ["lazy"] }
*/
use polars::prelude::*;
use std::collections::HashSet;


fn jaccard_similarity<'a, 'b>(
    sa: impl Iterator<Item = &'a str>,
    sb: impl Iterator<Item = &'b str>,
) -> f32 {
    let s1: HashSet<&str> = HashSet::from_iter(sa);
    let s2: HashSet<&str> = HashSet::from_iter(sb);

    let i = s1.intersection(&s2).count() as f32;
    let u = s1.union(&s2).count() as f32;
    let result = i / u;

    result
}

fn search(lf: LazyFrame, input: String) -> PolarsResult<Vec<String>> {
let df = lf
        .with_column(
            col("keys")
                .map(
                   move |s| {
                        let words: Vec<&str> = input.split(' ').collect();
                       
                        Ok(Some(
                            s.iter()
                                .map(|k| {
                                    jaccard_similarity(
                                        k.get_str().unwrap().split(' '),
                                        words.clone().iter().map(|x| *x),
                                    )
                                })
                                .collect(),
                        ))
                    },
                    GetOutput::from_type(DataType::Float32),
                )
                .alias("jaccard"),
        )
        .sort("jaccard", SortOptions { descending: true, nulls_last: true, multithreaded: false })
        .collect()
        .unwrap();

    let ca = df.column("keys")?.utf8()?;
    let to_vec: Vec<&str> = ca.into_no_null_iter().collect();
    let vec_str = to_vec.iter().map(|&x| x.into()).collect();

    Ok(vec_str)
}

fn main() {
    let input = "a ab abc abcd";

    let lf: LazyFrame = df! [
        "keys" => ["a ab", "a ab abc", "b ba abc abcd", "b ba bbc abcd bbcd"],
    ]
    .unwrap().lazy();

    let vec_words = search(lf, input.to_owned());

    println!("{:?}", &vec_words.unwrap())
}

Rustexplorer.

2 Likes

Thanks so much! My search function finally runs. I had to adapt it with passing the input twice. Once as String to do your code and once as &str to do some other filters. It looks rather contrived but I couldn't find another way to get it working together.

Well, you can't pass input as &str into the callback you use in col("keys").map(...) , because it may not live long enough . We don't know when callbacks you use in the lazy API are executed.

Lots of .filter in the lazy frame which don't have a problem with &str. So I don't quite understand that. Or is it because of the move in the .map method?

String coerces to &str thanks to this implementation of the Deref<Target=str> trait. So you can just clone the string and pass &clone_of_string to the places where you need the &str version of input. Or vice versa you can pass the reference and create an owned version of it which you pass to the callback.

I don't follow. Do you mean LazyFrame::filter? It takes an Expr as input, not a callback. If you look at the definition of Expr, you'll see that it is a type that is in itself (i.e. not passed as a reference) 'static. You can see that easily, because it does not have lifetime parameters, i.e. it is defined as enum Expr { ... }, not as enum Expr<'a> { ... }.

Let's take col for example. It takes &str as input and returns an Expr. You'll see that it does not save the reference internally when you look at its implementation, but that it constructs an Arc<str> from the &str (using this trait implementation). It basically copies the data of the &str to a new location and wraps that in an Arc. Arc<str> (Arc stands for atomically reference counted and is a thread-safe version of Rc) is a smart pointer type for multiple ownership that owns the data it references, so no issues with non-static lifetimes, as the data is dropped together with the last Arc that references it.

On the other hand, if you look at the implementation of Expr::map you see that the closure we pass as the first argument must fulfill these trait bounds:

Fn(Series) -> Result<Option<Series>, PolarsError> + 'static + Send + Sync,

The + 'static makes it impossible to capture any variable from the environment that is not 'static.

No, that is needed to make the closure capture input as value and not as reference. If we were to capture input as reference we'd again face the problem that our closure would capture something that is not 'static. Read more about how closures capture their environment in the Rust Book: Closures: Anonymous Functions that Capture Their Environment - The Rust Programming Language.

2 Likes

Just one clarification. It isn't necessary that the variable itself be 'static, just that it can only contain 'static references. This is why, for example, you can move owned types to it (which are not themselves 'static).

I also noticed that cloning the temporary vector (and search inputs) can be avoided:

use std::ops::Deref as _;

fn search<S: AsRef<str>>(lf: LazyFrame, input: S) -> PolarsResult<Vec<String>> {
    let words: Vec<_> = input.as_ref().split(' ').map(|s| s.to_string()).collect();
    let df = lf
        .with_column(
            col("keys")
                .map(
                    move |s| {
                        Ok(Some(
                            s.iter()
                                .map(|k| {
                                    jaccard_similarity(
                                        k.get_str().unwrap().split(' '),
                                        words.iter().map(|s| s.deref()),
                                    )
                                })
                                .collect(),
                        ))
                    },
                    GetOutput::from_type(DataType::Float32),
                )
                .alias("jaccard"),
        )
        .sort("jaccard", SortOptions { descending: true, nulls_last: true, multithreaded: false })
        .collect()
        .unwrap();

    let ca = df.column("keys")?.utf8()?;
    let vec_str = ca.into_no_null_iter().map(|x| x.to_string()).collect();

    Ok(vec_str)
}

Taking AsRef<str> as input allows the function to be called more generally without creating owned Strings. For instance, the &'static str in the demo can be passed directly to search, as can any String or &'a str.

Creating owned strings in the temporary vector allows the jaccard_similarity() call to reuse references without borrow checker issues. And similarly on the outgoing vector, you only have to collect once.

The actual runtime may not change much with these modifications. But it does make the code cleaner, and therefore easier to maintain. It also preserves developer intent: We don't want to clone the temporary vector in the inner iterator. It was just done to appease the borrow checker. Likewise, we don't want that same temporary to own copies of the input, but something has to give. (We decide to move ownership internal to the function as an implementation detail, not as an API contract.) Cloning the input once is probably a better tradeoff than cloning potentially many times.

That's also unnecessary. You can pass the input argument once and then reference it internally to the function wherever needed. The sample I shared here takes care of the ownership issue for the borrow checker's needs when interacting with Polars. For filtering, you still have access to the input through input.as_ref().

Here's an example, for illustration. If you wanted to filter on the first word in the input, you could split it by spaces and take the first element:

    let words: Vec<_> = input.as_ref().split(' ').map(|s| s.to_string()).collect();
    let df = lf
        .with_column(
            col("keys")
                .map(
                    // ...
                )
                .filter(input.as_ref().split(' ').next().unwrap())

This sample is not actually useful (it gives a ColumnNotFound error at runtime because I don't know what I'm doing, lol). But it demonstrates how easy it is to reference the input later on. Once again, clear developer intent while maintaining good performance by limiting unnecessary work.

Caveat emptor: I don't claim that these techniques provide the best performance per se. Rather, when it is possible to eliminate things like clones, and it doesn't add undue burden, then it is a good practice. This is just something that stood out to me as I was going over the thread.

2 Likes