Just one clarification. It isn't necessary that the variable itself be 'static
, just that it can only contain 'static
references. This is why, for example, you can move owned types to it (which are not themselves 'static
).
I also noticed that cloning the temporary vector (and search inputs) can be avoided:
use std::ops::Deref as _;
fn search<S: AsRef<str>>(lf: LazyFrame, input: S) -> PolarsResult<Vec<String>> {
let words: Vec<_> = input.as_ref().split(' ').map(|s| s.to_string()).collect();
let df = lf
.with_column(
col("keys")
.map(
move |s| {
Ok(Some(
s.iter()
.map(|k| {
jaccard_similarity(
k.get_str().unwrap().split(' '),
words.iter().map(|s| s.deref()),
)
})
.collect(),
))
},
GetOutput::from_type(DataType::Float32),
)
.alias("jaccard"),
)
.sort("jaccard", SortOptions { descending: true, nulls_last: true, multithreaded: false })
.collect()
.unwrap();
let ca = df.column("keys")?.utf8()?;
let vec_str = ca.into_no_null_iter().map(|x| x.to_string()).collect();
Ok(vec_str)
}
Taking AsRef<str>
as input allows the function to be called more generally without creating owned String
s. For instance, the &'static str
in the demo can be passed directly to search
, as can any String
or &'a str
.
Creating owned strings in the temporary vector allows the jaccard_similarity()
call to reuse references without borrow checker issues. And similarly on the outgoing vector, you only have to collect once.
The actual runtime may not change much with these modifications. But it does make the code cleaner, and therefore easier to maintain. It also preserves developer intent: We don't want to clone the temporary vector in the inner iterator. It was just done to appease the borrow checker. Likewise, we don't want that same temporary to own copies of the input, but something has to give. (We decide to move ownership internal to the function as an implementation detail, not as an API contract.) Cloning the input once is probably a better tradeoff than cloning potentially many times.
That's also unnecessary. You can pass the input argument once and then reference it internally to the function wherever needed. The sample I shared here takes care of the ownership issue for the borrow checker's needs when interacting with Polars. For filtering, you still have access to the input through input.as_ref()
.
Here's an example, for illustration. If you wanted to filter on the first word in the input, you could split it by spaces and take the first element:
let words: Vec<_> = input.as_ref().split(' ').map(|s| s.to_string()).collect();
let df = lf
.with_column(
col("keys")
.map(
// ...
)
.filter(input.as_ref().split(' ').next().unwrap())
This sample is not actually useful (it gives a ColumnNotFound
error at runtime because I don't know what I'm doing, lol). But it demonstrates how easy it is to reference the input later on. Once again, clear developer intent while maintaining good performance by limiting unnecessary work.
Caveat emptor: I don't claim that these techniques provide the best performance per se. Rather, when it is possible to eliminate things like clones, and it doesn't add undue burden, then it is a good practice. This is just something that stood out to me as I was going over the thread.