Correct way of passing nested vectors as arguments

Assume that I have a pre tokenized list of sentences in a Vec<Vec<String>> or Vec<Vec<&str>>; Upon which I need to compute a bunch of different metrics.

fn main() {
    let sent_words = vec![
        vec!["This", "is", "sentence1."],
        vec!["This", "is", "not", "sentence3"],
    ];
    let m1: f32 = metric1(&sent_words);
    let m2: f32 = metric2(&sent_words);
}

Some constraints/invariants.

  1. The metrics are all read only. They never modify the vector, inner vector or the string.
  2. I need a fast way of computing the number of words in each sentence, so no flattening/streaming.

My question is considering these constraints what is the right way to define the metrics?

fn metric1<T>(t: &[Vec<T>]) -> f32
where
    T: AsRef<str>
{
    return t.len() as f32 / 2.0;   
}

The above definition works, but I don't understand what is &[Vec<T>] is even supposed to mean? Have the inner Vectors moved inside the function?

Can I somehow take &[&[T]] as an argument? Because it does not seem to work.
This Stackoverflow answer suggests using AsRef and it works.

fn metric2<T, U>(t: &[U]) -> f32
where
    T: AsRef<str>,
    U: AsRef<[T]>,
{
    return t.len() as f32 / 2.0;
}

How does my metric1 differ from metric2?
When would I prefer using either?
Is there something better than both ?

To answer both of these questions, let’s talk about generic function and trait bounds.

Starting with the function signature

fn metric1<T>(t: &[Vec<T>]) -> f32
where
    T: AsRef<str>

A generic function like

fn foo<T>(t: &T) {}

is like a scheme (or C++ folks would say template) for a whole lot of concrete functions with concrete function signatures: There’s foo<u32> and foo<String> and foo<Vec<Vec<String>>>, etc… i.e. we’re defining an infinite number of functions, one for each possible type that could be substituted for T. (By the way, in Rust syntax, when calling these this explicitly, one would have to write foo::<u32> instead of foo<u32>.)

Substituting T a concrete type when they’re called is the whole idea behind generic functions; the generic function itself is a template with some wholes to be filled out in order to create a real actual concrete function as a result.

The resulting function signatures of these functions then is easy to determine: substitute T accordingly to find out that foo<u32> has the signature fn(&u32), and foo<String> the signature fn(&String); in fact, the compiler can even confirm this

fn foo<T>(t: &T) {}

fn demonstration() {
    let _: fn(&u32) = foo::<u32>;
    let _: fn(&String) = foo::<String>;
    // compiles successfully, so the types are correctly annotated
}

Now enter trait bounds. A generic function with trait bounds no longer allows you to substitute any type you want when calling it; instead it constrains the possible types to those that have the properties required in the where clause. Let’s just do some compilation test again

fn metric1<T>(t: &[Vec<T>]) -> f32
where
    T: AsRef<str>
{
    todo!()
}

fn demonstration() {
    let _ = metric1::<String>;
    let _ = metric1::<&str>;
    let _ = metric1::<Box<str>>;
    let _ = metric1::<&&&str>;
    // ^^^^^^^^^^^^^^ these all work
    let _ = metric1::<u32>; // <- this doesn’t compile
}

The last line will produce a compilation error

error[E0277]: the trait bound `u32: AsRef<str>` is not satisfied
  --> src/lib.rs:13:13
   |
13 |     let _ = metric1::<u32>;
   |             ^^^^^^^^^^^^^^ the trait `AsRef<str>` is not implemented for `u32`
   |
note: required by a bound in `metric1`
  --> src/lib.rs:3:8
   |
1  | fn metric1<T>(t: &[Vec<T>]) -> f32
   |    ------- required by a bound in this
2  | where
3  |     T: AsRef<str>
   |        ^^^^^^^^^^ required by this bound in `metric1`

which is due to the fact that for T being u32, the constraint T: AsRef<str> from the where clause is not met. The error message explains: the trait bound `u32: AsRef<str>` is not satisfied, and the u32: AsRef<str> comes from substituting u32 for T in T: AsRef<str>.

Notably, the other cases all compiler successfully, so apparently the trait bounds String: AsRef<str>, &str: AsRef<str>, Box<str>: AsRef<str>, and &&&str: AsRef<str> are all satisfied. The signatures are also as expected, just substitute T. This code would compile if it wasn’t for some more advanced details regarding lifetimes that I won’t go into here:

fn demonstration() {
    let _: fn(&[Vec<String>]) -> f32 = metric1::<String>;
    let _: fn(&[Vec<&str>]) -> f32 = metric1::<&str>;
    let _: fn(&[Vec<Box<str>>]) -> f32 = metric1::<Box<str>>;
    let _: fn(&[Vec<&&&str>]) -> f32 = metric1::<&&&str>;
}

So what does &[Vec<T>] mean? It means &[Vec<String>] or &[Vec<&str>] or any other type in place of T as long as the trait bound from the where clause it satisfied.

What does &[Vec<String>] mean though? You’re asking “Have the inner Vectors moved inside the function?”; the answer is, no. The whole type starts with &, so it’s a reference type and nothing is moved. It references a slice of Vec<String>s. This reference &[Vec<String>] can, for example, then be created from (implicitly coercing) a &Vec<Vec<String>> when you call metric1.

How does metric2 differ from metric1? Well, looking at its signature

fn metric2<T, U>(t: &[U]) -> f32
where
    T: AsRef<str>,
    U: AsRef<[T]>,

it’s more generic. Two degrees of freedom instead of one, of you will; or, actually… the way in which it actually is “more generic” is that the metric1 signature is a special case of metric2. You can let the compiler check this by simply calling metric2 from metric1’s function body,

fn metric2<T, U>(t: &[U]) -> f32
where
    T: AsRef<str>,
    U: AsRef<[T]>,
{
    todo!()
}

fn metric1<T>(t: &[Vec<T>]) -> f32
where
    T: AsRef<str>
{
    metric2(t)
}

or you could figure this out yourself by comparing and matching the signatures:

  • in order to make
    (t: &[U]) -> f32
    
    and
    (t: &[Vec<T>]) -> f32
    
    match, we need U = Vec<T>
  • the parameters “T” of both function could just be made to match, even though in general, the name of such parameters doesn’t matter; but I’ll spare us the intermediate step of renaming one of them and figuring out they’re going to have to be the same anyways.
  • the metric2 function has an additional constraint that needs to be eliminated: U: AsRef<T>. We’ve determined we want U to be Vec<T>, so next up we’ll need to check whether or not Vec<T>: AsRef<[T]> is a thing. Turns out it is a thing. Here’s the implementation. And here’s how the compiler can verify for us that it’s a thing.

Now, with the above exploration done, answering the second question “How does my metric1 differ from metric2?”

We’ve seen that metric2 is more generic than metric1. You can use metric2 as a replacement wherever you can use metric1. From a caller’s perspective (assuming metric1 and metric2 implement the same metric), if you have metric2, then also having metric1 is superfluous / useless.

It’s also actually more generic, since metric2 supports function signatures that metric1 can’t support. The example you gave is &[&[T], so something like &[&[String]] would work for metric2, and it does not work for metric1 (you can ask the compiler, why; the answer doesn’t even involve traits, the type of the wrong shape, so it won’t match no matter what we substitute in for T; in fact, the error message highlights that the choice of what goes in for T doesn’t matter by simply printing an “_” in its place, essentially saying “these types aren’t the same no matter what goes into the “_” place”).

We’ve only focused on the function signature so far yet, and how to call instantiate and perhaps call the function; but there’s also the implementation… the function body!

Now there’s an opposite effect here: A more generic function signature makes calling the function easier, but it makes implementing it harder. While you can always call metric2 in place of metric1, you cannot use every function body that’s valid for metric1 and use it in metric2 instead; in fact it’s the opposite: every implementation of metric2 is also a valid metric1 implementation.

When you’re working with the items of type U in metric2, you’ll always need to call .as_ref() on them to get a &[T], whereas in metric1, you had Vec<T>s (behind a reference) and it’s possible to turn a &Vec<T> into &[T] more implicitly.

fn metric1<T>(t: &[Vec<T>]) -> f32
where
    T: AsRef<str>
{
    let x: &[T] = &t[0]; // this works

    return t.len() as f32 / 2.0;   
}

fn metric2<T, U>(t: &[U]) -> f32
where
    T: AsRef<str>,
    U: AsRef<[T]>,
{
    // let x: &[T] = &t[0]; // this doesn’t work
    let x: &[T] = t[0].as_ref(); // but this works

    return t.len() as f32 / 2.0;
}

When the additional capabilities that the metric2 signature gives you on the caller side are useful, then that’s a good reason for preferring that approach. If implementing the metric with the metric2-style signature is problematic (though I don’t see a practical reason why it ever would be) that that could be a strong reason in favor of the metric1 signature. If the caller benefited from the generality of metric2, it would then have to work around the limitation.

There’s no real downside to the more generic function. It’s just a efficient if metric1 and metric2 have the same function body, and even if the latter has a few extra as_ref calls, those are typically exactly as cheap as the implicit coercions that Vec<T> offers. There can be downsides to generic functions in that

  • if there are generic parameters that are not part of any argument type, or possibly even worse neither part of an argument type nor of a return type, then the caller might need to explicitly speficy that generic parameter, or else there’d be a compilation error
  • the way that generic functions are implemented (via “monomorphization”) can in some cases result in a larger binary size; if that trade-off is undesired, less generic functions can be a performance benefit; this is arguably a rather advanced topic though, often involving things like “trait objects”, and sometimes the performance benefit can be achieved without making the function less generic for the caller, by implementing a generic shim around a less generic implementation

“Better” is subjective. More general is certainly possible. E.g. you could start involving IntoIterator trait bounds. At some point it might become useless though; as mentioned above, more generic version are only useful if you have any use[1] for some of the new signatures they support, such as the &[&[T]] example you mention: You had some type you wanted to also support and only this motivates trying to make the function more generic accordingly.


  1. or, in case the function is part of some public external API: if some (potential) user has any use ↩︎

4 Likes

Thanks for the incredibly detailed solution I definitely understood the more generic vs less generic argument. But I am still a bit unclear on the AsRef

I was reading this blog On dealing with owning and borrowing in public interfaces

This is what the author says

Q: AsRef<T> is a &T on steroids in terms of software design. It gives more power to the client as they’ll be able to pass more types in your functions. However, you now introduce a polymorphic API. The semantics are the same: you plan to perform only read-only computations or accept not to use a client-provided move-in value if you need to own something. Keep in mind an important hidden property here: because you accept values to be moved in while you just have a read-only contract on them, you also accept clients values to be dropped by corollary.

I am not understanding what the author means by the bolded part. It seems like a bad thing. For instance if I try to make a metric even more generic I could do

fn metric3<T, U, V>(t: V) -> f32
where
    T: AsRef<str>,
    U: AsRef<[T]>,
    V: AsRef<[U]>
{
    return t.as_ref().len() as f32 / 2.0;
}

So when someone calls metric3 with Vec<Vec<String>> what is actually going on ? Am I accepting a move even though it is read only?

EDIT Can I somehow turn metric3 into something that will accept either a Vec or a slice but still only borrows but does not move?

Yes, they're moving Vec into the function. But they may opt to not doing it - V: AsRef<[U]> implies &V: AsRef<[U]>, so if they need the Vec afterwards, they will just borrow it themselves:

fn main() {
    let v = vec![vec![""]];
    
    // works
    metric3(&v);
    metric3(&v);
    
    // would fail with second line uncommented
    metric3(v);
    // metric3(v);
}
2 Likes

Ah, I see, so it's a matter of implicit vs explicit. I feel like in a public API it might be better to make it explicitly clear that we don't want to move because we are only doing read only operations

The user of metric3 (or anyone else) can coerce a Vec to a slice, so anything that accepts a slice accepts a Vec in that sense. If you mean, without the user typing anything else, no -- if it accepts a Vec in that sense, it's moving.

If you wanted to accept &[_] or &Vec<_> (to any depth of &) but to not accept Vec, you could use

fn metric3<T, U, V>(t: &V) -> f32
//                 new ^
where
    // ...
    V: ?Sized + AsRef<[U]>

This gets rid of the "oops, accidentally gave away my Vec" hazard, though it also introduces an "I don't care if I give away my Vec but now I have to type more to pass &vec anyway" possibility.

2 Likes

Well, as for me, the AsRef bound is explicit enough, since it explicitly allows the user to use the reference and not move the value.

1 Like