Returning references from a thread

Hi,

I'm having difficulties solving this puzzle:

I want to read a big text file in a thread, and return some data from it as Vec<&str>.
The borrower checker complains about it ( I guess rightly) but I couldn't find a way how to to do that.

I could use crossbeam::scope but I'd like to avoid it for now.

I can't think of any solution other than returning a Vec<String> which I want to avoid.

Appreciate any help.

Alex.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=f0b28946b4f36f1bb40ed975c1da0b28

Vec<&str> doesn't contain any string data. It's only full of pointers to other data that stayed on the thread and got destroyed with the thread.

You can return Vec<String> or slightly more compact Vec<Box<str>> (Box<str> is like &str, but instead of borrowing someone else's data, it carries it itself).

You can't unfortunately return (String, Vec<&str>), because borrow checker can't guarantee safety of self-referential structs.

You can return (String, Vec<(usize, usize)>) with a vec that contains start and length of each string fragment.

Disclaimer: **This is almost certainly a bad idea for real programs. I’m asking out of pedagogical curiosity **

Would a type like this be sound for the general case of carrying around both an owned type and a reference to inside it?

pub struct ArcRef<Own,Ref>(Arc<Own>, *const Ref);

impl<Own,Ref> ArcRef<Own,Ref> {
    pub unsafe fn new(own:Arc<Own>, getref:impl FnOnce(&Own)->&Ref)->Self {
        // ptr must point to memory inside own
        // unsafe because we can’t ensure that statically
        let ptr = getref(own.deref()) as *const Ref;
        ArcRef(own, ptr)
    }
}

impl<Own,Ref> Deref for ArcRef<Own,Ref> {
    type Target=Ref;
    fn deref(&self)->&Ref { unsafe { &*(self.1) } }
}
1 Like

Yes, with a bit of care and unsafe you can make self-referential structs. The borrow checker is unable to guarantee that the owning field isn't moved/modified while the borrowing field still uses it, but you can ensure that with your own getters/setters.

There are some crates for this:

2 Likes

Luckily in my case the data is not extremely big so I ended up returning Vec<String>, but if the borrow checker could "see" that owned value is not moving(so that references are always valid) that would a decent quality of life improvement.

I'll give rental and owning_ref a try next time!

Thank you guys.

that can be done by scoping and having typed function signatures involving the lifetimes of the scopes, so that these properties can be checked at compile time. That is: ::crossbeam::scope

If you don't want to use the tool that gives you compile-time properties, you will need to pay runtime cost, such as deduplicating the strings into owned variants.

Note that there is another alternative to what @kornel suggested: when creating the initial String, you could convert the format!-ed string or the &str-borrowed string .into() an Arc<str> (for the latter case the cost is equivalent).

  • The main drawback of Arc<str> vs. String is that you cannot mutate the String anymore. But in practice, 99% usages of String are just to have 'static (owned) strs.

And then, instead of yielding &str scoped / short-lived borrowing references, you can yield Arc<str> owning references by doing Arc::clone() (although you can't subslice directly, for that we'd be back to ::owning_ref).

2 Likes

This suggests another alternative, Box::leak, which will deposit the string data in your address space until the program terminates. You can pass around as many internal &'static str references you want, at the expense of ever being able to deallocate the file data.

That's an interesting approach and it works:

    let s:Arc<str> = "foo bar".into();
    let mut v = Vec::<Arc<str>>::new();
    for sub in s.split(" ") {
        v.push(sub.into())
    }
    drop(s);

But is there any advantage over an owning String?

You mean aside from the fact that you can reference it in two threads without worrying about lifetimes?

If making Arc from &str copies the source, how is it better than just returning an owning String. In my original question I only care about returning the data from a thread.
So returning Vec<Arc<str>> has no advantage over Vec<String> in the context of just returning a Vec from the thread.

In your case, it doesn't seem like it does help if you are just doing a move at the end of a thread. The Arc gets you cheap clones + thread safety. Using Box<str> though would save you 8 bytes per item in your Vec since Box<str> doesn't track capacity like String.

2 Likes

In this example you are sub-slicing, which indeed does not play to Arc's natural advantage:

To detail that owning_ref idea (we'll need the indices pairs rather than &str references, and since there does not seem to be one easily available, I have hand-rolled my own .split(" ") iterator that yields indices pairs instead):

use std::sync::Arc;
use owning_ref::ArcRef;

fn yield_strings (s: impl Into<Arc<str>>)
  -> Vec<ArcRef<str>>
{
    let s: Arc<str> = s.into(); // ownership of the string is _local_ here
    let s: ArcRef<str> = s.into();
    let split_space_idxs =
        s   .char_indices()
            .filter(|&(_, c)| c == ' ').map(|(i, _)| i)
            .chain(Some(s.len()))
            .map({
                let mut prev = 0;
                move |i| (
                    ::core::mem::replace(&mut prev, i + 1),
                    i,
                )
            })
    ;    
    let mut ret = vec![];
    let mut yield_ = |item| ret.push(item);
    for (start, end) in split_space_idxs {
        if end == start { continue; } // Optional: skip emtpy strings
        let s = s.clone(); // inc the refcount to give ownership to the vec (elements)
        let sub = s.map(|it| &it[start .. end]);
        yield_(sub);
    }
    ret // even though the local `s` is dropped here, the vec has ownership
}

fn main ()
{
    let strs = yield_strings("foo bar");
    assert_eq!(dbg!(&*strs[0]), "foo");
    assert_eq!(dbg!(&*strs[1]), "bar");
    assert!(strs.get(2).is_none());
}

That being said, the above example will lead to up to n + 1 owners, where n is the number of words in the string. That means n + 1 incrementing and decrementing atomic counters, which can have a non-negligible performance impact (:thinking: technically using Rc instead of Arc would make sense, here; by wrapping the owned RcRef in an Unshare*-kind of wrapper...)

* Unshare
pub use lib::Unshared;
mod lib {
    pub
    struct Unshared<T> /* = */ (
        T,
    );

    impl<T> From<T> for Unshared<T> { ... }
    impl<T> Into<T> for Unshared<T> { ... }
    impl<T> Unshared<T> {
        pub
        fn get_mut (self: &'_ mut Self)
          -> &'_ mut T
        {
              &mut self.0
        }
    }

    unsafe // Safety: no `&Unshared` API whatsoever
        impl<T> Sync for Unshared<T>
        {}
}

In which case, now that we have achieved transforming the iteration into one that yields pairs of indices rather than references, solving the initial problem at hand becomes quite trivial, if only just a tad unergonomic:

- use std::sync::Arc;
- use owning_ref::ArcRef;
- 
- fn yield_strings (s: impl Into<Arc<str>>)
-   -> Vec<ArcRef<str>>
+ fn yield_strings (s: impl Into<String>)
+   -> (String, Vec<(usize, usize)>)
  {
-     let s: Arc<str> = s.into();
-     let s: ArcRef<str> = s.into();
+     // Reference-counting is not that needed anymore.
+     let s: String = s.into();
      let split_space_idxs = ... ;
      ...
      for (start, end) in split_space_idxs {
          if end == start { continue; } // Optional: skip emtpy strings
-         let s = s.clone();
-         let sub = s.map(|it| &it[start .. end]);
-         yield_(sub);
+         yield_((start, end));
      }
-     ret
+     (s, ret)
}

Usage:

fn main ()
{
    let (s, idxs) = yield_strings("foo bar");
    let get_str = |i| {
        let (start, end) = idxs[i];
        &s[start .. end]
    };
    assert_eq!(dbg!(get_str(0)), "foo");
    assert_eq!(dbg!(get_str(1)), "bar");
}

Of course, now the issue is that the usage is a bit ugly, and worse, error-prone! (what if they mutate s before indexing?).

The trick then is to inline that op directly in the called function / returned value:

fn main ()
{
    let strs = yield_strings("foo bar");
    assert_eq!(strs.len(), 2);
    assert_eq!(dbg!(&strs[0]), "foo");
    assert_eq!(dbg!(&strs[1]), "bar");
}
2 Likes

Wow, there so much to learn from this. Thanks!

1 Like

Another approach, that I don't think is mentioned here, is to transfer ownership of the original String up the call stack until it lives long enough for your needs, then return &str views into the string.

So if you have something like the following (which doesn't work, as you've seen:

fn main() {
    let corpus: Vec<&str> = fetch_words();
    for word in corpus {
        do_a_thing(word);
    }
}
fn fetch_strings<'a>() -> Vec<&'a str> {
    // Some implementation
}

You could instead do this:

fn main() {
    let s: String = fetch_string();
    // &str below is bound to the lifetime of s.
    let corpus: Vec<&str> = split_words(&s); 
    for word in corpus {
        do_a_thing(word);
    }
    // s is dropped here, when main returns
}

This is sort of the simplest, most foundational building block of working with lifetimes in rust: Make sure your owned data lives long enough for the borrows you want to take, and pass ownership up the callstack to extend the life of a value. Rc and unsafe give you extra tools for doing that in special situations, but there's a lot of power in just structuring your operations cleanly.

1 Like