String, str, and borrowing on strucs

Again, not necessarily. String can be thought of as pointer to str with some additional properties. In fact, that's exactly what you get with its Deref implementation.

I think in my example one other alternative would be to change the struct to be something like this:

pub struct OrderedString<'a> {
    ordered_symbols: Vec<&'a str>,
    index_by_symbol: HashMap<&'a str, usize>,
    owned_symbols: Vec<String>
}

in this variant you have ordered_symbols and index_by_symbol that are references. So when you can point to some string owned from the outside, then just do it. But in the case of "reading from file" only this portion, where no-one owns the strings, then you can use the owned_symbols to own them internally in this struct..
However, this seems a bit messy. I could encapsulate the fields via functions so that the consumer wouldn't care about this "trick" under-the-hood. However, that's probably gonna make it complex all the other structs that use the OrderedStrings since once using the symbols I should decide whether to apply the same logic, or else copy the value..

It is generally not possible to have references in one field borrow from another field of the same struct.

2 Likes

If you keep the owned strings in the symbols Vec and the other elements keep just indices to this Vec, then you don't have that ownership problem, though it can be tricky if you need to delete some strings.

This would be some form of interning, but in your own struct.

Question : wouldn't an IndexSet from indexmap work for you ?

let set = symbols_iter.collect::<indexmap::IndexSet>();
for (i, symbol) in set.iter().enumerate() {
    // `i` would be the index you store in your HashMap,
    // you don't need the ordered vec anymore since 
    // that set keeps the insertion order for you.
    // And since you wouldn't need the vec anymore, 
    // why is there a need for the index in the first place ?
}
1 Like

yeah IndexSet (or probably best IndexMap) I guess may solve my problem in terms of efficiency, given I have as input a guaranteed insertion order I can use it instead of a BTreeMap, agree. Thanks for the suggestion!

Still need to figure out how can I make up an OrderedString with reference when coming from another object that owns the strings, but instead a "new" OrderedString that actually owns the strings when reading from a serialized format (for instance a file)

Serde allows zero copy deserialization (though of course it comes with caveats), example here.

Applied in your case that does something like this.

Thanks, yeah serde was just an example, in my specific case I'm not using it but I wondered what techniques they use to achieve the borrowing so deserislize into a struct that has only references

ok I did a little bit of skeleton coding to show where my problem is. I am pretty sure my problem is all about the design around the ownership.. Here's the snip with some questions in the code itself (if this makes the problem more clear I will update the original question)

use std::collections::{BTreeMap, BTreeSet, HashMap};
use std::fmt::Write;
use std::io::Read;

pub struct User {
    first_name: String,
    last_name: String,
    interests: BTreeMap<String, String>,
}

impl User {
    pub fn create(
        first_name: String,
        last_name: String,
        interests: BTreeMap<String, String>,
    ) -> User {
        User {
            first_name,
            last_name,
            interests,
        }
    }
}

// Question 1: should I really use the lifetimes here? All I wanted to do is simply build a TreeSet..
// Since from a client point of view User::create is the main entrypoint,
// I want to extract all the strings seen into an ordered set
pub fn extract_symbols<'a>(users: &Vec<&'a User>) -> BTreeSet<&'a str> {
    let mut symbols = BTreeSet::new();
    users.iter().for_each(|user| {
        symbols.insert(user.first_name.as_str());
        symbols.insert(user.last_name.as_str());

        for key in user.interests.keys() {
            symbols.insert(key.as_str());
        }
        for value in user.interests.values() {
            symbols.insert(value.as_str());
        }
    });

    symbols
}

//Since from the pure Struct perspective I start with a User,
// here makes sense to not copy the string but have a reference to them
pub struct OrderedStrings<'a> {
    ordered_symbols: Vec<&'a str>,
    index_by_symbol: HashMap<&'a str, usize>,
}

impl<'a> OrderedStrings<'a> {
    pub fn new(symbols: BTreeSet<&'a str>) -> OrderedStrings {
        let mut index_by_symbol: HashMap<&'a str, usize> = HashMap::new();
        let mut ordered_symbols: Vec<&'a str> = Vec::with_capacity(symbols.len());

        for (i, item) in symbols.into_iter().enumerate() {
            index_by_symbol.insert(item, i);
            ordered_symbols.push(item);
        }

        OrderedStrings {
            ordered_symbols,
            index_by_symbol,
        }
    }

    pub fn get_index(&self, symbol: &str) -> Option<&usize> {
        self.index_by_symbol.get(symbol)
    }

    pub fn get_symbol(&self, index: usize) -> Option<&str> {
        // Question 2: should I really do this and deference the string?
        self.ordered_symbols.get(index).map(|s| *s)
    }

    pub fn get_size(&self) -> usize {
        self.ordered_symbols.len()
    }
}

impl<'a> OrderedStrings<'a> {
    pub fn serialize<W>(&self, to: &mut W)
    where
        W: Write,
    {
        //snip
        // writing all bytes into a file (or stream) is fine since I have the reference to the strings! all good here
    }

    pub fn deserialize<R>(from: &'a mut R) -> OrderedStrings<'a>
    where
        R: Read,
    {
        //Question 3: here I can't really imagine how I can build the OrderedStrings since someone needs to own the strings :(
    }
}

impl User {
    pub fn serialize<W>(from: &Vec<&User>, ordered_strings: &OrderedStrings, to: &mut W)
    where
        W: Write,
    {
        // When I serialize on the file system, I will take the string index
        // using OrderedString.get_index(). This is why the ordered_string struct is needed here
        // This is all good though, the OrderedString given is borrowing the values, so I can read and do all my stuff!
    }

    pub fn deserialize<R>(from: &mut R, ordered_strings: &OrderedStrings) -> Vec<User>
    where
        R: Read,
    {
        //Question 4. Here I am in trouble. first of all, I need an OrderedString since on the file there is only the index
        // so I need to do a reverse lookup here: OrderedString.get_symbol(i)
        // This has the string borrowed by somebody elses
        // and it is not in the User yet because I'm deserializing it from the file system.
        // Also, supposing that the strings backing OrderedStrings are stored on another vector, I can build the User
        // but then I'm forced to copy the strings. As a results I would keep in memory twice the strings that I actually want.
    }
}

I would change the struct to this:

pub struct OrderedStrings {
    ordered_symbols: Vec<String>,
    index_by_symbol: HashMap<String, usize>,
}

As for the extract_symbols, it's a bit unclear how it is used. I would match the types to where the input comes from and where the output is used.

Thanks Alice, but then if you chance the OrderedStrings to own the strings, then I need to copy them from User to OrderedStrings... say that those strings become millions I thought that using the pointer would make it more efficient?

pretty much a client creates a list of Users, then to generate an OrderedString it does the following:

let users = //vec of users refs
let unique_symbols = extract_symbols(&users);
let ordered_strings = OrderedStrings::new(unique_symbols);
ordered_strings.serialize(buffer)

Well it sounds like your deserialize method only works if OrderedStrings owns them.

Maybe you want to use Rc<str> which can be cheaply cloned?

so pretty much you mean to change the OrderedSymbol to this?

pub struct OrderedStrings {
    ordered_symbols: Vec<Rc<String>>,
    index_by_symbol: HashMap<Rc<String>, usize>,
}

impl OrderedStrings {
    pub fn new(symbols: BTreeSet<Rc<String>>) -> OrderedStrings {
        let mut index_by_symbol: HashMap<Rc<String>, usize> = HashMap::new();
        let mut ordered_symbols: Vec<Rc<String>> = Vec::with_capacity(symbols.len());

        for (i, item) in symbols.into_iter().enumerate() {
            index_by_symbol.insert(item, i);
            ordered_symbols.push(item.clone());
        }

        OrderedStrings {
            ordered_symbols,
            index_by_symbol,
        }
    }

    pub fn get_index(&self, symbol: &String) -> Option<&usize> {
        self.index_by_symbol.get(symbol)
    }

    pub fn get_symbol(&self, index: usize) -> Option<Rc<String>> {
        // Question 2: should I really do this and deference the string?
        self.ordered_symbols.get(index).map(|s| *s)
    }

    pub fn get_size(&self) -> usize {
        self.ordered_symbols.len()
    }
}

I can experiment and see whether this work yeah..one thing I was noticing is that this approach puts &String and Rc as functions parameters, maybe it's ok though

Rc<str> would be better, since it's strictly more powerful (the same thing as &String vs &str).

thanks for the suggestion, I thought I couldn't use <str> as a type.
Also, in this case would the new method be like this:

pub fn new(symbols: BTreeSet<&Rc<str>>) -> OrderedStrings {

instead of like this

pub fn new(symbols: BTreeSet<&Rc<&str>>) -> OrderedStrings {

is that correct?

but then to call the new method (for instance in my tests) I can't do this, since it generates an &Rc<&str>

let mut symbols = BTreeSet::new();
        symbols.insert(&Rc::new("world"));
        symbols.insert(&Rc::new("hello"));

When using Rc, you should not also use ampersands. The Rc type is itself a kind of reference. This is also why you can put a str inside it, as str must be behind some kind of reference.

ok, so does that mean that my new method should have this signature?

pub fn new(symbols: BTreeSet<Rc<str>>) -> OrderedStrings {

and if so how can I create such a BTreeSet?

You create it like any other BTreeSet, except that to convert a string into a Rc<str>, you need to type Rc::from(the_string). Additionally, you should feel free to clone the Rc<str> as much as possible — that's the entire point. Cloning an Rc<str> is incredibly cheap and gives you a new reference to the same shared string.

thanks I've missed the ::from I was using ::new instead. So the trick here is that when accepting the Rc, I can always take the ownership of the actual RC and then clone as long as I need new references, am I correct?

Yes, always take ownership of the Rc, and clone it everywhere you need to pass out shared ownership to the string.

1 Like