String, str, and borrowing on strucs

Hi all, as I'm playing with Rust, I'm facing some of the typical (I hope) problems with borrowing especially when handling strings.

Here's my problem: say I want to have a struct of this shape

pub struct OrderedStrings {
    symbols: Vec<String>,
    index_by_symbol: HashMap<String, usize>,
}

That given a high number of strings it stores it in order in the vector, and it also creates an hashmap to get from a string, what's its order.

Now what I'm bugged with is ownership.. since the vector and hashmap contain the exact same string, I'd like to store them only once in memory.

Do you have any suggestion on how to do it?

One additional point is, those strings are immutable so I guess they shoud actually be str instead of String?

I've tried some for of implementation that (spoiler alert) doesn't work. What I'm struggling with is I go to str then to &str then to String but I'm not quite getting it what is best way to go.. Here's the implementation where I believe I should end with (with probably some adjustments):

pub struct OrderedStrings {
    ordered_symbols: Vec<str>,
    index_by_symbol: HashMap<str, usize>,
}

impl OrderedStrings {
    pub fn new(symbols: BTreeSet<&str>) -> OrderedStrings {
        let mut index_by_symbol: HashMap<&str, usize> = HashMap::new();
        let mut ordered_symbols = Vec::with_capacity(symbols.len());

        for (i, &item) in symbols.iter().enumerate() {
            index_by_symbol.insert(item, i);
            ordered_symbols.push(item);
        }

        OrderedStrings {
            ordered_symbols,
            index_by_symbol,
        }
    }
}

there are few problems with this implementation:

  1. index_by_symbol in the struct should probably be &str to get a reference, but then I get a suggestion to use lifetimes..
  2. in the new method I accept an ordered set of strings (where I don't take ownership) that makes things complex when I want to use plain str

This combination of requirements hints that you probably want to use Rc<str> (or Arc<str>, if you need Send/Sync) - a smart pointer for shared immutable ownership.

5 Likes

Why do you need the extra Vec<String>? You can iterate over the keys of the HashMap just as well.

If you have to add Strings at runtime, and maybe have different uses of OrderedStrings, you might want to go with a StringStore(Vec<String>), and then use Vec<&'a str> and Hashmap<&'a str, usize> pointing into it.

1 Like

You probably could learn a lot by using a reference counted smart pointer. You could also use a map which keeps the insertion order for you like indexmap (or study its code) @bluss . playground with ref counting

3 Likes

thanks for the suggestions

I know I can but that vector is ordered.. so I need to access the strings by index, and the index by strings. You can see that as if I have 2 maps:

index_to_string HashMap<usize, str>,
string_to_index: HashMap<str, usize>,

is your suggestion to go with something like this?

use std::collections::{BTreeSet, HashMap};

pub struct OrderedStrings<'a> {
    ordered_symbols: Vec<&'a str>,
    index_by_symbol: HashMap<&'a str, usize>,
}

impl OrderedStrings<'_> {
    pub fn new(symbols: BTreeSet<&str>) -> OrderedStrings {
        let mut index_by_symbol: HashMap<&str, usize> = HashMap::new();
        let mut ordered_symbols = Vec::with_capacity(symbols.len());

        for (i, &item) in symbols.iter().enumerate() {
            index_by_symbol.insert(item, i);
            ordered_symbols.push(item);
        }

        OrderedStrings {
            ordered_symbols,
            index_by_symbol,
        }
    }
}

#[cfg(test)]
mod test {
    use super::*;

    #[test]
    fn it_can_be_initialized() {
        let mut symbols_tree = BTreeSet::new();
        symbols_tree.insert("bar");
        symbols_tree.insert("foo");

        let ordered_strings = OrderedStrings::new(symbols_tree);

        assert_eq!(2, ordered_strings.ordered_symbols.len());
        assert_eq!(2, ordered_strings.index_by_symbol.len());
    }
}

Would "foo" and "bar" be in memory as long as I keep a reference to the OrderedSymbol struct? Is this a good practice or am I getting this completely wrong?

I was trying to make a variant for which I can actually take the ownership of the string in the struct as in here

pub struct OrderedStrings2<'a> {
    ordered_symbols: Vec<str>,
    index_by_symbol: HashMap<&'a str, usize>,
}

impl OrderedStrings2 {
    pub fn new(symbols: BTreeSet<&str>) -> OrderedStrings2 {
        let mut index_by_symbol: HashMap<&str, usize> = HashMap::new();
        let mut ordered_symbols: Vec<str> = Vec::with_capacity(symbols.len());

        for (i, item) in symbols.into_iter().enumerate() {
            // can't do the following 2 lines
            // ordered_symbols.push(item);
            // index_by_symbol.insert(str::from(item), i);
        }

        OrderedStrings2 {
            ordered_symbols,
            index_by_symbol,
        }
    }
}

but then I have troubles of getting ownership of the str that is in the TreeSet.. :frowning:

Ok, so let's assume you need that Vec :slight_smile: Now, where's that BTreeSet from? Whey are there &str in it? Are they &'static str or are they borrowing from somewhere else? That should very much inform the further choices.

Also, forget about using str directly, you can't have a Vec<str>. Here is a version that works, but it's not clear if that's the right thing without knowing how you intend to use it.

(e) Ah, I realize my version is basically what you had. Taking ownership otoh means knowing who owns the strings in the first place, which is indeed the question "where's that BTreeSet from".

Thanks! yeah that's probably the main problem I agree: knowing when those strings come from.
So what I'm trying to do is to have a more structured object that has other info, one of these being those strings.
So for instance let's say we have some Posts (that are ordered by creation date). Such posts have tags.

From the outside world, you read a database and you create a list of posts, but then I want to export a view of them on file system.

So pretty much what I want to do is, I create a list of those posts, and then I want to create a file with a specific format. This is why I can have the list of all symbols (tags) seen across the different posts.
Now I guess I have 2 choices:

  1. (feels more natural) the posts own the strings and the OrderedStrings take a reference to them to create a projected "view" of a section of my posts
  2. I bisect my posts into multiple structs (one being OrderedStrings) and I give the ownership to each one of those (so OrderedStrings owns the strings)

Option 1 has the advantage that is clear that the main struct owns all the data, and uses the other structs to just create a projected view. However to do that I have to use lifetimes (maybe I'm too newbie and that's what you end up doing most of the cases)

Option 2 instead is nice in a way because I can destruct my main object into their own view-struct. That means I can only get back a reference to the contained data (strings for instance) but at the same time if I don't need a sub-struct anymore that will be freed from the memory..

Maybe option 2 makes more sense?

Shared ownership is expressed via Arc type. In your case, it's Arc<String> or a non-growable slimmer version of Arc<str>. This is like &str, but the reference is non-temporary thanks to Arc.

You can also look for string interning crates that store one copy of a string either globally or per some memory pool. These usually help if you have tons of small repeated strings, like tags or identifiers.

4 Likes

In the scenario you described I think Option 1 makes much more sense. The shape of the code will be like

let posts = get_stuff_from_db();

let ordered_strings = Odered_Strings::from_posts(&posts); // borrowing from the above

... //do stuff with ordered strings

// ordered_strings gets dropped first, removing all references
// posts gets dropped last, no references remaining

which is a pretty clear ownership story and easy to handle, even if you need lifetime annotations sometimes.

If you need to mutate the strings, or need to use API's that take ownership, that will get more complicated, and option 2 might make start to make more sense.

1 Like

alright! thanks for your help! will try it out!

e: actually last question I have on this approach.. if I have to make a getter for the vector, how can I then avoid to have a reference to a reference, i.e.

pub fn get_symbol(&self, index: usize) -> Option<&str> {
    self.ordered_symbols.get(index) // this doesn't work as it return Option<&&string>
}

iterating on this I believe there is an issue though. in

//do stuff with ordered string

I could actually save those strings in a format on the file system to get read later (imagine a sort of cache based on file system files).
If use references, then when reading from the file and building again the struct, the deserialization would own the symbol and when it creates the struct and return it, no one would have the ownership of those strings..or am I missing something? Imagine this:

impl<'a> OrderedString<'a> {
    pub fn new(symbols: BTreeSet<&'a str>) -> OrderedStrings {
       //snippet
    }

    pub fn create_from_buff<R>(&self, from: &mut R) -> OrderedString
        where
            R: Read,
    {
        
       let mut ordered_symbols: Vec<&str> = Vec::new();
       let mut index_by_symbol: HashMap<&str, usize> = Vec::new();
       //read from file...

      // boom! I can't do this anymore
        SymbolTable {
            ordered_symbols,
            index_by_symbol,
        }
    }
}

Sure, but the idea is the same idea as above. We used

let posts = get_stuff_from_db();
let ordered_strings = Odered_Strings::from_posts(&posts);

to have posts own our stuff. When restoring from the cache, do

let stuff_from_file = read_file(path);
let ordered_strings = Odered_Strings::from_filestruct(&stuff_from_file);

and keep around stuff_from_file to own the strings.

yeah, I think the problem here is that to make up the struct, I need to read bytes from the file..that's why the create_from_buff function.
So pretty much in my code above the read from file is reading the bytes to get the symbols.
Now I would expect that a client using my function what it does is just making sure to give you a good buffer, and in returns all it asks is a OrderedString structure with stuff inside..
That's why I don't fully understand how I can bring out the string ownership..

Sorry about the dumb question but this ownership thing is driving me nuts :frowning:

No worries :slight_smile: But: I don't understand your question. Can you phrase it differently? I'll try anyways:

Yeah, and the idea is to not read inside this function, but rather read the file in a higher scope. In that higher scope, you construct your string-owning struct (maybe just a Vec<String>) while reading, and then you pass a reference to that into your OrderedStrings::from_* functions (I realized now that I forgot to use & appropriatly, maybe that confused you, I'll fix it above).

Yeah I guess if I extract the read from file function it can work. My problem here is probably around the encapsulation..
What I want to do is that OrderedStrings can be created from the outside world only with a treeset (to enforce the strings are in order) but then inside the impl I'd like to have a sort of serialize/deserialize function to write and read the struct in a file.

So pretty much when using the struct impl you can:

  • create from an ordered set
  • flush on a buffer self
  • read from a buffer to recreate the struct

that's the reason why I wanted to have the read inside.. I've noticed that rust was bringing me to push stuff out, my only fear is that often it's just moving the problem one step away but it will still be there, so that's why I wonder whether I'm missing something (I'm sure I am)

Yeah in Rust you want to push ownership of things high up in the scope tree. If you don't want to do that explicitely, you might want to think about a top-level wrapper type that users interact with that can own things. As an entry-point, the user would construct the top-level type with the necessary data (database connection, paths..), and then create_from_buff becomes a method on the type taking &mut self and reads from the reader into the wrapper type, from which OrderedStrings can borrow.

1 Like

I was giving another try at this, and I was thinking a little bit about serde. In the doc, you can do soemething like:


fn main() {
    let point = Point { x: 1, y: 2 };

    let serialized = serde_json::to_string(&point).unwrap();
    println!("serialized = {}", serialized);

    let deserialized: Point = serde_json::from_str(&serialized).unwrap();
    println!("deserialized = {:?}", deserialized);
}

the deserialized part is pretty much what I'm trying to do here, I give the function a buffer, and it returns the struct.. I guess the serde example works because it's basic types and not &str so there is no reference to a string that gets lost during the creation...So I suspect if the Point struct of the example contained a &str even serde would not work.
But I wonder, is there anything I could do with the lifetimes to tell rust "hey, the actual string you get from the buffer, keep it somewhere with the same lifetime of the reference I create". I guess the problem is the "keep it somewhere" since there is no variable referencing it right?

this is actually interesting though, because if in rust I do something like

let hello_var = "hello"

hello_var is a reference to a String that is held on the stack (because it's not a String but a str).. should I be thinking to have a str instead?

It's not. It's a &str (reference to str, not String), that is compiled as a constant (held in the binary, not on the stack).

yeah sorry typo. so I guess that's the reason why I can't use str, but only &str and (&)String. Because str is just within the library, it's known at compile time.. I get it