Using Iterator<Item = T> when Iterator<Item = &T> required

Hello,

I made a function ptr_map that creates a map from strings to string pointers.
(It is the simplification of a more complex function.)
However, I am stuck using this function with an Iterator.
In a nutshell, I obtain my strings from an Iterator<Item = String>,
but ptr_map wants an Iterator<Item = &String>.

use std::collections::HashMap;

fn ptr_map<'s>(iter: impl Iterator<Item = &'s String>) {
    let mut symmap: HashMap<String, &String> = HashMap::new();
    for sym in iter {
        symmap.insert(sym.clone(), &sym);
    }
}

fn main() {
    let syms = ["a".to_string(), "b".to_string()];
    // all is fine here ...
    ptr_map(syms.iter())
}

fn howto(iter: impl Iterator<Item = String>) {
    // TODO: ... but what to do here?
    //ptr_map(iter)
}

Rust playground

If you could show me how to either
(a) modify ptr_map so that it takes Iterator<Item = String> or
(b) implement howto,
I would be extremely grateful! :slight_smile:

If you have an iterator of owned objects, you can't really turn that into an iterator of borrowed objects. Who would own the objects that you're borrowing?

It's not entirely clear to me why you need things to work the way they do in ptr_map. Is it because you're trying to use a single copy of the string in both the key and the value of the map? This might be a good use for shared ownership:

use std::rc::Rc;


fn ptr_map(iter: impl Iterator<Item = String>) {
    let mut symmap: HashMap<Rc<String>, Rc<String>> = HashMap::new();
    for sym in iter {
        let k = Rc::new(sym);
        let v = k.clone();
        symmap.insert(k, v);
    }
}
2 Likes

An iterator with item type &String requires that every single reference returned by the iterator remains valid until after the iterator has finished, but an iterator with item type String gives the consumer of the iterator the responsibility of keeping the strings alive.

So the only way to do this would be to collect your iterator of String into a vector and create an iterator over the vector.

2 Likes

It sounds like you're coming from C or C++ and are maybe thinking of references as pointers. It's true that they are implemented as pointers, but they are only one of the rust types that are implemented as pointers, and creating a HashMap that has references in it is almost always wrong. A reference is always borrowed from some other data structure or variable. So for your HashMap you'd have to be able to identify the variable that you're borrowing from. It's possible to do this, but it's probably not what you want to do. It means that you can't modify the variable that you're borrowing from, for instance, and you have to ensure that said variable lives as long as your HashMap does.

1 Like

Thanks for all your answers.

I forgot to mention that I have already a working version of this program using Rc/Arc as suggested by @jethrogb. However, it is quite slow when accessing the Strings from multiple threads, and I believe this may be due to the overhead of Arc.

These conditions can be fulfilled in my program, and it's precisely why I considered references in order to avoid the Rc overhead.

That is something that I would like to avoid, because I need to process the Strings as soon as they arrive from the iterator.

Yes, I understand that. I was wondering how to keep the strings alive without having to collect into a Vec before. For example, I thought about storing every string from the Iterator<Item = String> as soon as the string is processed by ptr_map. The question is, how do I store it so that Rust recognises that the lifetime is satisfied?
Something that I tried and that does not work, just to give you an idea:

fn ptr_map_test(iter: impl Iterator<Item = String>) {
    let mut symmap: HashMap<String, &String> = HashMap::new();
    let collected: Vec<String> = iter.map(|sym|
    {
        symmap.insert(sym.clone(), &sym);
        sym
    }).collect();
}

Here, I tried to store the Strings in the collected Vec to make sure that the references remain valid, but the compiler complains ...

The only slowdowns with Arc should be cloning or destroying -- those are the only operations that require modifications to the atomic refcount. However, moving it or borrowing a &str from Arc<str> should be as cheap as borrowing it from a String. Operations like contains_key and get are designed to not require having an Arc passed in (just a &str would do). So I'm not sure what significant costs you would reduce when switching the HashMap's stored type from an Arc to a borrowed value. Could you possibly elaborate on that?

Or perhaps are you using Arc<String> and taking on the cost of a double indirection (ArcString&str)? If so, perhaps switching to Arc<str> might gain you the benefits.

2 Likes

That example can't work. You are moving sym after inserting it, invalidating the reference to it. Even if you changed it to a for loop and used indexes into the for loop after insertion, it wouldn't work because adding more items requires mutably borrowing the vector, which requires exclusive access.

If you know the length up front, I guess you could do this:

fn ptr_map_test(iter: impl Iterator<Item = String>) {
    let mut symmap: HashMap<String, &String> = HashMap::new();
    
    // we must know the length up front
    let mut collected = vec![String::new(); iter.size_hint().0];
    
    for (ptr, s) in collected.iter_mut().zip(iter) {
        *ptr = s;
        symmap.insert(ptr.clone(), &*ptr);
    }
}

But it isn't ideal.

Of course.
To approximate my program, let's say that it processes sentences, consisting of a sequence of words. Example: "the quick brown fox jumps over the lazy dog". I am currently using Vec<Rc<String>> to represent a sentence, where each Rc<String> represents a word.
I'm using the HashMap to map equal words to the same pointer, such as "the" in the example. This is done to speed up the equality test of words.
I process lots and lots of sentences in my program (also in parallel, in which case a word will be Arc<String>), and during processing I will frequently have to clone these words. Currently, each clone/destroy operation causes some overhead. In particular, I suspect that cloning some frequently occurring words such as "the", "and" etc. becomes a bottleneck when processing in parallel.
My motivation to switch from Rc<String> to &str was to speed up cloning of words.

For that reason, just switching from Rc<String> to Rc<str> unfortunately does not change a lot, because cloning will remain equally expensive. In general, I rarely look at what is inside the Rc, I mainly use Rc::ptr_eq for the equality check. (Which I planned to replace by std::ptr::eq when switching to &str.)

Unfortunately, I do not know the length up front, because I do not know how many words there will be ...

Sounds like a good use for string interning. Take a look at https://crates.io/crates/string-interner

1 Like

I would like to avoid string interning, because
(a) it allows for runtime errors that are currently impossible with the <Rc> solution (resolving strings with the wrong interner, for example), and
(b) it complicates e.g. printing of sentences, because it requires carrying around the interner (particularly annoying for debugging).

It's still not clear to me why you want these words in both the key and the value of the map.

If you want to use references instead of shared ownership, you'll have to decide what part of your call stack is going to own the data. Regardless of what you chose, that owner will need to outlive the map. Here are some options:

  • The original input sentences
  • Some vec where you store everything
  • An arena
  • Leak memory
3 Likes

Wow, your tip about using an arena was a full hit!
It looks like this is exactly what I need.

An example illustrating my usage (requires typed-arena 2.0) follows.
It will print "Palindrome detected" exactly once.
If the line let v = arena.alloc(word.clone()); is uncommented, then it should not print anything.

use std::collections::HashMap;
use typed_arena::Arena;

/// Return true if reversing the order of words yields the same sentence.
fn palindrome(sentence: Vec<&str>) -> bool {
    sentence
        .iter()
        .zip(sentence.iter().rev())
        .all(|(w1, w2)| std::ptr::eq(*w1, *w2))
}

fn process(sentences: impl Iterator<Item = Vec<String>>) {
    let mut wordmap: HashMap<String, &str> = HashMap::new();
    let arena = Arena::new();
    for sentence in sentences {
        let mut rsentence = Vec::<&str>::new();
        for word in sentence {
            let v = wordmap
                .entry(word.clone())
                .or_insert_with(|| arena.alloc(word.clone()));
            // uncomment next line to map every word to a new reference
            //let v = arena.alloc(word.clone());
            rsentence.push(v);
        }

        if palindrome(rsentence) {
            println!("Palindrome detected")
        }
    }
}

fn main() {
    let sentences = vec![
        vec!["ashes".to_string(), "to".to_string(), "ashes".to_string()],
        vec!["the".to_string(), "brown".to_string(), "fox".to_string()],
    ];
    process(sentences.into_iter())
}

Thanks a lot for your help!

It's unclear to me why the Arena works but the interner does not. In both cases, string allocations borrow from the arena/interner. As long as you just have a single interner, you can just do pointer comparisons. And your code would end up using HashMap<&str, &str>, avoiding an additional allocation for the key (even now, you could switch the key's type, but the arena would still allocate for the key on every lookup, whereas the interner would return the last interned value).

Oh, the interner would work as well.
But as I wrote, it is important for me to have compile-time guarantees that I can always get the underlying string for every word in my program --- which cannot be assured by the interner.
(From the docs: pub fn resolve(&self, symbol: S) -> Option<&str>)
And I will actually not allocate so many different words, in comparison with the number of words appearing in sentences. Therefore, the additional allocation of the key is quite negligible in my scenario.

Don't you want get_or_intern() instead of resolve()?

Well, get_or_intern() is useful when creating a symbol, but when I already have a symbol and I want to obtain its underlying string, I need to use resolve(). And this operation can potentially fail.

For future reference, here is the final version of my example.
It maps equal Strings to &strs with the same memory address and processes the result in parallel.
It will print "Palindrome detected" exactly once.
If the line indicated in the code is uncommented, then it should not print anything.

use std::collections::HashSet;

/// Return true if reversing the order of words yields the same sentence.
fn palindrome(sentence: Vec<&str>) -> bool {
    sentence
        .iter()
        .zip(sentence.iter().rev())
        .all(|(w1, w2)| std::ptr::eq(*w1, *w2))
}

fn process(sentences: impl Iterator<Item = Vec<String>> + Send) {
    use colosseum::sync::Arena;
    use rayon::iter::{ParallelBridge, ParallelIterator};

    let mut words: HashSet<&str> = HashSet::new();
    let arena = Arena::new();

    let conv_sentence = |sentence: Vec<String>| -> Vec<&str> {
        let conv_word = |word: String| -> &str {
            words.get(&*word).copied().unwrap_or_else(|| {
                let w: &str = arena.alloc(word);
                words.insert(w);
                w
            })
        };
        // uncomment next line to map every word to a new reference
        //let conv_word = |word: String| -> &str { arena.alloc(word) };
        sentence.into_iter().map(conv_word).collect()
    };
    let process = |rsentence: Vec<&str>| {
        if palindrome(rsentence) {
            println!("Palindrome detected")
        }
    };

    sentences.map(conv_sentence).par_bridge().for_each(process)
}

fn main() {
    let sentences = vec![
        vec!["ashes".to_string(), "to".to_string(), "ashes".to_string()],
        vec!["dust".to_string(), "to".to_string(), "air".to_string()],
    ];
    process(sentences.into_iter())
}

Cargo.toml:

[dependencies]
rayon = "1.3"
colosseum = "0.2"
2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.