String indexing

Hi,
Apologies if this topic has been beaten to death, but I couldn't find satisfactory answers after an hour of googling, looking at stack exchange, user.rust-lang and of course reading the docs and experimenting.
I was trying to implement a simple string de-duplicator in Rust. Here is the code. I got it to work by changing the program logic but want to know how to get the commented version to work. Particularly, how do I get the character at position 'j' in a &str or String?

/*
fn squeeze(x: &str, y: &mut String) {
    for j in 1..x.len()+1 {
        if x.chars().nth(j) != x.chars().nth(j-1) {
            //y.push(&x[j-1..j]);
            print!("{}", &x[j-1..j]);
        }
    }
}
*/

// Places "squeezed" string in 'y'
fn squeeze(x: &str, y: &mut String) {
    let mut r: char;

    r = x.chars().next().unwrap();
    y.push(r);
    for i in x.chars() {
        if i != r {
            r = i;
            y.push(r);
        }
    }
}

fn main() {
    let a: &str = "ssommeesstrinngg";
    let mut b = String::new();

    println!("{}", &a);
    squeeze(&a, &mut b);
    println!("{}", &b);
}

The commented out function confuses characters and bytes.
You may need to read this Strings/Indexing.

Hi,

Thanks for responding. I did go through that chapter, but it did not help my cause much. The answer to the old one was actually right there in my new solution. All I had to do was to use the unwrap() on the 'Item' returned by the nth() function. The following works:

fn squeeze(x: &str, y: &mut String) {
    for j in 1..x.len()+1 {
        if x.chars().nth(j) != x.chars().nth(j-1) {
            y.push(x.chars().nth(j-1).unwrap());
        }
    }
}

But, is it legitimate to compare Items nth(j) and nth(j-1) like in the example above?

I do understand that the code is terrible/wasteful on the performance front (I am unaware of the cost of x.chars().nth(). There is a lot of repetition and I don't know if rustc optimises it). The exercise for me was to understand how to handle characters in string. Thanks.

There are a few different ways you can do this. One, that I think might be closest to the intent of the commented out code is to convert the string to a vector of char, like this:

fn squeeze(s: &str, y: &mut String) {
    let v = s.chars().collect::<Vec<_>>();
    for i in 0..v.len() {
        if i == 0 || v[i - 1] != v[i] {
            y.push(v[i])
        }
    }
}

Moral of the story: if you want to do random access on unicode codepoints, a vector of chars might be a better choice than a string. For sequential access, of course, the chars() iterator is fine.

Also see the dedup method in itertools, which does the deduplication in a more generic way.

Why just not use Vec::dedup()? It's in std!

Also, I would implement it as a topic starter did in the uncommented code. Or I'd tried a more functional way (as an execise):

fn squeeze(s: &str) -> String {
    s.chars().zip(s.chars().skip(1).chain("_".chars()))
        .filter(|&(a, b)| a != b).map(|(a, _)| a).collect()
}

fn main() {
  println!("{:?}", squeeze("aaaabcdddde"));
}

This is a good example of a problem for which you don't need random access. If you don't want to use itertool's dedup or build your own adapter struct, you could still do something like this:

fn append_squeezed(at: &mut String, x: &str) {
    at.extend(x.chars()
        // map c: char => (c, keepit): (char, bool)
        .scan(None, |s, c| Some(match s {
            ref mut old @ None => { *old = Some(c); (c, true) }
            Some(ref mut old) => (c, std::mem::replace(old, c) != c)
        })).filter(|&(_,keepit)| keepit).map(|(c,_)| c)
    );
}

I feel like a fold can be more elegant here:

fn squeezed(orig: &str) -> String {
    let mut newstr = String::new();
    orig.chars().fold(None, |old, new| 
        if Some(new) != old { newstr.push(new); Some(new) } else { old });
    newstr
}

fn main() {
    println!("{}", squeezed("aaaabcdddde"));
}

It's nice, but feels... impure. It uses fold() for side effect of mutating free variable newstr. But purity aside, it should be faster than my version for sure.

I know what you mean. Haskellers would probably shoot me for doing that :slight_smile:

My dream is, Rust being able to optmize away functional code (like my code in this topic above) in a way, it could compete with a classical imperative code. I ran benchmarks in playpen, and my functional code is twice as slow as OP's code, which is sad, but that's the reality.

BTW, your code is even faster than OP's code! Here.

Upd: I gathered all versions in a single playpen, so we can measure our code side-by-side =)

Upd: Strange thing, when I run bench for just a single function, it gives worse results, than when run along with all other functions. And when I reorder benches, top function gives worst time. Seems like compiler optimizes things differently, based on surrounding code. Could it be it sees some common code patterns between functions and makes them reuse some code parts? Or maybe it's kind of instructions cache or something? Can someone more experienced in low level optimizations enlighten me here?

Nice work! It might be good to run the functions multiple times - a few microseconds is not a reliable value for benchmarking, especially on a hosted environment like playpen.

Also since the playpen seems to cache output, you can run the same code only once and then have to make a whitespace change or so to actually rerun the benchmark.

That may be easier in a language that has linked lists and strings, which is a much more natural data structure in an immutable, functional world.

BTW, hoes does the standard dedup do in comparison?

From what I found out, it's basically the same algorithm as implemented by OP, but modifies vector in place using unsafe raw pointers operations.

I used the time crate to measure a million runs of the two algorithms and here are the times. It didn't really matter which of the functions was measured first.

bash-3.2$ cargo run
Running target/debug/str_dedup
PT3.892506986S seconds for squeeze1 1000000 iters.
PT1.073691585S seconds for squeeze 1000000 iters.

Algorithms:

fn squeeze(x: &str, y: &mut String) {
        let mut r: char;

        r = x.chars().next().unwrap();
        y.push(r);

        for i in x.chars() {
                if i != r {
                        r = i;
                        y.push(r);
                }
        }
}

fn squeeze1(s: &str) -> String {
            s.chars().zip(s.chars().skip(1).chain("_".chars()))
                        .filter(|&(a, b)| a != b).map(|(a, _)| a).collect()
}

Since the squeeze1 algorithm is returning a String, it must be getting allocated in one of the called functions which is probably adding to the time.