Help to make code more consise when using Cow

I have the following code:

use std::borrow::Cow;

pub fn json_encode(orig: &str) -> Cow<'_, str> {
    let mut chars = orig.char_indices();
    let mut owned = false;
    let mut res = String::new();
    while let Some( (i,c) ) = chars.next() {
        match c {
            '"' => { res = orig[..i].to_string();
                res.push_str("\\\""); owned = true;
                break},
            '\n' => { res = orig[..i].to_string();
                res.push_str("\x5Cn"); owned = true;
                break},
            '\r' => { res = orig[..i].to_string();
                res.push_str("\x5cr"); owned = true;
                break},
            '\t' => { res = orig[..i].to_string(); 
                res.push_str("\x5ct"); owned = true;
                break},
            '\\' => { res = orig[..i].to_string();
                res.push_str("\x5c\x5c"); owned = true;
                break},
            _ => () ,
        }
    }
    if owned {
        while let Some( (_,c) ) = chars.next() {
            match c {
                '"' => res.push_str("\\\""),
                '\n' => res.push_str("\x5Cn"),
                '\r' => res.push_str("\x5cr"),
                '\t' => res.push_str("\x5ct"),
                '\\' => res.push_str("\x5c\x5c"),
                _ => res.push(c),
            }
        }
        Cow::Owned(res)
    } else {
        Cow::Borrowed(orig)
    }
}


fn main() {
    let value = json_encode(r#"Hello " \ "
        world!"#);
    println!("{}", value);
    let value = json_encode("Hello world!");
    println!("{}", value);
}

It's working, however, I do not like that res string is created regardless if returned the borrowed or an owned value. I tried to declare it without an initialization, but I couldn't convince the compiler - it's okay in the borrowed case.

res = orig[..i].to_string(); looks like a boilerplate, and I couldn't figure out how to get rid of it.

Finally, since I know approximately a length of owned value, I would prefer to set it explicitly.

Any other improvements you could see?

1 Like

how about only saving the indices in the first pass, then run a second pass to escape the string only if the list of indicces is non-empty? something like this:

fn need_escape(c: char) -> bool { todo!() }
fn escape_at(s: &str, i: usize) -> &'static str { todo!() }

// first pass
let escape_indices: Vec<usize> = orig
        .char_indices()
        .filter_map(|(i, c)|  need_escape(c).then_some(i))
        .collect();
if escape_indices.is_empty() {
    return Cow::Borrowed(orig)
}

// second pass
let mut res = String::new();
let mut start = 0usize;
for i in escape_indices {
    res.push_str(&orig[start..i]);
    res.push_str(escape_at(orig, i));
    let escape_len = 1;
    start = i + escape_len;
}
// the trailing segment
res.push_str(&orgi[start..]);
Cow::Owned(res)

you can do some optimization with this method, for example, you can save the escaped sequence along with the indices in the first pass, etc.

2 Likes

I came up with pretty much the same idea, but in a single pass.

3 Likes

Generally, it's an interesting approach, however two passes ruin it.

It looks better, I only think that no reason to save all found escaped characters indices. Just one is sufficient and then just switch to by a char processing algorithm. I didn't do any bench marking yet, but I believe that a speed will be comparable, because I did a similar measurements for Java long time ago.

actually, I don't think a two pass method is that bad.

the second loop count is propotional to the number of escape characters, not the length of the input. granted, in the worst case, it's the same, but I would put it in the pathological category and not worry about it.

besides, after the first pass, you can calculate the length of the final result, so you can use String::with_capacity() to allocate upfront, this can reduce the reallocation overhead. so overall, it should be roughly on par with a single pass algorithm.

3 Likes

Here's a similar approach that uses find... which may be optimized when searching for a slice of chars (but I don't know if it actually is and I didn't benchmark either). Working with bytes could be faster... unfortunately memchr only supports up to three needles at a time and I'm unaware of a similar crate off the top of my head.

Using aho-corasick pattern matching:

4 Likes

memcpy'ing the non-matching prefix like the OP code does is possible with aho-corasick's replace_all_with() method. I did not benchmark it.

There is probably a lot of really optimized code out there that does exactly this. But anyway...

Just for clarity, I would first refactor the original code as:

pub fn json_encode(orig: &str) -> Cow<'_, str> {
    let mut chars = orig.char_indices();
    let mut res = None;
    while let Some((i, c)) = chars.next() {
        if let Some(c_esc) = escape_char(c) {
            res.insert(orig[..i].to_string()).push_str(c_esc);
            break;
        }
    }
    if let Some(mut res) = res {
        while let Some((_, c)) = chars.next() {
            if let Some(c_esc) = escape_char(c) {
                res.push_str(c_esc);
            } else {
                res.push(c);
            }
        }
        Cow::Owned(res)
    } else {
        Cow::Borrowed(orig)
    }
}

fn escape_char(c: char) -> Option<&'static str> {
    match c {
        '"' => Some("\\\""),
        '\n' => Some("\\n"),
        '\r' => Some("\\r"),
        '\t' => Some("\\t"),
        '\\' => Some("\\\\"),
        _ => None,
    }
}

Or, if you prefer:

pub fn json_encode(orig: &str) -> Cow<'_, str> {
    let mut chars = orig.char_indices();
    while let Some((i, c)) = chars.next() {
        if let Some(c_esc) = escape_char(c) {
            let mut res = orig[..i].to_string();
            res.push_str(c_esc);
            while let Some((_, c)) = chars.next() {
                if let Some(c_esc) = escape_char(c) {
                    res.push_str(c_esc);
                } else {
                    res.push(c);
                }
            }
            return Cow::Owned(res);
        }
    }
    Cow::Borrowed(orig)
}

fn escape_char(c: char) -> Option<&'static str> {
    match c {
        '"' => Some("\\\""),
        '\n' => Some("\\n"),
        '\r' => Some("\\r"),
        '\t' => Some("\\t"),
        '\\' => Some("\\\\"),
        _ => None,
    }
}

Regarding performance, note that String::new() does not do any allocations (but I avoided it regardless). The real issue will be the later reallocations, as you noticed. So you want to use String::with_capacity, either with a heuristic once you encounter the first escaped character, or by calculating the length exactly:

pub fn json_encode(orig: &str) -> Cow<'_, str> {
    if let Some(esc_len) = escaped_len(orig) {
        let mut res = String::with_capacity(esc_len);
        for c in orig.chars() {
            if let Some(c_esc) = escape_char(c) {
                res.push_str(c_esc);
            } else {
                // Note: There is probably some missed optimization potential here
                // because `c` was just UTF-8-decoded and now we re-encode it.
                res.push(c);
            }
        }
        Cow::Owned(res)
    } else {
        Cow::Borrowed(orig)
    }
}

fn escaped_len(mut s: &str) -> Option<usize> {
    let mut len = 0;
    let mut escaped = false;
    while let Some((before, esc, after)) = first_escaped_char(s) {
        len += before.len() + esc.len();
        escaped = true;
        s = after;
    }
    if escaped {
        len += s.len();
        Some(len)
    } else {
        None
    }
}

fn first_escaped_char<'a>(s: &'a str) -> Option<(&'a str, &'static str, &'a str)> {
    let mut chars = s.char_indices();
    while let Some((i, c)) = chars.next() {
        if let Some(c_esc) = escape_char(c) {
            return Some((&s[..i], c_esc, &s[chars.offset()..]));
        }
    }
    None
}

fn escape_char(c: char) -> Option<&'static str> {
    match c {
        '"' => Some("\\\""),
        '\n' => Some("\\n"),
        '\r' => Some("\\r"),
        '\t' => Some("\\t"),
        '\\' => Some("\\\\"),
        _ => None,
    }
}

Gentlemen, I did a little benchmarking, there are results:

Would you like to run cow? [Y|n]
202500000 Time elapsed: 1.152325342s
202500000 SReichelt Time elapsed: 920.648827ms
202500000 SReichelt1 Time elapsed: 746.881453ms
202500000 quinedot Time elapsed: 929.837921ms
202500000 quinedot1 Time elapsed: 933.813271ms

Two more algorithms from parasyte and nerditation are pending. I do not use Cargo and for the reason, I need to build the aho-corasick crate first. I also need to cover todo() part of another algorithm. The tests have been done on RK3588 processor and Rust code was compiled with opt-level=3. As you see, you could improve my code by 30%, that I consider very good. There is how my testing bench look


I did all testing in a single core mode, but multi threads version will unlikely change the results.

1 Like

Adding more benchmarks:

205500000 Time elapsed: 1.107804453s
205500000 SReichelt Time elapsed: 933.752354ms
205500000 SReichelt1 Time elapsed: 762.381794ms
205500000 quinedot Time elapsed: 966.918385ms
205500000 quinedot1 Time elapsed: 964.664714ms
205500000 parasyte Time elapsed: 127.979244716s

As you can see, mathematicians have a very strong brain, however they can't compete with CPU power. Rust is also difficult for them:

.....
warning: method `get_for_offset` is never used
   --> src/vector.rs:126:8
    |
120 | impl SensibleMoveMask {
    | --------------------- method in this implementation
...
126 |     fn get_for_offset(self) -> u32 {
    |        ^^^^^^^^^^^^^^

warning: 509 warnings emitted
....
warning: hiding a lifetime that's elided elsewhere is confusing
  --> src/util/alphabet.rs:93:23
   |
93 |     fn element_ranges(&self, class: u8) -> ByteClassElementRanges {
   |                       ^^^^^                ^^^^^^^^^^^^^^^^^^^^^^ the same lifetime is hidden here
   |                       |
   |                       the lifetime is elided here
   |
   = help: the same lifetime is referred to in inconsistent ways, making the signature confusing
help: use `'_` for type paths
   |
93 |     fn element_ranges(&self, class: u8) -> ByteClassElementRanges<'_> {
   |                                                                  ++++

warning: 266 warnings emitted

I can admit also a high warnings amount when I started learning Rust.

Oh, I initially missed the difference between "ms" and "s". :joy:
(BTW, just to make sure: You are measuring with --release, right? Otherwise the numbers will be rather meaningless.)

You can actually get much better performance by changing the API so you can reuse allocated memory (if you expect that case to matter):

pub struct JsonStringEncoder(String);

impl JsonStringEncoder {
    pub fn new() -> Self {
        JsonStringEncoder(String::new())
    }

    pub fn json_encode<'a>(&'a mut self, mut orig: &'a str) -> &'a str {
        if let Some((before, esc, after)) = Self::first_escaped_char(orig) {
            self.0.clear();
            let min_required_len = before.len() + esc.len() + after.len();
            if self.0.capacity() < min_required_len {
                let max_expected_len = before.len() + esc.len() + after.len() * 3 / 2;
                self.0.reserve(max_expected_len);
            }

            self.0.push_str(before);
            self.0.push_str(esc);
            orig = after;
            while let Some((before, esc, after)) = Self::first_escaped_char(orig) {
                self.0.push_str(before);
                self.0.push_str(esc);
                orig = after;
            }
            self.0.push_str(orig);

            &self.0
        } else {
            orig
        }
    }

    fn first_escaped_char<'a>(s: &'a str) -> Option<(&'a str, &'static str, &'a str)> {
        let mut chars = s.char_indices();
        while let Some((i, c)) = chars.next() {
            if let Some(c_esc) = Self::escape_char(c) {
                return Some((&s[..i], c_esc, &s[chars.offset()..]));
            }
        }
        None
    }

    fn escape_char(c: char) -> Option<&'static str> {
        match c {
            '"' => Some("\\\""),
            '\n' => Some("\\n"),
            '\r' => Some("\\r"),
            '\t' => Some("\\t"),
            '\\' => Some("\\\\"),
            _ => None,
        }
    }
}

And then, in your benchmark, construct JsonStringEncoder only once and use the same instance in every loop iteration.

(Similarly, in the aho-corasick version, you are probably supposed to call AhoCorasick::new once and then reuse it. But I don't know how much that will change.)

I mentioned that, I use -C,opt-level=3, but --release uses -C,opt-level=4. My testing showed that there is no difference in benchmark results. You also rose a very interesting point. Rust is heavily depends on RAII. However reusing memory instead of continuously to allocate and free it is more efficient. So your last correction is very aligned to my research now.