Is there the Rust way to modify String in the place?

This question appears multiple times in different forms. I just want to confirm there is no Rust solution for the code like below:

pub fn adjust_separator<'long>(path: &'long mut String) -> &'long String {
    let foreign_slash = if MAIN_SEPARATOR == '\\' {'/'} else {'\\'};
    for c in 0..path.len() {
       if path[c] == foreign_slash { path[c] = MAIN_SEPARATOR }
    }

    &path
}

Obviously I got rustc suggestions like below:

= help: the trait `SliceIndex<str>` is not implemented for `usize`, which is required by `String: Index<_>`

= help: the trait SliceIndex<[_]> is implemented for usize
= help: for that trait implementation, expected [_], found str
= note: required for String to implement Index<usize>

My replacement char has exactly same size in any UNICODE representation and it should introduce any String grow or shrink.

1 Like

You can use (unsafe) as_mut_vec.

As pointed out already, as_mut_vec is the easiest way to do it. It is unsafe because the compiler cannot guarantee the end result of modification is a valid UTF-8 string, so this is up to your code.

e.g.

pub fn adjust_separator<'long>(path: &'long mut String) -> &'long String {
    let foreign_slash = if MAIN_SEPARATOR == '\\' {
        '/' as u8
    } else {
        '\\' as u8
    };
    let bytes = unsafe { path.as_mut_vec() };
    for c in 0..bytes.len() {
        if bytes[c] == foreign_slash {
            bytes[c] = MAIN_SEPARATOR as u8
        }
    }
    path
}
3 Likes

Well, that's never the case, but I'll take the bait this time.

Assuming you don't care about unicode, you can do this:

pub fn adjust_separator(mut path: String) -> String {
    let (a, b) = if MAIN_SEPARATOR == '/' { ("/", "\\") } else { ("\\", "/") };
    for c in 0..path.len() {
        let c = c..c+1;
        if path[c.clone()] == *a { path.replace_range(c, b) }
    }

    path
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=6b36cd38206acaea73364ebbc5aa75b2

I changed the signature because &mut String -> &String like that is not useful. If you want to modify an existing string, then you want either &mut String -> () or String -> String.

Though really, as always for string processing you should just use a regex. Or use a real path library instead of trying to do path handling in strings.

7 Likes
pub fn adjust_separator(path: &mut String) -> &String {
    let foreign_slash = if MAIN_SEPARATOR == '\\' { '/' } else { '\\' };
    let foreign_slash_len = foreign_slash.len_utf8();

    let mut start = 0;
    while let Some(i) = path[start..].find(foreign_slash) {
        start += i;
        path.replace_range(start..start + foreign_slash_len, MAIN_SEPARATOR_STR);
        start += MAIN_SEPARATOR_STR.len();
    }
    path
}
3 Likes

Thanks, it looks useful. There is still a question : can a slash be a part of UTF char? But it looks like NOT, since 8th bit is set.

1 Like

This shouldn't be a problem in UTF-8 (Rust uses this encoding) if you are only swapping \ (UTF-8 code 92, or 01011100) and / (UTF-8 code 47, or 00101111).

As you point out, the most significant bit (8th bit if reading from right to left) is always 0, so they cannot be contained in a multi-byte character, which always has the MSB set to 1 in UTF-8.

3 Likes

This one looks like a solution. I didn't test a performance comparing to alternatives. But important part as no any string reallocation is here. Thanks.

@jumpnbrownweasel's solution is better, @scottmcm's solution only works for ASCII strings.

1 Like

Indeed, @jumpnbrownweasel's solution survived UNICODE data, when other failed. It's a little disappointing that such simple task can't be solved in the initial code snippet.

it can be done in-place, though, using as_mut_vec.

Here is the equivalent of the original snippet:

pub fn adjust_separator(path: &mut str) {
    let foreign = match MAIN_SEPARATOR {
        '\\' => b'/',
        '/' => b'\\',
        _ => panic!("Unknown separator")
    };
    let separator = MAIN_SEPARATOR.try_into().unwrap();
    // SAFETY: foreign and separator are both ASCII,
    // so we maintain UTF-8.
    let bytes = unsafe { path.as_bytes_mut() };
    for byte in bytes {
        if *byte == foreign {
            *byte = separator;
        }
    }
}
4 Likes

Thanks, it looks the variant I will keep in the code, although unsafe looks weird:

pub fn adjust_separator(mut path: String) -> String {
    let foreign_slash = if MAIN_SEPARATOR == '\\' { '/' } else { '\\' };
    let vec = unsafe {path.as_mut_vec()};
    for c in 0..vec.len() {
        if vec[c] == foreign_slash as u8 { vec[c] = MAIN_SEPARATOR as u8;}
    }

    path
}

Thanks everyone, the discussion was productvie and interesting.

Depending on your use case, you may want to check out this crate:

2 Likes

You could do it using safe code if you validate the bytes as UTF-8:

pub fn adjust_separator(path: String) -> String {
    let foreign_slash = if MAIN_SEPARATOR == '\\' { '/' } else { '\\' };
    let mut vec = path.into_bytes();
    for c in 0..vec.len() {
        if vec[c] == foreign_slash as u8 { vec[c] = MAIN_SEPARATOR as u8;}
    }

    String::from_utf8(vec).expect("adjust_separator produced invalid UTF-8")
}

Playground

The difficulty is because of the utf8 encoding, since chars are variable length. If utf32 were used, the initial code snippet could work but this would be extremely wasteful of space: ~4X.

It seems an initial design of Rust wasn't a future proof. That time Mac Book had only 2GB RAM and only few people worked on LLM. Now Mac Book has 16GB and mostly every one works on LLM. Even Java had a string as u16 based in 1995. In case of Rust, having OsString as String, and String as UTFString would be more reasonable for most use cases. Anyway, we should provide the best solution in the current design.

Using more memory for the same data also reduces processor cache hits. For several reasons, Swift switched to utf8.

5 Likes

It was designed with a UTF-8 everywhere mindset. Will it hold up? Time will tell. "Future proof" in the 1990s was UCS-2, still with us in various tech/forms such as Windows path encodings.

There are a few misconceptions in this paragraph:

  • Java uses UTF-16, but UTF-16 is a variable-length encoding just like UTF-8 is, so switching from UTF-8 to UTF-16 wouldn't help with the original code.
  • The world has generally shifted away from UTF-16 towards UTF-8, so judging from adoption, UTF-16 isn't more "future proof" than UTF-8, it's the other way around.
  • OsString also doesn't provide a mechanism to arbitrarily modify bytes, so switching to OsString also wouldn't make the original code work. Except of course it's better to use OsString, or better yet PathBuf to represent filesystem paths rather than String, for other reasons. Most strings are not filesystem paths, so it's better to use a portable type for those rather than OsString.

I don't understand what LLMs have to do with this.

7 Likes