Is there the Rust way to modify String in the place?

MOCKBA · October 31, 2024, 2:40am

This question appears multiple times in different forms. I just want to confirm there is no Rust solution for the code like below:

pub fn adjust_separator<'long>(path: &'long mut String) -> &'long String {
    let foreign_slash = if MAIN_SEPARATOR == '\\' {'/'} else {'\\'};
    for c in 0..path.len() {
       if path[c] == foreign_slash { path[c] = MAIN_SEPARATOR }
    }

    &path
}

Obviously I got rustc suggestions like below:

= help: the trait `SliceIndex<str>` is not implemented for `usize`, which is required by `String: Index<_>`
= help: the trait SliceIndex<[_]> is implemented for usize
= help: for that trait implementation, expected [_], found str
= note: required for String to implement Index<usize>

My replacement char has exactly same size in any UNICODE representation and it should introduce any String grow or shrink.

zirconium-n · October 31, 2024, 3:13am

You can use (unsafe) as_mut_vec.

hax10 · October 31, 2024, 3:26am

As pointed out already, as_mut_vec is the easiest way to do it. It is unsafe because the compiler cannot guarantee the end result of modification is a valid UTF-8 string, so this is up to your code.

e.g.

pub fn adjust_separator<'long>(path: &'long mut String) -> &'long String {
    let foreign_slash = if MAIN_SEPARATOR == '\\' {
        '/' as u8
    } else {
        '\\' as u8
    };
    let bytes = unsafe { path.as_mut_vec() };
    for c in 0..bytes.len() {
        if bytes[c] == foreign_slash {
            bytes[c] = MAIN_SEPARATOR as u8
        }
    }
    path
}

scottmcm · October 31, 2024, 3:36am

Well, that's never the case, but I'll take the bait this time.

Assuming you don't care about unicode, you can do this:

pub fn adjust_separator(mut path: String) -> String {
    let (a, b) = if MAIN_SEPARATOR == '/' { ("/", "\\") } else { ("\\", "/") };
    for c in 0..path.len() {
        let c = c..c+1;
        if path[c.clone()] == *a { path.replace_range(c, b) }
    }

    path
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=6b36cd38206acaea73364ebbc5aa75b2

I changed the signature because &mut String -> &String like that is not useful. If you want to modify an existing string, then you want either &mut String -> () or String -> String.

Though really, as always for string processing you should just use a regex. Or use a real path library instead of trying to do path handling in strings.

jumpnbrownweasel · October 31, 2024, 3:50am

pub fn adjust_separator(path: &mut String) -> &String {
    let foreign_slash = if MAIN_SEPARATOR == '\\' { '/' } else { '\\' };
    let foreign_slash_len = foreign_slash.len_utf8();

    let mut start = 0;
    while let Some(i) = path[start..].find(foreign_slash) {
        start += i;
        path.replace_range(start..start + foreign_slash_len, MAIN_SEPARATOR_STR);
        start += MAIN_SEPARATOR_STR.len();
    }
    path
}

MOCKBA · October 31, 2024, 3:56am

Thanks, it looks useful. There is still a question : can a slash be a part of UTF char? But it looks like NOT, since 8th bit is set.

hax10 · October 31, 2024, 4:04am

This shouldn't be a problem in UTF-8 (Rust uses this encoding) if you are only swapping \ (UTF-8 code 92, or 01011100) and / (UTF-8 code 47, or 00101111).

As you point out, the most significant bit (8th bit if reading from right to left) is always 0, so they cannot be contained in a multi-byte character, which always has the MSB set to 1 in UTF-8.

MOCKBA · October 31, 2024, 4:26am

This one looks like a solution. I didn't test a performance comparing to alternatives. But important part as no any string reallocation is here. Thanks.

tczajka · October 31, 2024, 10:32am

@jumpnbrownweasel's solution is better, @scottmcm's solution only works for ASCII strings.

MOCKBA · October 31, 2024, 5:27pm

Indeed, @jumpnbrownweasel's solution survived UNICODE data, when other failed. It's a little disappointing that such simple task can't be solved in the initial code snippet.

binarycat · October 31, 2024, 5:38pm

it can be done in-place, though, using as_mut_vec.

tczajka · October 31, 2024, 5:56pm

Here is the equivalent of the original snippet:

pub fn adjust_separator(path: &mut str) {
    let foreign = match MAIN_SEPARATOR {
        '\\' => b'/',
        '/' => b'\\',
        _ => panic!("Unknown separator")
    };
    let separator = MAIN_SEPARATOR.try_into().unwrap();
    // SAFETY: foreign and separator are both ASCII,
    // so we maintain UTF-8.
    let bytes = unsafe { path.as_bytes_mut() };
    for byte in bytes {
        if *byte == foreign {
            *byte = separator;
        }
    }
}

MOCKBA · October 31, 2024, 6:34pm

Thanks, it looks the variant I will keep in the code, although unsafe looks weird:

pub fn adjust_separator(mut path: String) -> String {
    let foreign_slash = if MAIN_SEPARATOR == '\\' { '/' } else { '\\' };
    let vec = unsafe {path.as_mut_vec()};
    for c in 0..vec.len() {
        if vec[c] == foreign_slash as u8 { vec[c] = MAIN_SEPARATOR as u8;}
    }

    path
}

Thanks everyone, the discussion was productvie and interesting.

quinedot · October 31, 2024, 6:44pm

Depending on your use case, you may want to check out this crate:

Lej77 · October 31, 2024, 6:45pm

You could do it using safe code if you validate the bytes as UTF-8:

pub fn adjust_separator(path: String) -> String {
    let foreign_slash = if MAIN_SEPARATOR == '\\' { '/' } else { '\\' };
    let mut vec = path.into_bytes();
    for c in 0..vec.len() {
        if vec[c] == foreign_slash as u8 { vec[c] = MAIN_SEPARATOR as u8;}
    }

    String::from_utf8(vec).expect("adjust_separator produced invalid UTF-8")
}

Playground

jumpnbrownweasel · October 31, 2024, 6:47pm

The difficulty is because of the utf8 encoding, since chars are variable length. If utf32 were used, the initial code snippet could work but this would be extremely wasteful of space: ~4X.

MOCKBA · October 31, 2024, 8:15pm

It seems an initial design of Rust wasn't a future proof. That time Mac Book had only 2GB RAM and only few people worked on LLM. Now Mac Book has 16GB and mostly every one works on LLM. Even Java had a string as u16 based in 1995. In case of Rust, having OsString as String, and String as UTFString would be more reasonable for most use cases. Anyway, we should provide the best solution in the current design.

jumpnbrownweasel · October 31, 2024, 8:43pm

Using more memory for the same data also reduces processor cache hits. For several reasons, Swift switched to utf8.

quinedot · October 31, 2024, 8:52pm

It was designed with a UTF-8 everywhere mindset. Will it hold up? Time will tell. "Future proof" in the 1990s was UCS-2, still with us in various tech/forms such as Windows path encodings.

tczajka · October 31, 2024, 9:03pm

There are a few misconceptions in this paragraph:

Java uses UTF-16, but UTF-16 is a variable-length encoding just like UTF-8 is, so switching from UTF-8 to UTF-16 wouldn't help with the original code.
The world has generally shifted away from UTF-16 towards UTF-8, so judging from adoption, UTF-16 isn't more "future proof" than UTF-8, it's the other way around.
OsString also doesn't provide a mechanism to arbitrarily modify bytes, so switching to OsString also wouldn't make the original code work. Except of course it's better to use OsString, or better yet PathBuf to represent filesystem paths rather than String, for other reasons. Most strings are not filesystem paths, so it's better to use a portable type for those rather than OsString.

I don't understand what LLMs have to do with this.

Topic		Replies	Views
Frank's Rust String Class	31	5976	January 12, 2023
Is there another way of indexing a String rather than converting it to bytes?	30	1782	November 17, 2020
Support beyond UTF-8? help	11	6646	January 12, 2023
Disappointed with Path	58	5714	March 5, 2021
[Absolute Beginner] Is this txt-splitter Rustlike? help	3	1104	January 12, 2023

Is there the Rust way to modify String in the place?

Related topics