This question appears multiple times in different forms. I just want to confirm there is no Rust solution for the code like below:
pub fn adjust_separator<'long>(path: &'long mut String) -> &'long String {
let foreign_slash = if MAIN_SEPARATOR == '\\' {'/'} else {'\\'};
for c in 0..path.len() {
if path[c] == foreign_slash { path[c] = MAIN_SEPARATOR }
}
&path
}
Obviously I got rustc suggestions like below:
= help: the trait `SliceIndex<str>` is not implemented for `usize`, which is required by `String: Index<_>`
= help: the trait SliceIndex<[_]> is implemented for usize
= help: for that trait implementation, expected [_], found str
= note: required for String to implement Index<usize>
My replacement char has exactly same size in any UNICODE representation and it should introduce any String grow or shrink.
As pointed out already, as_mut_vec is the easiest way to do it. It is unsafe because the compiler cannot guarantee the end result of modification is a valid UTF-8 string, so this is up to your code.
e.g.
pub fn adjust_separator<'long>(path: &'long mut String) -> &'long String {
let foreign_slash = if MAIN_SEPARATOR == '\\' {
'/' as u8
} else {
'\\' as u8
};
let bytes = unsafe { path.as_mut_vec() };
for c in 0..bytes.len() {
if bytes[c] == foreign_slash {
bytes[c] = MAIN_SEPARATOR as u8
}
}
path
}
Well, that's never the case, but I'll take the bait this time.
Assuming you don't care about unicode, you can do this:
pub fn adjust_separator(mut path: String) -> String {
let (a, b) = if MAIN_SEPARATOR == '/' { ("/", "\\") } else { ("\\", "/") };
for c in 0..path.len() {
let c = c..c+1;
if path[c.clone()] == *a { path.replace_range(c, b) }
}
path
}
I changed the signature because &mut String -> &String like that is not useful. If you want to modify an existing string, then you want either &mut String -> () or String -> String.
Though really, as always for string processing you should just use a regex. Or use a real path library instead of trying to do path handling in strings.
This shouldn't be a problem in UTF-8 (Rust uses this encoding) if you are only swapping \ (UTF-8 code 92, or 01011100) and / (UTF-8 code 47, or 00101111).
As you point out, the most significant bit (8th bit if reading from right to left) is always 0, so they cannot be contained in a multi-byte character, which always has the MSB set to 1 in UTF-8.
This one looks like a solution. I didn't test a performance comparing to alternatives. But important part as no any string reallocation is here. Thanks.
Indeed, @jumpnbrownweasel's solution survived UNICODE data, when other failed. It's a little disappointing that such simple task can't be solved in the initial code snippet.
You could do it using safe code if you validate the bytes as UTF-8:
pub fn adjust_separator(path: String) -> String {
let foreign_slash = if MAIN_SEPARATOR == '\\' { '/' } else { '\\' };
let mut vec = path.into_bytes();
for c in 0..vec.len() {
if vec[c] == foreign_slash as u8 { vec[c] = MAIN_SEPARATOR as u8;}
}
String::from_utf8(vec).expect("adjust_separator produced invalid UTF-8")
}
The difficulty is because of the utf8 encoding, since chars are variable length. If utf32 were used, the initial code snippet could work but this would be extremely wasteful of space: ~4X.
It seems an initial design of Rust wasn't a future proof. That time Mac Book had only 2GB RAM and only few people worked on LLM. Now Mac Book has 16GB and mostly every one works on LLM. Even Java had a string as u16 based in 1995. In case of Rust, having OsString as String, and String as UTFString would be more reasonable for most use cases. Anyway, we should provide the best solution in the current design.
It was designed with a UTF-8 everywhere mindset. Will it hold up? Time will tell. "Future proof" in the 1990s was UCS-2, still with us in various tech/forms such as Windows path encodings.
Java uses UTF-16, but UTF-16 is a variable-length encoding just like UTF-8 is, so switching from UTF-8 to UTF-16 wouldn't help with the original code.
The world has generally shifted away from UTF-16 towards UTF-8, so judging from adoption, UTF-16 isn't more "future proof" than UTF-8, it's the other way around.
OsString also doesn't provide a mechanism to arbitrarily modify bytes, so switching to OsString also wouldn't make the original code work. Except of course it's better to use OsString, or better yet PathBuf to represent filesystem paths rather than String, for other reasons. Most strings are not filesystem paths, so it's better to use a portable type for those rather than OsString.
I don't understand what LLMs have to do with this.