Shower thought: substrings and subslices

I have a "shower thought" question...

There are many slice and string functions that return slices and strings. Some, like str::as_bytes and OsStr::new, are cheap conversions from one type to another. Others, like trim and split_at, return sub-slices and sub-strings of their input.

Is there any general guarantee that these functions will create their output from their input? Will they always return sub-slices and sub-strings of their input arguments? Or do they have the freedom to return equivalent objects not derived from their input—for example, returning references to static or interned values?


trim

pub fn trim(&self) -> &str

Returns a string slice with leading and trailing whitespace removed.

Can trimming a string that has no whitespace return a different string? Can this assertion fail?

let s = "foo";
assert_eq!(s.as_ptr(), s.trim().as_ptr());

(playground)

I'm not saying it'd be a good idea, but would it be permissible for trim to return a reference to a different interned "foo" if it recognized that it had already trimmed that input before?


split_at

pub fn split_at(&self, mid: usize) -> (&[T], &[T])

Divides one slice into two at an index.

The wording strongly implies that the output slices are sub-slices of the input. Does splitting a slice always return sub-slices? Is it possible for the first assertion below to fail? Is the second assertion safe?

let s1 = &b"foo"[..];
let s2 = s1.split_at(0).0;
assert_eq!(s1.as_ptr(), s2.as_ptr());
assert_eq!(s1, unsafe { std::slice::from_raw_parts(s2.as_ptr(), s1.len()) });

(playground)

Hypothetically, would it be legal for split_at have an optimization like this at the top?

pub fn split_at(&self, mid: usize) -> (&[T], &[T]) {
    if mid == 0 {
        return (&[], self);
    }
    ...
}

from_utf8_lossy

pub fn from_utf8_lossy(v: &[u8]) -> Cow<'_, str>

Converts a slice of bytes to a string, including invalid characters.

When converting bytes to a string, if the bytes are valid UTF-8, will the str be created from the &[u8]?

let s = "foo";
if let Cow::Borrowed(b) = String::from_utf8_lossy(s.as_bytes()) {
    assert_eq!(s.as_ptr(), b.as_ptr());
}

(playground)

It seems like it. This test passes. But hang on—

If you change let s = "foo"; to let s = ""; it fails. Why? Because from_utf8_lossy has a code path where it returns Cow::Borrowed(""). You'll notice the "" reference isn't derived from the input slice v.

Is that a bug? Or is it acceptable behavior?

If it is acceptable, what does it say about the other cases above?

The second assertion is considered UB, since s2 only grants read access to its 0 bytes. Indeed, Miri throws an error when -Zmiri-tag-raw-pointers is enabled:

test split_at ... error: Undefined Behavior: trying to reborrow <185788> for SharedReadOnly permission at alloc1[0x0], but that tag does not exist in the borrow stack for this location
  --> /home/lm978/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/slice/raw.rs:97:9
   |
97 |         &*ptr::slice_from_raw_parts(data, len)
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |         |
   |         trying to reborrow <185788> for SharedReadOnly permission at alloc1[0x0], but that tag does not exist in the borrow stack for this location
   |         this error occurs as part of a reborrow at alloc1[0x0..0x3]
   |
   = help: this indicates a potential bug in the program: it performed an invalid operation, but the rules it violated are still experimental
   = help: see https://github.com/rust-lang/unsafe-code-guidelines/blob/master/wip/stacked-borrows.md for further information
           
   = note: inside `std::slice::from_raw_parts::<u8>` at /home/lm978/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/slice/raw.rs:97:9
note: inside `split_at` at src/lib.rs:8:9
  --> src/lib.rs:8:9
   |
8  |         std::slice::from_raw_parts(s2.as_ptr(), s1.len())
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside closure at src/lib.rs:2:1
  --> src/lib.rs:2:1
   |
1  |   #[test]
   |   ------- in this procedural macro expansion
2  | / fn split_at() {
3  | |     let s1 = &b"foo"[..];
4  | |     let s2 = s1.split_at(0).0;
5  | |
...  |
9  | |     });
10 | | }
   | |_^
   = note: this error originates in the attribute macro `test` (in Nightly builds, run with -Z macro-backtrace for more info)

note: some details are omitted, run with `MIRIFLAGS=-Zmiri-backtrace=full` for a verbose backtrace

error: aborting due to previous error
3 Likes

I'll assume you mean "under the guarantees of the stdlib". In that case, one could consider cases like Cow::Borrowed("") to be a violation of the documentation. You can file a ticket and the libs team will presumably do one of

  • Agree and change the impl, or
  • Agree but not consider it a big deal and update the documentation to clarify the freedom to do what they're doing now, or
  • Disagree about the meaning of the documentation

I think on a practical but prudent level, I would personally go with

  • Don't assume anything with slices of ZSTs
  • Don't assume empty slices are related
  • Otherwise assume the returned slices/elements are related

When it comes to the standard library.

Related reading can be found in the comments and reviews here; e.g.

  • With ZSTs, most address meaning is bunk
  • With empty slices, you can't be sure of provenance even if the addresses seem legit
    • You may not care about false positives though
2 Likes