Preserve the String object from File::lines, but only pass a slice on

Hi,

So this may very well be a recurring topic, and I apologize in advance if it has been answered many, many times before... It feels like a problem others must have stumbled upon, and maybe I should have asked before implementing my own solution from scratch :slight_smile:

So here's the thing: I'm parsing a file with a format that is heavily line-based, and I have written a deserialization function that reads an iterator of lines, treats them as string slices, and builds a structure that may contain smaller slices of those lines. Obviously, this means that the built structure has a lifetime related to the lifetime of the input data - and therein lies the problem.

If I read the full contents of the file into memory, I have a String object that I own, and I can pass str::lines to my deserialization function. However, if I want to read the file line by line, e.g. from a subprocess or from a decompression function or if the file is simply a bit too large, then there seems to be a problem:

  • BufRead::lines and similar methods return Strings - quite understandably! There is no pre-owned storage to point at!
  • if I use those as iterators, and I pass that String as a slice to my deserializtion function, the compiler obviously complains that the slice refers to a String that will be dropped very, very soon
  • ...unless I collect all the Strings into an array and keep it until I'm done, but that kind of defeats the purpose of not reading the full file's contents into memory!

So I guess what I'm asking for is, is there a way to preserve the contents of the String, but let the deserialization function use it as a slice? My first thought was to use Arc or something similar, but whatever I do in a function that handles successive lines and provides them to the iterator, any objects I create there will be dropped very, very soon, just as the String itself.

So my solution was to go off and implement a new trait that provides a very small subset of the str methods - just as much as I need for this particular project - and proxies them to bytes::Bytes internal storage that is, hopefully, always a valid UTF-8-encoded string. Of course, I'm aware of a couple of serious drawbacks of this method:

  • an implementation of this trait cannot simply be a Deref for &str, since the str methods are defined to return str objects, which would drop the ownership; I have to proxy them into returning Self instead
  • this means that I have to implement as many of the str methods as I - or others - may possibly want to use at some future point, and even though the implementations are usually trivial, it still feels sort of wrong
  • there are some str methods that I simply cannot proxy in the same way, e.g. ones using the std::str::pattern::Pattern trait, or at least I cannot implement then in a no-std library, since core::str::pattern is still experimental :slight_smile:
  • there will need to be all kinds of #[cfg(..)] shenanigans related to different Rust versions in the future; right now I'm really happy that str::from_utf8_unchecked() is stable in 1.87, and that's what I'm targetting

All that said, what I have so far is the StrLike and AdoptableStr traits and the StrOfBytes implementation in my - still unreleased - str-of-bytes library; for the moment it is only used in another still unreleased module, facet-deb822 ...but the main point of this post is to ask what have I missed, is there already an implementation of something like that - a struct that behaves as much like str as possible, but retains ownership of the data? As pointed out above, implementations that provide a Deref are not enough, since the ownership will be dropped as soon as the deserialization library invokes .split_once() or .strip_prefix() or something like that.

Of course, "oh come on, you're looking at it totally the wrong way! here's a much better way to do what you really need" answers will also be welcome :slight_smile: And yes, I know how to use parser combinators, but I think that at least nom has the same issue - a string slice in the result has no knowledge of the memory storage it refers to.

Thanks in advance for any insights!

This was confusing phrasing to me until I grokked your code. Ownership isn't dropped, a borrow is returned. What you meant (for the sake of other readers) is that you want an API like

impl StrLike for StrOfBytes {
    fn trim_ascii(&self) -> Self { ... }
}

where substring-returning methods returned some shared ownership type instead of a borrow.

arcstr exists as an alternative to bytes, but it doesn't try to mirror all of strs methods. Like bytes, the idea is that you would adopt use substr_from and friends whenever you need to get back to something owned, instead of trying to mirror every str / [u8] method. I don't know how it compares to bytes comprehensively, but you could consider it as an alternative for your innards.

I personally don't know of something that tries to mirror all of strs methods and provide shared ownership. bstr did the former, so it's not totally unprecedented. Those were concrete types and not a trait though, FWIW.[1]

You can perhaps minimize cfg shenanigans with a combination of an MSRV policy, implementing some new methods manually, and patience.


Click for a code review, though that's not really what you asked.

You can get rid of this unsafe.

 pub const fn str_end(value: &str) -> *const u8 {
-    // SAFETY: This is a string slice; its memory must be linear... we hope!
-    unsafe { value.as_ptr().add(value.len()) }
+    value.as_bytes().as_ptr_range().end
 }

Over here...

    pub fn as_str(&self) -> &str {
        // SAFETY: we *hope* all our data comes from valid string slices.
        unsafe { str::from_utf8_unchecked(&self.data) }
    }

Hope is a poor basis for soundness. That said, I didn't see a way to get non-str data into your Bytes with the code that's there so far. Either upgrade that to an invariant of your type (and reflect that in the safety comment), or if it's not an invariant, make the method unsafe and document the requirement for downstream consumers.

    /// Adopt a string, either unchecked or copying it, depending on the `adopt-checks` feature.
    #[cfg(feature = "adopt-checks")]
    fn adopt_(&self, value: &str) -> Self {
        self.adopt_or_copy_(value)
    }

    #[cfg(not(feature = "adopt-checks"))]
    fn adopt_(&self, value: &str) -> Self {
        self.adopt_unchecked_(value)
    }

    // ...

pub trait AdoptableStr: Sized {
    /// Attempt to adopt a string slice if it is within our memory range.
    fn try_adopt(&self, value: &str) -> Option<Self>;

    /// Adopt a string slice if possible, otherwise copy it.
    fn adopt_or_copy(&self, value: &str) -> Self;
}

Features are supposed to be additive, but your change in behavior is more exclusive. That is, if I don't include the adopt-checks feature flag and expect panicking sometimes, but I also have a dep that happens to also use your crate and enables adopt-checks, I'm not going to get the behavior I expect.

So IMO just have separate methods:

// Or directly on the type if you drop the trait idea
pub trait AdoptableStr: Sized {
    /// Attempt to adopt a string slice if it is within our memory range.
    fn try_adopt(&self, value: &str) -> Option<Self>;

    /// Adopt a string slice if it is within our memory range, otherwise panic.
    fn adopt(&self, value: &str) -> Self {
        self.try_adopt(value).expect("...")
    }

    /// Adopt a string slice if possible, otherwise copy it.
    fn adopt_or_copy(&self, value: &str) -> Self {
        self.try_adopt(value).unwrap_or_else(|| Self::copy_from_slice(value))
    }
}

Incidentally, inherent methods are preferred over trait methods during method resolution, so you probably don't need all those trailing _s like const fn len_.

Alternatively, drop the traits and you'll just have one method. Alternatively to that, combine your traits and you can also have only one method most places, provided as a default on the trait:

pub trait StrLike: AsRef<str> + Clone + Sized {
    fn adopt(&self, value: &str) -> Self;

    fn trim_ascii(&self) -> Self {
        self.adopt(self.as_ref().trim_ascii())
    }

  1. Do you expect multiple implementers? â†Šī¸Ž

1 Like