How to replace the std::mem::transmute() here?

struct CPReader<'a, R> {
    read: R,
    buf: String,
    iter: str::SplitWhitespace<'a>,
}

impl<'a, R: io::BufRead> CPReader<'a, R> {
    fn new(read: R) -> Self {
        Self { read, buf: String::new(), iter: "".split_whitespace() }
    }
    fn get<T: str::FromStr>(&mut self) -> T {
        let mut it = self.iter.next();
        while let None = it {
            self.buf.clear();
            self.read.read_line(&mut self.buf).unwrap();
            self.iter = unsafe {
                std::mem::transmute(self.buf.split_whitespace())
            };
            it = self.iter.next();
        }
        it.unwrap().parse().ok().unwrap()
    }
}

You can't (and shouldn't) create self-referential types in safe Rust. Take the string buffer out of the struct.

2 Likes

In order to make sure this is sound, you must use AliasableString instead of String to prevent aliasing issues invalidating your str::SplitWhitespace. Additionally, you don’t need the lifetime parameter on the struct because the lifetime is internal, not something it depends on. And the str::SplitWhitespace field must come before the String field, so that it gets dropped first and you don’t end up with a dangling reference in the destructor.

However, overall self-referential structs are best avoided. Instead, for this type I would recommend creating the iterator in each call to get:

struct CPReader<R> {
    read: R,
    buf: String,
    offset: usize,
}

impl<R: io::BufRead> CPReader<R> {
    fn new(read: R) -> Self {
        Self { read, buf: String::new(), offset: 0 }
    }
    fn get<T: str::FromStr>(&mut self) -> T {
        while self.offset == self.buf.len() {
            self.buf.clear();
            self.read.read_line(&mut self.buf).unwrap();
            self.offset = 0;
        }
        let mut iter = self.buf[self.offset..].split_whitespace();
        let item: T = iter.next().unwrap().parse().unwrap();
        self.offset = (iter.as_str() as *const str as *const () as usize)
            - (&*self.buf as *const str as *const () as usize);
        item
    }
}
1 Like

I think this can fail even if there's still subsequent lines left.

That part uses unstable API [iter.as_str()] AFAICT, right?

1 Like

Ah, that’s true. It would have to be part of the loop I suppose then.

You’re right, I didn’t realize that.

So, bug-fixed version (hopefully):

    fn get<T: str::FromStr>(&mut self) -> T {
        loop {
            let mut iter = self.buf[self.offset..].split_whitespace();
            let segment = match iter.next() {
                Some(segment) => segment,
                None => {
                    self.buf.clear();
                    self.read.read_line(&mut self.buf).unwrap();
                    self.offset = 0;
                    continue;
                },
            };
            self.offset = match iter.next() {
                Some(part) => (part as *const str as *const () as usize)
                    - (&*self.buf as *const str as *const () as usize),
                None => self.buf.len(),
            };
            break segment.parse().unwrap();
        }
    }

(Finishing my original answer to this thread just now.)


That's a self-referencing datatype. The iter field contains references into the buf field. There's no particularly good support for this in Rust, but you can build them nontheless using dedicated crates, or you can work around them sometimes.

In this particular case, it should be technically possible to keep unconsumed parts of the input in the BufRead, though I don't know about any good preexisting solutions to split up BufReads along (unicode) whitespace boundaries, and I won't write my own right now.

So perhaps, going back to dedicated crates... I can recommend ouroboros for this, and with it, you can rewrite your original code e.g. as

/**
[dependencies]
ouroboros = "0.15"
*/

use std::{io, str, mem};
use ouroboros::self_referencing;

struct CPReader<R> {
    read: R,
    buf_iter: OwnedSplitWhitespace,
}

#[self_referencing]
struct OwnedSplitWhitespace {
    buf: String,
    #[borrows(buf)]
    #[not_covariant]
    iter: str::SplitWhitespace<'this>
}
impl Default for OwnedSplitWhitespace {
    fn default() -> Self {
        OwnedSplitWhitespace::new(String::new(), |_| "".split_whitespace()) 
    }
}

impl<R: io::BufRead> CPReader<R> {
    fn new(read: R) -> Self {
        Self { read, buf_iter: Default::default() }
    }
    fn get<T: str::FromStr>(&mut self) -> T {
        let mut it = self.buf_iter.with_iter_mut(|iter| iter.next());
        while let None = it {
            let mut buf = mem::take(&mut self.buf_iter).into_heads().buf;
            buf.clear();
            self.read.read_line(&mut buf).unwrap();
            self.buf_iter = OwnedSplitWhitespaceBuilder {
                buf,
                iter_builder: |buf| buf.split_whitespace(),
            }.build();
            it = self.buf_iter.with_iter_mut(|iter| iter.next());
        }
        it.unwrap().parse().ok().unwrap()
    }
}

Rust Explorer


In the meantime @SabrinaJewson has presented some reasonable approach, too; once that's fully debugged, using an offset like that might be more straightforward.

1 Like

Why we need the intermediate as *const ()?

error[E0606]: casting `*const str` as `usize` is invalid
  --> src/main.rs:32:21
   |
32 |                     (part as *const str as usize)
   |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = help: cast through a thin pointer first

I think the reason for this is that it’s not clear whether the as usize cast is to return the address of the pointer as an integer or the length of the string — feasibly, it could do both. The *const () makes it clearer that it’s about the address.

is there a reason to use as casting over str.as_ptr()?

Nope, just forgot about that function :Ρ

so yeah, (part.as_ptr() as usize) - (self.buf.as_ptr() as usize) would be better. In future it’d be part.as_ptr().addr() - self.buf.as_ptr().addr().

5 Likes