Exploring FFI Workshop

Working through a Rust-course materials doing self-education I found FFI workshop to be intimidating to me. I finished it, though there was so much guessing along the way that I would really appreciate if you could give it a look in case I made something silly there.
Or to share any general advice too!

The solution is at https://github.com/skaunov/6991_ws09. The scope consists of two items: \

in L40

                buffer_current
                .into_iter()
                .map(|x| x.to_ne_bytes()[0] as char)
                .take_while(|x| x != &'\0')
                .collect(),

in open you should have at least 3 cases :

  1. file doesn't exist
  2. file exist but empty
  3. file exist and contains stuff.
    4?) file exist permission issue ?

depending on the case you might want different outcome of Open fonction or the following ones read_xyz.

1 Like

Thank you! Updated the repo.

It was a sad bug in the iterator... Not even really FFI related.

And the cases you pointed out I added in quite a straightforward way. Seems like the exercise about it.

:pray:

the count parameter to fgets() is the buffer capacity in chars, it's pointless to "reserve" space (for whatever reason, EOL or EOF, NUL, etc). if the call to gets() succeeded, a NUL terminator is guaranteed to be present, and EOL is always included if it is encountered within the limit of buffer size.

and you don't need to convert the bytes manually to String, there's more efficient ways. in C, you are supposed to call strlen() on the return value of fgets() when it succeeded, you can use libc::strlen(), but the equivalent of strlen() in rust is CStr::from_ptr(). then CStr::to_str() is used to check for valid utf8 encoding, and ToOwned is used to allocate the final result String

fn read_string(&mut self) -> Option<String> {
	let mut buffer = [0; 512];
	let count = buffer.len();
	unsafe {
		let ptr = libc::fgets(buffer.as_mut_ptr(), count as c_int, self.stream);
		if ptr.is_null() {
			return None;
		}
		CStr::from_ptr(ptr).to_str().ok().map(ToOwned::to_owned)
	}
}

alternatively, use a heap allocated buffer directly and convert it to the return value:

fn read_string(&mut self) -> Option<String> {
	// use uninitialized byte array instead of c_char array as buffer
	let mut buffer = Vec::<u8>::with_capacity(512);
	let count = buffer.capacity() / size_of::<c_char>();
	unsafe {
		let ptr = libc::fgets(buffer.as_mut_ptr().cast(), count as c_int, self.stream);
		if ptr.is_null() {
			return None;
		}
		buffer.set_len(libc::strlen(ptr) as usize + 1);
		CString::from_vec_with_nul_unchecked(buffer)
			.into_string()
			.ok()
	}
}

1 Like

why do you use to_be_bytes() here? although in practice the result should be the same, as c_char is always either i8 or u8 on modern architectures, technically, you can't assume c_char is in big endian encoding.

actually you should cast a c_char to rust native char directly. if you are concerning the signed-ness of c_char, cast to c_uchar first:

1 Like

It was out of desperation to make it work somehow. Where I was more thoughtful I used to_ne_bytes. That approach would be at least correct here, does it? Let me explore meanwhile the better one you described. :pray:

Wow... I really feel like mumbling when others do fine speech. X)

Btw, I remember I was trying to get away from choosing an arbitrary constant for the buffer length, so that it could be defined through the context... Is it possible here without insanely complicating things? I mean to just read the first line even if it doesn't fit 512 bytes.

PS Looking at the code now, I guess I would put fgets in a loop to seek end of relevant entity. But is it a good approach? :thinking:

Am I correct that this isn't related to FFI / C already, and relevant to any Rust code? Never seen this trick before.

Had really educative time reading through the variants you provided -- thanks one more time -- and still have something to compare. Here's how I would incorporate this. Though I need to get back to it with fresh head.
The question is: am I got you right that speaking about efficiency you mean performance? Or there's something more to it? I'm asking because it's hard to make my mind on the introduction of another unsafe line... I mean is it better go in suboptimal, but safe way, or go unsafe with optimization (an endless road?); answer of course would be highly contextual; but any hints and discussion are very welcome!

it should be correct to my understanding of the C standard. char in C is specified "at least 8 bit" and "can be most efficiently processed on the target system".

if the length of lines in the file is unbounded, you have to use a loop, each time grow the buffer and read more. (unless you use OS specific APIs and are not restricted to libc, e.g. you can mmap() the file)

you are correct, it's not related to FFI, it's one way to turn a &str to an owned String in rust, since str implements ToOwned.

efficiency is not exactly same as performance, although they are often related. in my comment, by "more efficient", what I actually mean is you should not manually iterate line and map c_char to rust char then collect the chars into a String. instead, it is much cheaper to first convert the entire buffer into a CString (or &CStr) then convert it to String. rust String is utf8, it's not represented as sequence of chars, so collecting iterator of chars into a String requires re-encoding each char (code point) into utf8 code unit (byte).

for the difference between efficiency and performance, here's a good cppcon talk:

generally I would always go the safe way first, and only resort unsafe when I deem it absolute necessary (most of the time it is necessary because the APIs are unsafe, very rarely would I consider it is necessary just for performance optimization alone)

it is not "better" per se, I was being pedantic here. as I said, c_char is not guaranteed to be exactly one byte (although it almost always is in practice). in my code, buffer is Vec<u8>, and libc::gets() expects the argument is count of c_chars, so I divide the capacity by the size of c_char.

the reason I use Vec<u8> instead of Vec<c_char> is because I want to later move the Vec into a CString and that requires Vec<u8>, not Vec<c_char>.

in my example, there's a common pattern when you want to pass a (uninitialized) buffer to an FFI function to fill it in:

  1. allocate a buffer without initializing it using Vec::with_capacity()
  2. obtain the pointer using Vec::as_mut_ptr() and pass it to the FFI function
  3. check that the FFI call is successful, and get the length of the filled data
  4. "fix up" the length of the buffer by calling Vec::set_len()
1 Like

this left me really puzzled
I remember brilliant Amos page on this, AFAIR he shows that in C string is just bytes which can carry correct UTF encoding of following bytes in case of complex symbols if you're lucky, but system doesn't give af about the byte-stream content.
Also c_char in libc - Rust shows that it's just i8 indeed, so I'm confused how size_of::<c_char>() varies... Is it architecture dependent? (And even so, it's defined in Rust, not in C. Good exercise!) %)

I really feel like all I get from fgets is filling a buffer with i8 and all we discussing is just different ways to look on it. Correct me if I'm wrong, pls! (I'm not being defensive, hope it doesn't look like this; just trying to grasp the thing.)

I should incorporate this too... At first I didn't "buy" it since array looked to me as a way to not introduce Vec when it's not needed. But on the second thought I guess there's no real difference in that point. But the approach itself is much cleaner and ubiquitous than relying on populating with zeroes I employed.

that is correct, but what is a byte? nowadays, every architecture agrees that a byte is exactly 8 bits (a.k.a. an octect), but the technical definition of a byte is "the smallest addressable memory unit", and it is architecture dependent.

yes. it is architecture dependent. rust's c_char is defined to follow the char type in C, and the char type in C is... complicated, for reasons.

the C standard

many C definitions are intentionally expressed in a convoluted/vague manner (in the hope that C would be compatible with every architecture). the latest C standard is C23. the ISO 9899:2023 document is not publicly available, but the final draft n3096 can be downloaded from open-std.org

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf

I'll excerpt some definitions as an example:

section 3.6 paragraph 1:
byte
addressable unit of data storage large enough to hold any member of the basic character set of the execution environment

paragraph 3:
note 2 to entry: A byte is composed of a contiguous sequence of bits, the number of which is implementation-defined.

section 3.7.1 paragraph 1:
single-byte character
⟨C⟩ bit representation that fits in a byte

section 5.2.1 paragraph 3:
Both the basic source and basic execution character sets shall have the following members: the 26 uppercase letters of the Latin alphabet, the 26 lowercase letters of the Latin alphabet, the 10 decimal digits, the following 29 graphic characters [...], the space character, and control characters representing horizontal tab, vertical tab, and form feed.

as you can see, the byte is defined in terms of the "basic character set", the length is implementation defined, though they do require that a byte is at least 8 bit long, in Annex E paragraph 2.

certain category of experts are called language lawyers for a reason. luckily for us, a byte (hence char) is always 8 bit long on modern architecture.

it is correct according to the implementation, but not by definition. the implementation c_char is aliased to i8, but they are defined to have different semantics. fgets() fills a buffer with c_chars, not i8.

in your example, you need to allocate for the String return value anyway, there's no real benefit to avoid a Vec, as the allocated memory is eventually moved into the final String. you do waste memory though since the size of a line cannot be determined beforehand and the buffer have to be allocate conservatively. note it is incorrect for lines longer than 512 bytes. to correctly handle unbounded lines, you need to loop and grow the buffer.

in general, if the buffer size is variable but bounded and not too large, use a local array to hold the data temporarily is a good solution. however if the maximum buffer size is too large that you might overflow the stack, or the data is streamed and the buffer needs to grow without an predetermined maximum size, you'll have to use heap allocated buffer.

there's a very common use case, where you want to return the data as an owned type (Vec, String, etc), and the buffer size is known or can be queried precisely, then the Vec::set_len() pattern is very efficient to use. for example, many Win32 APIs follow this pattern. see:

msdn

there's even a microsoft document page explaining this API pattern, to quote a paragraph:

If NULL is input for pbData and pcbData is not NULL, no error is returned, and the function returns the size, in bytes, of the needed memory buffer in the variable pointed to by pcbData. This lets an application determine the size of, and the best way to allocate, a buffer for the returned data.


a simple examples to read potentially long lines, loop and read piece by piece until a whole line is finished or EOF is reached. uses a local array as buffer for each piece, and pushes them into a heap allocated Vec, which will grow as needed. as an exercise, you can use the spare space of a Vec directly and omit the local array, but you need to manually grow the capacity. see Vec::spare_capacity_mut() and Vec::reserve().

fn read_string(&mut self) -> Option<String> {
	// 64 is arbitrarily chosen to be the local buffer capacity
	let mut local_buffer = [0u8; 64];
	let mut buffer = Vec::new();
	loop {
		unsafe {
			let ptr = libc::fgets(buffer.as_mut_ptr().cast(), 64, self.stream);
			if ptr.is_null() {
				if libc::feof(self.stream) != 0 {
					// eof
					break;
				} else {
					// io error
					return None;
				}
			}
			// excluding the NUL terminator
			let filled = libc::strlen(ptr);
			// because each piece might ends at invalid utf8 boundary, we cannot convert
			// individual  pieces to `&str` and use `String::push()`. instead we append
			// the pieces to a `Vec<u8>` and do the utf8 validation at last
			buffer.extend_from_slice(&buffer[..filled]);
			// finish if buffer is not fully filled or ends with eol
			// already checked for eof condition, so filled is not zero
			if filled + 1 < 64 || local_buffer[filled - 1] == b'\n' {
				break;
			}
		}
	}
	String::from_utf8(buffer).ok()
}
1 Like

Thank you for such a thorough explanation above! I guess it's a small thing left to wrap up this workshop analysis. I absolutely agree with the efficiency argument, excessive map usage, and the additional concepts you showed while explaining the things around these. What still bugging me is that map let to get away with only one unsafe in the unavoidable place. And ditching map basically adds the second unavoidable place. X) https://github.com/skaunov/6991_ws09/blob/7f0d751adcc234bb1c869bacf7d116a9a9a82559/src/main.rs#L59

Ofc it's a toy example, and all these approaches you showed here are much more useful then this dilemma. I guess it's just one more takeaway to all these unfolded depths from this workshop! :pray:

I feel like I addressed all the comments I could in the repo update. Many thanks to all, especially @nerditation .
Always happy to hear comments and ideas on the topic! :mortar_board:

I came across https://rust-lang.github.io/unsafe-code-guidelines/ resource. Have no idea if it's good (though I hope so!), and if I ever will end up really writing unsafe. Just want to leave it here as a possible next step to go in unsafe journey.