I'm writing an FFI call with win32, and the function fills a string buffer:
fn get_computer_name_two_step() -> String {
let mut buff_len = 0u32;
unsafe {
// this function will return an error code because it
// did not actually write the string. This is normal.
let e = GetComputerNameW(None, &mut buff_len).unwrap_err();
debug_assert_eq!(e.code(), HRESULT::from(ERROR_BUFFER_OVERFLOW));
}
// buff len now has the length of the string (in UTF-16 characters)
// the function would like to write. This *does include* the
// null terminator. Let's create a vector buffer and feed that to the function.
let mut buffer = Vec::<u16>::with_capacity(buff_len as usize);
unsafe {
WindowsProgramming::GetComputerNameW(
Some(PWSTR(buffer.as_mut_ptr())),
&mut buff_len).unwrap();
// set the vector length
// buff_len now includes the size, which *does not include* the null terminator.
buffer.set_len(buff_len as usize + 1);
}
// we can now convert this to a valid Rust string
// omitting the null terminator
String::from_utf16_lossy(&buffer[..buff_len as usize])
}
I'm setting the vector length to include the null terminator - I know this is fine and valid, but if I'm never going to need that null terminator, is it valid instead to just set the vector's length to the length of the string (before the terminator)? So instead doing something like:
// set the vector length
// buff_len now includes the size, which *does not include* the null terminator.
// this will set the length to just before the null terminator
buffer.set_len(buff_len);
...
// return the string
String::from_utf16_lossy(&buffer)
Perhaps a better question is: "if I'm using the vector as a buffer, and I write to it, is there any harm in setting the length to less than the number of bytes written, if the bytes at the end are never used?" Or will this leak memory somehow?
Thank you for any help.
Edit: I misunderstood. Original answer follows but see below.
There is harm, as the FFI will overwrite those bytes that aren't part of your allocation (and may be part of some other allocation, but it's UB even if not).
Are you saying the buffer capacity will always include room for the null terminator?
I think I might have done a bad job of explaining my question
. I apologize.
@jumpnbrownweasel - yes, I'm always providing capacity for the FFI call to write to the the null terminator buffer. In the example above, the only thing I'm potentially changing is the set_len
call after the FFI call has written to the buffer and I'm going to use the buffer in safe rust again.
So if the FFI call is going to write 9 wchar + a null terminator, I'm still allocating 10 with Vec::with_capacity(10)
.
My question is, after I have those 10 wchar, can I safely set the length to 9 (the length of the string or number of non-null wchars) before operating on the vector in safe rust. This makes the other calls a lot more readable (rather than slicing the buffer to chop off the null terminator).
@quinedot - you mention it's unsafe because the FFI will overwrite part of the allocation, but I'm still allocating space for the null terminator with Vec::with_capacity
, and not calling the FFI methods after calling set_len - does that change your answer?
Working backwards from this example (GetUserNameW) in the grob crate may be helpful. Or, you could just use change the API call in that example to GetComputerNameW
then move on.
Yes, I did indeed misunderstand. You meant to set the length to buff_len - 1
and not + 1
though, yeah?
you can simply pop()
if the data is only accessed from rust, since nul terminator is not used in rust. however, if the data will be passed back to ffi, you'd better using wrapper type over Vec
that preserves the nul terminator but Deref
to a slice without the nul byte. and guess what? we have such a type in the standard library: CString
.
Other way around. The buffer has to be +1 in size to accommodate the terminator and +1 on the second call. buff_len
is the number of "interesting" characters after the GetComputerNameW
Oh. Wait. I got that wrong...
The lpnSize parameter specifies the size of the buffer required, including the terminating null character.
Which is precisely why I wrote grob. So I'd never have to deal with that mess again.
1 Like
The change would be:
buffer.set_len(buff_len as usize + 1); //set length to include null terminator
to
buffer.set_len(buff_len as usize); //set length to just before the null terminator.
The windows APIs are weird here (I'm actually writing a tutorial, which is how I ended up stumbling on this) - but at that moment, right after the win32 call where it successfully wrote to the buffer, buff_len
should be the number of wchar written, not including the null terminator.
The windows API here does different things depending on the result of the call. If the call failed because the buffer is too small, it sets the length to the desired size in wchar including the null terminator. If the call succeeded, it sets it to the number of wchar written not including the terminator.
1 Like
@nerditation
we have such a type in the standard library: CString
.
I unfortunately don't think I can use CString because it's a set of wchar
, not char. I need to investigate OsString
but as far as I can tell, it doesn't provide a method to get a mutable pointer in the way that CString does; if you have an example of using one of the FFI types with a wchar API call that would be fantastic!
no, defininitely not!
OsString
is not for ffi interop, but to lossless roundtrip (potential ill-formed) string-like data in a platform specific way. for example, on linux, it's just a bunch of bytes with no particular restriction about encodings, and on windows, it is in the so-called wtf8
encoding, specifically, it is NOT utf16!
when dealing with wchar
s, unless you are using it opaquely without any processing, you always need to encode and decode across ffi boundary. I recommend the widestring
crate when dealing non utf8 strings.
For context, I'm trying to write a tutorial on using strings with the windows crate, so I'm trying to include the basic low level solution, before introducing HSTRINGs and other constructs.
@Coding-Badly
I'll check out the grob crate and potentially link it in the tutorial I'm writing - having something that simplifies the two-step call definitely is helpful.
@nerditation
I recommend the widestring
crate when dealing non utf8 strings.
Why would you recommend that over the encode/decode methods bundled with the string type?
All that said, I do worry we're getting a bit off topic from the main question: Is what I'm doing with the vectors and length acceptable/safe? To recap:
- I'm allocating a 10 wchar/u16 vec for a function that will write 9 wchar + 1 null terminator (10 total)
- I'm calling the FFI function and it's writing to the vector
- I'd like to set the length to 9 (rather than 10) so it doesn't include the null terminator in safe rust. Will that cause a problem?
I want to make sure that, even though there were items in the buffer that were initialized, it's OK to set the length to be shorter? My concern is that there was an example in the docs that said it's possible to leak memory using set_len, but I'd assume the vec's capacity will get freed, regardless of what the length is, and because a u16 shouldn't require special clean up, it should be fine.
But I wanted to confirm that.
1 Like
I think you mean this:
While the following example is sound, there is a memory leak since the inner vectors were not freed prior to the set_len
call:
In that example, the Vec contains nested Vecs that must be dropped/freed if the length of the Vec is reduced with set_len, to avoid a leak. This doesn't apply because your Vec simply contains u16
elements, and these are not allocated on the heap individually and therefore don't need to be dropped when the length is reduced.
1 Like
no.
the safety condition for set_len()
is:
new_len
must be less than or equal to capacity()
.
- The elements at
old_len..new_len
must be initialized.
your use case is perfectly valid.
in your example, it is equivalent to first buffer.set_len(10);
, then followed by buffer.pop();
the documentation is talking about the general case. when you use set_len()
to shrink a Vec<T>
, the extra elements are "forgotten", they are NOT dropped, and this could cause resource leaks.
your intuition is correct: this is not a concern for Vec<u16>
, since u16
is Copy
and doesn't need to be dropped.
the difference is widestring
may validate the encoding, but otherwise leave the data in the original "wide" format, it doesn't convert the string to rust native encoding, which is utf8.
these "wide" string types support some basic string operations, like push
, pop
, replace
, iterate
, etc, depending on the use case, it may be more desirable since you can eliminate the conversion overhead. but if you need more than these basic operations, it's usually better to convert to rust's native string type.
1 Like