Should String::from_utf8 remove trailing null?

I converted a C char array (with a trailing null) from a C library into a Rust String using
String::from_utf8 since I want to check that the bytes are valid UTF-8.

I was surprised to discover that the Rust String included the trailing null from the C char array and therefore would not match a String created from a Rust char array. In order to get them to match I had to add a terminating null to the Rust String, e.g.: String::from("abcd\0");!

I appreciate that String::from_utf8 "will take care to not copy the vector, for efficiency's sake". However, this behaviour is confusing when comparing Rust Strings that are created from different sources, especially since it isn't apparent when printing the string using a Display marker {}, but only when using a Debug marker {:?}.

Other types such as CString are specifically designed to store strings in C format, so shouldn't String::from_utf8 store strings in "standard" Rust format and remove trailing nulls?

This won't happen because it's an unacceptable breaking change. Moreover, NUL bytes are perfectly valid UTF-8.

If you're working with C strings and want to convert to a Rust string, then use CStr::to_str. It will trim the trailing NUL byte for you.

8 Likes

That format is UTF-8, and for better or for worse, the null char '\0' is a valid Unicode char, which is {en,de}coded as a null byte b'\0'.

So, what you are suggesting is not that different from saying that a string created from a UTF-8 sequence encoding "Display-invisible" chars at the end ('\0' being only one of them) should be auto-trimmed, and this gets to logic that should not, imho, be bundled within a "cast" / copyless operation.

Granted, that last point invoked a subjective view, same as you have your legitimate opposing view.

But here comes what is an actual objective rationale:

  • String::from_utf8 and str::from_utf8 ought to have the same beavior (owned / reborrowed variants of the same logic);

  • If str::from_utf8 "dropped" the terminating null, then it would be impossible to go back to a C-string view, unless a copy were to take place. This is unacceptable for a low-level primitive within a systems programming language.

Hence str::from_utf8 is correct in not dropping the terminating nul, and thus String::from_utf8 must also do the same, for the sake of API consistency.


Regarding your situation: you are mainly surprised that a C "string" is sugar language for Rust's b"string\0", and that if you write test / logic between the two languages, then that trailing null is something to be aware of. Interop is rarely 100% smooth, we could say :grinning_face_with_smiling_eyes:

The way to go from a C string to Rust structs is through CStr (CStr::from_ptr or CStr::from_bytes_with_nul). Then, you can control whether you want to bundle the terminating null byte with the .to_bytes_with_nul() / .to_bytes() methods, which will yield you the byte slice / view you desire.

  • For your use case, they even bundle a convenience method of .to_str() which strips the terminating null byte (weirdly enough they don't bundle a .to_str_with_nul()).

  • Btw, for UTF-8 encoded C strings, you may want to try the char_p::{Ref<'_>, Box} types, which do feature .to_str{,with_null}() APIs to give you the choice (in the borrowed case; in the owned case the "choice API" is missing)

3 Likes

Noooo I was so happy to have escaped zero-terminated strings, and the accompanying special treatment of NULL bytes, it's a security nightmare, please don't let them come to rust :sweat_smile:

(more seriously,, CStr/CString and friends are explicitly meant for this purpose to interface with C, the behavior has no place in pure rust code)

3 Likes

Point of clarification - though not apparent to your eyes on a console, the NUL byte is being output. "Display" really means "output UTF8", including control and other non-graphical characters. (I really do think naming that trait Display was a mistake, for a couple reasons like this.)

As others have pointed out, you probably actually want CStr/CString. Take care as embedded NULs (NULs that aren't the terminating NUL) are not allowed.

There are also some crates that support outputting escaped strings or byte arrays; perhaps some of them output escaped NULs.

Ha, ha, I totally agree!
I clearly used the wrong tool for the job. I shall endeavour to write pure rust code in future. :slight_smile:

1 Like

Thank you for your detailed reply.
It is clear that I used the wrong tool for the job.

String::from_utf8 is for UTF-8 strings and although a valid ASCII C string is a valid UTF-8 string the opposite is not the case, since a UTF-8 string may have a valid null char within it.
In addition to the reasons that you give for using CStr/CString, they also have methods that test whether C byte array contains null char within them, other than at the end, which String::from_utf8 permits.

In future I shall just use CStr/CString structs for interfacing with C strings and not any String methods! :slight_smile:

1 Like

Agreed. I was using the wrong tool for the job, I'm using CString now.

Thank you, you may be right, but it was my mistake, I don't think that it is a problem with display. Now that I've switched to using CString, it does not support Display, so the issue doesn't occur.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.