It's quite strange that Rust classifies \0 as a control character rather than a whitespace character. This is a little counter - intuitive. In other programming languages, \0 is usually classified as a whitespace character. What's the consideration behind this?
Looks like the Java function considers any C0 control code plus the ASCII space character (0x20) to be whitespace. That is, the Java definition of whitespace for this method is different than the Unicode definition of whitespace that the Rust str implementation uses.[1]
The difference in what is considered whitespace explains the difference in behavior. Presumably the Java method won't remove non-breaking spaces and the like either, for example.[2]
You had an expectation due to your background. That's reasonable. But as it turns out, Rust has a different definition of whitespace than the Java trim function, based on the Unicode standard.[1] That definition is unlikely to change to arbitrary include \0.[2][3] (It also means that str::trim trims things that Java does not.)
Bummer, but so it goes. If you need a method that trims \0 (or other C0 codes) ((and-or doesn't trim non-ASCII unicode whitespace)), you'll need another method than str::trim, be it provided by a crate or your own creation.
other language such as Python also have other interpretations, etc. ↩︎
It is not classified as whitespace in C, Python, C#, Swift, Go, OCaml, or Haskell. I also believe it is not considered whitespace by most regex engines.
null is not white space in BASIC either. If first byte (word in UTF-16) is a null then a null terminated string has no content, no matter what garbage exists before the null after the defined string length. (Windows uses null terminated strings a lot! Are you talking about another OS?)
In fact, null is not a space of any kind, all the way from ITA2 till now. It is a do nothing control code.
Alright, it seems that it has deviated a bit from my initial idea. Actually, I just think that the trim() function can remove characters like \t and \n and (blank) and so on, but it doesn't remove \0. That's where my confusion lies.
I'm wondering where you are getting these strings from that have null's in them?
If they are actual null terminated strings as used by C perhaps you can make use of Cstring to handle them, including turning into Rust String. CString in std::ffi - Rust
But what about those nulls you have at the beginning in your example or in the middle? That is very unusual.
These strings are sourced from DICOM files (a type of medical imaging file). In the original data, there is only one \0 at the end, not at the beginning or in the middle. The \0 at the beginning and in the middle in my example are just for testing the trim() function.
I would recommend stripping the null bytes before creating the string, it's a lot easier to work at the byte level there where you don't need to handle UTF-8 validity constraints.
That sounds like a great idea, I bet the DICOM format is structured that way intentionally to make C parsers easier to implement.
Assuming they have already read the file into a String and split it on lines, it could be as simple as this:
// Alternatively could return `Option<&str>` or `Result<&str, E>`
// and use `?` instead of `unwrap`
fn c_trim(s: &str) -> &str {
CStr::from_bytes_until_nul(s.as_bytes())
.unwrap()
.to_str()
.unwrap()
.trim()
}
EDIT: If this is for industry use and is not just a hobby project, please see @quinedot's remarks below. I would advise against writing your own DICOM parser.