In Rust, it's a bit counterintuitive that the str.trim() function doesn't remove the '\0' (null character) at the beginning and end of a string

pub fn main() {
        let str1 = "\0-1024\0\0";
        let str2 = str1.trim();
        let str3 = str1.trim_matches(|c: char| (c.is_whitespace() || c.is_control()));
        println!("str1: {:?} len: {}, str2: {:?} len: {}, str3: {:?} len: {}", str1, str1.len(), str2, str2.len(), str3, str3.len());
}
The code prints: 
 str1: "\0-1024\0\0" len: 8, str2: "\0-1024\0\0" len: 8, str3: "-1024" len: 5

It's quite strange that Rust classifies \0 as a control character rather than a whitespace character. This is a little counter - intuitive. In other programming languages, \0 is usually classified as a whitespace character. What's the consideration behind this?

2 Likes

Could you share an example? I haven't run into this anywhere before.

4 Likes

You can copy the code in my post and try to run it.

As in, what's a language that does consider \0 a whitespace character. E.g. Python does not.

2 Likes

And to address this directly, from the documentation:

‘Whitespace’ is defined according to the terms of the Unicode Derived Core Property White_Space, which includes newlines.

3 Likes
JAVA code:
    public static void main(String[] args) {

        String str1 = "\0-1024.5\0\0";
        System.out.println("str1 : " + str1 + ", len: " + str1.length());

        String str2 = str1.trim();
        System.out.println("str2 : " + str2 + ", len: " + str2.length());
    }

The code prints:

str1 : -1024.5, len: 10
str2 : -1024.5, len: 7

The trim() function will remove the \0 in Java.

Looks like the Java function considers any C0 control code plus the ASCII space character (0x20) to be whitespace. That is, the Java definition of whitespace for this method is different than the Unicode definition of whitespace that the Rust str implementation uses.[1]

The difference in what is considered whitespace explains the difference in behavior. Presumably the Java method won't remove non-breaking spaces and the like either, for example.[2]


  1. and pretty odd IMO, but opinions may differ ↩︎

  2. I didn't bother to cook up an example. ↩︎

6 Likes

And to address this directly, from the documentation:

let s = "\n Hello\tworld\t\n";

assert_eq!("Hello\tworld", s.trim());

In the documentation, the trim() function will remove characters like \n and \t, but it doesn't remove \0. I think \0 should also be removed.

2 Likes

You had an expectation due to your background. That's reasonable. But as it turns out, Rust has a different definition of whitespace than the Java trim function, based on the Unicode standard.[1] That definition is unlikely to change to arbitrary include \0.[2][3] (It also means that str::trim trims things that Java does not.)

Bummer, but so it goes. If you need a method that trims \0 (or other C0 codes) ((and-or doesn't trim non-ASCII unicode whitespace)), you'll need another method than str::trim, be it provided by a crate or your own creation.


  1. other language such as Python also have other interpretations, etc. ↩︎

  2. or other C0 codes ↩︎

  3. at the Rust level or at the Unicode level ↩︎

11 Likes

It is not classified as whitespace in C, Python, C#, Swift, Go, OCaml, or Haskell. I also believe it is not considered whitespace by most regex engines.

23 Likes

null is not white space in BASIC either. If first byte (word in UTF-16) is a null then a null terminated string has no content, no matter what garbage exists before the null after the defined string length. (Windows uses null terminated strings a lot! Are you talking about another OS?)

In fact, null is not a space of any kind, all the way from ITA2 till now. It is a do nothing control code.

Strange? In the C language \0 is not a whitespace character. \0 is used to indicate the end of a string. As such it cannot be any kind of space.

And that is how Unicode defines it: � U+0000 NULL - Unicode Explorer

You could use trim_matches str - Rust

1 Like

Alright, it seems that it has deviated a bit from my initial idea. Actually, I just think that the trim() function can remove characters like \t and \n and (blank) and so on, but it doesn't remove \0. That's where my confusion lies.

You seem to be ignoring all the previous replies that address your concern.

18 Likes

I'm wondering where you are getting these strings from that have null's in them?

If they are actual null terminated strings as used by C perhaps you can make use of Cstring to handle them, including turning into Rust String. CString in std::ffi - Rust

But what about those nulls you have at the beginning in your example or in the middle? That is very unusual.

These strings are sourced from DICOM files (a type of medical imaging file). In the original data, there is only one \0 at the end, not at the beginning or in the middle. The \0 at the beginning and in the middle in my example are just for testing the trim() function.

I would recommend stripping the null bytes before creating the string, it's a lot easier to work at the byte level there where you don't need to handle UTF-8 validity constraints.

4 Likes

That sounds like a perfectly valid usecase for CString for me, since the strings in that file seem to be \0-terminated.

6 Likes

That sounds like a great idea, I bet the DICOM format is structured that way intentionally to make C parsers easier to implement.

Assuming they have already read the file into a String and split it on lines, it could be as simple as this:

// Alternatively could return `Option<&str>` or `Result<&str, E>`
// and use `?` instead of `unwrap`
fn c_trim(s: &str) -> &str {
    CStr::from_bytes_until_nul(s.as_bytes())
        .unwrap()
        .to_str()
        .unwrap()
        .trim()
}

EDIT: If this is for industry use and is not just a hobby project, please see @quinedot's remarks below. I would advise against writing your own DICOM parser.

3 Likes