I run an external program file and want to process the output checking for an encoding of ISO-8859 or latin1. The code to run is :
let file_child = Command::new("file")
.arg(&config.filename_latin)
.stdout(Stdio::piped())
.spawn()
.expect("Failed to run file process");
let file_out = &file_child
.wait_with_output()
.expect("Failed to wait on output of file stdout");
let file_output = file_out.stdout.as_slice();
When I try to do a contains on the output I get a compile error which I cannot seem to resolve:
if file_output.contains("ISO-8859") && file_output.contains("text") {
println!("input file is latin1");
} else {
println!("input file is *not* latin1 but is {}", file_output);
}
The error I get is. I have tried several .as_* but always get the error messages. For some reason I need to compare a slice (&[u8]) with u8 or contains will do nothing.
Compiling latin2utf8 v0.1.0 (/home/data/gbonnema/projects/rust/latin2utf8)
error[E0308]: mismatched types
--> src/main.rs:28:29
|
28 | if file_output.contains("ISO-8859") && file_output.contains("text") {
| ^^^^^^^^^^ expected u8, found str
|
= note: expected type `&u8`
found type `&'static str`
error[E0308]: mismatched types
--> src/main.rs:28:65
|
28 | if file_output.contains("ISO-8859") && file_output.contains("text") {
| ^^^^^^ expected u8, found str
|
= note: expected type `&u8`
found type `&'static str`
error[E0277]: `[u8]` doesn't implement `std::fmt::Display`
--> src/main.rs:31:58
|
31 | println!("input file is *not* latin1 but is {}", file_output);
| ^^^^^^^^^^^ `[u8]` cannot be formatted with the default formatter
|
= help: the trait `std::fmt::Display` is not implemented for `[u8]`
= note: in format strings you may be able to use `{:?}` (or {:#?} for pretty-print) instead
= note: required because of the requirements on the impl of `std::fmt::Display` for `&[u8]`
= note: required by `std::fmt::Display::fmt`
Yes, since the stdout of an arbitrary program may not be (the) valid utf-8 (encoding of a unicode string), the output is not promoted to &str, and remains just as a raw sequence of bytes, a &[u8].
There are multiple solutions to solve this.
Either you assert that file outputs a unicode string encoded in utf-8, by using ::std::str::from_utf8, so that you can then use str's enhanced / smarter .contains();
or you perform the search at the byte level, (ab)using the fact you are only looking for ASCII bytes, where all the encodings concur. That is, you are looking for C's memmem() function, for which there must already be some crates.
If you don't care about performance (given your example it shouldn't), there is an easy naive implementation using .windows():
trait ContainsSlice {
type Item : PartialEq;
fn contains_slice (self: &'_ Self, slice: &'_ [Self::Item]) -> bool;
}
impl<Item : PartialEq> ContainsSlice for [Item] {
type Item = Item;
fn contains_slice (self: &'_ [Item], slice: &'_ [Item]) -> bool
{
let len = slice.len();
if len == 0 {
return true;
}
self.windows(len)
.any(move |sub_slice| sub_slice == slice)
}
}
Thank you, that indeed solves my issue. So, knowing that it is indeed utf8-encoded would help, but I don't as the program file may output in a different encoding.
Cargo clippy (a lint for rust) gave me the following tips here:
replace slice.len() == 0 by slice.is_empty(). The first argument is that this could be more efficient if calculating length takes longer than determining whether it is empty. In our case this shouldn't matter as we need the length as well. The other argument is better expression of intention.
replace .position(move |sub_slice| sub_slice == slice).is_some() by any(move |sub_slice| sub_slice == slice). Apparantly meaning the same thing.
Using .any(...) is definitely better / cleaner, I have updated the code to use it (I wasn't super fond of that .is_some()).
As you said, using len == 0 shows a clearer intent regarding the use case we are guarding against ( .windowspanic!ing when len = 0), and performance-wise fetching the length of a slice should always be equivalent to a field access, since a(ny pointer to a) slice is just a { ptr: NonNull<T>, len: usize } pair (a fat pointer). I have used a variable to make it even clearer, though. This way clippy no longer complains
For a more general version (i.e., checking whether a slice of Strings contains a certain subslice of &strs), you can use the following tweaked definition: