Format!-like writing to non-utf8 file?

kaj · September 10, 2016, 10:25am

Background: I'm writing a library for writing PDF files. PDF is a partially text-like binary format; it has no fixed encoding but large parts of the files are readable as ASCII text. However, sometimes that text contains text in any 8-bit encoding.

Question: Is there something like the format! macro where an argument can implement something like Display but write any binary data instead of a utf8? Or is there a way to write &[u8] instead of &str to a std::fmt::Formatter?

Or in other words, is there any way to replace this code:

try!(output.write_all(b"("));
try!(output.write_all(&encoding.encode_string(text)));
try!(output.write_all(b") Tj\n"));

With anything like this:

try!(write!(output, b"{} Tj\n", encoded_string(text, &encoding)));

(Where encoded_string creates a struct that implements something like the Display trait.)

vitiral · September 12, 2016, 4:06am

This may be a bad idea, but can you format! it and then use as_bytes?

I would think that shouldn't cost too much in terms of performance.

Edit: basically instead of answering your question I flipped it. Could you use a utf-8 foramteed String and then just write it as bytes to your output?

kaj · September 13, 2016, 5:51pm

The non-utf8 bytestring I have is what I need as arguments to format, not (only) output. The following actually seems to work:

write!(self.output, "({}) Tj\n", unsafe { str::from_utf8_unchecked(&bytes)})

But the documentation for from_utf8_unchecked says "This function is unsafe because it does not check that the bytes passed to it are valid UTF-8. If this constraint is violated, undefined behavior results, as the rest of Rust assumes that &strs are valid UTF-8." So I assume this is not actually a good idea ...

vitiral · September 13, 2016, 8:29pm

ah, you have non-utf8 INPUTS...

Hmm... well, I can't help you there BUT I would be surprised if "undefined behavior" actually hurt you. The way I would implement the formatter for writting strings would be to just write all it's bytes -- which is what you expect. I think "undefined behavior" would only happen if you tried to READ the string as UTF-8.

I would glance at the source and write a couple of tests to validate that things work as expected. If everything works as expected, I would open an issue against rust to get this behavior put in the documentation and tested with unit tests. I can't see why it should not be allowed.

vitiral · September 13, 2016, 8:32pm

actually, it kind of is documented right here: "as the rest of Rust assumes that &strs are valid UTF-8"

Since you are not going to read self.output as an &str I would think you should be fine.

Again, I would write a couple of unit tests and make sure everything works as expected, and maybe even open an issue. This is certainly an interesting use case.

stebalien · September 14, 2016, 1:43am

@vitiral

No, unfortunately that's generally not OK. In this case, the formatting internals are Unicode aware and are free to assume that you're writing UTF-8 (parts of the formatting code will, e.g., parse the string as Unicode to count chars). It may be OK in this case but that could change at any time.

@kaj

I think you may just be stuck doing this manually for now (or write some form of macro to make it nicer). Sorry.

Topic		Replies	Views
Stdin, stdout, stderr and encoding	5	3210	January 12, 2023
Generic writer (File & String) help	2	460	February 9, 2021
I need a few examples of non-utf8 OsString/PathBuf for my unit tests help	2	683	August 8, 2021
Support beyond UTF-8? help	11	6221	January 12, 2023
Invalid utf-8 when using .read_to_string from a file help	3	1816	March 2, 2020

Format!-like writing to non-utf8 file?

Related Topics