Format!-like writing to non-utf8 file?


#1

Background: I’m writing a library for writing PDF files. PDF is a partially text-like binary format; it has no fixed encoding but large parts of the files are readable as ASCII text. However, sometimes that text contains text in any 8-bit encoding.

Question: Is there something like the format! macro where an argument can implement something like Display but write any binary data instead of a utf8? Or is there a way to write &[u8] instead of &str to a std::fmt::Formatter?

Or in other words, is there any way to replace this code:

try!(output.write_all(b"("));
try!(output.write_all(&encoding.encode_string(text)));
try!(output.write_all(b") Tj\n"));

With anything like this:

try!(write!(output, b"{} Tj\n", encoded_string(text, &encoding)));

(Where encoded_string creates a struct that implements something like the Display trait.)


#2

This may be a bad idea, but can you format! it and then use as_bytes?

I would think that shouldn’t cost too much in terms of performance.

Edit: basically instead of answering your question I flipped it. Could you use a utf-8 foramteed String and then just write it as bytes to your output?


#3

The non-utf8 bytestring I have is what I need as arguments to format, not (only) output. The following actually seems to work:

write!(self.output, "({}) Tj\n", unsafe { str::from_utf8_unchecked(&bytes)})

But the documentation for from_utf8_unchecked says “This function is unsafe because it does not check that the bytes passed to it are valid UTF-8. If this constraint is violated, undefined behavior results, as the rest of Rust assumes that &strs are valid UTF-8.” So I assume this is not actually a good idea …


#4

ah, you have non-utf8 INPUTS…

Hmm… well, I can’t help you there BUT I would be surprised if “undefined behavior” actually hurt you. The way I would implement the formatter for writting strings would be to just write all it’s bytes – which is what you expect. I think “undefined behavior” would only happen if you tried to READ the string as UTF-8.

I would glance at the source and write a couple of tests to validate that things work as expected. If everything works as expected, I would open an issue against rust to get this behavior put in the documentation and tested with unit tests. I can’t see why it should not be allowed.


#5

actually, it kind of is documented right here: “as the rest of Rust assumes that &strs are valid UTF-8”

Since you are not going to read self.output as an &str I would think you should be fine.

Again, I would write a couple of unit tests and make sure everything works as expected, and maybe even open an issue. This is certainly an interesting use case.


#6

@vitiral

No, unfortunately that’s generally not OK. In this case, the formatting internals are Unicode aware and are free to assume that you’re writing UTF-8 (parts of the formatting code will, e.g., parse the string as Unicode to count chars). It may be OK in this case but that could change at any time.

@kaj

I think you may just be stuck doing this manually for now (or write some form of macro to make it nicer). Sorry.