Utf8, Js string -> Rust Vec<u8>

JS strings is just an immutable array of JS char, not necessairly a valid utf8 string as in Rust right ?

If so, how do we, using wasm_bindgen, convert a JS string (which may or may not be a valid utf8 string) to a Rust Vec<u8> ?

(I have no in-depth knowledge of wasm_bindgen, but) see here. JS strings have the same issues as Windows paths and other systems that bought into "16 bits ought to be enough for everyone".


Independently, between posting my question and reading your response, I also discovered that JS chars do some weird 8/16 bit auto conversion thingy that makes me very very uncomfortable for serialization / deserialization tasks.

In the past I've passed strings from JS to Rust (as an argument, not a field) and back again (as a field of the resulting value). The code is in production now, and I don't recall doing anything special for it, nor did/do I get borked data on either receiving side.

Given this, the solution to your problem could be as easy as:

  1. Transmit the string from JS to Rust using WASM
  2. Use String::to_bytes() or something like it

Were your JS strings UTF valid? My problem is I am constructing strings from binary serialization, which may or may not be UTF valid.

The strings start out as regular JS strings. I have no knowledge of how JS defines what a string is, so those details are irrelevant in terms of getting it working.

My experience has been that, given a valid JS string, wasm_bindgen just does the right thing in translating it to Rust, as well as back again.

But it should also be noted, after immediate post-processing, those strings arr pretty much deleted from memory, none of them are kept around long-term.

Here's my XY problem:

OCaml/JS converts some data into an array of u8.
I want Rust/wasm32 to read this as a Vec<u8>

The problem is that what OCaml/JS generates is binary serialized data, unlikely to be UTF valid.

Rust Strings are definitely UTF valid. I'm not sure about JS strings. Either way, if at any step, a "force to be UTF valid step occurs", it screws up the serialized data.

Well if you want to convert it to Vec<u8> rather than String, why is it necessary that the data is valid UTF-8?

OCaml/JS and Rust/wasm32 are both running in Chrome.

OCaml/JS needs to convert its args to some JS Value which can then be passed as an argument to Rust/wasm32.

If JS strings requires to be valid UTF, this is unlikely to work and I need to examine other routes.

In the worst case, this is the sort of problem that uuencode and b64encode are designed to work around, at a cost of increasing the transfer size.

1 Like

If both sides are using bytes, why bother with strings? Rather: assuming a string gets involved, exactly how does that happen? Are you sure you have to accept a string? Figuring out how to avoid this is the best path.

JS strings are UTF-16, so they don't correspond directly to bytes or to UTF-8. It's not possible to convert a string from one to the other “without validity checking” because they have different kinds of invalid states that don't correspond to each other.


Does it not support Uint8Array?


@2e71828 @kpreid @the8472 : Yeah, I'm increasingly convinced Js/string is the wrong aproach and Js/uint8array is the only way to go.

To be fair, Windows NT, JavaScript (and the Java it was copying) came out before UTF-16 was created in 1996. To be unfair, Java and JavaScript at least were well after UTF-8 (1992!), and absolutely could have used that for compatibility with existing systems, even if supplementary planes were not a thing yet.

I meant they were more casualties than fools; saying 16 bits was enough was the Unicode Consortium's blunder[1]. UCS-2 was attractive for the same reasons latin-1 was attractive: given a sequence of code units, there are no invalid encodings. Imagine Unicode with no encoding issues! Would have been great if it worked, but alas.

  1. along with the things they did to try and make it work before admitting it was inadequate ↩︎