How to covert chinese c_char to String

I loaded some data from a file that had some chinese words by FFI.
The c_char was correct but when I translated it to String was wrong like this.

pub fn _string(raw_ptr: *const c_char) -> String {
    let c_str = unsafe { CStr::from_ptr(raw_ptr) };
    c_str.to_string_lossy().into_owned()
}
Result: 
 Origin word: 标识码
 CStr: \xb1\xea\xca\xb6\xc2\xeb
 After translated String: ��ʶ��

Is there any solution to solve it?

Rust Strings must be UTF-8. Based on the escape sequence, you do not have UTF-8 data. Is it BIG5 by any chance? That seems to work, although I don't read CJK.

1 Like

I usually use Python for this kind of thing because it's easy to experiment with different encodings and it supports a lot out of the box.

>>> b = b'\xb1\xea\xca\xb6\xc2\xeb'
>>> b.decode("big5")
'梓妎鎢'
>>> b.decode("gb18030")
'标识码'

I also can't read it, but the GB 18030 decoding looks like it matches your original text. I discovered this through the highly scientific process of scanning Chinese character encoding - Wikipedia and guessing at encoding names until one looked right.

encoding-rs supports GB 18030 as well as some other encodings that look related and might work as well.

5 Likes

For your information: Note that c_char, i.e. “char” in the C language, is really just a fancy way of saying “byte”. It’s just a single byte without much meaning; a sequence of c_chars then is just any binary data, so it doesn’t prescribe any particular encoding. String and char in Rust are unicode-based, in particular, String is UTF-8 encoded, char is 4 bytes long and represents/identifies a unicode scalar value, and CStr::to_string_lossy does interpret the data in the CStr as UTF-8. The FFI call that you get your data from should probably specify what encoding the data is in instead. Judging by its Wikipedia article, GB18030 seens to be a commonly supported gouvernment standard format in mainland China, so maybe that information is even implicit.

Adapting your original code to use encoding_rs with GB18030 can look as follows

use std::ffi::CStr;
use std::os::raw::c_char;

use encoding_rs::GB18030;

unsafe fn string(raw_ptr: *const c_char) -> String {
    let c_str = CStr::from_ptr(raw_ptr);
    
    GB18030.decode_without_bom_handling(c_str.to_bytes()).0.into_owned()
}

Rust Playground

2 Likes

Thank you so much.

Thank you.

Thanks.