How to covert chinese c_char to String

GISerliang · February 22, 2022, 10:25am

I loaded some data from a file that had some chinese words by FFI.
The c_char was correct but when I translated it to String was wrong like this.

pub fn _string(raw_ptr: *const c_char) -> String {
    let c_str = unsafe { CStr::from_ptr(raw_ptr) };
    c_str.to_string_lossy().into_owned()
}

Result: 
 Origin word: 标识码
 CStr: \xb1\xea\xca\xb6\xc2\xeb
 After translated String: ��ʶ��

Is there any solution to solve it?

H2CO3 · February 22, 2022, 10:35am

Rust Strings must be UTF-8. Based on the escape sequence, you do not have UTF-8 data. Is it BIG5 by any chance? That seems to work, although I don't read CJK.

trentj · February 22, 2022, 11:17am

I usually use Python for this kind of thing because it's easy to experiment with different encodings and it supports a lot out of the box.

>>> b = b'\xb1\xea\xca\xb6\xc2\xeb'
>>> b.decode("big5")
'梓妎鎢'
>>> b.decode("gb18030")
'标识码'

I also can't read it, but the GB 18030 decoding looks like it matches your original text. I discovered this through the highly scientific process of scanning Chinese character encoding - Wikipedia and guessing at encoding names until one looked right.

encoding-rs supports GB 18030 as well as some other encodings that look related and might work as well.

steffahn · February 22, 2022, 12:00pm

For your information: Note that c_char, i.e. “char” in the C language, is really just a fancy way of saying “byte”. It’s just a single byte without much meaning; a sequence of c_chars then is just any binary data, so it doesn’t prescribe any particular encoding. String and char in Rust are unicode-based, in particular, String is UTF-8 encoded, char is 4 bytes long and represents/identifies a unicode scalar value, and CStr::to_string_lossy does interpret the data in the CStr as UTF-8. The FFI call that you get your data from should probably specify what encoding the data is in instead. Judging by its Wikipedia article, GB18030 seens to be a commonly supported gouvernment standard format in mainland China, so maybe that information is even implicit.

Adapting your original code to use encoding_rs with GB18030 can look as follows

use std::ffi::CStr;
use std::os::raw::c_char;

use encoding_rs::GB18030;

unsafe fn string(raw_ptr: *const c_char) -> String {
    let c_str = CStr::from_ptr(raw_ptr);
    
    GB18030.decode_without_bom_handling(c_str.to_bytes()).0.into_owned()
}

Rust Playground

GISerliang · February 22, 2022, 12:08pm

Thank you so much.

GISerliang · February 22, 2022, 12:09pm

Thank you.

GISerliang · February 22, 2022, 12:10pm

Thanks.

system · May 23, 2022, 12:11pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Rust how to convert `&str` to `*const libc::c_char`? help	20	1120	November 12, 2023
How to convert Rust String to wchar_t* in C++ help	6	2081	January 16, 2022
Converting *const c_char to &str help	7	9947	January 12, 2023
[Solved] Converting UTF-8 char* from C help	2	440	August 26, 2019
How to convert a non-zero-terminated C string to Rust &str or String help	12	2638	January 12, 2023

How to covert chinese c_char to String

Related Topics