In my cli application, I search for a user-input string in an array of strings. A Korean user of app reported a bug and after some digging, it seems that the same seemingly identical strings are encoded as two different byte slices.
String is haystack
and I am trying to find query
in it, which is typed by user and read by my application.
haystack: "진흙속의연꽃 속 다 지구"
query: "지구"
(note that query
seems to be the last two characters in haystack
)
Using above values, haystack.contains(query)
return false
. However if I hardcode the value of query into my source code as embed
and do haystack.contains(embed)
, I get true
!!!
And looking at their byte representations, I get:
query: "지구" [225, 132, 140, 225, 133, 181, 225, 132, 128, 225, 133, 174]
embed: "지구" [236, 167, 128, 234, 181, 172]
Is this related to UTF8 vs UTF16?
My cli application is invoked using a 3rd party GUI, and within my app I can confirm that on my computer it sets LC_CTYPE
to en_US.UTF-8
.
In above examples haystack
is fetched from web so it's possible it's using a different encoding than what my Korean user has when they type in query! How can I address this issue to prevent future bugs?