In my cli application, I search for a user-input string in an array of strings. A Korean user of app reported a bug and after some digging, it seems that the same seemingly identical strings are encoded as two different byte slices.
haystack and I am trying to find
query in it, which is typed by user and read by my application.
haystack: "진흙속의연꽃 속 다 지구" query: "지구"
query seems to be the last two characters in
Using above values,
false. However if I hardcode the value of query into my source code as
embed and do
haystack.contains(embed), I get
And looking at their byte representations, I get:
query: "지구" [225, 132, 140, 225, 133, 181, 225, 132, 128, 225, 133, 174] embed: "지구" [236, 167, 128, 234, 181, 172]
Is this related to UTF8 vs UTF16?
My cli application is invoked using a 3rd party GUI, and within my app I can confirm that on my computer it sets
In above examples
haystack is fetched from web so it's possible it's using a different encoding than what my Korean user has when they type in query! How can I address this issue to prevent future bugs?