Unicode enhancement of the GNU strings tool: search for multi-byte encoded strings in binary data

stringsext is a Unicode enhancement of the GNU strings tool with additional functionalities: stringsext recognizes Cyrillic, CJKV characters and other scripts in all supported multi-byte-encodings, while GNU strings fails in finding any of these scripts in UTF-16 and many other encodings.

stringsext is mainly useful for determining the Unicode content of non-text files. It prints all graphic character sequences in FILE or stdin that are at least MIN bytes long.

Unlike GNU strings stringsext can be configured to search for valid characters not only in ASCII but also in many other input encodings, e.g.: utf-8, utf-16be, utf-16le, big5-2003, euc-jp, koi8-r and many others. --list-encodings shows a list of valid encoding names based on the WHATWG Encoding Standard. When more than one encoding is specified, the scan is performed in different threads simultaneously.

When searching for UTF-16 strings, 96% of all possible two byte sequence, interpreted as UTF-16 code unit, relates directly to a Unicode code point. As a result, the probability of encountering valid Unicode characters in a random byte stream, interpreted as UTF-16, is 96%. In order to reduce this big number of false positives, stringsext provides a parameterizable Unicode-block-filter. See --encodings option for more details.

stringsext reads its input data from FILE. With no FILE, or when FILE is -, it reads standard input stdin.

When invoked with stringsext -e ascii -c i stringsext can be used as GNU strings replacement.

3 Likes