Announcing Stringsext Version 2

reu · January 17, 2020, 9:44am

This blog post is about stringsext version 2, a major rewrite of the original software. Version 2 comes with performant filter for scripts like “Latin“, “Arabic”, “Hebrew”, “Cyrillic”, ...

stringsext is a Unicode enhancement of the GNU strings tool with additional functionalities: stringsext recognizes Cyrillic, CJKV characters and other scripts in all supported multi-byte-encodings, while GNU strings fails in finding any of these scripts in UTF-16 and many other
encodings.

Screenshot

stringsext -tx -e utf-8 -e utf-16le -e utf-16be \
           -n 10 -a None -u African  /dev/disk/by-uuid/567a8410

 3de2fff0+	(b UTF-16LE)	ݒݓݔݕݖݗݙݪ
 3de30000+	(b UTF-16LE)	ݫݱݶݷݸݹݺ
<3de36528 	(a UTF-8)	فيأنمامعكلأورديافىهولملكاولهبسالإنهيأيقدهلثمبهلوليبلايبكشيام
>3de36528+	(a UTF-8)	أمنتبيلنحبهممشوش
<3de3a708 	(a UTF-8)	علىإلىهذاآخرعددالىهذهصورغيركانولابينعرضذلكهنايومقالعليانالكن
>3de3a708+	(a UTF-8)	حتىقبلوحةاخرفقطعبدركنإذاكمااحدإلافيهبعضكيفبح
 3de3a780+	(a UTF-8)	ثومنوهوأناجدالهاسلمعندليسعبرصلىمنذبهاأنهمثلكنتالاحيثمصرشرححو
 3de3a7f8+	(a UTF-8)	لوفياذالكلمرةانتالفأبوخاصأنتانهاليعضووقدابنخيربنتلكمشاءوهياب
 3de3a870+	(a UTF-8)	وقصصومارقمأحدنحنعدمرأياحةكتبدونيجبمنهتحتجهةسنةيتمكرةغزةنفسبي
 3de3a8e8+	(a UTF-8)	تللهلناتلكقلبلماعنهأولشيءنورأمافيكبكلذاترتببأنهمسانكبيعفقدحس
 3de3a960+	(a UTF-8)	نلهمشعرأهلشهرقطرطلب
 3df4cca8 	(c UTF-16BE)	փօև։֋֍֏֑֛֚֓֕֗֙֜֝֞׹
<3df4cd20 	(c UTF-16BE)	־ֿ׀ׁׂ׃ׅׄ׆ׇ׈׉׊׋

History

stringsext was first publicly released in December 2016 and hadn't changed much during the last 3 years. Of course, there had been some smaller bug-fixes and its migration towards the Rust's 2018-edition took some time, but the design and structure remained untouched. In the meantime Rust's backend library rust-encoding had been abandoned in favour of encoding_rs and it became apparent, that migrating to the new library equals a major rewrite of the source code. And when it came to start the new development, it was a good opportunity to revise some early design decisions, taken when I first stated the development in summer 2016. I was not very experienced in Rust programming by then. Since memory safety was - from the start - a fundamental requirement, the first version of stringsext had to get along with no unsafe{} Rust code. This design decision, together with
stringsext's multi-threaded architecture, resulted in numerous costly copying of strings and heap allocations.

In order to optimize speed, stringsext version 2 avoids copying of stream data as much as possible. This goal could not have been achieved without some unsafe{} tagged pointer arithmetics. To prevent misunderstanding: the unsafe{} tag does not mean, that the embraced code is unsafe. It rather
means, that for a few, well encapsulated lines of code, the borrow-checker is partly disabled and the programmer has to guarantee and test the memory safety without compiler help. The biggest security concern with stringsext is, that software tools used in disk and memory forensics are particularly exposed to malicious input. In stringsext, the source code, that first reads the potentially malicious input, is the so called the “decoder“, which is provided by the external library encoding_rs. This explains why stringsext's security heavily depend on the security guaranties of this library. encoding_rs was chosen with security concerns in mind: encoding_rs is well established, part of common web-browsers and as such widely used, tested and trustworthy. In addition, encoding_rs is also written in Rust, and thus offers the same memory safety guarantees as stringsext does. Because of its upstream decoder, stringsext is never directly in contact with the input stream, which reduces its attach surface significantly.

One of the first features, that had been added to stringsext version 1 was the Unicode-Block-Filter, allowing to restrict the characters to search for, to some Unicode code-point range. This feature proved to be very successful in memory forensics, that deals a lot with UTF-16. Almost any random byte-stream can be interpreted as valid UTF-16, therefore the Unicode-Block-Filter is vital for reducing false positives, while scanning for this encoding. At the same time, forensic practise also showed the limitation of the first filter design. To improve the situation stringsext version 2 comes with a new and
better configurable Unicode-Block-Filter.

stringsext is a stream oriented tool, that harmonizes well with other UNIX tools like head, tail, sed or grep. Version 2 has its own simple grep-like filter, useful to scan for file-paths and URLs. For more complex filtering with regular expressions, pipe stringsext's output through grep or sed.

Summery of changes, improvements and new features

Much faster (>30%).
Improved Unicode-Block-Filter:
- Search in scripts with predefined filters e.g.
  Latin, Arabic, Syriac, Cyrillic, etc and any combination of
  these.
- Configurable custom filter.
Improved ASCII-Filter:
- Predefined filters e.g. all ASCII except control characters or all
  ASCII with white-space, without controls".
- Custom filters with configurable sets of ASCII-codes that
  pass the filter.
New internal "grep"-like filter, mainly useful to search paths strings.
More detailed position indication for long strings.
Better interface with other stream oriented tools e.g. "head",
"tail", "sed" and "grep".
Better handling of zero terminated (C-style) strings in large fields.
New backend "encoding_rs".

Resources

Read more about stringsext on its project page. A paper about stringsext, published in 2019 in the Journal of Digital Forensics, Security and Law is available for free download.

stringsext comes with a man-page and API-documentation. The source-code is hosted on Github and on Gitlab. Binaries are available for download here.

system · April 16, 2020, 9:44am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unicode enhancement of the GNU strings tool: search for multi-byte encoded strings in binary data announcements	1	906	January 12, 2023
`bytes-text`: UTF-8 wrapper around `Bytes` announcements	3	292	February 9, 2022
non-English string shows as two different byte slices!	13	653	April 4, 2022
Reading UTF16 Text from an u8 Slice help	11	3445	May 21, 2020
Comparing UTF-8 characters	14	1812	August 12, 2020

Announcing Stringsext Version 2

Screenshot

History

Summery of changes, improvements and new features

Resources

Related Topics