Error handling and UTF-8 validation

Yeah, there isn't much you can do when the json is invalid. In this case I was outputting json though, and since Rust enforces that utf-8 validation happens when the string is created, the unwrap() in question would be at the creation of the string; not the json_encode call.

The proper detection of invalid data should have happened in the code that created the json, not the code receiving the json blob.

The late Jon Postel famously said “Be liberal in what you accept, and conservative in what you send” as a recipe for implementing internet protocols.

I have thought this was insanity since I first heard it decades ago. This is going to make for a lot of unreliable systems, I thought. Security was not in the front of my mind back then.

Sure enough it came to pass that a web of security vulnerabilities was the result and a pervasive data corruption everywhere. And a generation or two of sloppy programmers who take little care of these things.

Only today I find I was not alone in thinking this: "The Harmful Consequences of the Robustness Principle. draft-thomson-postel-was-wrong-03" :

1 Like

@alice and @trentj and @ZiCog
Luckily the Great text-sanitizer is here!! :smiley:
So you don't need to crash anymore on any ill-encoded Text!!

Even more as @Finn suggested it could create a remark for the Sales Staff:
"Unrecognized Text in Customer::Address"

So you can convert the JSON to a valid Customer object and don't need to loose any Purchase anymore.

I found this nice article about Data Mining

Very interesting the section:

Dropping of columns

In this step, we are going to drop columns with the least priority. The column such as ‘PassengerId’ and ‘Ticket’ comes under this category. Use drop() to drop the columns.

The most valuable Lession that it demonstrates is that in an imperfect world most likely any Data can be incomplete at any time.
But the important question is:

  • How to deal with imperfect data?
  • What are the Minimal Requirements to complete the Task?

So my Web Service will not give an ECustomerDataError unless the most Minimal Requirement is not fulfilled.
As perhaps: Email Address is missing or invalid.

No doubt.

Perhaps we need to make a distinction between being able to do something with incomplete or incorrect data and blindly accepting it or munging it into something it is not and then relying on that.

I have had the latter in mind. The classic case being SQL injection attacks.

Throw enough processing and now a days AI at almost anything it some sense can be made of it. But how much do you then want to rely on it?

Crashing is not the problem, though. Just stop using unwrap and that problem goes away. The real problem, the one you are not solving, is that you can't guarantee that your interpretation of bytes in a text file with mixed encodings will make any sense to a human reader.

There's another old maxim that will help illustrate here: "Garbage in, garbage out."

1 Like

I think this is a bit harsh.

While not everyone will want to get meaning out of this kind of jumble of bytes, I think it makes sense to want to be able to do it for some domains.

If this were a rust library, I'd certainly consider using it some situations. Maybe not as an unconditional filter - but what if it were an option in a text editor? "You've opened a file that doesn't make sense. Try one of these encodings, or see what it looks like passed through a sanitizer."

If someone had accidentally appended binary data into a file, or two common encodings, then it'd allow them to easily get back the parts that were valid text.

Switching between encodings is one thing ... Making sense of a file that is a random mix of more than 1 encoding is entirely a different matter. It isn't hard to imagine coming up with combinations of encodings and permutations of switching between them in one file that will lead to ambiguous interpretations of the data.

So while it may be harsh to call this kind of text "garbage", it's more importantly a pragmatic view of the situation. Trying to decipher files containing mixed encodings is futile if the goal is to make them intelligible in every conceivable case.

from_utf_lossy handles the first case, AFAIK.


The text-sanitizer does not automatically correct the Data with the correct Transliteration. It needs to be trained to do so, just as the Version text-sanitizer_v3 started to do, too.
Perhaps you might see it as a Limitation but it is a common problem to any AI Logic, it always need Human Feedback to become better.
On the other hand it is Encoding agnostic beside UTF-8 which makes it powerfull since it can transliterate any Byte Sequence to the given Transliteration Rules.
But also it needs to see the Real Byte Value to get the Transliteration right.
Therefore to use the String::from_utf8_lossy() is contra-productive because it delete the original byte and replaces it with always the same value U+fffd as seen in the example:
Lossy transliterated text:

<h2 id="selected"><a href="/de/ausfluge/" title="Ausfl�ge auf Lanzarote">AUSFL&Uuml;GE</a></h2>

and visible byte sequence:

<h2 id="selected"><a href="/de/ausfluge/" title="Ausfl(?fffd)ge auf Lanzarote">AUSFL&Uuml;GE</a></h2>

But conserving the original data as in:

 <p><b>Preise mit Mittagessen:</b> Erwachsene: von 44(?80) - Kinder: von 25(?80).</p>

I could even find the original encoding of the web page and thus suggest a possible Transliteration.

I also completed the Logic of the text-sanitizer.
On Playground: text-sanitizer

And I worked on the implementation with the other Language to get a comparable Result.
The Input/Output issue is resolved and behaves now like the Rust Application.

 read(0, "--2020-05-16 12:06:23-- http://w"..., 8192) = 8192
read(0, "<a href=\"ausfluge/#ausfluge/atla"..., 16384) = 16384
read(0, "arote Str\344nde\">STR&Auml;NDE</a><"..., 32768) = 32768
read(0, "</span></a></li> <li id=\"english"..., 65536) = 65536
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcd7ccd2000
read(0, " Sie das allw\366chentliche Markttr"..., 131072) = 131072
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcd7cc92000
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcd7cb92000
read(0, "it etwas Gl\374ck finden Sie dort O"..., 262144) = 262144
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcd7ca92000
read(0, "rados\"> <div class=\"imgcuadro\"> "..., 524288) = 524288
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcd7c992000
mmap(NULL, 1114112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcd7c882000
munmap(0x7fcd7ca92000, 1048576)         = 0
read(0, "-wochenmarkt\">MEHR INFO</a></div"..., 1048576) = 1048576
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcd7c682000
munmap(0x7fcd7c992000, 1048576)         = 0
mmap(NULL, 2162688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcd7c472000
munmap(0x7fcd7c882000, 1114112)         = 0
read(0, "en Fahrt mit der F\344hre geht es z"..., 2097152) = 607540
mmap(NULL, 2752512, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcd7c8f2000
munmap(0x7fcd7c682000, 2097152)         = 0
read(0, "", 2097152)                    = 0
mmap(NULL, 2752512, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcd7c1d2000
write(1, "--2020-05-16 12:06:23-- http://w"..., 2736500) = 2736500

So I repeated the hyperfine Benchmarks:
For the Rust Application:

$ hyperfine --warmup 3 -r 100 'target/release/text-sanitizer -i es de < lanzarote-com_de-ausfluge.html.x100 > /dev/null'
Benchmark #1: target/release/text-sanitizer -i es de < lanzarote-com_de-ausfluge.html.x100 > /dev/null
  Time (mean ± σ):      30.0 ms ±   5.3 ms    [User: 26.0 ms, System: 4.1 ms]
  Range (min … max):    26.4 ms …  46.4 ms    100 runs

And for the Implementation with the other Programming Language:

$ hyperfine --warmup 3 -r 100 './ -i es de < lanzarote-com_de-ausfluge.html.x100 > /dev/null'
Benchmark #1: ./ -i es de < lanzarote-com_de-ausfluge.html.x100 > /dev/null
  Time (mean ± σ):      56.1 ms ±   7.1 ms    [User: 48.9 ms, System: 7.2 ms]
  Range (min … max):    51.6 ms …  81.8 ms    100 runs

The Rust Application does the same Job in only 53.47% of the time.
That makes it almost double as fast.
That is the Point that shows a clear and measurable Advantage.

This mostly came to show as the Input was growing significantly.

I can only suppose but I mostly think it comes from the better memory management, which gains more significance on bigger Data Volumes.

But still the complete version of the Application runs 5 times slower than any other proposed solution.

I see where you're coming from. But I can provide a counterexample in the form of a "garbled" text file that a program cannot properly correct (without knowing the specific details of the file beforehand). In other words, while it might be trivial to write an algorithm that can transform this specific file into the expected output, the developer must know what the expected output is!

Given a different permutation of inputs (encodings used, byte positions where the encoding is switched, etc.) the algorithm will have to be changed for every new input file and the expected output will have to be known beforehand. A problem like this is intractable (and is one property that makes cryptography so successful).

Here's a command to create such a file:

echo -n $'\x48\x65\x6c\x6c\x6f\x2c\x20\x77\x6f\x72\x6c\x64\x21\x0a\xc8\x85\x93\x93\x96\x6b\x40\xa6\x96\x99\x93\x84\x5a\x25\x3f\x6f\x3f\x6f\x3f\x6f\x0a\x25\x21\x5a\x21\x5a\x21\x5a' >borked

The challenge is to:

  1. Figure out which encodings were used
  2. Re-encode the text as UTF-8
  3. Ensure the output matches the expected result

The text uses a mixture of ASCII and EBCDIC encodings. These two were chosen because they are quite incompatible with one another. However, it is easy enough to find incompatible encodings that are more modern and even ones that are based on ASCII. For example, ISO/IEC 8859-1 and Shift-JIS X 0208:1997 have some overlap in the 0xa1..0xef range that leads to ambiguities that cannot be resolved algorithmically.

The expected UTF-8 output for the borked file is:

Hello, world!
Hello, world!


The sequence of question marks and exclamation points is there just to drive the point home; the raw bytes can be viewed in ASCII as ?o?o?o\n%!Z!Z!Z. While this is valid ASCII (and therefore valid UTF-8) it is not the expected text.

I don't mean to be trolly, I'm just explaining why I threw in the towel on this one. Handling every file with mixed encodings is impossible. But sometimes you can make it work in some cases.

1 Like

EBCDIC, who'd a thought, well played.

Possibly a rare thing in today's web pages. It crossed my mind to create such an example with more common encodings.

The text-sanitizer is giving out:

$ text-sanitizer -i < borked
Hello, world!

While there is still the possibility to develop a Fallback Logic where stray Byte could be kept in a BackLog to match user-defined Byte Sequences this would only be activated when those Bytes fall out of the printable ASCII Range.
This is actually a Limitation which is inherit to its Design.

But you find also In modern Editors an Encoding Handling which is user-defined as others commented it as well. The User would choose which Encoding it is and save the File with the correct Encoding.

Also it is not the Intention of the Tool to detect Encodings.
It was conceived as an Replacement of cat -A which produces:

$ target/debug/text-sanitizer -i < borked
Hello, world!

Its actual Usage is to simplify System Message Parsing
From formally:

$ systemctl status postfix |cat -A
M-bM-^WM-^O postfix.service - Postfix Mail Transport Agent$
   Loaded: loaded (/usr/lib/systemd/system/postfix.service; enabled; vendor preset: disabled)$
   Active: active (running) since vie 2020-05-08 07:03:29 WEST; 2 weeks 3 days ago$
  Process: 4263 ExecStart=/usr/sbin/postfix start (code=exited, status=0/SUCCESS)$
  Process: 4260 ExecStartPre=/usr/libexec/postfix/chroot-update (code=exited, status=0/SUCCESS)$
  Process: 4234 ExecStartPre=/usr/libexec/postfix/aliasesdb (code=exited, status=0/SUCCESS)$
 Main PID: 4521 (master)$
    Tasks: 3$
   CGroup: /system.slice/postfix.service$
           M-bM-^TM-^\M-bM-^TM-^@4521 /usr/libexec/postfix/master -w$
           M-bM-^TM-^\M-bM-^TM-^@4523 qmgr -l -t unix -u$
           M-bM-^TM-^TM-bM-^TM-^@9301 pickup -l -t unix -u$

To now:

$ systemctl status postfix |target/debug/text-sanitizer -i es
* postfix.service - Postfix Mail Transport Agent
   Loaded: loaded (/usr/lib/systemd/system/postfix.service; enabled; vendor preset: disabled)
   Active: active (running) since vie 2020-05-08 07:03:29 WEST; 2 weeks 3 days ago
  Process: 4263 ExecStart=/usr/sbin/postfix start (code=exited, status=0/SUCCESS)
  Process: 4260 ExecStartPre=/usr/libexec/postfix/chroot-update (code=exited, status=0/SUCCESS)
  Process: 4234 ExecStartPre=/usr/libexec/postfix/aliasesdb (code=exited, status=0/SUCCESS)
 Main PID: 4521 (master)
    Tasks: 3
   CGroup: /system.slice/postfix.service
           |--4521 /usr/libexec/postfix/master -w
           |--4523 qmgr -l -t unix -u
           |--9301 pickup -l -t unix -u

where you can parse the Postfix Process ID very nicly like this:

$ systemctl status postfix |target/debug/text-sanitizer -i es|grep -ioE "\|--[[:space:]]*[0-9]+"

Therefore I stated an Empty Result or Crash would mean a Mission Failure for the whole System.

But ANSI escape codes are perfectly valid UTF-8 (not to be confused with ANSI codepages, which are not). So are the other control plane and line-drawing characters that systemctl is using. You can store output like that in a Rust string without any problems. I'll prove it:

[nix-shell:~/Development/encoding_c_dependency]$ cat src/ 
use std::io::{Read, stdin};
fn main() {
    let mut stdin = stdin();
    let mut out = String::new();
    stdin.read_to_string(&mut out).unwrap();
    println!("{}", out);

[nix-shell:~/Development/encoding_c_dependency]$ systemctl status NetworkManager | cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/encoding_c_dependency`
● NetworkManager.service - Network Manager
   Loaded: loaded (/lib/systemd/system/NetworkManager.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2020-05-20 21:36:27 MST; 4 days ago
     Docs: man:NetworkManager(8)
 Main PID: 1132 (NetworkManager)
    Tasks: 4 (limit: 4915)
   CGroup: /system.slice/NetworkManager.service
           ├─ 1132 /usr/sbin/NetworkManager --no-daemon
           └─15861 /sbin/dhclient -d -q -sf /usr/lib/NetworkManager/nm-dhcp-helper -pf /run/ -lf /var/lib/NetworkManager/ -cf /var/lib/NetworkManager/dhclient-wlp0s20f3.conf wlp0s20f3

May 25 09:09:42 michael-ThinkPad-X1-Carbon-7th NetworkManager[1132]: <info>  [1590422982.4686] device (wlp0s20f3): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
May 25 09:09:42 michael-ThinkPad-X1-Carbon-7th NetworkManager[1132]: <info>  [1590422982.4691] device (wlp0s20f3): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
May 25 09:09:42 michael-ThinkPad-X1-Carbon-7th NetworkManager[1132]: <info>  [1590422982.4693] device (wlp0s20f3): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
May 25 09:09:42 michael-ThinkPad-X1-Carbon-7th NetworkManager[1132]: <info>  [1590422982.4695] manager: NetworkManager state is now CONNECTED_LOCAL
May 25 09:09:42 michael-ThinkPad-X1-Carbon-7th dhclient[15861]: bound to -- renewal in 42933 seconds.
May 25 09:09:42 michael-ThinkPad-X1-Carbon-7th NetworkManager[1132]: <info>  [1590422982.4761] manager: NetworkManager state is now CONNECTED_SITE
May 25 09:09:42 michael-ThinkPad-X1-Carbon-7th NetworkManager[1132]: <info>  [1590422982.4762] policy: set 'shersys9 1' (wlp0s20f3) as default for IPv4 routing and DNS
May 25 09:09:42 michael-ThinkPad-X1-Carbon-7th NetworkManager[1132]: <info>  [1590422982.4766] device (wlp0s20f3): Activation: successful, device activated.
May 25 09:09:44 michael-ThinkPad-X1-Carbon-7th NetworkManager[1132]: <info>  [1590422984.6072] policy: set 'shersys9 1' (wlp0s20f3) as default for IPv6 routing and DNS
May 25 09:09:54 michael-ThinkPad-X1-Carbon-7th NetworkManager[1132]: <info>  [1590422994.6199] manager: NetworkManager state is now CONNECTED_GLOBAL

1 Like

Yes, the System gives you correct Encoding according to your LANG Environment Settings.
But this is not all I do with that Output
So I found

  1. The Text breaks when you write it to the System Log
  2. I need to figure out the Main Process ID to check it for its Life Time and the Child Process ID to check them for Orphan processes.

I found that with the other Programming Language I was using a Library that introduces an inconvenient overhead in the curcial part of the Application.
When I managed to replace it by optimized own code the Application as a whole speeded up a lot on big Data Amounts.
I noticed that the hyperfine tool gave very varying results so I increased also the Sample Rate to 1000 Repetitions.

$ hyperfine --warmup 3 -r 1000 'target/release/text-sanitizer -i es de < lanzarote-com_de-ausfluge.html.x100 >/dev/null'
Benchmark #1: target/release/text-sanitizer -i es de < lanzarote-com_de-ausfluge.html.x100 >/dev/null
  Time (mean ± σ):      28.5 ms ±   6.9 ms    [User: 24.4 ms, System: 4.2 ms]
  Range (min … max):    22.4 ms …  43.8 ms    1000 runs

Now the Implementation in the other Programming Language runs just the same as fast:

$ hyperfine --warmup 3 -r 1000 './ -i es de < lanzarote-com_de-ausfluge.html.x100 >/dev/null'
Benchmark #1: ./ -i es de < lanzarote-com_de-ausfluge.html.x100 >/dev/null
  Time (mean ± σ):      28.5 ms ±   6.4 ms    [User: 20.8 ms, System: 7.8 ms]
  Range (min … max):    22.7 ms …  42.7 ms    1000 runs

Still the Results from the hyperfine tool differ strongly from measurements from other profiling tools like the time command.
I found the time Command gives Results that are closer to the real Execution Time:

$ date +"%s.%N" ; time ./ -i es de < lanzarote-com_de-ausfluge.html.x100 >/dev/null ; date +"%s.%N"

real	0m0.041s
user	0m0.029s
sys	0m0.012s
$ echo "scale=3; 14.376726294-14.333980047"|bc -l

and for the Rust Application:

$ date +"%s.%N" ; time target/release/text-sanitizer -i es de < lanzarote-com_de-ausfluge.html.x100 >/dev/null ; date +"%s.%N"

real	0m0.040s
user	0m0.033s
sys	0m0.007s
$ echo "scale=3; 34.208851353-34.166612867"|bc -l

That also confirms the correctness of the date ; $command ; date measurements.

The only thing that is still needed are functionalities like warmup, run count and statistical evaluation of the measurements.
Something that a Perl Script can do just fine.

In conclusion I also must admit that Rust does seem to have nice performant std Libraries that permit a nice performance right out of the box.
Other Languages and Ecosystems put more Emphasis on Ease of Use than Performance.

As I commented before I was observing an inexplicable big difference between the real System Time that passed and the Measurements given by the hyperfine tool.
So I became curious about what the tool does and analysed it with the strace command.
I generated a strace activity log with:

$ strace -f -o strace_hf-text-san-rs_2020-06-17-1.log hyperfine --warmup 3 -r 3 'target/release/text-sanitizer -i es de < lanzarote-com_de-ausfluge.html >/dev/null'

That gave me very surprising insights:

$ cat strace_hf-text-san-rs_17378_2020-06-17-1.log|grep -iE " clone\("|wc -l
$ cat strace_hf-text-san-rs_17378_2020-06-17-1.log|grep -iE " execve\("|grep -vi unfinished|wc -l
$ cat strace_hf-text-san-rs_17378_2020-06-17-1.log|grep -iE " execve\("|grep -vi unfinish|grep -i "/sh\""|wc -l

For only 3 Execution Measurements it launches 215 Child Processes* and tries to execute 416 External Programs
Especially often it tries to use the bash Command Shell about 410 times

In comparison the time Command with:

$ strace -f -o strace_tm-text-san-c_2020-06-17-1.log time target/release/text-sanitizer -i es de < lanzarote-com_de-ausfluge.html >/dev/null
0.00user 0.00system 0:00.00elapsed 33%CPU (0avgtext+0avgdata 928maxresident)k
0inputs+0outputs (0major+284minor)pagefaults 0swaps
$ cat strace_tm-text-san-c_16279_2020-06-17-1.log|grep -iE " clone\("|wc -l
$ cat strace_tm-text-san-c_16279_2020-06-17-1.log|grep -iE " execve\("|wc -l

use only 1 Child Process and 1 External Program

These Statistics are especially relevant because the getrusage() System Function is used to measure the consumed Execution Time.

$ cat strace_hf-text-san-rs_17378_2020-06-17-1.log|grep -iE " getrusage\("|grep -vi "unfinish"|wc -l

The official Documentation explains:

Return resource usage statistics for all children of the
calling process that have terminated and been waited for.
These statistics will include the resources used by
grandchildren, and further removed descendants, if all of the
intervening descendants waited on their terminated children.

So all the Child Process builds alterate the Measurements and with this heavy overhead hardly can give a correct measurement.

Furthermore using the Linux bash Command Shell to measure an executable comes very close to what a simple Bash Script would do.
with the date ; $command ; date sequence.
But even this wouldn't launch so many Child Processes.

Most of those child processes probably come from

 ⠸ Measuring shell spawning time  ██████░░░░░░░░░░ ETA 00:00:01

hyperfine attempts to subtract the time taken by the shell to execute your process to give results more accurately scoped to your code. Looking at your results of ~30ms under hyperfine and ~40ms according to the shell seems to indicate that there might be ~10ms of extra overhead coming from the shell (though this is surprisingly high)

1 Like

I looked into the Source Code of the GNU time Command:
It also doesn't do anything different than

date ; $command ; date

and also uses the struct rusage *usage to find the Process Consumptions.
but in differences it only launches 1 Child Process

So I made a little Application that would measure the complete Execution time from the Spawn to the Reap of the Child Process and measure the Process Consumptions with getrusage() continuesly.

Very surprisingly after 1000 Executions the Values of the struct rusage *usage have grown significantly:

prfg no. '1000' (code: '0'): rpt: ''
prfg stt: ''
prfg mem: ''
prfg fll stt: ''
prfg user '21.848453'
prfg system '4.904511'
prfg res mem '8236'
prfg usg dmp:
  ru_idrss    => 0,
  ru_inblock  => 0,
  ru_isrss    => 0,
  ru_ixrss    => 0,
  ru_majflt   => 0,
  ru_maxrss   => 8236,
  ru_minflt   => 2219321,
  ru_msgrcv   => 0,
  ru_msgsnd   => 0,
  ru_nivcsw   => 3296,
  ru_nsignals => 0,
  ru_nswap    => 0,
  ru_nvcsw    => 3003,
  ru_oublock  => 0,
  ru_stime    => 4.904511,
  ru_utime    => 21.848453,

Which gives insight that the Dokumentation on the Statement

Return resource usage statistics for all children of the
calling process that have terminated and been waited for.

is for real

So the correct interpretation of this value could only be on a one on one call. Just as it does the GNU time command.
A small minimalist application would run the Command to profile and the Manager Application would evaluate the statistics.

So for now my Profiling Tool measures only the time from Spawn to Reap and calculates the 95% Line and Average Time.
But even more it processes the SIGCHLD Signal to reap the Child Process at the most earliest moment.

As a result it reported for the Rust Application:

$ ./ -r 1000 -d --dir=../../rust/text-sanitizer 'target/release/text-sanitizer -i es de < ../../lanzarote-com_de-ausfluge.html.x100 >/dev/null' 
Execution Time 95%: '0.039576' s
Execution Time AVG: '0.028952' s
Execution Time MIN: '0.025503 s
Execution Time MAX: '0.044042' s

And the other Implementation:

$ ./ -r 1000 -d --dir=../../pas*/text* './ -i es de < lanzarote-com_de-ausfluge.html.x100 >/dev/null'
Execution Time 95%: '0.052116' s
Execution Time AVG: '0.038323' s
Execution Time MIN: '0.033591' s
Execution Time MAX: '0.056527' s

Rust holds it's Advantage on Big Data Volume.

But on normal Data Volume it performs just the same or even a bit slower:

$ ./ -r 1000 -d --dir=../../rust/text-sanitizer 'target/release/text-sanitizer -i es de < ../../lanzarote-com_de-ausfluge.html >/dev/null'
Execution Time 95%: '0.003929' s
Execution Time AVG: '0.003573' s
Execution Time MIN: '0.003308' s
Execution Time MAX: '0.007229' s

And the other Implementation:

$ ./ -r 1000 -d --dir=../../pas*/text* './ -i es de < lanzarote-com_de-ausfluge.html >/dev/null'
Execution Time 95%: '0.003617' s
Execution Time AVG: '0.003208' s
Execution Time MIN: '0.002963' s
Execution Time MAX: '0.005126' s

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.