Is there a term for injecting a bug into a library to check a test works?

This is something I did today, I realised there was a "hole" in my testing, such that a class of bugs would not be detected by my tests.

So I wrote a test that was designed to detect this class of bugs, and then (temporarily!) modified my library so it malfunctioned to check my test would actually detect the bug...

I suppose you could even have conditional compilation to automate this, although I am not going that far!

1 Like

If it's a previously untested code path, would coverage analysis have caught that?

I guess so. There was lots (well... by my standards, lol!) of unit testing, and also more integrated testing as well, but what was missing was a major part of the functionality. So obvious when you thought about it, but like an "elephant in the room", something I never considered. Actually using the software it rapidly crashed and burned in the presence of this bug class. The new test is here: test.rs - source

Bebugging

4 Likes

Have you encountered mutation testing before? There's an effort to implement a mutation testing tool for Rust at https://mutants.rs/ which engages in it - the goal is to see if breaking your code also causes your tests to break.

10 Likes

Am giving it a whirl. Seems it may be busy for a while!

C:\Users\ano31\Rust\RustDb>cargo mutants
Found 5277 mutants to test
build    Unmutated baseline ... 0.1s
build    Unmutated baseline ... 0.2s
build    Unmutated baseline ... 0.3s
build    Unmutated baseline ... 0.4s
ok       Unmutated baseline in 14.9s build + 16.7s test
 INFO Auto-set test timeout to 1m 23s
build    src/page.rs:668:19: replace == with != in MutPage<'a>::rotate_left ... 0.1s
build    src/compile.rs:308:5: replace c_compare -> CExpPtr<bool> with CExpPtr::new() ... 0.1s
test     src/sortedfile.rs:880:9: replace PageList::append_one with () ... 0.1s
build    src/cexp.rs:131:9: replace <impl CExp for Sub<T>>::eval -> T with Default::default() ... 0.1s
build    src/util.rs:144:26: replace + with - in hex ... 0.1s
MISSED   src/util.rs:144:26: replace + with - in hex in 1.6s build + 10.1s test
build    src/exec.rs:156:15: replace -= with /= in EvalEnv<'r>::discard ... 1.0s
└           Compiling rustdb v5.2.60 (C:\Users\ano31\AppData\Local\Temp\cargo-mutants-RustDb-7fd1xC.tmp)
9/5277 mutants tested, 1 MISSED, 4 caught, 4 unviable, 1m 32s elapsed, about 14h 58m remaining
2 Likes

Sometimes I find myself deliberately tweaking/breaking my code so as to see if the tests I expect to fail do actually fail. I have called this "perturbation debugging" in the past for the lack of a better term.

Oddly enough I seem to have spent all day doing perturbation debugging today. I was feeding lots of JSON messages into jsonschema so as to validate them. They all validated OK. That struck me a suspicious. So I deliberately broke the JSON, it still validated OK. The validator accepted anything!

5 Likes

After about 15 minutes, my (windows) laptop rebooted without being asked....! Gave me a slightly strange error screen I never saw before. Something about it just had to reboot...

So cargo mutants at least seemed to find a bug in windows ( or my possibly flaky hardware ). Am retrying using

cargo mutants --shard 1/100

to see if my laptop can cope with that much!

That finished.

53 mutants tested in 11m 58s: 12 missed, 28 caught, 10 unviable, 3 timeouts

Results are interesting. Here is an example of a "false miss". That is, the change was harmless, so the tests didn't fail.

MISSED src/dividedstg.rs:14:30: replace + with * in 1.6s build + 10.7s test

Code is:

/// Bytes required to save FD ( root, size ).
pub const FD_SIZE: usize = 8 + 8;

This constant can be anything >= 16 and code will still be correct ( albeit non-optimal ). I found other similar examples where performance would be degraded but the software would still work ( so tests cannot find anything wrong ).

2 Likes

Probably some part of the modified code allocated all available memory, which tends to cause other programs' allocations to fail, and they then handle that poorly, and a cascade of stuff goes wrong.

This constant can be anything >= 16 and code will still be correct.

You will probably have to exclude various things like that.

In the end, mutation testing can't tell you “this is a real problem”, it can only tell you “your test suite doesn't require this to be exactly as it is”. Not all such facts are interesting.

1 Like

Sure. So far it doesn't seem to have found holes in my testing. I'm not sure that necessarily says my tests are good, it may just be the close-knit nature of the software - a random mutation is likely to break the software and tests will fail. If it doesn't, likely the mutation was harmless by chance. Still, it is a nice tool!

[ The difficult tests to write are those that will expose subtle flaws in the program logic, which may be hard to find. Still, a tool like this can find areas that are not being properly tested, and maybe should be. ]

Instead of breaking working code, can this be done by having the test run vs a trait. Then have a pub struct Corrct_Impl(); pub struct Wrong_Impl(); and asserting that when fed Wrong_Impl, it fails ?

1 Like

This is storage management (database) software, the main thing that can go wrong is it loses storage, or doubly allocates storage or over-writes something, or fails to read or write something, and somehow in that process loses some data, or fails to store or retrieve some data. Besides that it has to compile and execute (interpret) code correctly.

It isn't really "functions producing results" type software.

I do not understand what this changes.

Unit tests are just functions that return true or false.

All these mistakes you have described are things that one can fake in WrognImpl.

Furthermore, the RightImpl / WrongImpl approach allows one to keep both impls at once, without having to statefully make the codebase wrong, run a test, then revert.

1 Like

Does Windows not have overcommit and an OOM killer that tries to find the responsible process like Linux does? For simple cases it tends to be pretty good at finding and killing the correct process (though I have seen it sometimes decide to kill the parent terminal instead).

The way I see it having my application split into RightImpl / WrongImpl traits is adding complexity to the code base. Yet another thing that has to be maintained. Doubling the amount of changes required when refactoring happens etc. That WrongImple stuff is not satisfying any requirements on my application, it's only there to test the tests.

All in all I don't feel good about carrying broken code around in my code base. I have enough trouble keeping what is supposed to be good in order.

As for "run a test, then revert" that is exactly what I have done on occasion. I break my code to prove to myself that the test detects the breakage. Then I revert with "git checkout".

1 Like

Windows does not overcommit (at least by default).

2 Likes

I'm not saying "Right Impl + Wrong Impl" is better than "Right Impl".

I saying: conditioned on "testing the test" being important, we should automate "testing the test". This "manual break code, test the test, revert" is not easy to automate, where as "Right Impl + Wrong Impl" is.

It's not clear to me that we want to automate "testing the test".

The reason being is that I believe tests require, pretty much by definition, human consideration and input. We are the ones who know what we want the code to do (hopefully).

I think tests should be a simple and free from automation as possible. Make them complex and one ends up debugging the test logic. Which is a diversion from getting the things one actually wants done, done.

1 Like