Strange benchmark effects

I'm trying to write some benchmarks for my project but they behave really strange.
I just added some more benchmarks and this changed the outcome of the other benchmarks dramatically. But curiously, the new benchmarks run almost at the end of the queue...

Strangely, only the benchmarks of my own crate are affected, but not those for std::io that I added for comparison...

before:

test bench_copy_10000000_io                     ... bench:  16,469,863 ns/iter (+/- 1,096,697) = 607 MB/s
test bench_copy_10000000_io_buffered            ... bench:  16,552,239 ns/iter (+/- 1,932,135) = 604 MB/s
test bench_copy_10000000_netio                  ... bench:  16,338,155 ns/iter (+/- 1,209,088) = 612 MB/s
test bench_read_to_end_10000000_io              ... bench:  22,431,369 ns/iter (+/- 5,010,241) = 445 MB/s
test bench_read_to_end_10000000_io_buffered     ... bench:  23,119,148 ns/iter (+/- 1,466,016) = 432 MB/s
test bench_read_to_end_10000000_netio           ... bench:  16,512,276 ns/iter (+/- 1,303,915) = 605 MB/s
test bench_read_to_end_10000000_netio_sharedbuf ... bench:  30,629,028 ns/iter (+/- 1,631,658) = 326 MB/s
test bench_read_to_end_10000_io                 ... bench:      18,824 ns/iter (+/- 13,737) = 531 MB/s
test bench_read_to_end_10000_io_buffered        ... bench:      27,201 ns/iter (+/- 9,289) = 367 MB/s
test bench_read_to_end_10000_netio              ... bench:      14,666 ns/iter (+/- 2,747) = 681 MB/s
test bench_read_to_end_10000_netio_sharedbuf    ... bench:      20,345 ns/iter (+/- 7,794) = 491 MB/s
test bench_read_to_end_1000_io                  ... bench:       2,418 ns/iter (+/- 479) = 413 MB/s
test bench_read_to_end_1000_io_buffered         ... bench:       9,147 ns/iter (+/- 68,476) = 109 MB/s
test bench_read_to_end_1000_io_sharedbuf        ... bench:       2,475 ns/iter (+/- 726) = 404 MB/s
test bench_read_to_end_1000_netio               ... bench:       6,987 ns/iter (+/- 2,502) = 143 MB/s
test bench_read_to_end_1000_netio_sharedbuf     ... bench:       1,985 ns/iter (+/- 2,566) = 503 MB/s
test bench_tcp_read_to_end_100000_io            ... bench:  12,648,362 ns/iter (+/- 150,517) = 7 MB/s
test bench_tcp_read_to_end_100000_io_buffered   ... bench:  12,579,923 ns/iter (+/- 179,080) = 7 MB/s
test bench_tcp_read_to_end_100000_netio         ... bench:  12,471,772 ns/iter (+/- 147,966) = 8 MB/s

after:

test bench_copy_10000000_io                     ... bench:  16,346,917 ns/iter (+/- 1,662,843) = 611 MB/s
test bench_copy_10000000_io_buffered            ... bench:  16,248,851 ns/iter (+/- 1,232,908) = 615 MB/s
test bench_copy_10000000_netio                  ... bench:  29,436,861 ns/iter (+/- 1,918,322) = 339 MB/s
test bench_read_to_end_10000000_io              ... bench:  22,458,453 ns/iter (+/- 4,132,174) = 445 MB/s
test bench_read_to_end_10000000_io_buffered     ... bench:  23,286,373 ns/iter (+/- 2,092,395) = 429 MB/s
test bench_read_to_end_10000000_netio           ... bench:  29,224,446 ns/iter (+/- 2,300,689) = 342 MB/s
test bench_read_to_end_10000000_netio_sharedbuf ... bench:  30,379,066 ns/iter (+/- 2,015,099) = 329 MB/s
test bench_read_to_end_10000_io                 ... bench:      18,757 ns/iter (+/- 13,740) = 533 MB/s
test bench_read_to_end_10000_io_buffered        ... bench:      27,291 ns/iter (+/- 8,666) = 366 MB/s
test bench_read_to_end_10000_netio              ... bench:      25,967 ns/iter (+/- 8,832) = 385 MB/s
test bench_read_to_end_10000_netio_sharedbuf    ... bench:      20,335 ns/iter (+/- 7,342) = 491 MB/s
test bench_read_to_end_1000_io                  ... bench:       2,441 ns/iter (+/- 461) = 409 MB/s
test bench_read_to_end_1000_io_buffered         ... bench:       8,686 ns/iter (+/- 3,107) = 115 MB/s
test bench_read_to_end_1000_io_sharedbuf        ... bench:       2,459 ns/iter (+/- 866) = 406 MB/s
test bench_read_to_end_1000_netio               ... bench:       8,334 ns/iter (+/- 4,420) = 119 MB/s
test bench_read_to_end_1000_netio_sharedbuf     ... bench:       2,007 ns/iter (+/- 1,902) = 498 MB/s
test bench_read_to_string_10000000_io           ... bench:  25,685,669 ns/iter (+/- 1,578,978) = 389 MB/s
test bench_read_to_string_10000000_io_buffered  ... bench:  26,121,595 ns/iter (+/- 1,309,250) = 382 MB/s
test bench_read_to_string_10000000_netio        ... bench:  17,228,399 ns/iter (+/- 1,351,880) = 580 MB/s
test bench_tcp_read_to_end_100000_io            ... bench:  12,651,002 ns/iter (+/- 136,028) = 7 MB/s
test bench_tcp_read_to_end_100000_io_buffered   ... bench:  12,571,327 ns/iter (+/- 254,252) = 7 MB/s
test bench_tcp_read_to_end_100000_netio         ... bench:  12,450,104 ns/iter (+/- 269,427) = 8 MB/s

Any idea how this is possible and how to improve the benchmarks?

How many different places a function is used in your program influences inlining decisions. Some optimizations are critically dependent on inlining (unfortunately). So that's a guess at what may be happening.

The benchmarks are in a separate benches directory and AFAIK the crate that I'm testing is compiled separately.
The benchmark itself contains only one call to the library (per iteration), so whether or not this call is inlined shouldn't matter.

I'm not sure how/if cross-crate inlining works in Rust but I wouldn't expect such effects without LTO.

Any item with type parameters is available in the crate metadata and can be inlined into other crates.

You were right, it was inlining... So is the solution to mark everything #[inline]?

#[inline] makes an item eligible for inlining, but it does not force it (it only bumps the affinity of it slightly). It's hard to say if it's a solution in general. If you have a function that's really simple but could not inline, then using #[inline] is good.

Well in my case the function was apparently already eligible for inlining, otherwise it wouldn't be fast in some cases. And using #[inline] seems to push it just enough to inline it again.

But on the other hand, I'm not so sure anymore that my benchmark is actually useful at all. I think it may optimize too much to be realistic. Benchmarking I/O functions just with mocked streams that don't do any I/O is probably not ideal...