Is My Criterion Benchmarking Code Actually Answering My Question?

Being new to benchmarking, will the following code will actually reveal the knowledge I seek – Which std method is a faster way of determining if two floating point values have the same sign?

use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
// use num_traits::{float::Float};
use rand::{thread_rng, Rng};

type Pair = (f64, f64);

#[inline]
/// I think this will be slower because it includes a copysign call 
/// ^ I was wrong
fn compare_same_signum(test_pairs: &[Pair]) {
	test_pairs
		.iter()
		.for_each(|(a, b)| {
			let _ = a.signum() == b.signum();
		});
}

#[inline]
/// I think this will be faster because there's no copysign call 
/// ^ I was wrong
fn compare_same_is_sign(test_pairs: &[Pair]) {
	test_pairs
		.iter()
		.for_each(|(a, b)| {
			let _ = a.is_sign_negative() == b.is_sign_negative();
		});
}

#[inline]
/// I think this will maybe be on par with the is_sign methods which use to_bits 
/// ^ I was kind of right
fn compare_same_to_bits(test_pairs: &[Pair]) {
	let mask = 0x8000_0000_0000_0000u64;
	test_pairs
		.iter()
		.for_each(|(a,b)| {
			let _ = (a.to_bits() & b.to_bits()) & mask != 0;
		});
} 

fn bench_float_comparisons(c: &mut Criterion) {
	// Some random test data 
	let mut rng = thread_rng();
	let test_size = 100_000_000usize;
	let test_pairs = (0..test_size)
		.map(|_| (rng.gen(), rng.gen()))
		.collect::<Vec<Pair>>();
	// A benchmarking group
	let mut group = c.benchmark_group("float sign comparisons");
	// Comparisons on differently-logarithmic-decade-sized slices of the same random test data
	for i in (1..8).rev().map(|d| test_size/10usize.pow(d)) {
		group.bench_with_input(
			BenchmarkId::new("signum", i), 
			&i,
			|b, i| b.iter(|| {
				compare_same_signum(&test_pairs[0..*i])
			})
		);
		group.bench_with_input(
			BenchmarkId::new("is_sign_negative", i),
			&i,
			|b, i| b.iter(|| {
				compare_same_is_sign(&test_pairs[0..*i])
			})
		);	
		group.bench_with_input(
			BenchmarkId::new("to_bits", i),
			&i,
			|b, i| b.iter(|| {
				compare_same_to_bits(&test_pairs[0..*i])
			})
		);	

	}
	group.finish();
}

criterion_group!(benches, bench_float_comparisons);
criterion_main!(benches);

Is there a more accurate way to do this? Do you see any major issues with the way I write the benchmark?

Unnecessary background information:
I am partially doing this to learn to benchmark properly/accurately before I try benchmarking larger code sections that rely on the similar operations to classify points, vectors, etc that are known to be finite – otherwise I'd just stick to signum.

I ran the code a few times, but after viewing the results, my brain has neither grown any larger nor gained heightened powers of perception.

I think you need to return something from the test functions: As it stands right now, the optimizer might notice that your functions don't actually do anything and therefore remove the code you intend to test. For example, you could return the count of pairs that have the same sign:

#[inline]
fn compare_same_signum(test_pairs: &[Pair])->usize {
	test_pairs
		.iter()
		.filter(|(a, b)| a.signum() == b.signum())
		.count()
}

Thanks! This seems to produce more consistent results that make sense.

1 Like

By the way, if you are trying to learn proper benchmarking practice, do it with something much more obvious. In this case, I'd expect no difference; determining the sign of a floating-point number is trivial (it can be expressed by a couple of bitwise and operations and an equality check). With any reasonably smart compiler/optimizer, there will be no actual calls to copysign or any other functions, the equivalent code will be inlined. So try to do something non-trivial instead, which will have an obvious performance difference, like sorting a small and a big array or something.

One more thing, your compare_same_to_bits function is not correct, it checks if both numbers are negative. You probably want bitwise xor instead of the first bitwise and between the two numbers.

3 Likes

Thanks. I didn't notice that error in my compare_same_to_bits.

The motivation for testing this particular example was doing a bunch of comparisons of the signs of groups of floating point numbers in real code – I thought it'd be an easy start, but I'll definitely take your point into consideration.

The problem is that it's extremely difficult to extrapolate the results of nano-benchmarks to real code -- especially on modern super-scalar desktop chips. Not to mention what the optimizer will do with code -- if you measure general division, for example, the results are irrelevant for x / 10 because the compiler doesn't use division to do that.

CAD97 put together some great benchmarks in Converting a BGRA &[u8] to RGB [u8;N] (for images)? - #13 by CAD97 that show just how hard it is to understand how something will perform in aggregate. A bunch of operations show up as essentially free because of ILP and speculation and such -- in fact, one of the ones with the most instructions ends up being one of the fastest, and the one that's the fewest instructions is one of the slowest.

So it's critical to find a bigger chunk to measure. Ideally something with a meaningful loop that can run both smaller and larger instances of the problem -- how the unrolling & vectorization ends up can often be more important than how a single body run performs in isolation.

5 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.