Bad idea? Using criterion to measure not time

Hello there,
This might be a bad idea, but i have started using criterion to benchmark time, and it came to me the idea to use all robustuness of Criterion and hack it in a way to measure something other than time, stuff such as a ML model accuracy.

Here is an example of what im trying to do, in the following project im measuring the prediction accuracy through two metrics: Sensitivity and Specificity, and i want to bench these values instead of the time it takes to compleete them.

Using Custom Measurements - Criterion.rs Documentation i came up with the following file:

use criterion::{
    Criterion, criterion_group, criterion_main,
    measurement::{Measurement, ValueFormatter},
};
use linfa_nn::{CommonNearestNeighbour, NearestNeighbour};
use shopping::{evaluate, load_data_from_url, predict};

const TEST_RATIO: f32 = 0.4;
const K_NEIGHBORS: usize = 2;
const URL: &str = "https://cdn.cs50.net/ai/2023/x/projects/4/shopping.zip";

struct MetricMeasurement;

impl Measurement for MetricMeasurement {
    type Intermediate = f64;
    type Value = f64;

    fn start(&self) -> Self::Intermediate {
        0.0
    }

    fn end(&self, i: Self::Intermediate) -> Self::Value {
        i
    }

    fn add(&self, v1: &Self::Value, v2: &Self::Value) -> Self::Value {
        v1 + v2
    }

    fn zero(&self) -> Self::Value {
        0.0
    }

    fn to_f64(&self, value: &Self::Value) -> f64 {
        *value
    }

    fn formatter(&self) -> &dyn criterion::measurement::ValueFormatter {
        &MetricFormatter
    }
}

struct MetricFormatter;

impl ValueFormatter for MetricFormatter {
    fn format_value(&self, value: f64) -> String {
        format!("{value:.4}")
    }

    fn scale_values(&self, _typical_value: f64, _values: &mut [f64]) -> &'static str {
        ""
    }

    fn scale_throughputs(
        &self,
        _typical_value: f64,
        _throughput: &criterion::Throughput,
        _values: &mut [f64],
    ) -> &'static str {
        ""
    }

    fn scale_for_machines(&self, _values: &mut [f64]) -> &'static str {
        ""
    }
}

fn bench_sensitivity(c: &mut Criterion<MetricMeasurement>) {
    let (evidence, labels) =
        load_data_from_url(URL).expect("Failed to load CSV from URL for benchmark");

    let dataset = linfa::Dataset::new(evidence, labels);
    let mut rnd = rand::thread_rng();

    c.bench_function("Model Sensitivity", |b| {
        b.iter_custom(|iters| {
            let mut total = 0.0;

            for _ in 0..iters {
                let (train, test) = dataset
                    .clone()
                    .shuffle(&mut rnd)
                    .split_with_ratio(TEST_RATIO);

                let model = CommonNearestNeighbour::KdTree
                    .from_batch(train.records(), linfa_nn::distance::L2Dist)
                    .expect("NN index");

                let predictions = predict(&*model, train.targets(), test.records(), K_NEIGHBORS);
                let (sensitivity, _specificity) = evaluate(test.targets(), &predictions);
                total += sensitivity;
            }

            total as f64 / iters as f64
        })
    });
}

fn bench_specificity(c: &mut Criterion<MetricMeasurement>) {
    let (evidence, labels) =
        load_data_from_url(URL).expect("Failed to load CSV from URL for benchmark");

    let dataset = linfa::Dataset::new(evidence, labels);

    c.bench_function("Model Specificity", |b| {
        b.iter_custom(|iters| {
            let mut total = 0.0;

            let mut rnd = rand::thread_rng();
            for _ in 0..iters {
                let (train, test) = dataset
                    .clone()
                    .shuffle(&mut rnd)
                    .split_with_ratio(TEST_RATIO);

                let model = CommonNearestNeighbour::KdTree
                    .from_batch(train.records(), linfa_nn::distance::L2Dist)
                    .expect("NN index");

                let predictions = predict(&*model, train.targets(), test.records(), K_NEIGHBORS);
                let (_sensitivity, specificity) = evaluate(test.targets(), &predictions);
                assert!(specificity > 0.5);
                total += specificity;
            }

            total as f64 / iters as f64
        })
    });
}

criterion_group! {
    name = bench_accuracy;
    config = Criterion::default().with_measurement(MetricMeasurement);
    targets = bench_sensitivity, bench_specificity
}
criterion_main!(bench_accuracy);

The whole repo can be found here: Files · feat/benches · CS50-AI / shopping · GitLab

Now the benches are able to run but are giving a much worse accuracy than repeatedly running the main function.

Now is this a good idea? have anyone tried to do anything similar? should i continue trying to make this work or is it a waste of time?

1 Like

Sounds interesting. An immediate application that comes to mind is measuring cpu cycles or instruction counts, which are much more stable numbers than actual time. The metrics you mention are also interesting, but less obvious since they have a different domain. Time ranges from 0 to infinite, whereas sensitivity ranges from 0 to 1.

3 Likes

Very interesting, I could see using this approach to measure memory usage or allocations as well for example (together with a custom global allocator that tracks this information).