Best practices for managing test data

aah · July 20, 2018, 6:21pm

I'm currently trying to work out how to best manage data for integration tests, and I'm wondering if it's a "solved" problem, or one for which there is community consensus?

I have an application + library that processes audio files, and I would like to be able to run tests with example audio workloads. I've struggled so far to find a neat solution. So far, I have tried and been unsiatisfired with:

git-lfs to store test data in a data subdirectory
downloading data in a cargo build script

The former makes it slow to check out the code, while the latter is brittle and does not guarantee a coupling between the data being present, and the test that requires it.

Are there any better options?

azriel91 · July 21, 2018, 9:09am

Not sure about better, but here are some ideas:

Generate the audio files
Can this generate realistic data? not sure
Store the test data in a separate repository, and download just the latest revision when cloning (maybe git submodules do this, I haven't confirmed but they might)
This is kind of a mix between the two options you've mentioned — download during cloning, but not all of the data, and has the added benefit of versioning the data with the code.

Centril · July 22, 2018, 12:45pm

It depends on what sort of audio files are expected and processed by the application. If you for example are analyzing music which consists of human singing and the application is checking for things like that, it will be difficult. Audio files with speech will also be difficult.

If you want to test simpler things you could use markov chains and randomly generate some midi perhaps. You can then use GitHub - proptest-rs/proptest: Hypothesis-like property testing for Rust to facilitate the testing and shrinking.

dylan.dpc · July 22, 2018, 12:49pm

I wouldn't recommend this. It is better to keep tests as "offline" as possible. You don't want a test failing due to a download failure.

aah · July 22, 2018, 7:31pm

Thanks for suggestions, all.

@azriel91 -- unfortunately generating the audio files is out of the question for me, as my library is designed to search for/detect features in real-world audio data, the nature of which artificial tests would not be able to replicate.

Using a git sub-module might be an option. As they are audio files, though, they tend to be quite large. I suppose that might not be so much of a problem if they don't ever change though!

@dylan.dpc -- I completely agree that downloading in a cargo script is non optimal, that's why I posed the question I also agree that a test failing due to a download failure is unacceptable. Ideally, I'm looking for a solution that allows me to encode something like the following vague pseudo-code:

if data driven tests enabled: 
    for each test in tests:
        let data = download_data()
        if data = failure: 
            skip test
        otherwise: 
            run_test(test, data)

Using cargo or another tool (e.g. git) to download data outside of the testing environment essentially lifts the download_data() call into a separate loop, rather than integration the testing acquisition into the testing environment.

Maybe a better question would be, does rust have any testing environments that support something like the pseudo-code above?

aah · July 23, 2018, 3:07pm

Another piece of information that might be useful for others who find this thread in the future. There are plans afoot for Rust to support a greater range of test frameworks, which might somewhat solve the problem I pose here. There are more details in this blog post, and the linked RFC: http://blog.jrenner.net/rust/testing/2018/07/19/test-in-2018.html

dylan.dpc · July 23, 2018, 5:44pm

you can store some pre-generated audio files in the tests folder and use the same files on every build.

aah · July 23, 2018, 6:44pm

Unfortunately, the files are really quite big (on the order of 100mb for some), so I'm loathe to include them in git. I've had a play with git-lfs, with some success, but cheap git-lfs hosts are not plentiful.

The files are also publicly available in a very script friendly way, so I'd rather not (essentially) re-host them myself.

owen · February 3, 2020, 10:28pm

I agree with you that you cannot accept a solution that makes the project slow to check out your project code. I cant support downloading data in a cargo build script even with a conditional that the build is a test build.

Integration tests are for many projects essential to avoid work, for some projects having integration tests that require large amounts of data, are essential. For such projects you need to start a separate git repository for your integration tests, and your original project contain only unit tests. Cargo and Cargo users expect in published crates only unit tests. But don't forget if you are writing a product to depend upon, for me as a user I expect the author to know it works, and as such correctly maintaining your Integration tests is a real work saver.

So now you have 2 git repos, and you still don't have anywhere to store your test data! Lets get some specs I want:

Acts as a “Content Addressable Storage” system.
Easy to host data for distributing data locally in public and private environment.
Minimal download for updates.
Easy to cache.

I suspect ostree is a good solution. I have not researched git-lfs to see if it matches all of my requirements.

Topic		Replies	Views
Downloading test data help	7	142	December 10, 2024
How to handle large files as sources of test data? help	4	840	October 7, 2021
Using known files in tests and benchmarks help	15	1324	June 14, 2020
Post build script to generate tests help	6	733	October 1, 2021
How to reliably locate test data files from tests? help	6	10947	January 12, 2023

Best practices for managing test data

Related topics