Experienced programmer's first Rust program: how to improve style and organization?

Hi folks, first I want to say how awesome the Rust docs are, and how helpful the compiler is. It really makes a difference for someone first learning the language.

I'm an experienced programmer just learning Rust. I've made a few attempts to power through the Rust Book in recent months, but I've lacked the time and discipline to really absorb and use it all. I learn better when I have a specific task to accomplish, so yesterday I set out to write out a program related to my work as as bioinformatics scientist. After several hours scouring Google, StackOverflow, the Rust Book, and rustc --explain, I've finally gotten the program to work as intended. It's included below, as is a link to a Github repo.

At the moment, it's all a big jumble in the main function. At several points I tried to reorganize the code into smaller dedicated functions, but I'm still don't have great intuition for when ownership, lifetimes, etc., come into play, and I certainly don't have the syntax ready at my fingertips.

So my questions are:

  1. Are there any glaring issues with the code? Its style, implementation, or use (or not) of idiomatic conventions?
  2. How would you recommend reorganizing this code into smaller functions? Argument parsing is a substantial portion of the code, and that could certainly be refactored. Anything else?
extern crate argparse;
extern crate bio;
extern crate rand;

use argparse::{ArgumentParser, Store};
use bio::io::fastq;
use rand::{Rng, RngCore, SeedableRng};
use rand_xorshift::XorShiftRng;
use std::fs::File;
use std::io;

fn main() {
    let mut infile = "-".to_string();
    let mut outfile = "-".to_string();
    let mut numreads = "500".to_string();
    let mut seed = "0".to_string();
    let mut fhin;
    let mut fhout;
    let mut stdout;
    let mut stdin;
    let mut nonseedgen;
    let mut seedgen: XorShiftRng;

    {
        let mut parser = ArgumentParser::new();
        parser
            .refer(&mut outfile)
            .add_option(
                &["-o", "--outfile"],
                Store,
                "Write output to FILE; by default, output is written to the terminal (stdout)",
            )
            .metavar("FILE");
        parser
            .refer(&mut numreads)
            .add_option(
                &["-n", "--num-reads"],
                Store,
                "Randomly sample N sequences from the input; by default N=500",
            )
            .metavar("INT");
        parser
            .refer(&mut seed)
            .add_option(
                &["-s", "--seed"],
                Store,
                "Seed random number generator for reproducible behavior; by default, the RNG sets its own random state"
            )
            .metavar("INT");
        parser
            .refer(&mut infile)
            .add_argument("seqs", Store, "Sequences in Fastq format");
        parser.parse_args_or_exit();
    }

    let instream: &mut dyn std::io::Read = match infile.as_str() {
        "-" => {
            stdin = io::stdin();
            &mut stdin
        }
        filename => {
            fhin = File::open(&filename).unwrap();
            &mut fhin
        }
    };
    let outstream: &mut dyn std::io::Write = match outfile.as_str() {
        "-" => {
            stdout = io::stdout();
            &mut stdout
        }
        filename => {
            fhout = File::create(&filename).unwrap();
            &mut fhout
        }
    };
    let rng: &mut dyn RngCore = match seed.as_str() {
        "0" => {
            nonseedgen = rand::thread_rng();
            &mut nonseedgen
        }
        seedstr => {
            let seednum: u64 = seedstr.parse().unwrap();
            seedgen = SeedableRng::seed_from_u64(seednum);
            &mut seedgen
        }
    };
    let size: usize = numreads.parse().unwrap();
    let mut reservoir: Vec<fastq::Record> = Vec::new();
    let reader = fastq::Reader::new(instream);
    let mut writer = fastq::Writer::new(outstream);
    let mut count = 0;
    for result in reader.records() {
        let record = result.expect("Error during Fastq parsing");
        count += 1;
        if reservoir.len() < size {
            reservoir.push(record);
        } else {
            let r = rng.gen_range(1, count);
            if r <= size {
                reservoir[r - 1] = record;
            }
        }
    }
    eprintln!(
        "Sampled {} reads from a total of {}",
        reservoir.len(),
        count
    );
    for record in reservoir.iter() {
        writer
            .write_record(record)
            .expect("Error writing Fastq record");
    }
}

https://github.com/standage/umngqusho/tree/8cfac54a79cb30b3ad18b7fbb14320fef43c9b41

The style's not too bad, but there are some Rust features you could take advantage of:

  • Instead of using "-" as a filename to mean standard input/output (preventing the user from inputting that as a filename), you could make those Options and represent no file with None so you can use map.
  • I'd make an Args struct that contains all the user input and move parsing into its own method.
  • Instead of having variables for each potential file or stream, I'd make them be a Box<dyn std::io::Write>. It's generally good to trim down the number of variables that may go unused.
  • Instead of parsing values yourself (like seed), argparse should be able to automatically parse them for you if you pass it the right type (like u64.)
  • In general, using iterator chaining when you can is preferable to for loops writing to vector indices and maintaining a count manually. This case isn't too unidiomatic, and actually probably easier to write with the for loop, but I've provided an example of how you might write it in a "declarative" way.

I've reorganized your code to show how you could split it into functions / methods and make it a bit more Rust-y. Since this is a straight-shot command line app and not a server or something, you don't really need to care too much about lifetimes here - elision will do the work.

Besides splitting it into parts, though, the overall structure is pretty good.

use argparse::{ArgumentParser, Store, StoreOption};
use bio::io::fastq;
use rand::{Rng, RngCore, SeedableRng};
use rand_xorshift::XorShiftRng;
use std::{fs::File, io, path::PathBuf};

/// Parsed arguments.
struct Args {
    outfile: Option<PathBuf>,
    seqs: Option<PathBuf>,
    num_reads: u32,
    seed: Option<u64>,
}

impl Args {
    /// Parse command line args.
    fn parse_or_exit() -> Self {
        let mut outfile = None;
        let mut seqs = None;
        let mut num_reads = 500;
        let mut seed = None;

        let mut parser = ArgumentParser::new();

        parser
            .refer(&mut outfile)
            .add_option(
                &["-o", "--outfile"],
                StoreOption,
                "Write output to FILE; by default, output is written to the terminal (stdout)",
            )
            .metavar("FILE");
        parser
            .refer(&mut num_reads)
            .add_option(
                &["-n", "--num-reads"],
                Store,
                "Randomly sample N sequences from the input; by default N=500",
            )
            .metavar("INT");
        parser
            .refer(&mut seed)
            .add_option(
                &["-s", "--seed"],
                StoreOption,
                "Seed random number generator for reproducible behavior; by default, the RNG sets its own random state"
            )
            .metavar("INT");
        parser
            .refer(&mut seqs)
            .add_argument("seqs", StoreOption, "Sequences in Fastq format");

        parser.parse_args_or_exit();

        Self {
            outfile,
            seqs,
            num_reads,
            seed,
        }
    }

    /// Open input stream from either a file or stdin.
    fn in_stream(&self) -> Box<dyn io::Read> {
        self.seqs
            .map(|name| Box::new(File::open(&name).unwrap()) as Box<dyn io::Read>)
            .unwrap_or_else(|| Box::new(io::stdin()))
    }

    /// Open output stream from either a file or stdout.
    fn out_stream(&self) -> Box<dyn io::Write> {
        self.outfile
            .map(|name| Box::new(File::open(&name).unwrap()) as Box<dyn io::Write>)
            .unwrap_or_else(|| Box::new(io::stdout()))
    }

    /// Create an RNG from a optional provided seed.
    fn rng(&self) -> Box<dyn RngCore> {
        self.seed
            .map(|seed| Box::new(XorShiftRng::seed_from_u64(seed)) as Box<dyn RngCore>)
            .unwrap_or_else(|| Box::new(rand::thread_rng()))
    }
}

/// Sample records from the input stream.
fn sample_records(
    in_stream: &mut dyn io::Read,
    rng: &mut dyn RngCore,
    num_reads: u32,
) -> Vec<fastq::Record> {
    // Read the records from the input.
    let records = fastq::Reader::new(in_stream)
        .records()
        .map(|record| record.expect("Error during Fastq parsing"));

    // Create a reservoir from the first `num_reads` records.
    let reservoir = records.by_ref().take(num_reads).collect::<Vec<_>>();

    // Get indices of remaining records.
    let remaining_idxs = reservoir.len()..;

    // Generate random values for the remaining records.
    let rand_vals = remaining_idxs.by_ref().map(|c| rng.gen_range(0, c - 1));

    // For each remaining record, get a mutable reference to a random entry in
    // the reservoir, if the random number is in range.
    let remaining = records
        .zip(rand_vals)
        .filter_map(|(record, r)| reservoir.get_mut(r).map(|entry| (record, entry)));

    // Write the remaining records to the random reservoir entries.
    for (record, entry) in remaining {
        *entry = record;
    }

    eprintln!(
        "Sampled {} reads from a total of {}",
        reservoir.len(),
        remaining_idxs.next().unwrap(),
    );

    reservoir
}

/// Write sampled records to the output stream.
fn write_records(out_stream: &mut dyn io::Write, reservoir: &[fastq::Record]) {
    let mut writer = fastq::Writer::new(out_stream);

    for record in reservoir {
        writer
            .write_record(record)
            .expect("Error writing Fastq record");
    }
}

fn main() {
    let args = Args::parse_or_exit();
    let in_stream = args.in_stream();
    let out_stream = args.out_stream();
    let rng = args.rng();
    let reservoir = sample_records(&mut in_stream, &mut rng, args.num_reads);
    write_records(&mut out_stream, &reservoir);
}

8 Likes

Using index operations is definitely not recommended. However, that doesn't mean you cannot use for-loops. It's just a matter of using the appropriate iterator in for <element> in <iterator>.

EDIT: Manual indexing can cause a panic, while iterating cannot (unless the iterator is faulty or you somehow trigger an OOM error). If the iterator also implements TrustedLen, all out-of-bounds checks will be removed, which results in performance improvements approximately anti-proportional to the amount of work done inside the loop.

1 Like

Since most part of your code is argument parsing, why don't you use structopt which vastly simplifies it? Also, constant amount of allocations would not impact the performance much while improves the code readability.

use bio::io::fastq;
use rand::{Rng, RngCore, SeedableRng};
use rand_xorshift::XorShiftRng;
use std::fs::File;
use std::io;
use std::path::PathBuf;

#[derive(Debug, structopt::StructOpt)]
struct Opt {
    /// Write output to FILE; by default, output is written to the terminal (stdout)
    #[structopt(short, long)]
    outfile: Option<PathBuf>,
    /// Randomly sample N sequences from the input; by default N=500
    #[structopt(short, long, default_value = "500")]
    num_reads: usize,
    /// Seed random number generator for reproducible behavior; by default, the RNG sets its own random state
    #[structopt(short, long)]
    seed: Option<u64>,
    /// Sequences in Fastq format
    seqs: Option<PathBuf>,
}

fn main() {
    let opt = Opt::from_args();
    // eprintln!("{:?}", opt);
    
    let mut instream: Box<dyn std::io::Read> = match &opt.seqs {
        Some(path) => Box::new(File::open(path).unwrap()),
        None => Box::new(io::stdin()),
    };
    let mut outstream: Box<dyn std::io::Write> = match &opt.outfile {
        Some(path) => Box::new(File::open(path).unwrap()),
        None => Box::new(io::stdout()),
    };
    let mut rng: Box<dyn RngCore> = match opt.seed {
        Some(seed) => Box::new(XorShiftRng::seed_from_u64(seed)),
        None => Box::new(rand::thread_rng()),
    };
    
    let size = opt.num_reads;
    let mut reservoir: Vec<fastq::Record> = Vec::new();
    let reader = fastq::Reader::new(instream);
    let mut writer = fastq::Writer::new(outstream);
    let mut count = 0;
    for result in reader.records() {
        let record = result.expect("Error during Fastq parsing");
        count += 1;
        if reservoir.len() < size {
            reservoir.push(record);
        } else {
            let r = rng.gen_range(1, count);
            if r <= size {
                reservoir[r - 1] = record;
            }
        }
    }
    eprintln!(
        "Sampled {} reads from a total of {}",
        reservoir.len(),
        count
    );
    for record in &reservoir {
        writer
            .write_record(record)
            .expect("Error writing Fastq record");
    }
}
1 Like

Thanks for your thorough response @asymmetrikon!

I'm having trouble wrapping my mind around some of your suggested changes. Let's take this for instance.

  • The Box<T> is for heap allocation, right? Is this to make it easier to work with the variable across different scopes?
  • Why do we need to call .unwrap() after File::open(&name)?
  • The code beginning with as is casting the data type of the Box::new() result, correct?
  • I'm familiar with .map functions in other languages, where it's usually applied to lists or iterators. Here it's applied to a file path (or more strictly a PathBuf object). Why is map recommended here?
  • I'm trying to understand when the dyn keyword is needed. This article was helpful, but even in my original where I wasn't doing any heap allocation the compiler issued a warning suggesting I needed to use dyn.

Thank you for your patience with a n00b!

dyn T indicates a trait object - basically, some data and a table of functions that work on that data to implement a trait T. For example, dyn Read is an object that implements the trait Read, and we know nothing else about. in_stream's source can be either a file or stdin. Those are different types, so you can't assign either to the same variable. However, since they both implement io::Read, you can turn them into dyn Reads (here using as - we don't need an as on the stdin one because the type inference engine is smart enough to figure it out.) That has variable size, though, so it needs to be stored somewhere that can accommodate that - we can put unsized things in a Box on the heap.

There are other ways of handling this - you could create your own enum like

enum InStream {
    File(File),
    Stdin(Stdin),
}

then implement io::Read for it and use an instance of it instead of the Box<dyn io::Read>. If you don't want that, you can use something like the Either type from the either crate; it implements Read if both of its variants do, so you'd use an Either<File, Stdin> and it'd work the same way. Boxing trait objects is a pretty standard way of dealing with them if your program is okay with allocating.

File::open returns a Result<File>, so we have to handle the bad case (maybe the user doesn't have permissions for the file, or it's write protected, etc.) Unwrapping takes care of that by panicking and ending the program if any of those things happen. Probably in a real application you'd want to bubble those errors up to main so you can print them out to the user nicely formatted.

Here, it's not being used on a PathBuf, but on an Option<PathBuf> - you have to handle the case where the user didn't provide an input. The pair of map and unwrap_or_else there allow you to return one thing if they did enter a file, and one thing if they didn't. It's equivalent to writing

match self.seqs {
    Some(name) => Box::new(File::open(&name).unwrap()) as Box<dyn io::Read>,
    None => Box::new(io::stdin()),
}

Incidentally, you can treat an Option<T> like an iterator that returns either 0 or 1 Ts, and its map function is just the Option's map function.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.