Code review request : first project in rust

I was facing a memory leak problem in c++ codebase. I wrote a tool in rust to parse UMDH log files (which is a snapshot of address + stack trace of all allocations).

It is ~250 lines of rust code : umdh_analyzer/main.rs at bb2bb20ca8d2f7a0424d90160288e9a518e17963 · ashishnegi/umdh_analyzer · GitHub

Main idea is to parse multiple UMDH files and make map of BackTraceId => set { all allocations };
Then allocations which are always present and increasing are sign of memory leak.

  1. My first version had many clone() functions - most of which i removed. There are few remaining - i am not sure what is the best way to remove those.
  2. I am using a lot of &String to save memory. Is that correct approach ?
  3. I want to not do unnecessary allocations and copy of data. This is my main principle.

Please give your valuable feedback.

use std::cmp::Ordering;
use std::collections::{HashMap, HashSet};
use std::fs::File;
use std::io::{BufRead, Error};
use std::path::Path;
use std::{env, io};

type BacktraceAllocationsMap = HashMap<String, HashSet<i64>>;
type BacktraceRefAllocationsMap<'a> = HashMap<&'a String, HashSet<i64>>;

fn parse_umdh_file(file_name: &str) -> Result<BacktraceAllocationsMap, Error> {
    let path = Path::new(&file_name);

    // Open the path in read-only mode, returns `io::Result<File>`
    let file = File::open(&path)?;
    let lines = io::BufReader::new(file).lines();

    let mut backtrace_addresses: BacktraceAllocationsMap = HashMap::new();

    for line_res in lines {
        let line = line_res?;
        // Format of line:
        // 30 bytes + 30 at 1E30F1B0770 by BackTraceF076263
        if line.contains("BackTrace") {
            let at_pos = line.find("at ");
            if at_pos.is_none() {
                continue;
            }

            let address_start_pos = at_pos.unwrap() + 3;
            let line_str = line.as_bytes();
            let mut address_end_pos = address_start_pos;
            while address_end_pos < line.len() && line_str[address_end_pos] != b' ' {
                address_end_pos += 1;
            }

            if address_end_pos + 4 + 9  > line.len() {
                // 4 for " by ", 9 for "BackTrace "
                continue; // bad format line ?
            }

            let address = match i64::from_str_radix(&line[address_start_pos..address_end_pos], 16) {
                Ok(address) => address,
                Err(_) => continue,
            };

            let backtrace_pos = address_end_pos + 4; //4 == " by ".len();
            let backtrace = String::from(&line[backtrace_pos..line.len()]);

            backtrace_addresses
                .entry(backtrace)
                .or_insert_with(HashSet::new)
                .insert(address);
        }
    }

    Ok(backtrace_addresses)
}

fn find_common_allocations<'a>(
    all_backtraces: &'a [&String],
    backtrace_maps: &[&'a BacktraceAllocationsMap],
) -> BacktraceRefAllocationsMap<'a> {
    let mut common_allocations: BacktraceRefAllocationsMap = HashMap::new();
    // find allocations which are common in all.
    for k in all_backtraces.iter() {
        let mut present = true;

        // Is this BackTrace present in all log files ? if not, skip costly set intersections
        for bk in backtrace_maps.iter() {
            if !bk.contains_key(*k) {
                present = false;
                break;
            }
        }

        if !present {
            continue;
        }

        let mut current_set = backtrace_maps[0].get(*k).unwrap()
            .intersection(backtrace_maps[1].get(*k).unwrap()).cloned().collect::<HashSet<i64>>();

        for bk in backtrace_maps.iter().skip(2) {
            // there is no easy (right?) way to iterate over HashSet and also remove from it;
            current_set = bk[*k]
                .intersection(&current_set)
                .cloned()
                .collect::<HashSet<i64>>();

            if current_set.is_empty() {
                present = false;
                break;
            }
        }

        if present {
            common_allocations.insert(k, current_set);
        }
    }
    common_allocations
}

fn print_allocations(backtraces: &mut Vec<&String>, allocation_diffs: &[BacktraceRefAllocationsMap]) {
    let common_allocations = allocation_diffs.last().unwrap();
    backtraces.sort_by(|a, b| {
        let a_present = common_allocations.contains_key(*a);
        let b_present = common_allocations.contains_key(*b);
        if a_present && b_present {
            // this allocation type is present in last dump
            common_allocations[*a]
            .len()
            .cmp(&common_allocations[*b].len())
            .reverse()
        } else if !a_present && !b_present {
            // both are not present : sort by name itself.
            a.cmp(b).reverse()
        } else if a_present {
            Ordering::Greater
        } else {
            Ordering::Less
        }
    });

    println!("Common allocations: [1st..Last],[2nd..Last],[3rd..Last],..,BackTrace*");
    for backtrace in backtraces {
        for c_a in allocation_diffs {
            if let Some(allocs) = c_a.get(*backtrace) {
                print!("{:?},", allocs.len())
            } else {
                print!(",")
            }
        }

        println!("{}", backtrace);
    }
}

fn parse_umdh_files(file_names: &[String]) -> Vec<BacktraceAllocationsMap> {
    let mut backtrace_maps: Vec<BacktraceAllocationsMap> = Vec::new();

    for umdh_file in file_names {
        backtrace_maps.push(parse_umdh_file(umdh_file).unwrap());
    }

    backtrace_maps
}

fn get_all_backtraces(backtrace_maps: &[BacktraceAllocationsMap]) -> Vec<&String> {
    let mut all_backtraces_set: HashSet<&String> = HashSet::new();
    for keys in backtrace_maps.iter() {
        all_backtraces_set.extend(keys.keys());
    }

    // i believe this cloned() is not costly as it is converting from &&String to &String.
    all_backtraces_set.iter().cloned().collect::<Vec<&String>>()
}

fn main() {
    let args: Vec<String> = env::args().collect();

    if args.len() < 3 {
        println!(
            "Usage: cargo run -- umdh_file_path1 umdh_file_path2  umdh_file_path3\n
                 File paths in order of oldest to latest."
        );
        return;
    }

    let num_files = args.len() - 1;

    let backtrace_maps = parse_umdh_files(&args[1..]);
    let all_backtraces = get_all_backtraces(&backtrace_maps);
    let allocation_diffs = backtrace_maps
        .iter()
        .take(backtrace_maps.len() - 1)
        .map(|m| find_common_allocations(&all_backtraces, &[m, backtrace_maps.last().unwrap()]))
        .collect::<Vec<BacktraceRefAllocationsMap>>();

    if allocation_diffs.len() != num_files - 1 {
        panic!("unexpected allocation diff count")
    }

    // strictly increasing common allocation counts.
    let mut leaked_backtraces: Vec<&String> = Vec::new();
    // constant common allocation counts.
    let mut static_backtraces: Vec<&String> = Vec::new();
    // variable allocations - increasing & decreasing with time.
    let mut variable_backtraces: Vec<&String> = Vec::new();

    let mut missing_keys: HashMap<&String, usize> = HashMap::new();
    // get allocation in differet buckets.
    for k in all_backtraces.iter() {
        let mut last_count = 0;
        let mut is_variable = false;
        let mut is_static = true;
        let mut not_present = false;

        for c_a in allocation_diffs.iter() {
            if let Some(allocs) = c_a.get(k) {
                if allocs.len() >= last_count {
                    if (last_count != 0) && (allocs.len() != last_count) {
                        is_static = false;
                    }
                    last_count = allocs.len();
                } else {
                    is_variable = true;
                }
            } else {
                not_present = true;
            }
        }

        // trace only if present in only few files
        if not_present {
            missing_keys.insert(k, last_count);
            continue;
        }

        if is_variable {
            variable_backtraces.push(k);
        } else if is_static {
            static_backtraces.push(k);
        } else {
            leaked_backtraces.push(k);
        }
    }

    println!("Potential Leaked allocations as these allocations always kept increasing:");
    print_allocations(&mut leaked_backtraces, &allocation_diffs);

    println!("Variable allocations: [Count increased and decreased with time] / Can be waiting on some workflow like GC to deallocate these");
    print_allocations(&mut variable_backtraces, &allocation_diffs);

    println!("Allocations that never changed address: Sorted by count: [These can be global or leaked allocations]");
    let always_present_allocations = find_common_allocations(
        &all_backtraces,
        &backtrace_maps.iter().collect::<Vec<&BacktraceAllocationsMap>>(),
    );
    let mut always_present_allocations_vec = always_present_allocations
        .keys()
        .collect::<Vec<&&String>>();
    always_present_allocations_vec.sort_by(|a, b| {
        always_present_allocations[**a]
            .len()
            .cmp(&always_present_allocations[**b].len())
            .reverse()
    });
    for k in always_present_allocations_vec {
        if always_present_allocations[k].len() > 1 {
            println!(
                "{},{} => {:?}",
                k,
                always_present_allocations[k].len(),
                always_present_allocations[k]
            );
        }
    }

    println!("Static allocations: [Count never changed]");
    print_allocations(&mut static_backtraces, &allocation_diffs);

    println!(
        "BackTraces which are definitely not leaking as they were not present in some umdh file"
    );
    println!("{:?}", missing_keys);
    println!("Allocations of last diff");
    println!("{:?}", allocation_diffs.last().unwrap());
}

Edit1: Ran cargo clippy and made changes

First off, welcome! Asking for code reviews like this is one of the best ways to improve your familiarity with the language and find different ways of writing code.

Passing around references instead of cloning and pass-by-value is good, but you can actually do one better by passing &str instead of &String.

A Rust String is a struct containing a pointer, a length field, and a capacity field, so &String is actually a pointer to a pointer to a string. A &str is just a pointer to a string and makes no assumptions about where the string came from (e.g. it could be a string literal, some array on the stack, the whole array of text pointed to by a String, or just a small sub-section of that String), so by changing your functions to accept &str you remove a level of indirection while also making the code more flexible.

You should run cargo clippy on your code. Clippy is a linter which detects common patterns and suggests better alternatives.

This feels super clunky, how about replacing it with something like this:

let in_all_log_files = backtrace_maps.iter().all(|bk| bk.contains_key(*k));
if !in_all_log_files { 
  continue;
}

Lots of things are expressions in Rust, so often it's more ergonomic to assign a value to something directly instead of doing let flag = false; for item in items { if condition { flag = true; break; } }. As a Rustacean, this way of coding smells like C :stuck_out_tongue:

I think parse_umdh_files() should return a Result here instead of blowing up whenever it can't parse a file.

You go to all the effort of capturing errors in parse_umdh_file() so it seems a bit of a waste to not use ? to propagate the errors up.

You can leave the end index off here and just write String::from(&line[backtrace_pos..]). Depending on personal preference, you might also find line[backtrace_pos..].to_string() easier to read.

Sorry for the nitpicking, but why didn't you just write let backtrace_pos = address_end_pos + " by ".len()?

File paths are not strings, they are paths.

  1. Correctness - not all filenames are UTF8, so you risk the chance of blowing up when someone runs your tool on a file with a non-UTF8 file
  2. Purpose - A string (&str) can contain anything, maybe parse_umdh_file() accepts the contents of the file as a UTF8 string, maybe that first argument is a path, the only way you'll find out is by reading the source code. If I pass in a &Path then I can tell at a glance that I've been given a path on disk and there is no opportunity for mix-ups, and the compiler will enforce this.
  3. Make it domain specific - by using a separate type, a Path can hide its internals and provide domain-specific methods that are both helpful and maintain invariants. That's why Path methods operate on the unit of "components" (e.g. the path, to, and foo in /path/to/foo) instead of individual characters and the push() method handles a lot of the intricacies of joining paths - it's more nuanced than let joined = format!("{}/{}", path_a, path_b).

Reading through your code, BacktraceAllocationsMap looks like a pretty important datastructure for this code.

I feel like you would be able to express things a lot better by using a proper newtype instead of a type alias because you can add domain-specific methods that encapsulate common operations or maintain invariants (e.g. allocations must be sorted).

The thinking is quite similar to the "strings are not paths" rant comment from above.

I find it quite strange that a "print" function is mutating the backtraces variable.

Ideally you would ensure the backtraces are sorted by construction (e.g. by using a different data structure or creating a newtype that maintains invariants) so you side-step the need to sort inside print_allocations().

4 Likes

Thanks for your insightful comments.

I have a question:

Which implementation is better ?
1)

fn get_all_backtraces(backtrace_maps: &[BacktraceAllocationsMap]) -> Vec<&str> {
    let mut all_backtraces_set: HashSet<&String> = HashSet::new();
    for keys in backtrace_maps.iter() {
        all_backtraces_set.extend(keys.keys());
    }

    all_backtraces_set.iter().map(|s| s.as_str()).collect::<Vec<&str>>()
}

vs
2)

fn get_all_backtraces(backtrace_maps: &[BacktraceAllocationsMap]) -> Vec<&str> {
    let mut all_backtraces_set: HashSet<&str> = HashSet::new();
    for keys in backtrace_maps.iter() {
        all_backtraces_set.extend(keys.keys().map(|s| s.as_str()));
    }

    all_backtraces_set.iter().cloned().collect::<Vec<&str>>()
}

It seems to be here, 1) is better even though it is using &String -- as input parameter is in &String.

I am on windows and in release build, i can't see function name in disassembly (and breakpoints don't work in VSCode). So, i can't find if compiler can optimize that " by ".len() to 4. In debug mode, it does not.

What is your setup for debugging release build binaries in vscode ? Do we need to pass some flags to make it easier to debug a release build ? like no name mangling / frame pointer ?

Maybe the problem is that the release binary doesn't contain debug information. You can change that: Well you can either increase the optimization of debug builds or add debug information to release builds. You just need to configure the dev or release profile in your Cargo.toml accordingly (or use one of the config files, or set the respective environment variable if you don't want to modify Cargo.toml):

Profiles - The Cargo Book

The string is a constant, so its length is almost guaranteed to be inlined by LLVM.

For example, let's look at this code snippet.

fn main() {
    let length = " by ".len();
    std::process::exit(length as i32);
}

(playground)

Compiling in debug mode and asking the playground to show assembly:

playground::main:
	subq	$24, %rsp
	leaq	.L__unnamed_2(%rip), %rdi
	movl	$4, %esi
	callq	core::str::<impl str>::len
	movq	%rax, 8(%rsp)
	movq	%rax, 16(%rsp)
	movq	8(%rsp), %rax
	movl	%eax, %edi
	movq	std::process::exit@GOTPCREL(%rip), %rax
	callq	*%rax
	ud2

Then switching to release:

playground::main:
	pushq	%rax
	movl	$4, %edi
	callq	*std::process::exit@GOTPCREL(%rip)
	ud2

You can see the first one calls core::str::<impl str>::len() and passes that to std::process::exit(), while the second one just loads 4 directly into edi.

To be honest, I've been using Rust for over 5 years now and can probably count the number of times I've reached for a debugger on one hand, whereas I would need it several times a day when writing C#.

The vast majority of bugs can be prevented at compile time by just structuring your data types the right way and using the type system to encode assumptions/invariants, then unit/integration tests will catch most of the rest.

That said, when I have needed to use the debugger I've found that whatever rust-analyzer uses tends to just work out of the box.

You can add debug = 1 to the [profile.release] section in your Cargo.toml to tell the compiler to emit some debug info.

I don't have a strong preference, although if it were me I'd write things in a way that doesn't require a for-loop and mutation:

fn get_all_backtraces(backtrace_maps: &[BacktraceAllocationsMap]) -> Vec<&str> {
    let keys: HashSet<_> = backtrace_maps.iter()
        .flat_map(|m| m.keys())
        .collect();

    keys.iter().map(|key| key.as_str()).collect()
}

(playground)

What do you use for production -- when things generate crash dumps ? I opened it in windbg and it complained about missing private pdbs ; it had loaded the pdb from targets/release folder correctly.

I don't know about Windows, but in Linux your debug symbols are part of the executable so there is never any issue with missing PDB files and things tend to just work.

I also tend to rely a lot more on good logs and providing useful error messages. The anyhow crate has a nice mechanism for attaching extra "context" to an error before you propagate it up, letting you record chains of errors that might be printed like this:

Error: Unable to initialize the runtime engine
  Caused by: Unable to load the configuration file
  Caused by: Reading the configuration file failed
  Caused by: Unable to open "/path/to/config.json"
  Caused by: File not found

Wouldn't it be better to chain two mappings together in one iteration, without intermediate collect? Like this:

fn get_all_backtraces(backtrace_maps: &[BacktraceAllocationsMap]) -> Vec<&str> {
    backtrace_maps
        .iter()
        .flat_map(|map| map.keys())
        .map(|key| key.as_str())
        .collect()
}

Collecting into the HashSet will dedup, whereas directly into the Vec will not, so the two-step might be intentional.

1 Like

Yeah, the original code used a HashSet for deduplication so I copied that pattern.

Alternatively, you could store them in a sorted Vec and use binary search for deduplication, but that requires more code and I'm not quite sure how the OP wants to handle duplicates (merge/ignore/overwrite, etc.).

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.