Matching an array of bytes (file headers) precisely

Hi

I am trying to read the first few bytes from a binary file and match it to various magic headers. However, if I was searching for Microsoft Office (docx etc) documents, both the headers for zip and Microsoft Office will match since newer Microsoft Office documents are considered zip files as well. How do I get only the more precise match instead of both?

use memmem::*;
....
fn search_headers(file_bytes: &[u8])
{
	let vec_mac = hex!("CFFAEDFE").to_vec();
	let vec_mz = hex!("4D5A").to_vec();
	let vec_zip = hex!("504B0304").to_vec();
	let vec_ms = hex!("504B030414000600").to_vec();

	let vec_array = [vec_mac,
					 vec_mz,
					 vec_zip,
					 vec_ms
					];

	for file_array in vec_array.iter()
	{
		let fileheader = memmem::TwoWaySearcher::new(&file_array);
		let search_output = fileheader.search_in(&file_bytes[..20]);

		match search_output
		{
			Some(retval) =>
			{
				println!("\tFound matching file type");
				println!("\tmatching array {:02X?}", file_array);
				println!();
			}
			None =>
			{
				println!();
				continue;
			}
		}
	}
}

You could check for the longest prefix first

Hi @jethrogb

I switched the order and put the longer prefix first but since I am iterating through the entire array, both will still match, unless I terminate the iteration the moment I found one match. However, that will require that I manually check that for similar prefixes (such as in my example), I have to put them in "descending order", starting from the longest prefix down to the shortest one. In that case, the longest prefix will be checked first and when there is a match, the iteration will stop.

Would there be a way to check for precise matches programmatically?

You could sort your Vec of headers by descending length programmatically.
Playground.

Hi @quinedot

Thanks, I will take a look at that.

You can also use tuples to implement more advanced sorting. For example "sort by string length descending, then lexical order".

use std::cmp::Reverse;

fn main() {
    let mut headers = vec!["foo", "bar", "foobar", "fob", "foz", "fozz"];
    headers.sort_unstable_by_key(|h| (Reverse(h.len()), &h[..]));
    println!("length then string: {:#?}", headers);

    headers.sort_unstable_by_key(|h| (Reverse(&h[..]), h.len()));
    println!("string then length: {:#?}", headers);
}

Outputs the following:

length then string: [
    "foobar",
    "fozz",
    "bar",
    "fob",
    "foo",
    "foz",
]
string then length: [
    "fozz",
    "foz",
    "foobar",
    "foo",
    "fob",
    "bar",
]

(playground)

Sounds like you could also do this with aho-corasick using its leftmost-longest match semantics.

1 Like