How can I verify the integrity and type of JPG, PDF, and DOCX files in Rust before storing them in MinIO?

I'm working on a project where I receive JPG, PDF, and DOCX files as byte streams in incoming requests. Before storing these files in MinIO, I want to implement a validation step to ensure that:

The file is of the correct type (JPG, PDF, or DOCX)
The file is not corrupted
The file is not suspicious or potentially malicious

I'm looking for recommendations on libraries or methods that can help me perform these checks efficiently. Ideally, I'd like to be able to return flags or indicators for each of these validation points.
My current workflow:

Receive file as bytes in the request
[Validation step to be implemented]
Store the file in MinIO if it passes validation

I have found 2 approches can anyone suggest their views on this

  • MIME Type and Extension Validation:
  • Description: Determine the file's MIME type from its byte content and ensure it matches the expected MIME type and file extension.
  • Method:
    • Analyze the raw bytes of the received file.
    • Use a library (e.g., infer in Rust) to detect the MIME type.
    • Confirm that the inferred MIME type and file extension are correct.
  • Example: Check if a .jpg file's MIME type is image/jpeg.
  • SOF (Start of File) and EOF (End of File) Structure Validation:
  • Description: Verify the file's integrity by checking that it starts and ends with the correct markers according to its format specification.
  • Method:
    • Compare the file's beginning bytes to known magic numbers (e.g., FF D8 for JPEG).
    • Ensure the file ends with the correct EOF marker (e.g., FF D9 for JPEG).
    • Optionally, check for other structural elements to validate the file's format.
  • Example: Validate a JPEG by ensuring it starts with FF D8 and ends with FF D9.

If you have control over the transfer protocol, you can transmit a checksum of some kind to verify that the bytes received are the same as those sent. That, coupled with either of the two methods that you already found, should handle the vast majority of accidental (i.e. non-malicious) problems that might arise.

Blocking actually malicious files is much harder, and will depend a lot on your threat model. If you’re just concerned with technical exploits, you may need to fully parse/interpret the files inside an isolated sandbox. If, on the other hand, you want to block correctly-formed files with malicious content, such as fraudulent claims or upsetting imagery, you’ll need a much more complicated solution than even that.

3 Likes

is there any way to implement static content analysis ? or any sandboxing method i can perform

/// Checks for suspicious patterns such as JavaScript code or executable file headers.

  fn contains_suspicious_patterns(file_bytes: &[u8]) -> bool {

 let suspicious_patterns: [&[u8]; 56] = [
    b"<script>",           
    b"eval(",              
    b"document.write(",  
    b"/JS (",  
    b"ELF",               
    b"cmd.exe",            
    b"powershell.exe",     
    b"alert(",             
    b"/S /JavaScript",    
    b"sh -c",              
    b"#!/bin/sh",       
    b"#!/bin/bash",        
    b"#!/usr/bin/env python",
    b"import os",         
    b"subprocess.call",    
    b"exec(",             
    b"<?php",              
    b"system(",            
];

// Check for risky patterns
for &pattern in suspicious_patterns.iter() {
    if file_bytes.windows(pattern.len()).any(|window| window == pattern) {
        info!("Detected risky pattern: {:?}", String::from_utf8_lossy(pattern));
        return true;
    }
}
false }

/// Analyzes the file to detect suspicious content or execution patterns.
fn analyze_file(file_path: &str) -> Result<(), Box> {
let file_bytes = read_file_content(file_path)?;

if contains_suspicious_patterns(&file_bytes) {
 
    info!("Suspicious content detected in file: {}", file_path);
} else {

    info!("File appears to be clean: {}", file_path);
}

Ok(())

} i have found this approch can anyone help me to optimize this for docx pdf and JPEG format

docx and pdf files usually store their contents compressed, so just naively searching for specific strings won't tell you if any of the strings are found in the uncompressed version of the data. if you're looking for malicious files you'll probably want to run an antivirus engine on them. I haven't used it myself, but the clamav-client crate or something similar is probably what you want.

I wouldn't trust antivirus to actually secure content. They're reactive in nature.

If you want to render content inert you need to reencode it. E.g. jpegs could be stripped of all headers, undo the huffman coding etc. and then repackage only the bare image data.
I guess PDFs could be repackaged as PDF/A, maybe fonts could be substituted (font rendering is an attack surface...), its image contents could be given similar treatment to JPEG. Or you could just turn the entire PDF into images, though that'd be a major inconvenience to users.

A somewhat less paranoid approach would be just running files through strict parsers and reject documents that contain spec violations or things like scripts or macros.

Well, in the end it depends on your threat model, what exactly do you want to defend against?

  • A user clicking "yes" on "do you want to run this script"
  • The user is running an unpatched acrobat reader from 6 years ago
  • The image instructs the user to send his money to a nigerian prince, the user follows the instructions

are all different problems and require different defenses.

2 Likes

is there any new view on this ?

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.