Redaction of Document

So I have a document and I want to write redact some certain parts of the document but I want to do it using rust. My first idea was to convert to image and then image to matrix, but I am kind of stuck at the phase of image to matrix and gets complicated. I would really like some advice on this topic Any reference will be extremely helpful.

Thanks!

I don't understand how converting an image to a matrix is aiding in the task (nor how converting a document to an image is helping.) If you have a text-based document, you can redact it by simply replacing text. What is the purpose in rendering a document as an image here?

Oh my idea was I take it as a grayscale image and then the pixels are turned 0 , the ones I want to redact. What do you suggest?

What kind of document it is? Probably it would be easier to edit it directly.

I want to do it for general document, well the idea is let's take a redacted document and then we are given the original one too and the points were it is redacted. One needs to do the redaction and compare with the redacted document and if they are exactly the same. So I sort of need it to go in the flow rather than just redacting it on my own.

That only works if the documents are rendered in exactly the same way. Which is almost assuredly never going to happen if you don't care what format they are in.

If you really intend to compare two general documents and accurately detect if the redacted and original share the same source, you need to either have both documents in their original format, or do some AI pattern-matching analysis to measure the similarity of their content, perhaps ignoring parts that are obviously redacted. But it doesn't help to do a redaction, because you're probably never going to be looking at the redacted parts anyway; they don't contain any useful information.

It seems like your method would only tell you two documents are the same if you already know they are the same.

Yes because I want to verify that the redacted document is indeed the redaction of the given document

You don't need to "convert from image to matrix." An image is a matrix of pixels. Once you've used the image crate to load the image into memory, you can access and modify the elements of that matrix using the various methods in ImageBuffer and other image types/traits.

For example, if you want to turn all the pixels in a 100-by-100 rectangle black, you can do:

let width = 100;
let height = 100;
let black = Luma([0]);

for y in 0..height {
    for x in 0..width {
        image.put_pixel(x, y, black);
    }
}

Oh that's interesting and then I can just display the image in order to check right ?

Yes. Or if you want to compare it to a reference image, you can load both images into memory and then check whether they are equal.

No I just want to take an image, do redaction at the places and then check if it is similar to the original image. That is interesting way to be honest. But now I am thinking what if I want to redact specific parts like not anywhere in the image but some specific parts. That seems difficult then..

It sounds like you might be dealing with photocopy images, or scanned images. Something that is more likely to be stored as .jpg file than a .txt file.

This is why it is difficult. The jpeg (or any similar image format) doesn't understand text, it just understands pixels (oversimplifying here but go with it).

To interpret "specific parts" of an image. Like a date of birth or a name within a document is a whole interesting challenge for machine learning in its own right. If you can possibly go by generous coordinate regions of the image you'll have a much easier go of it.

Hopefully you're dealing with a known set of forms or something and you can pair some metadata with the image to lookup how to redact it.

Yeah, that's what I am going for. That's interesting, I will think about it, can I do it in rust ??

If you can do it at all (converting image to text is, as was said before, a fairly hard problem), then yes, you can do it in Rust too.

1 Like

Yes, but do yourself a favour and do it in python :wink:

Python has a huge headstart in machine learning. In fairness rust, will have its uses in the machine learning space but in my opinion it is best used at a lower level (for speed and correctness). With time this may well change!

The high level tasks are just that much easier with established libraries with loads of tutorials.

Yeah that's the conclusion I came up with finally yesterday, I will do the redact part in python.

Thanks!