Redaction of Document

sims28 · February 24, 2021, 5:14pm

So I have a document and I want to write redact some certain parts of the document but I want to do it using rust. My first idea was to convert to image and then image to matrix, but I am kind of stuck at the phase of image to matrix and gets complicated. I would really like some advice on this topic Any reference will be extremely helpful.

Thanks!

skysch · February 24, 2021, 5:17pm

I don't understand how converting an image to a matrix is aiding in the task (nor how converting a document to an image is helping.) If you have a text-based document, you can redact it by simply replacing text. What is the purpose in rendering a document as an image here?

sims28 · February 24, 2021, 5:20pm

Oh my idea was I take it as a grayscale image and then the pixels are turned 0 , the ones I want to redact. What do you suggest?

Cerber-Ursi · February 24, 2021, 5:23pm

What kind of document it is? Probably it would be easier to edit it directly.

sims28 · February 24, 2021, 5:27pm

I want to do it for general document, well the idea is let's take a redacted document and then we are given the original one too and the points were it is redacted. One needs to do the redaction and compare with the redacted document and if they are exactly the same. So I sort of need it to go in the flow rather than just redacting it on my own.

skysch · February 24, 2021, 5:57pm

That only works if the documents are rendered in exactly the same way. Which is almost assuredly never going to happen if you don't care what format they are in.

If you really intend to compare two general documents and accurately detect if the redacted and original share the same source, you need to either have both documents in their original format, or do some AI pattern-matching analysis to measure the similarity of their content, perhaps ignoring parts that are obviously redacted. But it doesn't help to do a redaction, because you're probably never going to be looking at the redacted parts anyway; they don't contain any useful information.

It seems like your method would only tell you two documents are the same if you already know they are the same.

sims28 · February 24, 2021, 6:12pm

Yes because I want to verify that the redacted document is indeed the redaction of the given document

mbrubeck · February 24, 2021, 6:12pm

You don't need to "convert from image to matrix." An image is a matrix of pixels. Once you've used the image crate to load the image into memory, you can access and modify the elements of that matrix using the various methods in ImageBuffer and other image types/traits.

For example, if you want to turn all the pixels in a 100-by-100 rectangle black, you can do:

let width = 100;
let height = 100;
let black = Luma([0]);

for y in 0..height {
    for x in 0..width {
        image.put_pixel(x, y, black);
    }
}

sims28 · February 24, 2021, 6:13pm

Oh that's interesting and then I can just display the image in order to check right ?

mbrubeck · February 24, 2021, 6:14pm

Yes. Or if you want to compare it to a reference image, you can load both images into memory and then check whether they are equal.

sims28 · February 24, 2021, 6:17pm

No I just want to take an image, do redaction at the places and then check if it is similar to the original image. That is interesting way to be honest. But now I am thinking what if I want to redact specific parts like not anywhere in the image but some specific parts. That seems difficult then..

drmason13 · February 24, 2021, 9:00pm

It sounds like you might be dealing with photocopy images, or scanned images. Something that is more likely to be stored as .jpg file than a .txt file.

This is why it is difficult. The jpeg (or any similar image format) doesn't understand text, it just understands pixels (oversimplifying here but go with it).

To interpret "specific parts" of an image. Like a date of birth or a name within a document is a whole interesting challenge for machine learning in its own right. If you can possibly go by generous coordinate regions of the image you'll have a much easier go of it.

Hopefully you're dealing with a known set of forms or something and you can pair some metadata with the image to lookup how to redact it.

sims28 · February 24, 2021, 9:04pm

Yeah, that's what I am going for. That's interesting, I will think about it, can I do it in rust ??

Cerber-Ursi · February 25, 2021, 3:02am

If you can do it at all (converting image to text is, as was said before, a fairly hard problem), then yes, you can do it in Rust too.

drmason13 · February 25, 2021, 6:13pm

Yes, but do yourself a favour and do it in python

Python has a huge headstart in machine learning. In fairness rust, will have its uses in the machine learning space but in my opinion it is best used at a lower level (for speed and correctness). With time this may well change!

The high level tasks are just that much easier with established libraries with loads of tutorials.

sims28 · February 25, 2021, 7:56pm

Yeah that's the conclusion I came up with finally yesterday, I will do the redact part in python.

Thanks!

system · May 26, 2021, 7:56pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Converting an image into a 2D matrix	14	3585	May 24, 2021
Vector to image (grayscale) code review	13	1949	August 27, 2021
Remove large white strips between images? help	6	361	March 8, 2023
Know of any document metadata sanitization projects or libraries? rusty or not community	1	367	June 17, 2021
Image encryption	7	630	June 4, 2023

Redaction of Document

Related Topics