Reading MS word file

Hi,
How can I read the content of MS word '.doc' file, and search for key word in it.
I'd like to make quick scan for some CVs.

Thanks

I have no idea if any of them are good, but there do seem to be a few crates that support reading from .docx files https://crates.io/search?q=docx

Another alternative would be using Apache's Tika lib. It's not a rust thing though, unfortunately.

I'd like to ask whether do you need reading *.doc files (as stated in title), *.dic files (as stated in post) or *.docx files (as the furst reply presumes)? These are three very different things.

Actually it could be *.doc, *.docx or *.pdf just in order not to ask for so many things at once I asked about *.doc first :slight_smile:

Which was the hardest of the three. PDF and DOCX have specification thats are publicly available, while DOC is strictly propretiary and ancient.

DOCX under the hood is nothing more than some zipped XML files. Perhaps this knowledge is enough to get the information you need? If you need more specific information its best to evaluate the crates available to find that that suites your needs.

PDFs may or may not contain the text displayed in a machine readable format. A PDF either contains a picture accidentally containing text or some instructions how to place letters on a page (this is simplified!!). In these instructions there is simply a "pointer" to a glyph from the font which does not necessarily need to be ASCII, UTF-8 or anything else. Due to compression and minification the same ID may even mean different letters in different fonts. A PDF may contain extra meta information which maps picture or placement instruction to actual text in a well known encoding (I don't remember which it was, put you can google the spec). Perhaps you find crates that help you reading the PDF, but thats nothing I'd put my money on.

DOC is, as already said, proprietiary and also binary. Back in the days when it still was current, there have been a lot of reverse enginieered readers and writers across languages, all did their job reasonably well, at least as long as you didn't try to alter files that were created by another tool. I'm not sure if there is still such a drive, ~15 years after the last MS product was shipped that used DOC as its default output.

In general, regardless of the format you get, the documents are not structured in a machine friendly way. Due to the internal format used, the "label" "date of birth" and the actual date may look like they are adjacent in the printable layout but may be miles away from each other in the internal representation. Some applicants may prefer to write "birthday" instead, and so on.

So to be honest, I think the easiest way is to simply ask your applicants to fill out additional fields in the application form providing that data you want to scan for. I'm already used to do this from many sites.

For applcations that do not come through the application form but are sent in via snail mail or email, there should be a manual import into the system anyway, then you can add the necessary metadata by hand as well.

PS: There are PDF substandards that try to enforce structured meta data for easier access by the blind, which would als mean easier access for machines, but very few people know how to actually create them correctly. Also as far as I know LaTeX is still not able to generate them.

8 Likes

Thanks a lot for the comprehensive explanation, deeply appreciated,

I found dotext an easy one, that solved my issue.

1 Like

You could try openXML

still v0.0.0?!

openXML appears to be an empty crate as well. I do not see any documentation or source code repository link.

I second @NobbZ's recommendation to try to work with docx files. I am interested in this topic and have been researching some of the underlying XML libraries.
If you need to manipulate the XML files in place quick-xml seems like a strong contender.

For primarily reading data @RazrFalcon's has built an excellent library called roxmltree. That link to the README details alternatives and associated features.

I modified the print_pos.rs example to search for some text nodes in a document.xml. I was running it against a document.xml extracted from a docx archive: cargo run --example print_pos -- /tmp/extracted-file.docx/word/document.xml.

Better yet, just use anvie's dotext. It is a library for extracting text from docx and other formats. I'll avoid posting on old threads next time :slight_smile:

1 Like