PDF to DOCX conversion in Rust (high accuracy)

I’m working on a PDF → DOCX converter in Rust (for a Tauri app), aiming to preserve layout, tables, fonts, and spacing as accurately as possible.

From what I understand, PDFs don’t store structured content like DOCX, so reconstructing everything correctly is quite challenging.

I wanted to ask:

  • Has anyone here worked on PDF to DOCX conversion in Rust?
  • Are there any crates or tools that help with layout reconstruction?
  • Is near-lossless conversion realistic using only Rust, or is a hybrid/external approach usually needed?
  • How do you typically handle tables and complex layouts?

Also, I’d really appreciate any suggestions on how you would approach building something like this

I don't know how I would attempt to do that, but I know how Google Books are doing that. They turn PDFs into TIFFs and feed that to OCR program.

Yes, I'm not joking, it turned out to be simpler and more reliably that attempts to play whack-a-mole game with bazillion spec violations that various programs do.

So far sounds more of “I want to make an omelette and I plan to use supercollider and lunar module for that”, but perhaps your app really needs to solve an impossible task (it can be solved with some reliability, just not 100% reliably).

It's worse than that. At least with DOCX you have one canonical program that is supposed to work with them. If Microsoft Word doesn't accept your DOCX file then you may say that file is corrupt and people would agree.

With PDFs… it's a crazy zoo out there. Just to even print them you may need to try few different approaches, because spec violations are rampant. Trying to do anything else, if you don't know who and how produces these PDFs, quickly turns into a full-time job.

Shell to Pandoc. Or, use Pandoc as the starting model for building a Rust-only version.

The only experience I personally have has to do with attempts at extractions of the highlights ("annotations"), embeddable with the PDF files themselves. A few hours into the research, it turned out that such an apparently "easy" task (how hard could opening a file and extracting a bunch of text really be?) it turned that PDFs have no idea of the existence of this "text".

It only knows the coordinates of individual highlights. Which you have to match, iterate, and extract on a glyph-by-glyph basis for each and every annotation of each and every single page. That was quite a fun discovery to stumble upon. It was even more fun to scour through the web trying to find any library that would do half a decent job at piecing all of the pieces together in order to accomplish the task.

Reconstructing the whole layout, on the other hand, from bottom up? Into an XML-based format like DOCX? Extracting individual glyphs, each of which only "knows" where and how it should be rendered, and building a fully styled/layout-accurate markup tree out of them? Tables, too?

You might want to spend a few hours/days playing around with building PDFs from scratch yourself, for starters. That will give you a much clearer picture as to what exactly you're dealing with. After that, you might well reconsider this whole little enterprise. If not, prepare for a sub-par hacky patching of a few pieces here and there you manage to salvage - at best, or an absolutely gargantuan amount of work (spanning months, if not years) building your own lib from scratch.

Is there an opposite project in Rust converting .docx and .odt files in PDF?

Yes I know PDF to Docx near-lossless isn’t realistic but is needed for app

Thank you will look into it