convert pdf to markdown
last updated: Jul 03, 2024
https://github.com/VikParuchuri/marker
Marker converts PDF to markdown quickly and accurately.
- Supports a wide range of documents (optimized for books and scientific papers)
- Supports all languages
- Removes headers/footers/other artifacts
- Formats tables and code blocks
- Extracts and saves images along with the markdown
- Converts most equations to latex
- Works on GPU, CPU, or MPS
How it works
Marker is a pipeline of deep learning models:
- Extract text, OCR if necessary (heuristics, surya, tesseract)
- Detect page layout and find reading order (surya)
- Clean and format each block (heuristics, texify
- Combine blocks and postprocess complete text (heuristics, pdf_postprocessor)
It only uses models where necessary, which improves speed and accuracy.
neat application of machine learning. I've not used it yet, but have had situations in the past where it would have come in handy.