docling
last updated: Nov 10, 2025
https://github.com/docling-project/docling
https://docling-project.github.io/docling
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
- 🗂️ Parsing of multiple document formats incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, VTT, images (PNG, TIFF, JPEG, ...), and more
- 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
recommendation from hrbrmstr for turning PDFs into text for use in an AI system, it "works shockingly well"