Back to Articles

Using Vision LLMs + late interaction to improve document retrieval (RAG, search engines, etc.), solely using the image representation of document pages (paper)!

Context

Model Architecture ViDoRe Results Conclusion Citation Acknowledgments To improve the query answering capabilities of LLMs, it is often best to first search for information online or in external document sets (PDFs), before letting a LLM synthetize a grounded response (RAG). In practice, these retrieval pipelines for PDF documents have a huge impact on performance but are non-trivial...

Run Optical Character Recognition (OCR) on scanned PDFs