PageIndex OCR is the world's first OCR model that understands documents as a whole — preserving full structure across pages, instead of treating each page as an independent unit.
We present PageIndex, a reasoning-based RAG system that simulates how human experts navigate and extract knowledge from long documents through tree search.
We introduce Model Augmented Fine-tuning (Mafin) — a novel approach for fine-tuning a black-box embedding model by augmenting it with a trainable embedding model.
We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model and a measure of certainty of the implicit preference model optimized by DPO.