We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model and a measure of certainty of the implicit preference model optimized by DPO.
We introduce Model Augmented Fine-tuning (Mafin) — a novel approach for fine-tuning a black-box embedding model by augmenting it with a trainable embedding model.
We argue that context blindness — the inability of vector-based retrieval to condition on full conversational and reasoning context — is the most fundamental limitation of vector RAG, and outline a paradigm shift from semantic similarity to context-dependent relevance classification.
We examine the inherent limitations of OCR from an information-theoretic perspective and show why a direct, vision-based approach with PageIndex is more effective.
We built what Andrej Karpathy described, and solved the hard part. OpenKB is an open-source CLI that compiles raw documents into a structured, interlinked wiki, powered by PageIndex for long PDFs.
PageIndex OCR is the world's first OCR model that understands documents as a whole — preserving full structure and section hierarchy across pages, instead of treating each page as an independent unit.
PageIndex is a vectorless, reasoning-based retrieval framework that simulates how human experts extract knowledge from complex documents. Instead of relying on vector similarity search, it builds a tree-structured index from documents and enables LLMs to perform agentic reasoning over that structure for context-aware retrieval. The retrieval process is traceable and interpretable, and requires no vector DBs or chunking.
PageIndex was recognized on the Open Source Growth Index (OSSCAR) Q1 2026 by Supabase × Commit VC, ranking #14 in GitHub Star Growth and #38 Overall in the Scaling Tier.
VentureBeat covers PageIndex, the vectorless, reasoning-based RAG framework that uses tree search over document structure to reach 98.7% accuracy on FinanceBench, where vector-based retrieval typically fails.
We benchmarked PageIndex Chat against ChatGPT 5.1 on real-world long documents. PageIndex achieved 100% accuracy compared to ChatGPT 5.1's 59-82%, with faster response times and page-level traceability.