All Labels

Name: PageIndex
Author: Vectify AI

All Posts

Published on
February 8, 2024
Active Preference Learning for Large Language Models
ICML 2024Research
We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model and a measure of certainty of the implicit preference model optimized by DPO.
Published on
March 12, 2024
Enhancing Black-Box Embeddings with Model Augmented Fine-Tuning
Research
We introduce Model Augmented Fine-tuning (Mafin) — a novel approach for fine-tuning a black-box embedding model by augmenting it with a trainable embedding model.
Published on
February 19, 2025
PageIndex Leads Financial QA Benchmark
ModelInsights
We introduce Mafin2.5, which is built based on PageIndex, with a 98.7% accuracy rate on the finance industry question-answering benchmark.
Published on
September 1, 2025
From Claude Code to Agentic RAG
InsightsInsights
The same bet behind both PageIndex and Claude Code: skip the vector DB and let the LLM itself drive retrieval. We explore the rise of agentic retrieval over vector indexing and PageIndex's agentic, vectorless RAG framework. Where Claude Code retrieves over a codebase with simple bash tools instead of a vector database, PageIndex gives long documents an in-context tree index that an LLM agent navigates by reasoning — no chunking, embeddings, or vector store.
Published on
February 2, 2026
Context Blindness: A Fundamental Limitation of Vector RAG
InsightsInsights
We argue that context blindness — the inability of vector-based retrieval to condition on full conversational and reasoning context — is a fundamental limitation of vector RAG, and outline a paradigm shift from semantic similarity to context-dependent relevance classification. In this view, retrieval becomes a relevance decision made by an LLM with full context, scaled efficiently through hierarchical tree search.
Published on
October 27, 2025
Do We Still Need OCR?
InsightsInsights
We examine the inherent limitations of OCR from an information-theoretic perspective and show why a direct, vision-based approach with PageIndex is more effective. Because flattening a 2D page into a 1D text sequence is inherently lossy, PageIndex acts as a vectorless retrieval layer that selects the relevant pages of a long document, which a VLM then reads directly as images.
Published on
April 10, 2026
OpenKB: An Open-Source LLM Knowledge Base
ProductProduct
We built what Andrej Karpathy described, and solved the hard part. OpenKB is an open-source CLI that compiles raw documents into a structured, interlinked wiki, powered by PageIndex for long PDFs.
Published on
August 5, 2025
PageIndex OCR: The First Long-Context OCR Model
ProductProduct
PageIndex OCR is the world's first OCR model that understands documents as a whole — preserving full structure and section hierarchy across pages, instead of treating each page as an independent unit.
Published on
October 20, 2025
Introducing PageIndex Chat
ProductProduct
Experience the power of reasoning-based RAG with PageIndex Chat - our new conversational interface for intelligent document understanding.
Published on
May 3, 2026
PageIndex File System: Massive-Scale Document Search
ProductProduct
PageIndex File System is a file-level tree layer that sits above your documents and scales the same PageIndex tree search from a single document to millions of documents in one index. It synthesizes a semantic hierarchy with virtual nodes when no usable folder structure exists, builds the tree on demand for each query, and adapts how it searches each node to stay efficient at scale.
Published on
March 31, 2026
PageIndex Selected for GitHub Secure Open Source Fund
NewsNews
PageIndex has been selected for GitHub's Secure Open Source Fund, supporting a broader security roadmap for long-document AI infrastructure.
Published on
January 25, 2026
PageIndex Hit #1 GitHub Trending
NewsNews
PageIndex reached
Published on
September 19, 2025
PageIndex: Next-Generation Vectorless, Reasoning-based RAG
ResearchResearch
PageIndex is a vectorless, reasoning-based retrieval framework that simulates how human experts extract knowledge from complex documents. Instead of relying on vector similarity search, it builds a tree-structured index from documents and enables LLMs to perform agentic reasoning over that structure for context-aware retrieval. The retrieval process is traceable and interpretable, and requires no vector DBs or chunking.
Published on
April 29, 2026
PageIndex Featured on The Open-Source Growth Index (OSSCAR)
NewsNews
PageIndex was recognized on the Open Source Growth Index (OSSCAR) Q1 2026 by Supabase × Commit VC, ranking #14 in GitHub Star Growth and #38 Overall in the Scaling Tier.
Published on
January 30, 2026
PageIndex Featured in VentureBeat: A Tree Search Framework That Hits 98.7% Where Vector Search Fails
NewsNews
VentureBeat covers PageIndex, the vectorless, reasoning-based RAG framework that uses tree search over document structure to reach 98.7% accuracy on FinanceBench, where vector-based retrieval typically fails.
Published on
November 30, 2025
PageIndex vs ChatGPT 5.1
ResearchInsights
We benchmarked PageIndex Chat against ChatGPT 5.1 on real-world long documents. PageIndex achieved 100% accuracy compared to ChatGPT 5.1's 59-82%, with faster response times and page-level traceability.
Published on
December 12, 2025
RAG for Technical Manuals
ResearchInsights
How PageIndex’s vectorless, reasoning-based RAG overcomes the challenges of traditional vector RAG in long, complex technical manuals.