PageIndex: Next-Generation Reasoning-based RAG
Try Reasoning-based RAG with PageIndex — no vector DB, no vibe retrieval 👋
Higher Accuracy
Relevance beyond similarity
Better Transparency
Clear reasoning trajectory
Like A Human
Retrieve like a human expert
No Vector DB
No extra infra overhead
No Chunking
Preserve full context
No Top-K
Retrieve all relevant passages
PageIndex: Next-Generation Reasoning-based RAG
Higher Accuracy
Relevance beyond similarity
Better Transparency
Clear reasoning trajectory
Like A Human
Retrieve like a human expert
No Vector DB
No extra infra overhead
No Chunking
Preserve full context
No Top-K
Retrieve all relevant passages
Introduction
PageIndex Workflow
PageIndex is a reasoning-based retrieval system that simulates how human experts retrieve knowledge from documents. It first converts documents into hierarchical tree structures, then performs tree search to retrieve relevant information.

PageIndex OCR
Convert PDF to Markdown with preserved document structure
PageIndex Tree Generation
Generate hierarchical tree structure optimized for retrieval
PageIndex Retrieval
Reasoning-based retrieval by document tree search
The First Long-Context OCR Model
PageIndex OCR
PageIndex OCR is the first long-context OCR model designed to preserve the global structure of documents, understanding entire multi-page documents as unified structures rather than isolated pages.
Current OCR Limitations
Process Each Page Independently
Current OCR models only process each page independently, losing context across pages and breaking document structure.
Broken Markdown Rendering & Lost Context
Incorrect heading levels break Markdown rendering, making it hard to use for downstream LLMs.
PageIndex OCR
Model with Long Context Window
PageIndex OCR model has a long context window, allowing it to understand the entire document as a unified structure.
Global Structure Preservation
Preserves hierarchical organization across page boundaries for complete document understanding and context.

Context-Preserving, Reasoning-Native Index
PageIndex Tree Generation
Documents are indexed as hierarchical tree structures generated by PageIndex. These trees maintain the original document's logical flow and organizational structure. This LLM-optimized tree representation enables precise navigation and is ready for reasoning-based RAG.
- No Vector DB Required
- Tree structures are represented as lightweight JSON objects, avoiding the overhead and complexity of vector databases.
- No Chunking Required
- Preserves natural document structure without artificial text splitting for better context retention.
- Node Location and Summary
- Provides node page number and summary for precise information navigation.
- Optimized for Long Documents
- Tree generation optimized for financial reports, legal documents, and technical manuals beyond LLM context limits.
# Example of PageIndex Tree Structure
...
{
"title": "Financial Stability",
"node_id": "0006",
"page_index": 21,
"text": "The Federal Reserve maintains financial stability through comprehensive...",
"prefix_summary": "This section discusses...",
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"page_index": 22,
"text": "The Federal Reserve's monitoring focuses on identifying...",
"summary": "This section discusses..."
},
{
"title": "Domestic and International Cooperation and Coordination",
"node_id": "0008",
"page_index": 28,
"text": "In 2023, the Federal Reserve collaborated internationally...",
"summary": "This section discusses..."
}
]
}
Accurate, Human-like Tree Search
PageIndex Retrieval
PageIndex uses tree search with multi-step reasoning to retrieve information from complex documents. This approach goes beyond traditional vector-based retrieval by simulating how human experts systematically navigate and extract insights from lengthy documents.
# Example of PageIndex Retrieval API Response
{
"title": "Monetary Policy and Economic Developments",
"node_id": "0004",
"nodes": [
{
"title": "March 2024 Summary",
"node_id": "0005",
"relevant_contents": [{
"page_index": 10,
"relevant_content": "The labor market has gained averaging 239,000 per month since June 2023..."
}]
},
{
"title": "June 2023 Summary",
"node_id": "0006",
"relevant_contents": [{
"page_index": 15,
"relevant_content": "The labor market has remained very tight, with job gains averaging 314,000 per month during..."
}]
}
]
}
- No Top-K Selection Required
- Tree search automatically identifies all relevant tree nodes without manual parameter tuning.
- Transparent Search Trajectories
- Returns the complete search path through the tree structure, providing transparency and rich contextual information.
- Node and Page References
- Every retrieved passage includes its node ID and page number from the original document for verifiable information retrieval.
- LLM-Ready Output Format
- Structured data output with relevant paragraphs and search trajectories, ready for downstream LLM processing.
RAG Comparison
PageIndex vs Vector DB
Choose the right RAG technique for your task.
- Financial reports and SEC filings
- Regulatory and compliance documents
- Healthcare and medical reports
- Legal contracts and case law
- Technical manuals and scientific documentation
- Vibe retrieval
- Semantic recommendation systems
- Creative writing and ideation tools
- Short news/email retrieval
- Generic knowledge question answering
Case Study
PageIndex Powers Leading Industry Models
PageIndex forms the foundation of Mafin 2.5, a leading RAG model for financial report analysis, achieving 98.7% accuracy on FinanceBench — the highest in the market.
30%
RAG with Vector DB
One vector index for all the documents.
50%
RAG with Vector DB
One vector index for each document.
98.7%
RAG with PageIndex
Query-to-SQL for document-level retrieval, PageIndex for node-level retrieval.
The results of RAG with Vector DB are from the FinanceBench paper.