The First Long-Context OCR Model
PageIndex OCR
Designed to preserve the global structure of documents, understanding entire multi-page documents as unified structures rather than isolated pages.
Current OCR Limitations
Process Each Page Independently
Current OCR models only process each page independently, losing context across pages and breaking document structure.
Broken Markdown Rendering & Lost Context
Incorrect heading levels break Markdown rendering, making it hard to use for downstream LLMs.
PageIndex OCR
Model with Long Context Window
PageIndex OCR model has a long context window, enabling it to understand the entire document as a unified structure.
Global Structure Preservation
Preserves hierarchical organization across page boundaries for complete document understanding and context.
Reasoning-Native Index
PageIndex Tree Generation
Documents are indexed as hierarchical trees, which maintains the original document's logical flow and organizational structure. This LLM-optimized tree representation enables precise navigation and is ready for reasoning-based RAG.
No Vector DB Required
Tree structures are represented as lightweight JSON objects, avoiding the overhead and complexity of vector databases.
01
PageIndexTree.json
# Example of PageIndex Tree Structure { "title": "Financial Stability", "node_id": "0006", "page_index": 21, "text": "The Federal Reserve maintains financial stability through comprehensive...", "prefix_summary": "This section discusses...", "nodes": [ { "title": "Monitoring Financial Vulnerabilities", "node_id": "0007", "page_index": 22, "text": "The Federal Reserve's monitoring focuses on identifying...", "summary": "This section discusses..." }, { "title": "Domestic and International Cooperation and Coordination", "node_id": "0008", "page_index": 28, "text": "In 2023, the Federal Reserve collaborated internationally...", "summary": "This section discusses..." } ] }
Human-like Tree Search
PageIndex Retrieval
PageIndex uses tree search with multi-step reasoning to retrieve information from complex documents. This approach simulates how human experts systematically navigate and extract insights from lengthy documents.
01
No Top-K Selection Required
Tree search automatically identifies all relevant tree nodes without manual parameter tuning.
PageIndexRetrieval.json
# Example of PageIndex Tree Structure { "title": "Financial Stability", "node_id": "0006", "page_index": 21, "text": "The Federal Reserve maintains financial stability through comprehensive...", "prefix_summary": "This section discusses...", "nodes": [ { "title": "Monitoring Financial Vulnerabilities", "node_id": "0007", "page_index": 22, "text": "The Federal Reserve's monitoring focuses on identifying...", "summary": "This section discusses..." }, { "title": "Domestic and International Cooperation and Coordination", "node_id": "0008", "page_index": 28, "text": "In 2023, the Federal Reserve collaborated internationally...", "summary": "This section discusses..." } ] }