Logo
background-grid

PageIndex API

Bring Powerful Long-Document Understanding to Your Workflow

Get direct access to tree generation and long-document understanding. Integrate it seamlessly into your applications to query, analyze, and reason over complex documents with precision.

PageIndex Illustration

The First Long-Context OCR Model

PageIndex OCR

Designed to preserve the global structure of documents, understanding entire multi-page documents as unified structures rather than isolated pages.

Current OCR Limitations

Current OCR Limitations

Process Each Page Independently

Current OCR models only process each page independently, losing context across pages and breaking document structure.

Broken Markdown Rendering & Lost Context

Incorrect heading levels break Markdown rendering, making it hard to use for downstream LLMs.

PageIndex OCR

PageIndex OCR

Model with Long Context Window

PageIndex OCR model has a long context window, enabling it to understand the entire document as a unified structure.

Global Structure Preservation

Preserves hierarchical organization across page boundaries for complete document understanding and context.

Reasoning-Native Index

PageIndex Tree Generation

Documents are indexed as hierarchical trees, which maintains the original document's logical flow and organizational structure. This LLM-optimized tree representation enables precise navigation and is ready for reasoning-based RAG.

No Vector DB Required

Tree structures are represented as lightweight JSON objects, avoiding the overhead and complexity of vector databases.

01

PageIndexTree.json

# Example of PageIndex Tree Structure

{
  "title": "Financial Stability",
  "node_id": "0006",
  "page_index": 21,
  "text": "The Federal Reserve maintains financial stability through comprehensive...",
  "prefix_summary": "This section discusses...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "page_index": 22,
      "text": "The Federal Reserve's monitoring focuses on identifying...",
      "summary": "This section discusses..."
    },
    {
      "title": "Domestic and International Cooperation and Coordination",
      "node_id": "0008",
      "page_index": 28,
      "text": "In 2023, the Federal Reserve collaborated internationally...",
      "summary": "This section discusses..."
    }
  ]
}

Human-like Tree Search

PageIndex Retrieval

PageIndex uses tree search with multi-step reasoning to retrieve information from complex documents. This approach simulates how human experts systematically navigate and extract insights from lengthy documents.

01

No Top-K Selection Required

Tree search automatically identifies all relevant tree nodes without manual parameter tuning.

PageIndexRetrieval.json

# Example of PageIndex Tree Structure

{
  "title": "Financial Stability",
  "node_id": "0006",
  "page_index": 21,
  "text": "The Federal Reserve maintains financial stability through comprehensive...",
  "prefix_summary": "This section discusses...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "page_index": 22,
      "text": "The Federal Reserve's monitoring focuses on identifying...",
      "summary": "This section discusses..."
    },
    {
      "title": "Domestic and International Cooperation and Coordination",
      "node_id": "0008",
      "page_index": 28,
      "text": "In 2023, the Federal Reserve collaborated internationally...",
      "summary": "This section discusses..."
    }
  ]
}
Background grid

Want to integrate PageIndex to your LLMs or AI agents?