Logo
Product

PageIndex OCR: The First Long-Context OCR Model

Published on
avatar

Mingtian

Introduction

In the age of AI, OCR (Optical Character Recognition) is no longer just about extracting text — it's about extracting structured knowledge from documents. Most OCR systems are optimized for per-page text recognition, but they are not designed to recognize documents as cohesive, multi-page entities. To support modern AI workflows OCR must move beyond isolated page processing and toward understanding entire documents as unified structures.

We introduce PageIndex OCR — the first long-context OCR model designed to preserve global structure of the documents. It significantly outperforms other leading OCR tools from Mistral and Contextual AI in recognizing the hierarchical layout and semantic relationships across pages.

Current OCR Sees Individual Pages — Not Documents

Most current popular OCR systems analyze each page in isolation — dividing it into blocks, processing each block independently, and ultimately returning a flat, fragmented output. This causes serious issues:

  • Broken Markdown Rendering and Poor Readability: Headings, subheadings, and sections that span across pages lose their structural relationships, resulting in markdown that lacks proper formatting and visual hierarchy. This not only breaks rendering but also makes documents difficult for humans to read, navigate, and follow logically.
  • Lost Context, Weak Searchability: Relationships across pages, sections, and concepts are ignored, causing fragmented understanding and retrieval results that lack the context needed for relevance.
  • Poor Downstream AI Usability: Whether for training AI models or automating workflows, flat, per-page OCR output lacks the depth needed for reasoning or intelligent indexing.

PageIndex OCR Preserves The Document Structure

Instead of treating each page as a standalone input, PageIndex OCR leverages the context window of large vision-language models and treats the entire document as a cohesive, structured whole. It can not only generate accurate page-level markdown content, but can also preserve the hierarchical organization of content — titles, sections, subsections, bullet lists, tables, references — across page boundaries.

  • Accurate Page-Level Markdown Content: PageIndex OCR can transform each page into LLM-ready markdown text.
  • Preserving Multi-Page Structure: PageIndex OCR preserves the hierarchical structure of the whole document. Significantly improves markdown rendering and document representation.

Beyond Markdown: Structured Tree Generation with PageIndex OCR

In addition to the classic markdown output, PageIndex OCR also supports generating a "table of contents (ToC)" tree structure which represents the document. Each node comes with its node text and the correct hierarchical relationships with its parent and child nodes. This enables PageIndex OCR to be not just a markdown tool, but also provides better representation of the document and enables more downstream tasks:

  • For Humans: Enjoy rich, auto-generated ToCs for better navigation and markdown rendering.
  • For AI: Enable hierarchical search and reasoning. A structured output tree allows systems to perform ToC-based retrieval, offering greater transparency and precision.

See this introduction to learn more about conducting reasoning-based RAG with PageIndex tree.

PageIndex OCR Outperforms Leading OCR Tools

Here's a comparison between PageIndex OCR and other popular OCR tools on a sample paper, DeepSeek-R1, highlighting their ability to preserve document structure. Specifically, we extract markdown from Mistral OCR API and PageIndex OCR API and extract the titles, marked with "#" from both output. The results clearly show that PageIndex OCR preserves the correct hierarchical structure of the document, whereas competing tools often flatten or misinterpret section relationships.

summary

You can reproduce this comparison using our open-source notebook here.

Get Started with the PageIndex OCR API

Ready to move beyond single-page OCR and unlock the power of global document structure for downstream LLM and AI tasks?

  • 🚀 Dashboard - Manage your API keys, monitor usage, and explore results — all in one clean, developer-friendly interface.

  • 📘 API Reference - Learn how to integrate PageIndex OCR into your stack. Our documentation includes detailed guides, API specs, and example code to get you started quickly.

Product Announcement

PageIndex Introduction

PageIndex Introduction Video Thumbnail