PageIndex now scales to millions of documents

Available today for enterprise. Cloud rollout coming soon.

We started PageIndex with one belief: retrieval over long documents should look more like human reading than like semantic similarity search. Since launch, the open-source PageIndex, one of the fastest-growing AI-infra repos on GitHub, has crossed 26k GitHub stars in a few months, hit #1 on GitHub Trending, been selected for the GitHub Secure Open Source Fund, and now serves 23k+ cloud users in production.

Today we're announcing the next chapter: the PageIndex File System, a new layer on top of the vectorless retrieval engine that lets a single index reason over millions of documents. It ships today as part of PageIndex Enterprise, with a cloud edition arriving later this month.

This post is a quick tour: why classic vector-based RAG hits a ceiling, what PageIndex is, why a plain file system stops working at this scale, and what the PageIndex File System adds to get past it.

Where classic vector-based RAG breaks

The standard RAG recipe is by now familiar: chunk every document into passages, run each chunk through an embedding model to get a fixed-size vector, store those vectors in a vector database, and at query time embed the question and pull back the top-K nearest neighbors. It works, until it doesn't. Two things go wrong, and both get worse as the corpus grows.

1. Embeddings have limited representation power. A single fixed-length vector has to summarize an entire chunk into a few hundred numbers, and embedding models cap their input length at a few hundred or a few thousand tokens. That cap forces two compromises that quietly degrade quality:

Chunking breaks semantic continuity. Real documents have sections, tables, footnotes, and cross-references that flow across page boundaries. Slicing them into fixed-size windows shreds those dependencies. The chunk that contains the answer is often missing the context that makes the answer make sense.
Retrieval is blind to context. Only the user's literal query gets embedded. The conversation that came before, the user's role, the evolving intent of a multi-turn dialogue: all of that has to be discarded before encoding. The retriever sees a context-stripped probe, not a real question in a real situation.

2. Similarity is not the same as relevance. Vector search ranks by cosine similarity to the query. But what users actually want is relevance, and the two come apart in both directions:

Similar but not relevant (low accuracy). In professional domains (legal, medical, financial), language is repetitive and small differences carry critical meaning. Two paragraphs can look almost identical to an embedding model and yet say opposite things about who is liable, what dose to give, or which clause applies. Vector search happily returns the wrong one because it "looks right".
Relevant but not similar (low recall). Conversely, the right answer is often phrased very differently from the query, or lives many sections away from the most-cited passage. Finding it takes reasoning over the document's structure, not surface-level word matching. Vector search has no mechanism for that, so the genuinely relevant chunk falls past rank K and disappears silently. You don't get an error; you just get a worse answer.

These aren't edge cases. They're the two failure modes our enterprise customers hit again and again, and they're exactly what motivated us to build a different kind of retriever.

What is PageIndex?

PageIndex is a vectorless RAG framework. Instead of chopping documents into chunks, embedding them into vectors, and ranking by cosine similarity, PageIndex represents each document as a tree (sections nest into subsections, subsections into pages, pages into content blocks) and lets an LLM navigate the tree to find the answer.

The shape of the tree is the table of contents you'd see in a book. The retrieval policy is an LLM that, at each node, asks a single question: given the user's query, the conversation so far, and where I am in the document, should I look inside this subtree? No fixed top-K, no embedding bottleneck, no information dropped silently because it ranked $K{+}1$ .

Three properties fall out of this design, and each one is exactly what classic vector RAG cannot offer:

Relevance classification, not semantic similarity. The LLM doesn't compute a cosine score; it makes a yes/no judgment at every node (is this subtree worth opening for this query?) using full-document understanding, not a 768-dimensional proxy. The two failure modes of similarity search (similar-but-irrelevant, relevant-but-dissimilar) simply don't apply.
Retrieval depends on context. The decision at each node is conditioned on the query, the conversation history, the user's role, and the path the LLM has already walked. There's no fixed-length cap forcing context to be discarded. Context shapes every navigation step.
Transparent retrieval process. The search trace is a readable path through the tree: which sections were opened, which were skipped, which yielded the evidence. You can audit why an answer came back, replay the same path for a different model, and surface the citation chain to the end user. Vector search returns a list of chunks with no story; PageIndex returns a route.

From single document to file system

Here is the obvious objection. Classic vector RAG scales effortlessly to millions of documents: embeddings are pre-computed once, top-K lookup over a billion vectors is a well-solved engineering problem, and the index doesn't care whether you have 1k or 100M chunks. PageIndex, by contrast, is built on a document tree, a richer structure, but one that an LLM has to navigate. Won't the LLM choke when there are a million trees to walk?

It's a fair question, and the answer starts with an observation we can almost take for granted: a file system is already a tree. Folders contain subfolders, subfolders contain files. That is a pre-defined hierarchy over a corpus, given to us for free by every document store that has ever existed. So the natural way to scale tree search across documents is to make the file system itself a node-level layer: each document tree hangs off a leaf of the file system tree, and the whole corpus becomes one big tree.

This unification is the key. The same tree search policy that an LLM uses to navigate a section of a 100-page report can navigate the directory hierarchy of a million-document drive and then descend, without changing tools, into a specific document's internal tree.

That is the basic shape. But, as the next section shows, a plain file system is not enough at a million-document scale: the hierarchy you inherit from disk is rarely the hierarchy you want to search by. The rest of this post is about how to fix that without abandoning the tree search framework.

Why a simple file system isn't enough

If you have a few thousand documents that already live in a tidy folder hierarchy, you can take that folder tree, point an LLM at it, and call it a day. PageIndex has supported that "inherit the folder structure" mode from day one.

At a million documents, this stops working. Three reasons.

1. Often, there is no folder hierarchy at all. Many enterprise corpora live in document management systems, S3 buckets, or SharePoint libraries that are effectively flat: every file in one giant pool, with nothing but a row of metadata fields (author, date, type) and sometimes not even that. A SQL query over those fields handles the easy cases, but anything that needs content-level understanding has no tree to navigate, because no tree was ever built.

2. The hierarchy is one-dimensional. Even when a folder tree exists, a document is rarely "about" exactly one thing. A contract belongs simultaneously to a vendor, a region, a fiscal year, and a product line. A folder tree forces you to pick one axis. The other axes, the ones the user is actually querying on, are gone.

3. Folder labels are unreliable signals for an LLM. Real corpora accumulate folders called misc/, final_v3_USE_THIS_ONE/, and 2019_legacy/. Even tidy-looking paths like /finance/2024/ say nothing about whether the document discusses pricing risk or liquidity. The LLM ends up searching folder names, not document meaning, and gets pruned in the wrong direction.

So the question becomes: what do you put inside the tree when no good tree exists yet?

The PageIndex File System

The PageIndex File System is what we built to fix all three. It's a query-time tree layer that sits above your documents and lets the same tree search policy scale from a single document to millions. Three technical pieces, all live in the enterprise release:

Virtual nodes: synthesizing the structure

When the corpus has no usable hierarchy, PageIndex builds one. Documents are clustered into topic nodes by topic models or LLM-driven grouping; each document also gets LLM-inferred metadata (category, summary, key entities), which become additional internal nodes in the tree. The result is a hierarchy whose internal labels are semantic: exactly the signal the LLM needs to prune branches early.

Crucially, the same document can sit under more than one virtual ancestor (vendor and region and year). A flat file system can't express this; a PageIndex tree can.

The tree is query-dependent

A traditional file system has one hierarchy, fixed at ingestion. That works for storage; it does not work for retrieval, because no single hierarchy is right for every question. "What did vendor X charge us in 2024?" wants a tree organized by vendor, then by year. "Show me all contracts up for renewal next quarter" wants a tree organized by status and renewal date. Same corpus, two completely different trees.

PageIndex builds the file system tree on demand, conditioned on the query. Given a question, it picks which metadata axes to use as internal nodes, which clusters to surface, and how deep to nest them, so the LLM is always navigating a hierarchy that is informative for this query. Different queries produce different views over the same documents, without re-ingesting or re-embedding anything.

The same machinery makes the index improve over time: traversal patterns from past queries refine the virtual nodes and metadata, so the more the system is used, the better the tree it builds for the next question.

Tree search adapts to whether the structure is informative

A static traversal (always going layer by layer) is the wrong default at scale. Sometimes a node's children carry rich, query-relevant labels (/contracts/2024/vendor_X/), and the LLM should descend one layer at a time, using each label to prune. Sometimes the labels are uninformative (misc/, folder_1/, an arbitrary user-uploaded directory), and walking the structure layer by layer just burns LLM calls on signal-free intermediate nodes.

PageIndex picks the strategy per node, conditioned on the query:

Layer-wise: when child labels are informative for this query, return the children, prune by their labels.
Recursive (dynamic flattening): when child labels are uninformative, collapse the subtree down to its leaves and defer the discrimination to the actual content. Uninformative levels are bypassed entirely.

This dynamic flattening is what keeps tree search efficient at a million-document scale. The LLM never has to read structure that doesn't help it; the depth of the search shrinks to the depth that actually carries information for the question being asked.

What's available today

PageIndex Enterprise, with the PageIndex File System included, is generally available now:

Single-index scale to millions of documents via the PageIndex File System
Virtual-node synthesis and query-dependent index construction
Dedicated or VPC deployment

The PageIndex Cloud edition (same engine and File System, fully managed, pay-as-you-go) is rolling out later this month. Existing OSS users keep everything that's in the open-source repo, and the OSS roadmap continues unchanged.

If you're hitting the wall with your current vector RAG stack — accuracy plateaus, recall holes you can't audit, indexes you can't update without a full re-embedding job — get in touch. We'd love to show you what tree search at scale looks like.

The PageIndex Team