Logo
OpenKB: An Open-Source LLM Knowledge Base
Published on

PageIndex Team

Introducing OpenKB

Introduction

Last week, Andrej Karpathy, a founding member of OpenAI and former Director of AI at Tesla, described a workflow he called "LLM Knowledge Bases": using LLMs not just to manipulate code, but to compile knowledge: ingesting raw documents, papers, and articles, and having an LLM continuously build and maintain a structured, interlinked wiki in Markdown. Browse it in Obsidian, query it with an LLM agent, file the answers back in. Knowledge compounds. Nothing is re-derived from scratch on every query.

The response was immediate. The thread went viral. Thousands of developers recognized the pattern. This was not a toy demo or an academic proposal; it was a concrete, working workflow that anyone could adopt today.

But Karpathy himself flagged the hard part. In a reply to the thread, he noted that long books and PDFs remain difficult, and suggested using EPUB instead of PDF where possible, or otherwise processing one chapter at a time.

That problem is exactly what we work on at PageIndex. So we built OpenKB.

What is OpenKB

OpenKB is an open-source, CLI-based system that implements Karpathy's vision end-to-end, and extends it to handle the long-document problem he identified. Drop raw files into a directory; an LLM compiles them into a structured, interlinked wiki of Markdown files. Query the wiki. Run health checks. Watch mode picks up new files automatically. Open the whole thing in Obsidian.

The fundamental insight, which Karpathy articulated clearly, is that knowledge should accumulate. Traditional RAG rediscovers knowledge from scratch on every query. Nothing builds up. OpenKB compiles knowledge once into a persistent wiki, then keeps it current as you add new sources. Cross-references already exist. Contradictions are flagged. Synthesis reflects everything you have consumed, not just what happened to surface in a single retrieval pass.

Solving the Hard Part: Long PDFs

Karpathy's workflow works beautifully for web articles and short papers, where the LLM can read full text directly. But long documents (e.g., hundred-page reports, technical manuals, dense research papers) break this model. You cannot simply feed an 800-page PDF into a context window and expect coherent synthesis.

This is where OpenKB is meaningfully different, and where our work on PageIndex becomes directly relevant.

Long documents are handled via PageIndex tree indexing. Rather than reading a long PDF in full, PageIndex builds a hierarchical tree index of the document, mirroring how a human actually navigates a long text: reading the table of contents, identifying relevant sections, and drilling down. The LLM reads the tree and reasons over it to retrieve what it needs, rather than scanning the full text or relying on static semantic similarity. This approach is vectorless and requires no chunking or vector database.

Key Capabilities

Any format. OpenKB ingests PDF, Word, PowerPoint, Excel, HTML, Markdown, CSV, plain text, and more, via Microsoft's markitdown library. Whatever format your source material is in, you can drop it in.

Scale to long documents. Short documents are read in full by the LLM. Long PDFs (configurable threshold, default 20 pages) are indexed by PageIndex into a hierarchical tree. The LLM reads the tree instead of the full text, enabling accurate retrieval from documents that would otherwise exceed context limits or degrade in quality.

Native multi-modality. OpenKB retrieves and understands figures, tables, and embedded images, not just text. Documents are rich artifacts, and the knowledge base reflects that.

Compiled wiki. When you add a document, the LLM generates a summary page, reads existing concept pages, and creates or updates cross-document concept pages that synthesize knowledge across sources. The knowledge base gets richer with every addition.

Query. Ask questions against your compiled wiki. The LLM navigates the structured knowledge to answer, rather than searching raw documents from scratch each time.

Interactive Chat. Multi-turn conversations with persisted sessions you can resume across runs.

Lint. Health checks find contradictions, knowledge gaps, orphaned pages, and stale content. The LLM can suggest new connections and article candidates as your wiki grows.

Watch mode. Drop files into raw/ and the wiki updates automatically in the background.

Obsidian compatible. The wiki is plain .md files with [[wikilinks]]. Open the wiki/ directory as an Obsidian vault and you get graph view, backlinks, and full navigation out of the box, exactly the IDE Karpathy described.

How It Works

The architecture is straightforward:

raw/                              You drop files here
├─ Short docs ──→ markitdown ──→ LLM reads full text
│                                     │
├─ Long PDFs ──→ PageIndex ────→ LLM reads document trees
│                                     │
│                                     ▼
│                         Wiki Compilation (using LLM)
│                                     │
▼                                     ▼
wiki/
├── index.md             Knowledge base overview
├── log.md               Operations timeline
├── AGENTS.md            Wiki schema (LLM instructions)
├── sources/             Full-text conversions
├── summaries/           Per-document summaries
├── concepts/            Cross-document synthesis ← the good stuff
├── explorations/        Saved query results
└── reports/             Lint reports

The wiki AGENTS.md file defines structure and conventions, the LLM's instruction manual for maintaining the wiki. Edit it to change how your knowledge base is organized; changes take effect immediately at runtime.

Multi-LLM support is built in via LiteLLM: OpenAI, Anthropic, Gemini, and any other LiteLLM-compatible provider.

What's Next

Karpathy ended his thread with a note that stayed with us: "I think there is room here for an incredible new product instead of a hacky collection of scripts."

OpenKB is our answer to that. It is not a collection of scripts; it is a coherent system with a defined architecture, a persistent wiki format, and a retrieval layer built specifically for the document types that matter in serious research.

We would love to hear what you think. Star the repo, open an issue, or reach out directly.


Try OpenKB now: github.com/VectifyAI/OpenKB

To learn more about the retrieval engine powering long-document handling in OpenKB, see PageIndex, our vectorless, reasoning-based RAG engine for long documents.