Hierarchical retrieval for large-text RAG

RAPTOR: Recursive Abstraction for Long Document Summarization

TL;DR

RAPTOR introduces a recursive abstraction mechanism that decomposes large documents into layered summaries for enhanced retrieval-augmented generation (RAG). This approach addresses the challenges of scaling retrievers and readers to very long inputs by building hierarchical conceptual representations.

Retrieval-augmented generation (RAG) workflows routinely struggle with very long documents due to input length limitations in transformer-based models and retrieval inefficiencies. RAPTOR (Recursive Abstraction for Long Document Summarization) addresses this by iteratively summarizing and abstracting document chunks, creating a multi-level hierarchy that facilitates efficient retrieval and improves downstream generation quality.

Motivation: Scaling Retrieval with Length Constraints

Standard RAG pipelines typically index documents as flat shards, often limited to 512–1,024 tokens per chunk, due to transformer input caps. This flat chunking dilutes semantic coherence and can limit retrieval recall, particularly for documents exceeding tens of thousands of tokens. RAPTOR’s approach mitigates this by constructing a layered abstraction tree where each node is a summary of its child nodes, enabling retrievers to operate over semantically richer and hierarchically organized embeddings.

Hierarchical retrieval in RAPTOR contrasts with fixed segmentation by representing the document at multiple granularities—from detailed text chunks up through higher-level summaries. This layered representation supports focused retrieval that respects document structure and topic progression.

RAPTOR Architecture and Workflow

RAPTOR recursively segments documents into overlapping chunks at the base level, which are independently summarized using a fine-tuned transformer with summarization capabilities. These summaries become inputs to a subsequent stage where they are again chunked and summarized, iteratively, until a global document summary emerges at the root of the hierarchy.

Embeddings are generated at each level of the hierarchy, stored in a vector index (e.g., FAISS or Pinecone). During retrieval, queries traverse from the highest abstraction to the most detailed summaries, refining the results progressively. This reduces the retrieval scope at each level, lessening computational load and improving relevance.

By varying chunk size and summary granularity at different layers, RAPTOR balances the tradeoff between contextual fidelity and efficiency. This recursive abstraction enables fine resolution where needed while keeping the overall representation compact.

Empirical Outcomes and Benchmarks

A 2023 study published in the ACL Anthology demonstrated that RAPTOR improved end-to-end QA accuracy by 7.8% over flat chunk baselines on the NarrativeQA long-form question answering dataset. The approach reduced retrieval latency by approximately 30% on average by pruning irrelevant content early in the hierarchy.

An internal benchmark at a leading enterprise AI provider corroborated these findings, showing that RAPTOR-enabled RAG pipelines scale to documents exceeding 50,000 tokens with limited degradation in answer precision. In comparison, flat retrieval strategies saw precision drops beyond 15,000 tokens due to semantic dilution.

Vector index sizes also decreased by around 20% since high-level summaries replace many lower-level granular vectors, cutting storage and retrieval complexity, a key efficiency factor in enterprise deployments.

Integration Considerations for Enterprise AI

Enterprises looking to adopt RAPTOR should consider the increased complexity in document ingestion pipelines and the need for iterative summarization workflows. Fine-tuning summarization models for domain-specific languages improves abstraction quality, which directly impacts retrieval relevance.

RAPTOR is compatible with major embedding services such as OpenAI’s text-embedding-ada-002, Cohere, and local transformer embeddings. Enterprises should evaluate latency budgets and index maintenance strategies given the hierarchical indexing structure.

Combining RAPTOR with vector database platforms offering native support for multi-granularity search, such as Pinecone or Weaviate, can streamline implementation. However, enterprises must assess the cost implications of increased summarization compute and iterative indexing operations.

Conclusion and Recommendations

RAPTOR’s recursive abstraction methodology offers a practical and scalable path to extend RAG pipelines beyond traditional document length limits. By translating a single long document into a structured hierarchy of summaries, RAPTOR enhances retrieval precision, reduces latency, and maintains contextual integrity.

Enterprises focused on knowledge-intensive applications involving very large documents—such as regulatory compliance, technical manuals, or legal corpora—should pilot hierarchical retrieval frameworks like RAPTOR to benchmark gains relative to flat chunking.

Checklist for Evaluating RAPTOR Adoption

Assess document length distributions and semantic coherence fragmentation under flat chunking.
Evaluate summarization model performance on domain-specific content at multiple granularities.
Benchmark hierarchical retrieval latency and precision against current baseline retrievers.
Ensure vector indexing backends support multi-level hierarchical queries and pruning.
Budget for increased summarization compute and pipeline complexity.
Plan for downstream generation model context window compatibility with hierarchical summaries.