Recursive Language Models: How to Process Infinite Context With an 8B Model

Context windows are the wrong war.

RLMs (Recursive Language Models) let a language model process inputs 100x larger than its context window by never putting the full input into the context at all. Instead, the prompt lives as a variable in a Python REPL. The model writes code to peek at slices, search with regex, chunk the input, and recursively call itself on each piece. An 8B parameter model fine-tuned this way improves 28.3% over its base and approaches GPT-5 on long-context tasks. The paper is from MIT, the code is open source, and you can run it locally right now.

Every lab is racing to extend context windows. Google ships 2 million tokens with Gemini. Anthropic offers 200K standard with 1M in beta. GPT-5.2 pushes to 400K. The assumption is that bigger windows solve the long-context problem. But there's a fundamental issue with this approach: transformers degrade on long inputs. Attention gets diluted. The model loses track of details buried in the middle — performance is highest when relevant information sits at the start or end of the context, and drops sharply for anything in between. And you pay for every token whether the model actually needs it or not.

The RLM paper from Zhang, Kraska, and Khattab at MIT takes the opposite approach. Don't make the window bigger. Make the model smarter about what it looks at.

How RLMs work: the REPL loop

The core idea is simple once you see it. Instead of feeding a 200K token document into the model's context window, you store it as a string variable in a Python REPL environment. The model never sees the raw document. It sees metadata: how long the document is, the first few hundred characters, and a set of functions it can call to interact with the content.

From there, the model writes Python code. It can peek at any slice. It can grep for patterns. It can chunk the document into pieces and recursively call itself on each chunk. When it has enough information, it returns a final answer.

Here's what the loop looks like at a high level:

1. Load the user's prompt into REPL memory as a string variable
2. Give the root LM only: the query, prompt length, a prefix, and access functions
3. Root LM generates Python code → executes in the REPL
4. Truncated stdout appends to the model's conversation history
5. Repeat steps 3-4 until the model calls FINAL(answer)
6. Return the answer

Three design choices make this work:

The prompt is a symbolic handle, not context. In a standard LLM call, the prompt goes into the context window. In an RLM, it becomes a REPL variable. The model interacts with it programmatically. This means prompt length never hits the context window limit because the prompt is never in the window.

Output construction happens in the REPL. Final answers are built as variables and returned, not generated token-by-token by the model. This bypasses output length constraints too.

Recursion is a first-class operation. The code running in the REPL can invoke the LLM on any slice of the input. The model can partition a million-token document into 50 chunks, call itself on each chunk with a specific sub-question, collect the results, and synthesize. All within a single top-level call.

The strategies that emerge

The researchers didn't hard-code how the model should process long inputs. They gave it the tools (peek, slice, grep, sub-call) and let it figure out strategies on its own. Three patterns emerged consistently:

Regex-first filtering. Before reading anything, the model greps the input for relevant patterns. Processing a 10M token corpus looking for a specific claim? The model runs a regex to narrow the search space from millions of tokens to a few thousand, then reads only the relevant sections. This is how a human would do it.

Partition and map. For tasks requiring full coverage (summarization, exhaustive search), the model chunks the input into manageable pieces and recursively calls itself on each one. Each sub-call extracts what it needs. The root model then synthesizes across all sub-results.

Progressive refinement. The model peeks at the beginning of the document to understand structure, then uses that understanding to make targeted reads. It doesn't scan linearly. It jumps to where the information is likely to be.

None of these strategies were trained explicitly. They emerged from the model having access to the REPL primitives and figuring out how to use them.

What the benchmarks show

The results are striking, especially for small models.

Task	Input Length	GPT-5 (vanilla)	RLM(GPT-5)	Improvement
OOLONG	131K tokens	44.0%	56.5%	+28.4%
OOLONG-Pairs	32K tokens	0.04%	58.0%	massive
BrowseComp+	6-11M tokens	0.0% (exceeded window)	91.3%	scales beyond window
CodeQA	23K-4.2M tokens	24.0% (exceeded window)	62.0%	+158%

RLM doesn't just handle longer inputs. It handles the same-length inputs better. On OOLONG at 131K tokens (well within GPT-5's context window), the recursive approach still beats vanilla by 28%. The model processes the information more effectively when it controls what it looks at.

The most striking result: RLM(GPT-5-mini) outperforms vanilla GPT-5 on OOLONG by over 34 points — a 114% improvement. A smaller, cheaper model with recursive inference beats a larger model that tries to brute-force the context. That's the thesis of this paper in one data point.

The small model story is even more interesting. The researchers fine-tuned Qwen3-8B on just 1,000 trajectories generated by a larger model (Qwen3-Coder-480B). The training data came from LongBenchPro tasks that were completely unrelated to the evaluation benchmarks. RLM-Qwen3-8B improved 28.3% over base Qwen3-8B and approached GPT-5 performance on three of four tasks.

An 8B model. One thousand training examples. Unrelated training domain. Approaching GPT-5 on long-context work.

The scaling behavior holds too. Push OOLONG to 263K tokens — double the original benchmark — and RLM still maintains a 49% improvement over the base model while vanilla GPT-5 degrades sharply.

Running it yourself

The library is a drop-in replacement for standard LLM completion calls. Install it:

pip install rlms

Basic usage with a cloud model:

from rlm import RLM
 
rlm = RLM(
    backend="openai",
    backend_kwargs={"model_name": "gpt-5-nano"},
    verbose=True,
)
 
response = rlm.completion(
    "Summarize the key findings from this research corpus."
    + massive_document_string
)
print(response.response)

That's the same API as a standard completion call. The library handles the REPL setup, prompt-as-variable storage, recursive loop management, and answer extraction.

Running it locally with small models

This is where it gets interesting for edge AI and local inference. The library supports vLLM as a backend, which means you can point it at any local model:

from rlm import RLM
 
# Run against a local model via vLLM's OpenAI-compatible server
rlm = RLM(
    backend="openai",
    backend_kwargs={
        "model_name": "Qwen/Qwen3-8B",
        "base_url": "http://localhost:8000/v1",
        "api_key": "not-needed",
    },
    verbose=True,
)
 
# Process a document far beyond the model's 32K context window
with open("massive_codebase.txt") as f:
    code = f.read()  # 500K+ tokens
 
response = rlm.completion(
    f"Find all security vulnerabilities in this codebase and explain each one.\n\n{code}"
)

Start the local vLLM server:

vllm serve Qwen/Qwen3-8B --port 8000

The model needs about 16GB of RAM at 8B parameters in full precision. Qwen3-8B ships with pre-quantized FP8 and AWQ variants — swap in Qwen/Qwen3-8B-FP8 or Qwen/Qwen3-8B-AWQ to cut memory roughly in half. Go further with 4-bit GGUF and you're running in 6-8GB. The RLM overhead is minimal because the heavy lifting happens in the REPL, not in the model's context.

For even smaller setups, the architecture works with any model that can generate Python code and follow the REPL loop pattern. A community fine-tuning gist exists for Qwen3-4B, and the paper's training approach (distill 1,000 trajectories from a larger model, fine-tune the small one) is straightforward to replicate with Unsloth or standard LoRA tooling.

The minimum viable model for RLM needs two capabilities: it has to write correct Python, and it has to understand the REPL loop protocol (peek, slice, sub-call, FINAL). The paper only tests down to 8B, so where the floor sits is an open question — my guess is that below 4B, code generation quality drops enough that the recursive strategies break down, but nobody has published results proving that yet. At 8B, the sweet spot hits. You get reliable code generation, the model learns the recursion pattern from a small number of examples, and the whole thing runs on consumer hardware.

RLMs vs RAG vs context stuffing: what changes for builders

RLMs reframe the long-context problem from an architectural one (build bigger windows) to an inference strategy one (use the window you have more intelligently).

Here's how the approaches compare:

Approach	Max Input	Quality at Scale	Cost per Query	Latency
Context stuffing	Limited by window (130K-2M)	Degrades in the middle	Proportional to tokens	Fast (single pass)
Summarize-then-answer	Unlimited in theory	Lossy — details get compressed out	Moderate (two passes)	Moderate
RAG (retrieval)	Unlimited in theory	Depends on retrieval quality	Low (only relevant chunks)	Fast
RLM (recursive)	Unlimited (10M+ tested)	Maintains or improves with scale	Comparable to stuffing, long-tailed	Slow (sequential sub-calls)

This has three practical implications:

Small models get long-context superpowers. An 8B model with a 32K context window can now process million-token inputs by recursively chunking and sub-calling. You don't need a 128K or million-token context window. You need a model that can write code and follow a loop.

Cost drops significantly. On BrowseComp+ (6-11M tokens), RLM(GPT-5) cost a median of $0.99 per query versus $1.50-$2.75 for naive approaches that try to stuff everything into the context. The caveat: cost distributions are long-tailed. If the model gets stuck in a loop or runs redundant verifications, outlier runs can spike well above the median. For local models, the cost is just electricity and time.

The architecture is model-agnostic. The RLM library supports OpenAI, Anthropic, OpenRouter, and local models via vLLM. Same REPL loop, same recursive strategy, different model underneath. You can swap the backbone without changing the inference logic.

The limitation is latency. Recursive sub-calls are sequential right now. A complex task might make 20-30 sub-calls, each requiring a full model inference pass. On a local 8B model, that's minutes, not seconds. The paper acknowledges this — async sub-calls and prefix caching are obvious next optimizations that haven't been implemented yet.

For batch processing and background research pipelines, the latency is a non-issue. This is exactly the pattern I use in Tuon Deep Research — multi-step pipelines where each phase processes a slice of the input and passes findings forward. RLMs formalize that pattern at the model level. For interactive chat, the latency is a real constraint. But I'd rather have a slow correct answer than a fast wrong one from a model that lost track of what it read 50K tokens ago.

How RLMs work: the REPL loop

The strategies that emerge

What the benchmarks show

Running it yourself

Running it locally with small models

RLMs vs RAG vs context stuffing: what changes for builders

Related posts

The Frontier Model Tax

Multi-Model Routing: Instruct vs Thinking on Edge Devices

AI Agent Architecture: How to Build 'Identity' Without Breaking Performance

The LLM Parameter Lie: What Actually Matters in 2026