AI Agent Architecture: How to Build 'Identity' Without Breaking Performance

AI Agent Architecture: How to Build 'Identity' Without Breaking Performance

If you look at AI dev tooling right now, you see the "mega prompt."

Usually, it's a 2,000-word XML monolith. It promises to turn an LLM into Naval Ravikant or a senior engineer. It packs behavioral guidelines, simulated memories, and philosophical models. It's great for social media bookmark-bait.

But it's terrible architecture.

If you want to give an AI a true "soul," OpenClaw is a great framework. It gives an LLM a persistent "Markdown brain." It maintains daily journals, evolves its own rules, and builds real memory. It solves the identity problem well.

But try dropping that massive persona into a recursive agent loop. Two things happen immediately.

First, you burn through tokens. Fast. A 2,500-token system prompt inside a standard 20-step ReAct loop eats 50,000 input tokens before the agent runs any useful code.

Second, the context bloats. The model gets caught up playing the character. It starts to hallucinate variables and forgets how to execute basic Python tasks.

The industry treats AI identity as a creative writing problem. It's actually a data engineering problem. A prompt over 2,000 words is a signal that your architecture is leaking. You don't need a better prompt. You need state management.

Here is how I build agent architectures that keep the persona without sacrificing the precision of the doer.

1. Tiered Context Architecture and Style Injection

Decouple the thinking from the talking. You stop packing everything into one massive prompt and instead split the task into two calls: The Doer and The Speaker.

When you force an LLM to solve a complex problem and maintain a specific persona simultaneously, both suffer. It usually drops the persona or hallucinates the logic. Agents should work like standard applications. You don't load your entire database into RAM to update one user record. It's basic separation of concerns.

Step 1: The Doer (Logic Layer)

Give the model a dry prompt focused entirely on rules and constraints. No personality. No formatting instructions beyond structural requirements.

Prompt: "Analyze this refund request against our 30-day policy. Calculate the date difference. Output raw JSON only." Output:

{
  "approved": false,
  "days_elapsed": 34,
  "reason": "Policy strictly limits refunds to 30 days."
}

Step 2: The Speaker (Presentation Layer)

Now you run a second, cheaper inference step. Pass that verified JSON payload into a prompt strictly dedicated to tone.

Prompt: "You are an empathetic, direct support agent. Read this JSON outcome. Write a two-sentence email explaining the decision. Do not apologize excessively." Output: "I looked into your request, but we can't process the refund today. Your order was placed 34 days ago, which is past our 30-day return window."

The doer solves the problem. The speaker translates the solution into the persona's voice.

This gives you a clear audit trail. If the final output is wrong, you know exactly where it broke. If the JSON is right but the email sounds weird, you tweak the speaker prompt. You stop playing whack-a-mole with a 2,000-word system prompt.

2. The RAG Trap: Why RAG Fails for Persona Management

Do not use RAG (Retrieval-Augmented Generation) for persona prompts under 2,000 words. Vectorizing a personality destroys the rhythm and makes the model sound like a robot reading flashcards.

If you chunk a standard persona prompt into a vector database, you destroy the voice. The model retrieves isolated sentences about "communication style" but loses the overarching rhythm. It sounds like a schizophrenic robot.

RAG is only useful for identity if you have a massive, 50,000+ token character bible. This might include a complete history of past decisions, journal entries, or exact chat transcripts. In that case, you retrieve episodic memory, not core behavioral instructions. For anything under 2,000 words, use prompt caching instead.

3. The Observer Pattern: Reducing Agent Costs

The Observer Pattern cuts costs by up to 80% while maintaining persona quality. You use a cheap model for logic and an expensive model for the final "identity" audit.

This maps to the Supervisor pattern in frameworks like LangGraph. Here is the mechanical breakdown:

  1. The Doer: Use a fast, cheap model (like Claude 3.5 Haiku or GPT-4o-mini). Give it a dry system prompt. Let it write the code.
  2. The Observer: Keep the expensive model (like Claude 3.5 Sonnet or GPT-4o) asleep. This model holds the heavy, complex persona prompt.
  3. The Audit: When the Doer finishes, it passes the output to the Observer. The Observer wakes up, reviews the choices against its "soul" guidelines, issues corrections, and goes back to sleep.
# 1. The Doer: Fast, cheap, soulless
doer_response = client.messages.create(
    model="CHEAP_MODEL", # e.g. Claude Haiku or GPT-4o-mini
    max_tokens=2000,
    system="Write raw, functional Python code. No explanations.",
    messages=[{"role": "user", "content": "Write a scraper for example.com"}]
)
 
draft_code = doer_response.content[0].text
 
# 2. The Observer: Expensive, opinionated, only runs once
observer_prompt = """
You are a Staff Engineer. Review this code against our architecture guidelines:
1. Is the network logic tightly coupled to the parser?
2. Are exceptions swallowed?
Fix the architecture and return only the refactored code.
"""
 
final_response = client.messages.create(
    model="EXPENSIVE_MODEL", # e.g. Claude Sonnet or GPT-4o
    max_tokens=2000,
    system=observer_prompt,
    messages=[{"role": "user", "content": f"Review this draft:\n{draft_code}"}]
)

The math is clear. Using a cheap model for the doer is often 12x cheaper than using the flagship model for every step. You trade dollars for seconds because running two sequential API calls is slower. Use this for async background tasks rather than real-time chat.

4. State Management and Prompt Caching for AI Agents

Use prompt caching to keep large persona prompts "warm" for a fraction of the cost. This allows continuous identity without the 5,000-token input penalty on every turn.

Anthropic's prompt caching allows you to keep that massive system prompt cached. Instead of paying full price for 5,000 tokens on every turn, you pay a small fraction to read the cached state.

There are two strict rules for caching:

  1. The Token Minimum: Caching activates if your prompt is over 1,024 tokens for Sonnet or 2,048 tokens for Opus.
  2. The Prefix Rule: Cached content must be at the very top of your prompt. If you inject a dynamic variable before the cached persona block, you break the cache.

Format your system prompt as an array and tag the heaviest block with a cache control flag.

# Generic Implementation
system=[
    {
        "type": "text",
        "text": "Your massive 5,000-word persona instructions go here...",
        "cache_control": {"type": "ephemeral"} # This saves your budget
    }
]

Every time you hit that block, the 5-minute timer resets. Other providers handle this differently:

  • OpenAI: This is automatic. If your prompt is over 1,024 tokens and hits a prefix match, you get a 50% discount.
  • Google Gemini: This requires contextCaching via the Gemini API. Importantly, the minimum threshold is 32,768 tokens. It is designed for entire codebases.

The Takeaway

Personality is expensive. It's beautiful, but it does dilute attention and eat context.

If you want to build agents that ship, stop treating them like improv actors. Treat them like distributed systems. Decouple the identity from the execution. Retrieve what you need when you need it. Let the cheap models do the heavy lifting while the expensive models provide the leverage of quality control.