Ditching Exa AI for Self-Hosted Search

Here's the thing about commercial APIs: they work great until they don't. And when you're building a device that's supposed to work offline, depending on a cloud service for web search feels... wrong.

So I spent today ripping out Exa AI and replacing it with something I can run locally. Here's how it went.

The Problem

The device's "Lookup" tool lets the AI search the web and read URLs. It's running LFM2.5-1.2B-Instruct from Liquid AI for synthesis. Until now, the search itself was powered by Exa AI — a solid service, but:

Needs an API key (costs money)
Requires internet to their servers
My data goes through their infrastructure

For a device built around "your data never leaves your device," this was the weak link. There's a growing body of work on fully local RAG systems that run on consumer hardware without cloud dependencies — the goal is the same: keep sensitive data private while still grounding LLM responses in real information.

What I Tried First

Started by looking at DDGS (DuckDuckGo search library). Simple, free, no API key needed. Seemed perfect.

Then I ran the same query twice, 30 seconds apart. Completely different results.

That's a dealbreaker. If your AI asks "what's the weather in Tokyo" and gets garbage results half the time, the whole thinking system falls apart.

Enter Perplexica

Found Perplexica — an open-source AI search engine. It bundles:

SearXNG (metasearch engine — aggregates results from 200+ search services, no tracking, no profiling)
Answer synthesis (like Perplexity)
Docker container (one command to run)

The plan was simple: run Perplexica in a Docker container, hit its API instead of Exa's.

What Actually Happened

Spent an hour fighting with the CUSTOM_OPENAI config. The idea was to point Perplexica at my local llama-server for answer synthesis. Set up the config perfectly:

[MODELS.CUSTOM_OPENAI]
API_KEY = "local"
API_URL = "http://host.docker.internal:8081/v1"
MODEL_NAME = "LFM2.5-1.2B-Instruct"

Verified llama-server was reachable from inside the container. Checked the config was mounted correctly. Everything looked right.

Hit /api/providers... and CUSTOM_OPENAI wasn't there.

No error. No warning. Just... missing.

The Pivot

Okay. New plan.

Perplexica bundles SearXNG anyway. What if I just use SearXNG directly for search, and do the answer synthesis myself?

Tested it:

curl 'http://localhost:8080/search?q=test&format=json'

Got back beautiful JSON with results from Google, DuckDuckGo, Brave, and Wikipedia. Titles, URLs, snippets, scores. Everything I needed.

The Final Architecture

Three pieces:

SearXNG (port 8080) — Web search, returns JSON. Aggregates Google, DuckDuckGo, Brave, Wikipedia, and more. 25k+ GitHub stars, actively maintained, and purpose-built for privacy.
trafilatura — Python library for extracting text from URLs. Published as an ACL 2021 paper, consistently outperforms other open-source extractors in benchmarks, and is used by HuggingFace, Microsoft Research, and Stanford.
My LLM — Synthesizes answers from search results. This "search then synthesize" pattern is essentially what Zero-Indexing Internet Search Augmented Generation formalizes: fetch results dynamically via search APIs, re-rank them, then use an LLM to extract and synthesize an answer. Same idea, just self-hosted.

The code change was surprisingly clean:

# web_search() - hits SearXNG
response = requests.get(f"{searxng_url}/search", params={
    "q": query,
    "format": "json",
    "engines": "google,duckduckgo,brave,wikipedia",
})
 
# web_contents() - uses trafilatura
downloaded = trafilatura.fetch_url(url)
text = trafilatura.extract(downloaded)
 
# web_answer() - SearXNG + LLM synthesis
results = search_with_searxng(query)
answer = llm.synthesize_from_results(results)

The output format matches Exa's exactly. Same JSON structure. Zero changes needed to prompts or the thinking system.

The Startup Dance

One gotcha: Docker + llama-server need to be running before the app can do web searches.

Added a startup script that:

Checks if services are already running (idempotent)
Starts llama-server on port 8081
Starts the Perplexica container (which includes SearXNG)
Waits for health checks

Then hooked it into the app's loading screen:

"Loading services..." → starts SearXNG
"Loading Whisper model..."
"Loading LLM model..."
"Ready"

First boot after reboot takes an extra 10-15 seconds. After that, services stay running and startup is instant.

What I Learned

Metasearch > single engine. SearXNG aggregates results from multiple search engines. More consistent than hitting one API. This is a well-known advantage — metasearch reduces the variance of any single engine's ranking quirks.
Sometimes the wrapper isn't worth it. Perplexica's answer synthesis is nice, but fighting config issues wasn't. Going direct to SearXNG was simpler.
Output format parity matters. By keeping the exact same JSON structure, I didn't have to touch any prompts or downstream code. The AI has no idea the backend changed.
Show loading progress. When services take time to start, tell the user what's happening. A blank screen is anxiety-inducing.

Numbers

Docker image size: 2.08 GB (Perplexica + SearXNG)
First query latency: ~4 seconds (search + LLM synthesis)
Subsequent queries: ~2-3 seconds
RAM overhead: ~200 MB for the container

What's Next

The web search now runs entirely on the Pi. No external API calls except to actual search engines (Google, DuckDuckGo, etc). The search layer is inspectable, self-hosted, and free — exactly the kind of setup SearXNG was designed for.

Next up: testing this in real conversations. The thinking system should work exactly the same, but now it's self-hosted.

Funny footnote: on a different project — a note-taking app — we kept Exa and built an enrichment pipeline on top of it. Different constraints, different call.

The Problem

What I Tried First

Enter Perplexica

What Actually Happened

The Pivot

The Final Architecture

The Startup Dance

What I Learned

Numbers

What's Next

Related posts

LFM2.5-1.2B on Raspberry Pi 5: llama.cpp Optimization Guide

Multi-Model Routing: Instruct vs Thinking on Edge Devices

The Frontier Model Tax

Recursive Language Models: How to Process Infinite Context With an 8B Model