When to Stop Adding Rules: Building a Negation-Aware Sanitizer Without Overfitting

I almost fell into a classic trap today. Let me tell you about it.
The Setup
The device I'm building runs LFM2.5-1.2B-Instruct from Liquid AI on a Raspberry Pi 5. It's a 1.17B parameter model designed for on-device inference — fits in under 1GB quantized. It works. But small models have quirks.
One quirk: when my Reasoner → Planner pipeline asks the model to select a tool, it sometimes outputs ToolName(query="weather in Seattle") instead of the actual tool name like Lookup(query="weather in Seattle").
The model knows it needs a tool. It just... doesn't commit to which one.
The Obvious Fix
Build a sanitizer. When the model outputs ToolName(...), replace it with the correct tool based on keywords in the query.
"Weather" → Lookup. "Calculate" → Calculate. "Remember what I said" → Remember.
Simple regex patterns — the same approach that got us from 78% to 97%. Worked great on standard tests.
Then I added harder tests. Red herrings like "I remember paying $89, what's the tip?" (Should be Calculate, not Remember.)
Still 90%. Not bad.
Then I pushed harder. "Don't calculate anything, just recall what I told you about my budget."
50% accuracy. The sanitizer saw "calculate" and picked Calculate. It ignored the "Don't."
This isn't just my problem. Recent research calls it "negation blindness" — models often fail to capture semantic changes caused by negation and generate identical responses to both positive and negated queries. A 2025 EMNLP paper built an entire taxonomy of negation types and found that even large neural models persistently underperform on negation-containing inputs. My 1.2B model on a Pi doesn't stand a chance without help.
The Slippery Slope
So I added negation detection:
negation_prefixes = [
r"don'?t\s+(?:\w+\s+){0,2}", # "don't calculate"
r"skip\s+(?:the\s+)?", # "skip the lookup"
r"without\s+", # "without searching"
r"forget\s+", # "forget the calculations"
]When the query contains "don't calculate," penalize Calculate by -2 points. Let positive keywords compete against negative ones.
Accuracy jumped back to 90%.
Then I tested "Why don't you calculate 25 * 4?"
Failed. The sanitizer penalized Calculate. But "why don't you" is a polite request, not a negation. The user wants calculation.
So I added more patterns:
- "Why didn't you..."
- "What's the reason you didn't..."
- "Shouldn't you..."
- "Wouldn't it be better to..."
Each pattern fixed one test case. Each pattern added complexity.
The Overfitting Question
After an hour of this, something felt wrong. Every new test failure triggered a new regex pattern. The sanitizer was growing. I was playing whack-a-mole.
I stopped and asked: Are we overfitting?
The research is clear on this:
"Traditional rule-based systems face a fundamental challenge with complexity growth through special-case handling... The systems tend to accumulate ad-hoc rules for edge cases, which compromises maintainability and introduces brittleness."
That's exactly what I was doing. Adding ad-hoc rules. Accumulating special cases.
But here's the nuance: not all rules are ad-hoc.
The Linguistics Test
"Don't calculate" vs "Why don't you calculate?" isn't an arbitrary edge case. It's a real linguistic distinction.
Linguists call it metalinguistic negation vs descriptive negation — a distinction formalized by Laurence Horn back in 1989. "Why don't you X?" is what Searle called an indirect speech act — the literal form is a question, but the pragmatic function is a request. The negation operates at the pragmatic level, not the semantic level.
Amazon's Alexa team ran into this exact problem. Their 2024 study on intent encoders found that models consistently confuse negation with implicature — "don't calculate" vs "why don't you calculate" is a textbook case. It's not an edge case. It's a category.
So I added one pattern:
# Skip negation if it's a polite request form
skip_negation = bool(re.search(r"why\s+don'?t|why\s+not", query_lower))And stopped there.
The Final Numbers
| Test Suite | Accuracy | Notes |
|---|---|---|
| Standard (v3) | 97.5% | Basic queries |
| Hard (v4) | 90.0% | Red herrings |
| Extreme (v5) | 82.5% | Heavy misdirection |
| Negation focused | 93.3% | Targeted negation tests |
| Adversarial | 75.0% | Intentionally trying to break it |
The adversarial tests still have failures:
- "Why didn't you calculate that earlier?" (inquiry about past)
- "What's the reason you didn't search?" (complex inquiry pattern)
- "Oh sure, don't bother calculating..." (sarcasm)
I'm not fixing those. That's the overfitting zone. A Feb 2025 paper showed you can improve negation robustness with self-supervised pre-training tasks, gaining 1.8-9.1% on negation benchmarks. But that's a model-level fix — retraining. For a rule-based fallback layer, knowing when to stop adding patterns is the whole game.
What the Research Says
Here's the thing: what I built isn't weird. It's a standard pattern.
NeMo Guardrails (NVIDIA) uses "lightweight, rule-based wrappers to adjust outputs in real-time." PrimeGuard (2024) takes a similar approach — tuning-free routing that directs requests to different model configurations, hitting 97% safe responses without fine-tuning. Same idea as the sanitizer: a lightweight correction layer that doesn't require retraining.
Voiceflow, Rasa, and Google Dialogflow all use hybrid systems: "If the LLM fails to identify a valid intent, the system reverts to the original NLU classification." A 2024 survey on intent detection compared 7 SOTA LLMs and concluded the best approach is a hybrid — LLMs combined with fine-tuned sentence transformers — getting near-LLM accuracy with 50% less latency and better out-of-scope detection.
Research on small language models specifically notes they "struggle with output format adherence." Google DeepMind's 2024 work on automata-based output constraints addresses this directly — using formal language theory to constrain model output to valid formats. My regex sanitizer is the scrappy version of the same instinct.
The legitimate pattern is: LLM first, rule-based fallback second.
The danger zone is: adding rules for every edge case until the rules become the primary system.
The Decision Framework
When should you add a pattern?
-
Is it a category or a case? "Why don't you" is a grammatical category (polite request form). "Oh sure, don't bother" is a specific case (sarcasm). Add patterns for categories.
-
What's the likelihood? "Why don't you X?" is common everyday speech. "I wouldn't say don't calculate" is rare. Prioritize common patterns.
-
What breaks if you add it? Each pattern risks false positives. "Why don't you" is specific enough to not trigger on "I don't know why."
-
Is the upstream fix better? Better prompts that make the model output correct tool names directly would eliminate the need for the sanitizer entirely.
The Stopping Point
I stopped at "why don't" and "why not." Two patterns. Both linguistically principled. Both common in speech.
The adversarial tests still show 6 failures. I'm okay with that.
A 75% accuracy fallback system that handles format failures is better than a 95% accuracy system that's brittle and unmaintainable.
I also tried fine-tuning the model with DPO to bake correct formats into the weights instead of fixing them after the fact. That had its own tradeoffs.