Model Routing Strategy¶
Applies to: All PRDs referencing LLM calls Last Updated: February 2026 Source of truth: This document. PRDs should reference model tiers (fast/default/nano), not specific model names.
Active Models¶
| Model | Tier | Provider | Cost (prompt/completion per 1K tokens) | Used For |
|---|---|---|---|---|
gpt-5 |
Default / Reasoning | OpenAI | $0.005 / $0.015 | Response generation (when depth needed) |
gpt-5-mini |
Fast | OpenAI | $0.00015 / $0.0006 | Response generation (default), vision, classification |
gpt-5-nano |
Ultra-fast | OpenAI | $0.00005 / $0.0002 | Emotion detection, entity extraction, query synthesis, event tracking |
gemini-2.5-flash-lite |
Router | N/A (separate billing) | Unified message routing with full conversation context |
Pipeline Stage → Model Mapping¶
User Message
│
├── FastReflex (<10ms, regex, no LLM)
│
├── Smart Message Router ─────────── gemini-2.5-flash-lite
│ Routes to: workflow / miniapp / direct response
│
├── Emotion Classification ───────── gpt-5-nano (150 tokens)
│
├── Entity Extraction ────────────── gpt-5-nano (reflex limits)
│
├── Event Detection ──────────────── gpt-5-nano (300 tokens)
│ Detects future events for proactive surfacing
│
├── Memory Query Synthesis ───────── gpt-5-nano (300 tokens)
│ Generates auxiliary questions for multi-query recall
│
├── Content Moderation (Input) ───── omni-moderation-latest (OpenAI)
│
├── Response Generation ──────────── gpt-5-mini (default, 1000 tokens)
│ │ gpt-5 (deep reasoning, 2000 tokens)
│ │
│ └── Vision (if image) ──────── gpt-5-mini multimodal (1500 tokens)
│
├── Content Moderation (Output) ──── omni-moderation-latest (OpenAI)
│
└── Memory Storage ───────────────── No LLM (Supermemory API)
Token Limits (Feature-Flag Controlled)¶
| Context | Flag Key | Default | Emergency (FORCE_BASIC_MODE) |
|---|---|---|---|
| Reflex responses | llm.max_tokens.reflex |
1500 | 200 |
| Deep reasoning | llm.max_tokens.deep_reasoning |
2000 | 200 |
| Default conversation | llm.max_tokens.default |
1000 | 200 |
| Vision analysis | llm.max_tokens.vision |
1500 | 200 |
All OpenAI calls enforce a 15-second timeout to fit within the 20-second routing window.
PRD Guidance¶
When writing PRDs, reference model tiers rather than specific model names. This allows engineering to swap models without PRD updates.
| Tier | Use When | Current Model |
|---|---|---|
router |
Routing decisions, intent classification with full context | gemini-2.5-flash-lite |
nano |
Sub-tasks under 300 tokens: entity extraction, emotion, query synthesis | gpt-5-nano |
fast |
Response generation, vision analysis, classification | gpt-5-mini |
default |
Complex reasoning, deep conversation, multi-step logic | gpt-5 |
moderation |
Content safety checks (input and output) | omni-moderation-latest |
Example PRD language:
- "The emotion classifier uses the nano tier for sub-150-token classification."
- "Response generation defaults to the fast tier, escalating to default for complex reasoning."
Cost Implications¶
| Tier | Typical Call Cost | Calls Per Conversation Turn |
|---|---|---|
| Router (Gemini) | ~$0.0001 | 1 |
| Nano | ~$0.00003 | 2–4 (emotion + entity + query synthesis + event) |
| Fast (generation) | ~$0.0004 | 1 |
| Default (reasoning) | ~$0.01 | 0–1 (only when escalated) |
| Moderation | ~$0.0001 | 2 (input + output) |
| Total per turn | ~\(0.001–\)0.011 | Depends on escalation |
At 50 messages/day (free tier cap), estimated daily LLM cost per free user: ~\(0.05–0.07. At ~200 messages/day (active Superpowers+ user), estimated daily LLM cost: ~\)0.20–0.50.