Skip to content

Model Routing Strategy

Applies to: All PRDs referencing LLM calls Last Updated: February 2026 Source of truth: This document. PRDs should reference model tiers (fast/default/nano), not specific model names.


Active Models

Model Tier Provider Cost (prompt/completion per 1K tokens) Used For
gpt-5 Default / Reasoning OpenAI $0.005 / $0.015 Response generation (when depth needed)
gpt-5-mini Fast OpenAI $0.00015 / $0.0006 Response generation (default), vision, classification
gpt-5-nano Ultra-fast OpenAI $0.00005 / $0.0002 Emotion detection, entity extraction, query synthesis, event tracking
gemini-2.5-flash-lite Router Google N/A (separate billing) Unified message routing with full conversation context

Pipeline Stage → Model Mapping

User Message
  ├── FastReflex (<10ms, regex, no LLM)
  ├── Smart Message Router ─────────── gemini-2.5-flash-lite
  │     Routes to: workflow / miniapp / direct response
  ├── Emotion Classification ───────── gpt-5-nano (150 tokens)
  ├── Entity Extraction ────────────── gpt-5-nano (reflex limits)
  ├── Event Detection ──────────────── gpt-5-nano (300 tokens)
  │     Detects future events for proactive surfacing
  ├── Memory Query Synthesis ───────── gpt-5-nano (300 tokens)
  │     Generates auxiliary questions for multi-query recall
  ├── Content Moderation (Input) ───── omni-moderation-latest (OpenAI)
  ├── Response Generation ──────────── gpt-5-mini (default, 1000 tokens)
  │     │                               gpt-5 (deep reasoning, 2000 tokens)
  │     │
  │     └── Vision (if image) ──────── gpt-5-mini multimodal (1500 tokens)
  ├── Content Moderation (Output) ──── omni-moderation-latest (OpenAI)
  └── Memory Storage ───────────────── No LLM (Supermemory API)

Token Limits (Feature-Flag Controlled)

Context Flag Key Default Emergency (FORCE_BASIC_MODE)
Reflex responses llm.max_tokens.reflex 1500 200
Deep reasoning llm.max_tokens.deep_reasoning 2000 200
Default conversation llm.max_tokens.default 1000 200
Vision analysis llm.max_tokens.vision 1500 200

All OpenAI calls enforce a 15-second timeout to fit within the 20-second routing window.


PRD Guidance

When writing PRDs, reference model tiers rather than specific model names. This allows engineering to swap models without PRD updates.

Tier Use When Current Model
router Routing decisions, intent classification with full context gemini-2.5-flash-lite
nano Sub-tasks under 300 tokens: entity extraction, emotion, query synthesis gpt-5-nano
fast Response generation, vision analysis, classification gpt-5-mini
default Complex reasoning, deep conversation, multi-step logic gpt-5
moderation Content safety checks (input and output) omni-moderation-latest

Example PRD language: - "The emotion classifier uses the nano tier for sub-150-token classification." - "Response generation defaults to the fast tier, escalating to default for complex reasoning."


Cost Implications

Tier Typical Call Cost Calls Per Conversation Turn
Router (Gemini) ~$0.0001 1
Nano ~$0.00003 2–4 (emotion + entity + query synthesis + event)
Fast (generation) ~$0.0004 1
Default (reasoning) ~$0.01 0–1 (only when escalated)
Moderation ~$0.0001 2 (input + output)
Total per turn ~\(0.001–\)0.011 Depends on escalation

At 50 messages/day (free tier cap), estimated daily LLM cost per free user: ~\(0.05–0.07. At ~200 messages/day (active Superpowers+ user), estimated daily LLM cost: ~\)0.20–0.50.