Model Routing Strategy¶

Applies to: All PRDs referencing LLM calls Last Updated: February 2026 Source of truth: This document. PRDs should reference model tiers (fast/default/nano), not specific model names.

Active Models¶

Model	Tier	Provider	Cost (prompt/completion per 1K tokens)	Used For
`gpt-5`	Default / Reasoning	OpenAI	$0.005 / $0.015	Response generation (when depth needed)
`gpt-5-mini`	Fast	OpenAI	$0.00015 / $0.0006	Response generation (default), vision, classification
`gpt-5-nano`	Ultra-fast	OpenAI	$0.00005 / $0.0002	Emotion detection, entity extraction, query synthesis, event tracking
`gemini-2.5-flash-lite`	Router	Google	N/A (separate billing)	Unified message routing with full conversation context

Pipeline Stage → Model Mapping¶

User Message
  │
  ├── FastReflex (<10ms, regex, no LLM)
  │
  ├── Smart Message Router ─────────── gemini-2.5-flash-lite
  │     Routes to: workflow / miniapp / direct response
  │
  ├── Emotion Classification ───────── gpt-5-nano (150 tokens)
  │
  ├── Entity Extraction ────────────── gpt-5-nano (reflex limits)
  │
  ├── Event Detection ──────────────── gpt-5-nano (300 tokens)
  │     Detects future events for proactive surfacing
  │
  ├── Memory Query Synthesis ───────── gpt-5-nano (300 tokens)
  │     Generates auxiliary questions for multi-query recall
  │
  ├── Content Moderation (Input) ───── omni-moderation-latest (OpenAI)
  │
  ├── Response Generation ──────────── gpt-5-mini (default, 1000 tokens)
  │     │                               gpt-5 (deep reasoning, 2000 tokens)
  │     │
  │     └── Vision (if image) ──────── gpt-5-mini multimodal (1500 tokens)
  │
  ├── Content Moderation (Output) ──── omni-moderation-latest (OpenAI)
  │
  └── Memory Storage ───────────────── No LLM (Supermemory API)

Token Limits (Feature-Flag Controlled)¶

Context	Flag Key	Default	Emergency (FORCE_BASIC_MODE)
Reflex responses	`llm.max_tokens.reflex`	1500	200
Deep reasoning	`llm.max_tokens.deep_reasoning`	2000	200
Default conversation	`llm.max_tokens.default`	1000	200
Vision analysis	`llm.max_tokens.vision`	1500	200

All OpenAI calls enforce a 15-second timeout to fit within the 20-second routing window.

PRD Guidance¶

When writing PRDs, reference model tiers rather than specific model names. This allows engineering to swap models without PRD updates.

Tier	Use When	Current Model
`router`	Routing decisions, intent classification with full context	gemini-2.5-flash-lite
`nano`	Sub-tasks under 300 tokens: entity extraction, emotion, query synthesis	gpt-5-nano
`fast`	Response generation, vision analysis, classification	gpt-5-mini
`default`	Complex reasoning, deep conversation, multi-step logic	gpt-5
`moderation`	Content safety checks (input and output)	omni-moderation-latest

Example PRD language: - "The emotion classifier uses the nano tier for sub-150-token classification." - "Response generation defaults to the fast tier, escalating to default for complex reasoning."

Cost Implications¶

Tier	Typical Call Cost	Calls Per Conversation Turn
Router (Gemini)	~$0.0001	1
Nano	~$0.00003	2–4 (emotion + entity + query synthesis + event)
Fast (generation)	~$0.0004	1
Default (reasoning)	~$0.01	0–1 (only when escalated)
Moderation	~$0.0001	2 (input + output)
Total per turn	~$0.001–$0.011	Depends on escalation

At 50 messages/day (free tier cap), estimated daily LLM cost per free user: ~$0.05–0.07. At ~200 messages/day (active Superpowers+ user), estimated daily LLM cost: ~$0.20–0.50.