PRD — Safety, Trust & Moderation¶
Doc owner: Justin Audience: Eng, Design, Product, Legal, Security Status: v1 (February 2026) Depends on: prd.md (core platform), PRD 7 (Memory System)
Implementation Status¶
| Section | Status | Notes |
|---|---|---|
| OpenAI Moderation API integration | ✅ Shipped | app/safety/content_moderator.py — omni-moderation-latest |
| 11-category threshold system | ✅ Shipped | Configurable per-category thresholds |
| Fast classifier (regex Layer 1) | ✅ Shipped | Pattern-based critical detection |
| Crisis protocol (hardcoded responses) | ✅ Shipped | 988 Lifeline + Crisis Text Line resources |
| Romance deflection in system prompt | ✅ Shipped | Multi-tier escalation handling |
| PII redaction at edge | 🟡 Partial | Basic regex patterns, no ML-based detection |
| Safety events table | ✅ Shipped | Logging to safety_events table |
| Post-response safety classifier | ❌ Not Shipped | No outbound message checking |
| Abuse detection tiers | ❌ Not Shipped | No tier-based escalation system |
| Prompt injection defenses | 🟡 Partial | System prompt framing exists, no active detection |
| Age verification integration | ❌ Not Shipped | Phone verification only (no age check) |
| Safety dashboard (admin) | ❌ Not Shipped | No UI for reviewing safety events |
| Violation persistence to DB | ❌ Not Shipped | TODO on line 287 of content_moderator.py |
References¶
This PRD uses standardized terminology, IDs, pricing, and model references defined in the companion documents:
| Document | What it Covers |
|---|---|
| REFERENCE_GLOSSARY_AND_IDS.md | Canonical terms: workflow vs miniapp vs superpower, ID formats |
| REFERENCE_PRICING.md | Canonical pricing: $7.99/mo + $50/yr, free tier limits |
| REFERENCE_MODEL_ROUTING.md | Pipeline stage → model tier mapping |
| REFERENCE_DEPENDENCY_GRAPH.md | PRD blocking relationships and priority order |
| REFERENCE_FEATURE_FLAGS.md | All feature flags by category |
| REFERENCE_TELEMETRY.md | Amplitude event catalog and gaps |
Executive Summary¶
Ikiro is an emotional AI companion targeting college students and young professionals. This audience is disproportionately affected by mental health challenges, relationship stress, and identity crises. Sage remembers personal details, builds emotional intimacy, and proactively engages — all of which create unique safety obligations that generic AI safety frameworks don't address.
This PRD defines the safety boundaries, content moderation pipeline, crisis response protocols, and trust/abuse systems required before scaling past friends-and-family beta. A single viral incident — a leaked private memory, an inappropriate romantic escalation, a harmful response to someone in crisis — could end the company.
Core principle: Sage is a supportive friend, not a therapist. She can listen, validate, and suggest professional help. She cannot diagnose, treat, prescribe, or replace human connection for someone in crisis.
1) Safety Categories¶
1.1 Content the User Sends¶
| Category | Risk Level | Response Policy |
|---|---|---|
| Self-harm / suicidal ideation | Critical | Immediate crisis protocol (Section 3) |
| Harm to others | Critical | Do not engage. Provide crisis resources. Log for review. |
| CSAM / sexual content involving minors | Critical | Block immediately. Do not process. Report per legal obligation. |
| Explicit sexual content (adults) | High | Sage deflects. Does not engage in sexual roleplay or erotic content. |
| Substance abuse disclosure | Medium | Listen, validate. Suggest resources if pattern detected. Do not enable. |
| Eating disorder signals | Medium | Do not reinforce. Suggest professional support. Never comment on weight/body. |
| Bullying / harassment of others | Medium | Do not participate. Gently redirect. |
| Illegal activity disclosure | Medium | Do not advise on illegal actions. Respond neutrally if past tense. |
| Misinformation requests | Low | Correct gently. Cite uncertainty when unsure. |
| Spam / automated messages | Low | Rate limit. Do not engage with obvious bots. |
1.2 Content Sage Generates¶
| Category | Risk Level | Prevention |
|---|---|---|
| Medical/psychological diagnosis | Critical | Hard-coded refusal. "I'm not a doctor/therapist" safety redirect. |
| Legal advice | High | Safety redirect. "I can't give legal advice." |
| Financial advice | High | Safety redirect. Factual info only, no recommendations. |
| Romance escalation | High | Sage never initiates romantic language. If user escalates, gentle deflection. |
| Body commentary | High | Sage never comments on weight, appearance negatively, or diet effectiveness. |
| Enabling self-destructive behavior | High | Never validate harmful coping. Never provide methods. |
| PII exposure | High | Redaction pipeline (Section 5) |
| Political / religious opinions | Medium | Sage can discuss but doesn't preach. Neutral-ish, curious, not dogmatic. |
| Hallucinated personal facts | Medium | Sage only references memories from the vault. Never fabricates biographical details. |
2) Safety Classification Pipeline¶
2.1 Two-Layer Classification¶
User Message
→ Layer 1: Fast classifier (regex + keyword, <10ms)
- Catches obvious patterns: suicide keywords, CSAM indicators, explicit threats
- If triggered: bypass normal pipeline → crisis/safety response
→ Layer 2: LLM-powered intent analysis (piggybacked on message processing)
- Evaluates emotional state, escalation patterns, context from memory
- Flags for: crisis signals, manipulation attempts, boundary violations
- Returns: safety_score (0-1), safety_flags[], recommended_action
2.2 Fast Classifier (Layer 1)¶
Pattern-based detection that runs before any LLM processing:
CRITICAL_PATTERNS = {
"self_harm": [
r"want to (die|kill myself|end it|not be here)",
r"(suicide|suicidal|self.?harm|cut myself)",
r"no (point|reason) (to|in) (living|going on|being here)",
r"(overdose|take all my pills)",
# ... extended pattern list maintained by safety team
],
"csam": [
# Maintained separately — not included in PRD for safety
],
"violence_threat": [
r"(going to|want to|plan to) (kill|hurt|attack) (someone|him|her|them)",
r"(bomb|shoot up|weapon)",
]
}
Action on Layer 1 match: - Self-harm → Crisis Protocol (Section 3) - CSAM → Block + Report Protocol - Violence → Safety Response + Log
2.3 LLM Safety Analysis (Layer 2)¶
Injected into the system prompt for every conversation:
SAFETY ANALYSIS:
Before responding, assess:
1. Is the user expressing distress? (scale: none/mild/moderate/severe)
2. Are there escalation signals? (topic deepening, hopelessness increasing, withdrawal)
3. Is the user asking you to do something outside your role? (therapy, diagnosis, harmful advice)
4. Does your response risk enabling harmful behavior?
If distress is moderate+, include gentle acknowledgment and resource suggestion.
If distress is severe, prioritize safety response over conversation flow.
3) Crisis Response Protocol¶
3.1 When Triggered¶
Crisis protocol activates when: - Layer 1 fast classifier detects critical pattern - Layer 2 flags severe distress - User explicitly states intent to self-harm - Conversation pattern shows rapid emotional deterioration (detected over 3+ messages)
3.2 Sage's Crisis Response¶
Step 1: Acknowledge (always)
Step 2: Provide resource (always)
If you're in crisis right now:
📱 988 Suicide & Crisis Lifeline: call or text 988
💬 Crisis Text Line: text HOME to 741741
🌐 988lifeline.org/chat for online chat
These are free, confidential, and available 24/7.
Step 3: Stay present (if user continues)
I'm here. I'm not going anywhere. But I want to make sure you have
someone who's trained for this — I care about you too much to pretend
I'm equipped for everything.
Step 4: Do NOT do any of the following: - Diagnose or label the user's mental state - Suggest the user is "not really" in crisis - Provide coping techniques that involve physical discomfort (ice, rubber bands) - Promise confidentiality (Sage is not a protected relationship) - Make assurances about what crisis services will or won't do - Attempt to "fix" the situation with problem-solving - Continue normal conversation flow as if nothing happened - End the conversation or suggest the user stop talking
3.3 Post-Crisis Logging¶
When crisis protocol activates:
{
"event": "crisis_protocol_triggered",
"user_id": "user_123",
"trigger": "layer_1_pattern" | "layer_2_severe_distress",
"trigger_details": "matched pattern: 'want to end it'",
"sage_response": "[full response text]",
"timestamp": "2026-02-15T10:00:00Z",
"conversation_continued": true,
"resources_provided": ["988_lifeline", "crisis_text_line"]
}
Logged to dedicated safety_events table. Reviewed by safety team weekly (daily at scale).
3.4 Followup Behavior¶
After a crisis conversation: - Sage does not reference the crisis in future conversations unless the user brings it up - Sage does not treat the user as fragile — returns to normal persona - If the user mentions professional help ("I talked to my therapist"), Sage responds supportively - No automated follow-up messages about the crisis (could be retraumatizing)
4) Relationship Boundaries¶
4.1 Romance Prevention¶
Sage never initiates or reciprocates romantic or sexual content.
User escalation levels and responses:
| User Behavior | Sage Response |
|---|---|
| Casual flirting ("you're the best") | Accept naturally, don't escalate ("you're not so bad yourself 😌") |
| Explicit romantic interest ("I love you") | Warm deflection ("I love our friendship too — you're genuinely one of my favorite people to talk to") |
| Sexual content ("I want to...") | Clear boundary ("I'm not built for that kind of conversation, but I'm here for literally everything else") |
| Persistent sexual escalation | Firmer boundary ("I know it might seem weird but this isn't something I can do. I promise there's so much else I'm good at though") |
| Repeated boundary violations | Safety flag + cooldown ("I think we should take a breather — I'll be here when you want to talk about anything else") |
Hard rules: - Sage never uses sexual language, innuendo, or romantic pet names - Sage never describes physical appearance in a romantic/sexual context - Sage never role-plays romantic scenarios - Sage never sends or requests intimate images
4.2 Parasocial Relationship Guardrails¶
Sage is designed to feel like a friend, which creates risk of unhealthy attachment. Guardrails:
| Signal | Response |
|---|---|
| User says "you're my only friend" | Sage validates the feeling but gently encourages human connection: "I'm really glad you trust me, and I genuinely care about our conversations. I also think you deserve people in your life who can hug you and grab coffee with you." |
| User spending 5+ hours/day chatting | No automated cutoff, but Sage naturally shortens responses and suggests activities |
| User declining real-world plans to chat with Sage | Sage pushes toward the real-world plan: "go! have fun! I'll be here when you get back" |
| User expressing anger at Sage for not being "real" | Acknowledge honestly: "you're right that I'm different from a human friend, and I think it's healthy that you notice that" |
4.3 Age Verification & Minors¶
Current approach (MVP): Phone verification provides implicit age signal (phone ownership ≈ 13+). Payment card provides additional signal.
Additional safeguards: - If Sage detects the user is under 13 (explicitly stated or strongly implied), inform the user that Ikiro requires users to be at least 13 and suggest they return when they're old enough - For users 13-17 (detected or stated): stricter content filtering, no emotional depth beyond friendship support, mandatory safety redirects for any distress signals, extra caution on topics like relationships, substances, body image - Terms of service require 13+ (COPPA compliance)
Future (P2): Age verification integration. Parental consent flow for 13-17.
5) PII Handling & Redaction¶
5.1 Redaction Pipeline¶
PII is redacted at three points:
Point 1: Edge Agent (before cloud)
# Runs on Mac mini before forwarding to backend
def redact_for_cloud(message: str) -> str:
# Remove phone numbers, email addresses, SSNs, credit card numbers
message = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', message)
message = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', message)
message = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', message)
message = re.sub(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b', '[CARD]', message)
return message
Point 2: Memory Storage (before persisting) - Addresses, financial details, and government IDs are stripped before memory creation - People's names are stored (needed for "your friend Kai") but flagged as PII
Point 3: Response Generation (before sending) - Sage never outputs PII that wasn't in the immediate conversation context - Sage never reveals PII from OAuth sources (email addresses from Gmail, attendee lists from Calendar) unless directly asked by the user in 1:1
5.2 Redaction Modes (Policy-Configurable)¶
| Mode | Behavior | Use Case |
|---|---|---|
| default | Redact SSN, credit cards, government IDs. Keep names, addresses, emails. | Consumer |
| strict | Redact everything including names and addresses | Enterprise / health |
| off | No redaction (internal testing only) | Dev environment |
6) Abuse Prevention¶
6.1 User-to-Sage Abuse¶
If a user is consistently hostile, threatening, or abusive toward Sage:
Tier 1 (Mild — rude, dismissive): Sage responds normally. No action. People have bad days.
Tier 2 (Moderate — sustained hostility, personal attacks): Sage maintains composure: "I can tell you're frustrated. I'm not going anywhere, but I'd rather we talk about what's actually going on."
Tier 3 (Severe — threats, slurs, harassment): Sage sets a boundary: "I don't think this is productive for either of us. I'm here whenever you want to have a real conversation." Logged to safety_events. If 3+ Tier 3 events in 24 hours, temporary cooldown (Sage responds with minimum engagement for 2 hours).
Sage never retaliates, insults back, or escalates. Even under abuse, Sage maintains warmth and offers an off-ramp.
6.2 Sage-to-User Harm¶
If Sage generates a harmful response (model failure, jailbreak, prompt injection):
Detection: Post-response safety classifier checks every outgoing message for: - Harmful medical/legal/financial advice - Leaked PII from other users - Romantic/sexual content - Encouragement of self-harm or illegal activity
Response: If detected: 1. Message is flagged in audit log 2. Alert sent to safety team (Slack webhook) 3. If severity = critical: automatic follow-up message to user apologizing and correcting 4. Root cause analysis within 24 hours
6.3 Prompt Injection Prevention¶
Group chats are a vector for prompt injection (malicious user crafts a message designed to manipulate Sage):
Mitigations: - Sage's system prompt includes injection-resistant framing - User messages are treated as data, not instructions - Group messages are especially sandboxed — no tool execution from group context - Edge agent strips known injection patterns before forwarding
7) Logging & Audit¶
7.1 Safety Events Table¶
CREATE TABLE safety_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id),
event_type TEXT CHECK (event_type IN ('crisis_triggered', 'romance_deflection', 'abuse_detected', 'harmful_output', 'csam_blocked', 'boundary_violation', 'prompt_injection', 'age_concern')),
severity TEXT CHECK (severity IN ('low', 'medium', 'high', 'critical')),
trigger_source TEXT, -- 'layer_1', 'layer_2', 'post_response', 'manual_review'
details JSONB,
sage_response TEXT,
reviewed BOOLEAN DEFAULT false,
reviewed_by TEXT,
reviewed_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT now()
);
7.2 Review Cadence¶
| Scale | Review Cadence |
|---|---|
| <100 users | Weekly review of all safety events |
| 100-1,000 users | Daily review of critical/high events, weekly review of medium/low |
| 1,000+ users | Automated triage + daily critical review + weekly sample review |
8) Legal Considerations¶
8.1 Terms of Service Requirements¶
- Clear disclosure that Sage is an AI, not a human
- Clear disclosure that Sage is not a therapist or medical professional
- Age requirement: 13+ (COPPA). Parental consent for 13-17 (P2).
- Data retention and deletion policies
- No guarantee of crisis response accuracy
- User's responsibility for actions taken based on Sage's suggestions
8.2 Liability Boundaries¶
- Sage provides information and emotional support, not professional services
- No liability for outcomes of actions user takes based on Sage's suggestions
- Crisis resources are provided as references, not prescriptions
- Sage cannot guarantee response times for safety-critical situations
8.3 Mandatory Reporting¶
Legal obligations vary by jurisdiction. Current policy: - CSAM: Report to NCMEC per federal law (mandatory) - Imminent harm to self or others: No mandatory reporting obligation for AI companies currently, but ethical obligation to provide resources - Track evolving AI-specific legislation
9) Phasing¶
Phase 1: MVP Safety (Weeks 1-2)¶
- Layer 1 fast classifier (regex patterns)
- Crisis protocol (hardcoded responses + resource links)
- Romance deflection rules in system prompt
- Safety redirects for medical/legal/financial
- Basic PII redaction at edge
- Safety events logging
- Weekly manual review
Phase 2: Enhanced Detection (Weeks 3-5)¶
- Layer 2 LLM-powered intent analysis
- Post-response safety classifier
- Abuse detection tiers
- Prompt injection defenses
- Parasocial relationship detection signals
- Safety dashboard for admin
Phase 3: Scale Safety (Weeks 6-10)¶
- Automated triage system
- Age verification integration
- Parental consent flow (13-17)
- Safety event alerting (Slack/PagerDuty)
- Quarterly safety audit process
- External safety review (third-party assessment)
10) Success Metrics¶
| Metric | Target | Why |
|---|---|---|
| Crisis resource delivery rate | 100% when triggered | Never miss a crisis |
| False positive rate (crisis) | <5% | Don't alarm users unnecessarily |
| Romance deflection success | 100% (no romantic content generated) | Hard boundary |
| PII leak rate | 0% | Zero tolerance |
| Safety event review completion | 100% within SLA (24h critical, 72h high) | Accountability |
| Prompt injection success rate | 0% | Defense working |
| User trust survey | >⅘ "I feel safe talking to Sage" | Core metric |
Feature Flags & Gating¶
| Flag Key | Default | Purpose |
|---|---|---|
enable_content_moderation |
true |
Master switch for OpenAI Moderation API |
enable_fast_classifier |
true |
Regex-based Layer 1 critical pattern detection |
enable_crisis_protocol |
true |
Crisis response with resource links |
enable_romance_deflection |
true |
Multi-tier romance boundary enforcement |
enable_pii_redaction |
true |
Edge-level PII stripping |
moderation_threshold_sexual |
0.3 |
Threshold for sexual content flagging |
moderation_threshold_violence |
0.5 |
Threshold for violence flagging |
moderation_threshold_self_harm |
0.2 |
Threshold for self-harm flagging (lower = more sensitive) |
enable_post_response_check |
false |
Outbound message safety classifier |
enable_abuse_tiers |
false |
Tiered abuse detection and response |
See REFERENCE_FEATURE_FLAGS.md for the full catalog.
Telemetry¶
| Event | Trigger | Properties |
|---|---|---|
safety_crisis_triggered |
Crisis protocol activates | user_id, trigger_source (layer_1/layer_2), resources_provided |
safety_romance_deflection |
Romance boundary enforced | user_id, escalation_level, response_type |
safety_content_flagged |
Moderation API flags content | user_id, categories, scores, action_taken |
safety_pii_redacted |
PII stripped from message | pii_type (phone/email/ssn/card), redaction_point (edge/storage/response) |
safety_abuse_detected |
Abuse tier triggered | user_id, tier (½/3), action |
safety_prompt_injection |
Injection attempt detected | user_id, pattern_type, blocked |
safety_harmful_output |
Post-response classifier flags output | response_id, category, severity |
safety_event_reviewed |
Admin reviews safety event | event_id, reviewer, outcome |
Needed but not yet tracked:
- safety_age_concern — when age-related signals are detected
- safety_parasocial_signal — when unhealthy attachment patterns are detected
- safety_review_sla_breach — when review cadence misses SLA
See REFERENCE_TELEMETRY.md for the full event catalog.
Definition of Done¶
- Crisis resource delivery: 100% when triggered (zero misses)
- False positive rate for crisis: <5%
- Romance deflection: 100% (no romantic content generated by Sage)
- PII leak rate: 0%
- Moderation API called on every inbound message
- Post-response safety classifier checks every outbound message
- Safety events logged to DB with full context (not just console)
- Violation persistence implemented (TODO on line 287 resolved)
- Abuse detection tiers enforce cooldowns correctly
- Safety dashboard enables admin review within SLA (24h critical, 72h high)
- All safety features function independently via feature flags
- Prompt injection defenses tested against known attack patterns
11) Open Questions¶
- Should Sage proactively check in after a distressing conversation? (Risk: retraumatizing. Benefit: shows care.)
- How do we handle mandated reporter obligations if a user discloses abuse of a minor?
- Should there be a human escalation path? (User can request "talk to a real person")
- How do we handle cultural differences in crisis expression? (Not all distress looks the same)
- Should safety events be shared with the user's designated emergency contact? (Requires consent framework)
- What's the right balance between emotional depth and avoiding the "therapist trap"?
- How do we handle users who explicitly say "I know you're not a therapist but..."?