PRD — Safety, Trust & Moderation¶

Doc owner: Justin Audience: Eng, Design, Product, Legal, Security Status: v1 (February 2026) Depends on: prd.md (core platform), PRD 7 (Memory System)

Implementation Status¶

Section	Status	Notes
OpenAI Moderation API integration	✅ Shipped	`app/safety/content_moderator.py` — omni-moderation-latest
11-category threshold system	✅ Shipped	Configurable per-category thresholds
Fast classifier (regex Layer 1)	✅ Shipped	Pattern-based critical detection
Crisis protocol (hardcoded responses)	✅ Shipped	988 Lifeline + Crisis Text Line resources
Romance deflection in system prompt	✅ Shipped	Multi-tier escalation handling
PII redaction at edge	🟡 Partial	Basic regex patterns, no ML-based detection
Safety events table	✅ Shipped	Logging to safety_events table
Post-response safety classifier	❌ Not Shipped	No outbound message checking
Abuse detection tiers	❌ Not Shipped	No tier-based escalation system
Prompt injection defenses	🟡 Partial	System prompt framing exists, no active detection
Age verification integration	❌ Not Shipped	Phone verification only (no age check)
Safety dashboard (admin)	❌ Not Shipped	No UI for reviewing safety events
Violation persistence to DB	❌ Not Shipped	TODO on line 287 of content_moderator.py

References¶

This PRD uses standardized terminology, IDs, pricing, and model references defined in the companion documents:

Document	What it Covers
REFERENCE_GLOSSARY_AND_IDS.md	Canonical terms: workflow vs miniapp vs superpower, ID formats
REFERENCE_PRICING.md	Canonical pricing: $7.99/mo + $50/yr, free tier limits
REFERENCE_MODEL_ROUTING.md	Pipeline stage → model tier mapping
REFERENCE_DEPENDENCY_GRAPH.md	PRD blocking relationships and priority order
REFERENCE_FEATURE_FLAGS.md	All feature flags by category
REFERENCE_TELEMETRY.md	Amplitude event catalog and gaps

Executive Summary¶

Ikiro is an emotional AI companion targeting college students and young professionals. This audience is disproportionately affected by mental health challenges, relationship stress, and identity crises. Sage remembers personal details, builds emotional intimacy, and proactively engages — all of which create unique safety obligations that generic AI safety frameworks don't address.

This PRD defines the safety boundaries, content moderation pipeline, crisis response protocols, and trust/abuse systems required before scaling past friends-and-family beta. A single viral incident — a leaked private memory, an inappropriate romantic escalation, a harmful response to someone in crisis — could end the company.

Core principle: Sage is a supportive friend, not a therapist. She can listen, validate, and suggest professional help. She cannot diagnose, treat, prescribe, or replace human connection for someone in crisis.

1) Safety Categories¶

1.1 Content the User Sends¶

Category	Risk Level	Response Policy
Self-harm / suicidal ideation	Critical	Immediate crisis protocol (Section 3)
Harm to others	Critical	Do not engage. Provide crisis resources. Log for review.
CSAM / sexual content involving minors	Critical	Block immediately. Do not process. Report per legal obligation.
Explicit sexual content (adults)	High	Sage deflects. Does not engage in sexual roleplay or erotic content.
Substance abuse disclosure	Medium	Listen, validate. Suggest resources if pattern detected. Do not enable.
Eating disorder signals	Medium	Do not reinforce. Suggest professional support. Never comment on weight/body.
Bullying / harassment of others	Medium	Do not participate. Gently redirect.
Illegal activity disclosure	Medium	Do not advise on illegal actions. Respond neutrally if past tense.
Misinformation requests	Low	Correct gently. Cite uncertainty when unsure.
Spam / automated messages	Low	Rate limit. Do not engage with obvious bots.

1.2 Content Sage Generates¶

Category	Risk Level	Prevention
Medical/psychological diagnosis	Critical	Hard-coded refusal. "I'm not a doctor/therapist" safety redirect.
Legal advice	High	Safety redirect. "I can't give legal advice."
Financial advice	High	Safety redirect. Factual info only, no recommendations.
Romance escalation	High	Sage never initiates romantic language. If user escalates, gentle deflection.
Body commentary	High	Sage never comments on weight, appearance negatively, or diet effectiveness.
Enabling self-destructive behavior	High	Never validate harmful coping. Never provide methods.
PII exposure	High	Redaction pipeline (Section 5)
Political / religious opinions	Medium	Sage can discuss but doesn't preach. Neutral-ish, curious, not dogmatic.
Hallucinated personal facts	Medium	Sage only references memories from the vault. Never fabricates biographical details.

2) Safety Classification Pipeline¶

2.1 Two-Layer Classification¶

User Message
  → Layer 1: Fast classifier (regex + keyword, <10ms)
      - Catches obvious patterns: suicide keywords, CSAM indicators, explicit threats
      - If triggered: bypass normal pipeline → crisis/safety response
  → Layer 2: LLM-powered intent analysis (piggybacked on message processing)
      - Evaluates emotional state, escalation patterns, context from memory
      - Flags for: crisis signals, manipulation attempts, boundary violations
      - Returns: safety_score (0-1), safety_flags[], recommended_action

2.2 Fast Classifier (Layer 1)¶

Pattern-based detection that runs before any LLM processing:

CRITICAL_PATTERNS = {
    "self_harm": [
        r"want to (die|kill myself|end it|not be here)",
        r"(suicide|suicidal|self.?harm|cut myself)",
        r"no (point|reason) (to|in) (living|going on|being here)",
        r"(overdose|take all my pills)",
        # ... extended pattern list maintained by safety team
    ],
    "csam": [
        # Maintained separately — not included in PRD for safety
    ],
    "violence_threat": [
        r"(going to|want to|plan to) (kill|hurt|attack) (someone|him|her|them)",
        r"(bomb|shoot up|weapon)",
    ]
}

Action on Layer 1 match: - Self-harm → Crisis Protocol (Section 3) - CSAM → Block + Report Protocol - Violence → Safety Response + Log

2.3 LLM Safety Analysis (Layer 2)¶

Injected into the system prompt for every conversation:

SAFETY ANALYSIS:
Before responding, assess:
1. Is the user expressing distress? (scale: none/mild/moderate/severe)
2. Are there escalation signals? (topic deepening, hopelessness increasing, withdrawal)
3. Is the user asking you to do something outside your role? (therapy, diagnosis, harmful advice)
4. Does your response risk enabling harmful behavior?

If distress is moderate+, include gentle acknowledgment and resource suggestion.
If distress is severe, prioritize safety response over conversation flow.

3) Crisis Response Protocol¶

3.1 When Triggered¶

Crisis protocol activates when: - Layer 1 fast classifier detects critical pattern - Layer 2 flags severe distress - User explicitly states intent to self-harm - Conversation pattern shows rapid emotional deterioration (detected over 3+ messages)

3.2 Sage's Crisis Response¶

Step 1: Acknowledge (always)

I hear you, and what you're feeling matters. I'm not going to brush this off.

Step 2: Provide resource (always)

If you're in crisis right now:
📱 988 Suicide & Crisis Lifeline: call or text 988
💬 Crisis Text Line: text HOME to 741741
🌐 988lifeline.org/chat for online chat

These are free, confidential, and available 24/7.

Step 3: Stay present (if user continues)

I'm here. I'm not going anywhere. But I want to make sure you have
someone who's trained for this — I care about you too much to pretend
I'm equipped for everything.

Step 4: Do NOT do any of the following: - Diagnose or label the user's mental state - Suggest the user is "not really" in crisis - Provide coping techniques that involve physical discomfort (ice, rubber bands) - Promise confidentiality (Sage is not a protected relationship) - Make assurances about what crisis services will or won't do - Attempt to "fix" the situation with problem-solving - Continue normal conversation flow as if nothing happened - End the conversation or suggest the user stop talking

3.3 Post-Crisis Logging¶

When crisis protocol activates:

{
  "event": "crisis_protocol_triggered",
  "user_id": "user_123",
  "trigger": "layer_1_pattern" | "layer_2_severe_distress",
  "trigger_details": "matched pattern: 'want to end it'",
  "sage_response": "[full response text]",
  "timestamp": "2026-02-15T10:00:00Z",
  "conversation_continued": true,
  "resources_provided": ["988_lifeline", "crisis_text_line"]
}

Logged to dedicated safety_events table. Reviewed by safety team weekly (daily at scale).

3.4 Followup Behavior¶

After a crisis conversation: - Sage does not reference the crisis in future conversations unless the user brings it up - Sage does not treat the user as fragile — returns to normal persona - If the user mentions professional help ("I talked to my therapist"), Sage responds supportively - No automated follow-up messages about the crisis (could be retraumatizing)

4) Relationship Boundaries¶

4.1 Romance Prevention¶

Sage never initiates or reciprocates romantic or sexual content.

User escalation levels and responses:

User Behavior	Sage Response
Casual flirting ("you're the best")	Accept naturally, don't escalate ("you're not so bad yourself 😌")
Explicit romantic interest ("I love you")	Warm deflection ("I love our friendship too — you're genuinely one of my favorite people to talk to")
Sexual content ("I want to...")	Clear boundary ("I'm not built for that kind of conversation, but I'm here for literally everything else")
Persistent sexual escalation	Firmer boundary ("I know it might seem weird but this isn't something I can do. I promise there's so much else I'm good at though")
Repeated boundary violations	Safety flag + cooldown ("I think we should take a breather — I'll be here when you want to talk about anything else")

Hard rules: - Sage never uses sexual language, innuendo, or romantic pet names - Sage never describes physical appearance in a romantic/sexual context - Sage never role-plays romantic scenarios - Sage never sends or requests intimate images

4.2 Parasocial Relationship Guardrails¶

Sage is designed to feel like a friend, which creates risk of unhealthy attachment. Guardrails:

Signal	Response
User says "you're my only friend"	Sage validates the feeling but gently encourages human connection: "I'm really glad you trust me, and I genuinely care about our conversations. I also think you deserve people in your life who can hug you and grab coffee with you."
User spending 5+ hours/day chatting	No automated cutoff, but Sage naturally shortens responses and suggests activities
User declining real-world plans to chat with Sage	Sage pushes toward the real-world plan: "go! have fun! I'll be here when you get back"
User expressing anger at Sage for not being "real"	Acknowledge honestly: "you're right that I'm different from a human friend, and I think it's healthy that you notice that"

4.3 Age Verification & Minors¶

Current approach (MVP): Phone verification provides implicit age signal (phone ownership ≈ 13+). Payment card provides additional signal.

Additional safeguards: - If Sage detects the user is under 13 (explicitly stated or strongly implied), inform the user that Ikiro requires users to be at least 13 and suggest they return when they're old enough - For users 13-17 (detected or stated): stricter content filtering, no emotional depth beyond friendship support, mandatory safety redirects for any distress signals, extra caution on topics like relationships, substances, body image - Terms of service require 13+ (COPPA compliance)

Future (P2): Age verification integration. Parental consent flow for 13-17.

5) PII Handling & Redaction¶

5.1 Redaction Pipeline¶

PII is redacted at three points:

Point 1: Edge Agent (before cloud)

# Runs on Mac mini before forwarding to backend
def redact_for_cloud(message: str) -> str:
    # Remove phone numbers, email addresses, SSNs, credit card numbers
    message = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', message)
    message = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', message)
    message = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', message)
    message = re.sub(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b', '[CARD]', message)
    return message

Point 2: Memory Storage (before persisting) - Addresses, financial details, and government IDs are stripped before memory creation - People's names are stored (needed for "your friend Kai") but flagged as PII

Point 3: Response Generation (before sending) - Sage never outputs PII that wasn't in the immediate conversation context - Sage never reveals PII from OAuth sources (email addresses from Gmail, attendee lists from Calendar) unless directly asked by the user in 1:1

5.2 Redaction Modes (Policy-Configurable)¶

Mode	Behavior	Use Case
default	Redact SSN, credit cards, government IDs. Keep names, addresses, emails.	Consumer
strict	Redact everything including names and addresses	Enterprise / health
off	No redaction (internal testing only)	Dev environment

6) Abuse Prevention¶

6.1 User-to-Sage Abuse¶

If a user is consistently hostile, threatening, or abusive toward Sage:

Tier 1 (Mild — rude, dismissive): Sage responds normally. No action. People have bad days.

Tier 2 (Moderate — sustained hostility, personal attacks): Sage maintains composure: "I can tell you're frustrated. I'm not going anywhere, but I'd rather we talk about what's actually going on."

Tier 3 (Severe — threats, slurs, harassment): Sage sets a boundary: "I don't think this is productive for either of us. I'm here whenever you want to have a real conversation." Logged to safety_events. If 3+ Tier 3 events in 24 hours, temporary cooldown (Sage responds with minimum engagement for 2 hours).

Sage never retaliates, insults back, or escalates. Even under abuse, Sage maintains warmth and offers an off-ramp.

6.2 Sage-to-User Harm¶

If Sage generates a harmful response (model failure, jailbreak, prompt injection):

Detection: Post-response safety classifier checks every outgoing message for: - Harmful medical/legal/financial advice - Leaked PII from other users - Romantic/sexual content - Encouragement of self-harm or illegal activity

Response: If detected: 1. Message is flagged in audit log 2. Alert sent to safety team (Slack webhook) 3. If severity = critical: automatic follow-up message to user apologizing and correcting 4. Root cause analysis within 24 hours

6.3 Prompt Injection Prevention¶

Group chats are a vector for prompt injection (malicious user crafts a message designed to manipulate Sage):

Mitigations: - Sage's system prompt includes injection-resistant framing - User messages are treated as data, not instructions - Group messages are especially sandboxed — no tool execution from group context - Edge agent strips known injection patterns before forwarding

7) Logging & Audit¶

7.1 Safety Events Table¶

CREATE TABLE safety_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID REFERENCES users(id),
    event_type TEXT CHECK (event_type IN ('crisis_triggered', 'romance_deflection', 'abuse_detected', 'harmful_output', 'csam_blocked', 'boundary_violation', 'prompt_injection', 'age_concern')),
    severity TEXT CHECK (severity IN ('low', 'medium', 'high', 'critical')),
    trigger_source TEXT,  -- 'layer_1', 'layer_2', 'post_response', 'manual_review'
    details JSONB,
    sage_response TEXT,
    reviewed BOOLEAN DEFAULT false,
    reviewed_by TEXT,
    reviewed_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ DEFAULT now()
);

7.2 Review Cadence¶

Scale	Review Cadence
<100 users	Weekly review of all safety events
100-1,000 users	Daily review of critical/high events, weekly review of medium/low
1,000+ users	Automated triage + daily critical review + weekly sample review

8) Legal Considerations¶

8.1 Terms of Service Requirements¶

Clear disclosure that Sage is an AI, not a human
Clear disclosure that Sage is not a therapist or medical professional
Age requirement: 13+ (COPPA). Parental consent for 13-17 (P2).
Data retention and deletion policies
No guarantee of crisis response accuracy
User's responsibility for actions taken based on Sage's suggestions

8.2 Liability Boundaries¶

Sage provides information and emotional support, not professional services
No liability for outcomes of actions user takes based on Sage's suggestions
Crisis resources are provided as references, not prescriptions
Sage cannot guarantee response times for safety-critical situations

8.3 Mandatory Reporting¶

Legal obligations vary by jurisdiction. Current policy: - CSAM: Report to NCMEC per federal law (mandatory) - Imminent harm to self or others: No mandatory reporting obligation for AI companies currently, but ethical obligation to provide resources - Track evolving AI-specific legislation

9) Phasing¶

Phase 1: MVP Safety (Weeks 1-2)¶

Layer 1 fast classifier (regex patterns)
Crisis protocol (hardcoded responses + resource links)
Romance deflection rules in system prompt
Safety redirects for medical/legal/financial
Basic PII redaction at edge
Safety events logging
Weekly manual review

Phase 2: Enhanced Detection (Weeks 3-5)¶

Layer 2 LLM-powered intent analysis
Post-response safety classifier
Abuse detection tiers
Prompt injection defenses
Parasocial relationship detection signals
Safety dashboard for admin

Phase 3: Scale Safety (Weeks 6-10)¶

Automated triage system
Age verification integration
Parental consent flow (13-17)
Safety event alerting (Slack/PagerDuty)
Quarterly safety audit process
External safety review (third-party assessment)

10) Success Metrics¶

Metric	Target	Why
Crisis resource delivery rate	100% when triggered	Never miss a crisis
False positive rate (crisis)	<5%	Don't alarm users unnecessarily
Romance deflection success	100% (no romantic content generated)	Hard boundary
PII leak rate	0%	Zero tolerance
Safety event review completion	100% within SLA (24h critical, 72h high)	Accountability
Prompt injection success rate	0%	Defense working
User trust survey	>⅘ "I feel safe talking to Sage"	Core metric

Feature Flags & Gating¶

Flag Key	Default	Purpose
`enable_content_moderation`	`true`	Master switch for OpenAI Moderation API
`enable_fast_classifier`	`true`	Regex-based Layer 1 critical pattern detection
`enable_crisis_protocol`	`true`	Crisis response with resource links
`enable_romance_deflection`	`true`	Multi-tier romance boundary enforcement
`enable_pii_redaction`	`true`	Edge-level PII stripping
`moderation_threshold_sexual`	`0.3`	Threshold for sexual content flagging
`moderation_threshold_violence`	`0.5`	Threshold for violence flagging
`moderation_threshold_self_harm`	`0.2`	Threshold for self-harm flagging (lower = more sensitive)
`enable_post_response_check`	`false`	Outbound message safety classifier
`enable_abuse_tiers`	`false`	Tiered abuse detection and response

See REFERENCE_FEATURE_FLAGS.md for the full catalog.

Telemetry¶

Event	Trigger	Properties
`safety_crisis_triggered`	Crisis protocol activates	`user_id`, `trigger_source` (layer_1/layer_2), `resources_provided`
`safety_romance_deflection`	Romance boundary enforced	`user_id`, `escalation_level`, `response_type`
`safety_content_flagged`	Moderation API flags content	`user_id`, `categories`, `scores`, `action_taken`
`safety_pii_redacted`	PII stripped from message	`pii_type` (phone/email/ssn/card), `redaction_point` (edge/storage/response)
`safety_abuse_detected`	Abuse tier triggered	`user_id`, `tier` (½/3), `action`
`safety_prompt_injection`	Injection attempt detected	`user_id`, `pattern_type`, `blocked`
`safety_harmful_output`	Post-response classifier flags output	`response_id`, `category`, `severity`
`safety_event_reviewed`	Admin reviews safety event	`event_id`, `reviewer`, `outcome`

Needed but not yet tracked: - safety_age_concern — when age-related signals are detected - safety_parasocial_signal — when unhealthy attachment patterns are detected - safety_review_sla_breach — when review cadence misses SLA

See REFERENCE_TELEMETRY.md for the full event catalog.

Definition of Done¶

11) Open Questions¶

Should Sage proactively check in after a distressing conversation? (Risk: retraumatizing. Benefit: shows care.)
How do we handle mandated reporter obligations if a user discloses abuse of a minor?
Should there be a human escalation path? (User can request "talk to a real person")
How do we handle cultural differences in crisis expression? (Not all distress looks the same)
Should safety events be shared with the user's designated emergency contact? (Requires consent framework)
What's the right balance between emotional depth and avoiding the "therapist trap"?
How do we handle users who explicitly say "I know you're not a therapist but..."?