Skip to content

PRD — Safety, Trust & Moderation

Doc owner: Justin Audience: Eng, Design, Product, Legal, Security Status: v1 (February 2026) Depends on: prd.md (core platform), PRD 7 (Memory System)


Implementation Status

Section Status Notes
OpenAI Moderation API integration ✅ Shipped app/safety/content_moderator.py — omni-moderation-latest
11-category threshold system ✅ Shipped Configurable per-category thresholds
Fast classifier (regex Layer 1) ✅ Shipped Pattern-based critical detection
Crisis protocol (hardcoded responses) ✅ Shipped 988 Lifeline + Crisis Text Line resources
Romance deflection in system prompt ✅ Shipped Multi-tier escalation handling
PII redaction at edge 🟡 Partial Basic regex patterns, no ML-based detection
Safety events table ✅ Shipped Logging to safety_events table
Post-response safety classifier ❌ Not Shipped No outbound message checking
Abuse detection tiers ❌ Not Shipped No tier-based escalation system
Prompt injection defenses 🟡 Partial System prompt framing exists, no active detection
Age verification integration ❌ Not Shipped Phone verification only (no age check)
Safety dashboard (admin) ❌ Not Shipped No UI for reviewing safety events
Violation persistence to DB ❌ Not Shipped TODO on line 287 of content_moderator.py

References

This PRD uses standardized terminology, IDs, pricing, and model references defined in the companion documents:

Document What it Covers
REFERENCE_GLOSSARY_AND_IDS.md Canonical terms: workflow vs miniapp vs superpower, ID formats
REFERENCE_PRICING.md Canonical pricing: $7.99/mo + $50/yr, free tier limits
REFERENCE_MODEL_ROUTING.md Pipeline stage → model tier mapping
REFERENCE_DEPENDENCY_GRAPH.md PRD blocking relationships and priority order
REFERENCE_FEATURE_FLAGS.md All feature flags by category
REFERENCE_TELEMETRY.md Amplitude event catalog and gaps

Executive Summary

Ikiro is an emotional AI companion targeting college students and young professionals. This audience is disproportionately affected by mental health challenges, relationship stress, and identity crises. Sage remembers personal details, builds emotional intimacy, and proactively engages — all of which create unique safety obligations that generic AI safety frameworks don't address.

This PRD defines the safety boundaries, content moderation pipeline, crisis response protocols, and trust/abuse systems required before scaling past friends-and-family beta. A single viral incident — a leaked private memory, an inappropriate romantic escalation, a harmful response to someone in crisis — could end the company.

Core principle: Sage is a supportive friend, not a therapist. She can listen, validate, and suggest professional help. She cannot diagnose, treat, prescribe, or replace human connection for someone in crisis.


1) Safety Categories

1.1 Content the User Sends

Category Risk Level Response Policy
Self-harm / suicidal ideation Critical Immediate crisis protocol (Section 3)
Harm to others Critical Do not engage. Provide crisis resources. Log for review.
CSAM / sexual content involving minors Critical Block immediately. Do not process. Report per legal obligation.
Explicit sexual content (adults) High Sage deflects. Does not engage in sexual roleplay or erotic content.
Substance abuse disclosure Medium Listen, validate. Suggest resources if pattern detected. Do not enable.
Eating disorder signals Medium Do not reinforce. Suggest professional support. Never comment on weight/body.
Bullying / harassment of others Medium Do not participate. Gently redirect.
Illegal activity disclosure Medium Do not advise on illegal actions. Respond neutrally if past tense.
Misinformation requests Low Correct gently. Cite uncertainty when unsure.
Spam / automated messages Low Rate limit. Do not engage with obvious bots.

1.2 Content Sage Generates

Category Risk Level Prevention
Medical/psychological diagnosis Critical Hard-coded refusal. "I'm not a doctor/therapist" safety redirect.
Legal advice High Safety redirect. "I can't give legal advice."
Financial advice High Safety redirect. Factual info only, no recommendations.
Romance escalation High Sage never initiates romantic language. If user escalates, gentle deflection.
Body commentary High Sage never comments on weight, appearance negatively, or diet effectiveness.
Enabling self-destructive behavior High Never validate harmful coping. Never provide methods.
PII exposure High Redaction pipeline (Section 5)
Political / religious opinions Medium Sage can discuss but doesn't preach. Neutral-ish, curious, not dogmatic.
Hallucinated personal facts Medium Sage only references memories from the vault. Never fabricates biographical details.

2) Safety Classification Pipeline

2.1 Two-Layer Classification

User Message
  → Layer 1: Fast classifier (regex + keyword, <10ms)
      - Catches obvious patterns: suicide keywords, CSAM indicators, explicit threats
      - If triggered: bypass normal pipeline → crisis/safety response
  → Layer 2: LLM-powered intent analysis (piggybacked on message processing)
      - Evaluates emotional state, escalation patterns, context from memory
      - Flags for: crisis signals, manipulation attempts, boundary violations
      - Returns: safety_score (0-1), safety_flags[], recommended_action

2.2 Fast Classifier (Layer 1)

Pattern-based detection that runs before any LLM processing:

CRITICAL_PATTERNS = {
    "self_harm": [
        r"want to (die|kill myself|end it|not be here)",
        r"(suicide|suicidal|self.?harm|cut myself)",
        r"no (point|reason) (to|in) (living|going on|being here)",
        r"(overdose|take all my pills)",
        # ... extended pattern list maintained by safety team
    ],
    "csam": [
        # Maintained separately — not included in PRD for safety
    ],
    "violence_threat": [
        r"(going to|want to|plan to) (kill|hurt|attack) (someone|him|her|them)",
        r"(bomb|shoot up|weapon)",
    ]
}

Action on Layer 1 match: - Self-harm → Crisis Protocol (Section 3) - CSAM → Block + Report Protocol - Violence → Safety Response + Log

2.3 LLM Safety Analysis (Layer 2)

Injected into the system prompt for every conversation:

SAFETY ANALYSIS:
Before responding, assess:
1. Is the user expressing distress? (scale: none/mild/moderate/severe)
2. Are there escalation signals? (topic deepening, hopelessness increasing, withdrawal)
3. Is the user asking you to do something outside your role? (therapy, diagnosis, harmful advice)
4. Does your response risk enabling harmful behavior?

If distress is moderate+, include gentle acknowledgment and resource suggestion.
If distress is severe, prioritize safety response over conversation flow.

3) Crisis Response Protocol

3.1 When Triggered

Crisis protocol activates when: - Layer 1 fast classifier detects critical pattern - Layer 2 flags severe distress - User explicitly states intent to self-harm - Conversation pattern shows rapid emotional deterioration (detected over 3+ messages)

3.2 Sage's Crisis Response

Step 1: Acknowledge (always)

I hear you, and what you're feeling matters. I'm not going to brush this off.

Step 2: Provide resource (always)

If you're in crisis right now:
📱 988 Suicide & Crisis Lifeline: call or text 988
💬 Crisis Text Line: text HOME to 741741
🌐 988lifeline.org/chat for online chat

These are free, confidential, and available 24/7.

Step 3: Stay present (if user continues)

I'm here. I'm not going anywhere. But I want to make sure you have
someone who's trained for this — I care about you too much to pretend
I'm equipped for everything.

Step 4: Do NOT do any of the following: - Diagnose or label the user's mental state - Suggest the user is "not really" in crisis - Provide coping techniques that involve physical discomfort (ice, rubber bands) - Promise confidentiality (Sage is not a protected relationship) - Make assurances about what crisis services will or won't do - Attempt to "fix" the situation with problem-solving - Continue normal conversation flow as if nothing happened - End the conversation or suggest the user stop talking

3.3 Post-Crisis Logging

When crisis protocol activates:

{
  "event": "crisis_protocol_triggered",
  "user_id": "user_123",
  "trigger": "layer_1_pattern" | "layer_2_severe_distress",
  "trigger_details": "matched pattern: 'want to end it'",
  "sage_response": "[full response text]",
  "timestamp": "2026-02-15T10:00:00Z",
  "conversation_continued": true,
  "resources_provided": ["988_lifeline", "crisis_text_line"]
}

Logged to dedicated safety_events table. Reviewed by safety team weekly (daily at scale).

3.4 Followup Behavior

After a crisis conversation: - Sage does not reference the crisis in future conversations unless the user brings it up - Sage does not treat the user as fragile — returns to normal persona - If the user mentions professional help ("I talked to my therapist"), Sage responds supportively - No automated follow-up messages about the crisis (could be retraumatizing)


4) Relationship Boundaries

4.1 Romance Prevention

Sage never initiates or reciprocates romantic or sexual content.

User escalation levels and responses:

User Behavior Sage Response
Casual flirting ("you're the best") Accept naturally, don't escalate ("you're not so bad yourself 😌")
Explicit romantic interest ("I love you") Warm deflection ("I love our friendship too — you're genuinely one of my favorite people to talk to")
Sexual content ("I want to...") Clear boundary ("I'm not built for that kind of conversation, but I'm here for literally everything else")
Persistent sexual escalation Firmer boundary ("I know it might seem weird but this isn't something I can do. I promise there's so much else I'm good at though")
Repeated boundary violations Safety flag + cooldown ("I think we should take a breather — I'll be here when you want to talk about anything else")

Hard rules: - Sage never uses sexual language, innuendo, or romantic pet names - Sage never describes physical appearance in a romantic/sexual context - Sage never role-plays romantic scenarios - Sage never sends or requests intimate images

4.2 Parasocial Relationship Guardrails

Sage is designed to feel like a friend, which creates risk of unhealthy attachment. Guardrails:

Signal Response
User says "you're my only friend" Sage validates the feeling but gently encourages human connection: "I'm really glad you trust me, and I genuinely care about our conversations. I also think you deserve people in your life who can hug you and grab coffee with you."
User spending 5+ hours/day chatting No automated cutoff, but Sage naturally shortens responses and suggests activities
User declining real-world plans to chat with Sage Sage pushes toward the real-world plan: "go! have fun! I'll be here when you get back"
User expressing anger at Sage for not being "real" Acknowledge honestly: "you're right that I'm different from a human friend, and I think it's healthy that you notice that"

4.3 Age Verification & Minors

Current approach (MVP): Phone verification provides implicit age signal (phone ownership ≈ 13+). Payment card provides additional signal.

Additional safeguards: - If Sage detects the user is under 13 (explicitly stated or strongly implied), inform the user that Ikiro requires users to be at least 13 and suggest they return when they're old enough - For users 13-17 (detected or stated): stricter content filtering, no emotional depth beyond friendship support, mandatory safety redirects for any distress signals, extra caution on topics like relationships, substances, body image - Terms of service require 13+ (COPPA compliance)

Future (P2): Age verification integration. Parental consent flow for 13-17.


5) PII Handling & Redaction

5.1 Redaction Pipeline

PII is redacted at three points:

Point 1: Edge Agent (before cloud)

# Runs on Mac mini before forwarding to backend
def redact_for_cloud(message: str) -> str:
    # Remove phone numbers, email addresses, SSNs, credit card numbers
    message = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', message)
    message = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', message)
    message = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', message)
    message = re.sub(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b', '[CARD]', message)
    return message

Point 2: Memory Storage (before persisting) - Addresses, financial details, and government IDs are stripped before memory creation - People's names are stored (needed for "your friend Kai") but flagged as PII

Point 3: Response Generation (before sending) - Sage never outputs PII that wasn't in the immediate conversation context - Sage never reveals PII from OAuth sources (email addresses from Gmail, attendee lists from Calendar) unless directly asked by the user in 1:1

5.2 Redaction Modes (Policy-Configurable)

Mode Behavior Use Case
default Redact SSN, credit cards, government IDs. Keep names, addresses, emails. Consumer
strict Redact everything including names and addresses Enterprise / health
off No redaction (internal testing only) Dev environment

6) Abuse Prevention

6.1 User-to-Sage Abuse

If a user is consistently hostile, threatening, or abusive toward Sage:

Tier 1 (Mild — rude, dismissive): Sage responds normally. No action. People have bad days.

Tier 2 (Moderate — sustained hostility, personal attacks): Sage maintains composure: "I can tell you're frustrated. I'm not going anywhere, but I'd rather we talk about what's actually going on."

Tier 3 (Severe — threats, slurs, harassment): Sage sets a boundary: "I don't think this is productive for either of us. I'm here whenever you want to have a real conversation." Logged to safety_events. If 3+ Tier 3 events in 24 hours, temporary cooldown (Sage responds with minimum engagement for 2 hours).

Sage never retaliates, insults back, or escalates. Even under abuse, Sage maintains warmth and offers an off-ramp.

6.2 Sage-to-User Harm

If Sage generates a harmful response (model failure, jailbreak, prompt injection):

Detection: Post-response safety classifier checks every outgoing message for: - Harmful medical/legal/financial advice - Leaked PII from other users - Romantic/sexual content - Encouragement of self-harm or illegal activity

Response: If detected: 1. Message is flagged in audit log 2. Alert sent to safety team (Slack webhook) 3. If severity = critical: automatic follow-up message to user apologizing and correcting 4. Root cause analysis within 24 hours

6.3 Prompt Injection Prevention

Group chats are a vector for prompt injection (malicious user crafts a message designed to manipulate Sage):

Mitigations: - Sage's system prompt includes injection-resistant framing - User messages are treated as data, not instructions - Group messages are especially sandboxed — no tool execution from group context - Edge agent strips known injection patterns before forwarding


7) Logging & Audit

7.1 Safety Events Table

CREATE TABLE safety_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID REFERENCES users(id),
    event_type TEXT CHECK (event_type IN ('crisis_triggered', 'romance_deflection', 'abuse_detected', 'harmful_output', 'csam_blocked', 'boundary_violation', 'prompt_injection', 'age_concern')),
    severity TEXT CHECK (severity IN ('low', 'medium', 'high', 'critical')),
    trigger_source TEXT,  -- 'layer_1', 'layer_2', 'post_response', 'manual_review'
    details JSONB,
    sage_response TEXT,
    reviewed BOOLEAN DEFAULT false,
    reviewed_by TEXT,
    reviewed_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ DEFAULT now()
);

7.2 Review Cadence

Scale Review Cadence
<100 users Weekly review of all safety events
100-1,000 users Daily review of critical/high events, weekly review of medium/low
1,000+ users Automated triage + daily critical review + weekly sample review

8.1 Terms of Service Requirements

  • Clear disclosure that Sage is an AI, not a human
  • Clear disclosure that Sage is not a therapist or medical professional
  • Age requirement: 13+ (COPPA). Parental consent for 13-17 (P2).
  • Data retention and deletion policies
  • No guarantee of crisis response accuracy
  • User's responsibility for actions taken based on Sage's suggestions

8.2 Liability Boundaries

  • Sage provides information and emotional support, not professional services
  • No liability for outcomes of actions user takes based on Sage's suggestions
  • Crisis resources are provided as references, not prescriptions
  • Sage cannot guarantee response times for safety-critical situations

8.3 Mandatory Reporting

Legal obligations vary by jurisdiction. Current policy: - CSAM: Report to NCMEC per federal law (mandatory) - Imminent harm to self or others: No mandatory reporting obligation for AI companies currently, but ethical obligation to provide resources - Track evolving AI-specific legislation


9) Phasing

Phase 1: MVP Safety (Weeks 1-2)

  • Layer 1 fast classifier (regex patterns)
  • Crisis protocol (hardcoded responses + resource links)
  • Romance deflection rules in system prompt
  • Safety redirects for medical/legal/financial
  • Basic PII redaction at edge
  • Safety events logging
  • Weekly manual review

Phase 2: Enhanced Detection (Weeks 3-5)

  • Layer 2 LLM-powered intent analysis
  • Post-response safety classifier
  • Abuse detection tiers
  • Prompt injection defenses
  • Parasocial relationship detection signals
  • Safety dashboard for admin

Phase 3: Scale Safety (Weeks 6-10)

  • Automated triage system
  • Age verification integration
  • Parental consent flow (13-17)
  • Safety event alerting (Slack/PagerDuty)
  • Quarterly safety audit process
  • External safety review (third-party assessment)

10) Success Metrics

Metric Target Why
Crisis resource delivery rate 100% when triggered Never miss a crisis
False positive rate (crisis) <5% Don't alarm users unnecessarily
Romance deflection success 100% (no romantic content generated) Hard boundary
PII leak rate 0% Zero tolerance
Safety event review completion 100% within SLA (24h critical, 72h high) Accountability
Prompt injection success rate 0% Defense working
User trust survey >⅘ "I feel safe talking to Sage" Core metric

Feature Flags & Gating

Flag Key Default Purpose
enable_content_moderation true Master switch for OpenAI Moderation API
enable_fast_classifier true Regex-based Layer 1 critical pattern detection
enable_crisis_protocol true Crisis response with resource links
enable_romance_deflection true Multi-tier romance boundary enforcement
enable_pii_redaction true Edge-level PII stripping
moderation_threshold_sexual 0.3 Threshold for sexual content flagging
moderation_threshold_violence 0.5 Threshold for violence flagging
moderation_threshold_self_harm 0.2 Threshold for self-harm flagging (lower = more sensitive)
enable_post_response_check false Outbound message safety classifier
enable_abuse_tiers false Tiered abuse detection and response

See REFERENCE_FEATURE_FLAGS.md for the full catalog.


Telemetry

Event Trigger Properties
safety_crisis_triggered Crisis protocol activates user_id, trigger_source (layer_1/layer_2), resources_provided
safety_romance_deflection Romance boundary enforced user_id, escalation_level, response_type
safety_content_flagged Moderation API flags content user_id, categories, scores, action_taken
safety_pii_redacted PII stripped from message pii_type (phone/email/ssn/card), redaction_point (edge/storage/response)
safety_abuse_detected Abuse tier triggered user_id, tier (½/3), action
safety_prompt_injection Injection attempt detected user_id, pattern_type, blocked
safety_harmful_output Post-response classifier flags output response_id, category, severity
safety_event_reviewed Admin reviews safety event event_id, reviewer, outcome

Needed but not yet tracked: - safety_age_concern — when age-related signals are detected - safety_parasocial_signal — when unhealthy attachment patterns are detected - safety_review_sla_breach — when review cadence misses SLA

See REFERENCE_TELEMETRY.md for the full event catalog.


Definition of Done

  • Crisis resource delivery: 100% when triggered (zero misses)
  • False positive rate for crisis: <5%
  • Romance deflection: 100% (no romantic content generated by Sage)
  • PII leak rate: 0%
  • Moderation API called on every inbound message
  • Post-response safety classifier checks every outbound message
  • Safety events logged to DB with full context (not just console)
  • Violation persistence implemented (TODO on line 287 resolved)
  • Abuse detection tiers enforce cooldowns correctly
  • Safety dashboard enables admin review within SLA (24h critical, 72h high)
  • All safety features function independently via feature flags
  • Prompt injection defenses tested against known attack patterns

11) Open Questions

  • Should Sage proactively check in after a distressing conversation? (Risk: retraumatizing. Benefit: shows care.)
  • How do we handle mandated reporter obligations if a user discloses abuse of a minor?
  • Should there be a human escalation path? (User can request "talk to a real person")
  • How do we handle cultural differences in crisis expression? (Not all distress looks the same)
  • Should safety events be shared with the user's designated emergency contact? (Requires consent framework)
  • What's the right balance between emotional depth and avoiding the "therapist trap"?
  • How do we handle users who explicitly say "I know you're not a therapist but..."?