Conversation Data Format Specification¶

Purpose: Guide for engineers contributing conversation data to improve the multi-bubble messaging system.

Last Updated: October 31, 2025

Overview¶

We need real conversation data to improve three aspects of the system:

Few-Shot Examples → Add to system prompts for better tone/style
Burst Pattern Training → Improve multi-bubble timing and flow
Fine-Tuning Data → Future model fine-tuning (optional)

This document specifies the exact formats needed for each use case.

Format 1: Few-Shot Examples (PRIMARY NEED)¶

Use Case: Add to persona passport examples and system prompts to guide tone, style, and multi-bubble patterns.

Where it goes: app/persona/passports/sage.json → examples array

Required Format¶

{
  "conversation_id": "conv_001",
  "scenario": "Friend giving advice about dating/being single",
  "tone_tags": ["supportive", "casual", "teasing", "empowering"],
  "examples": [
    {
      "user": "that sounds so horrible",
      "assistant": "LOL no its empowering",
      "context": "User reacting negatively to movie suggestion",
      "notes": "Quick pushback with humor, doesn't over-explain"
    },
    {
      "user": "I think I would actually cry",
      "assistant": "no its good!",
      "context": "User doubling down on negativity",
      "notes": "Persistent encouragement, stays upbeat"
    },
    {
      "user": "I have no interest in staying in and watching a show w another guy",
      "assistant": "just skip [activity] just one night",
      "context": "User resistant to idea",
      "notes": "Gentle persuasion, specific actionable advice"
    }
  ]
}

Guidelines¶

DO: - Keep examples short (1-2 sentence exchanges) - Include diverse scenarios (advice, banter, emotional support, tasks) - Preserve natural typos/lowercase if they show personality - Replace specific names with [PERSON], [PLACE], [ACTIVITY] - Add context to explain the interaction

DON'T: - Include long multi-paragraph responses - Remove all personality (some typos/slang are good!) - Include PII (real names, addresses, phone numbers) - Use examples where tone is mean, toxic, or inappropriate

Example Scenarios We Need¶

High priority: - ✅ Supportive advice (dating, career, school stress) - ⏳ Playful teasing/banter between friends - ⏳ Making plans (logistics, coordinating) - ⏳ Emotional support (dealing with loss, anxiety) - ⏳ Celebrating wins together - ⏳ Inside jokes developing over time

Format 2: Burst Pattern Examples¶

Use Case: Understand how humans naturally split ideas across multiple messages (the "multi-bubble" pattern).

Where it goes: data/burst_patterns/ for analysis

Required Format¶

{
  "burst_id": "burst_001",
  "scenario": "Explaining movie night idea to friend",
  "messages": [
    {
      "order": 1,
      "text": "omg you know what you two should do",
      "timestamp_relative_ms": 0,
      "bubble_type": "setup",
      "function": "Get attention, tease the idea"
    },
    {
      "order": 2,
      "text": "not go out one night and drink wine and watch 'how to be single' together",
      "timestamp_relative_ms": 7009,
      "bubble_type": "main_idea",
      "function": "Deliver the actual suggestion with details"
    },
    {
      "order": 3,
      "text": "that movie",
      "timestamp_relative_ms": 1119,
      "bubble_type": "clarification",
      "function": "Clarify/emphasize one detail"
    }
  ],
  "pattern_notes": "3-bubble burst: attention-getter → full idea → quick clarification",
  "total_duration_ms": 8128,
  "interruption_occurred": false
}

What We Extract¶

From burst patterns we learn: - Timing: How long between messages? (helps tune delivery delays) - Structure: How do people break up complex ideas? - Function: What role does each bubble play?

Bubble Type Taxonomy¶

Common patterns we've observed: - setup - "omg you know what...", "wait hear me out" - main_idea - The actual point/suggestion - clarification - "that movie", "you know what i mean" - emphasis - "no its good!", "i'm serious" - reaction - "LOL", "💀", "omg" - follow_up_question - "have you seen it?"

Guidelines¶

DO: - Include timestamp differences (milliseconds) - Label the function of each message - Note if conversation was interrupted - Include 2-5 message sequences

DON'T: - Include single isolated messages - Skip timing data - Mix different conversation topics in one burst

Format 3: Full Conversation Context (For Future Fine-Tuning)¶

Use Case: Train models to understand full conversation flow, relationship building, memory.

Where it goes: data/fine_tuning/conversations/ (not used yet, but preparing format)

Required Format¶

{
  "conversation_id": "conv_full_001",
  "date": "2024-XX-XX",
  "participants": ["user", "assistant"],
  "relationship_stage": "friend",
  "conversation_context": "Discussing user's dating life and giving supportive advice",
  "messages": [
    {
      "sender": "assistant",
      "text": "focus on yourself for a bit",
      "timestamp": "2024-XX-XXT00:00:00.000Z",
      "intent": "advice",
      "tone": "supportive"
    },
    {
      "sender": "assistant",
      "text": "omg you know what you two should do",
      "timestamp": "2024-XX-XXT00:00:07.000Z",
      "intent": "suggestion_setup",
      "tone": "excited"
    },
    {
      "sender": "assistant",
      "text": "not go out one night and drink wine and watch 'how to be single' together",
      "timestamp": "2024-XX-XXT00:00:12.000Z",
      "intent": "suggestion",
      "tone": "casual"
    },
    {
      "sender": "user",
      "text": "that sounds so horrible",
      "timestamp": "2024-XX-XXT00:00:15.000Z",
      "intent": "rejection",
      "tone": "negative"
    },
    {
      "sender": "assistant",
      "text": "LOL no its empowering",
      "timestamp": "2024-XX-XXT00:00:16.000Z",
      "intent": "pushback",
      "tone": "playful_disagreement"
    }
  ],
  "metadata": {
    "total_messages": 15,
    "duration_minutes": 5,
    "topic_tags": ["dating", "self_care", "movie_recommendation"],
    "emotional_arc": "supportive → playful → encouraging",
    "resolution": "user agrees to consider the idea"
  }
}

Guidelines¶

DO: - Keep original timestamps (relative or anonymized dates) - Label intent and tone for each message - Include full conversation context - Note emotional arc and resolution - Anonymize all PII

DON'T: - Skip metadata - Include incomplete conversations - Include conversations without clear context

Data Privacy & Ethics¶

Before Submitting Data¶

Required: - ✅ Consent obtained from all participants - ✅ All real names replaced with placeholders - ✅ Specific locations/dates anonymized - ✅ No sensitive personal information (health, finance, etc.) - ✅ No minors' data - ✅ No toxic/harmful content

Anonymization Guide¶

Replace these: - Names: "Sarah" → "[FRIEND_1]" - Places: "Stanford" → "[UNIVERSITY]" or "[SCHOOL]" - Companies: "Google" → "[COMPANY]" - Dates: "December 2nd" → "[DATE]" or generic season - Activities: "Blue Jay bar" → "[PLACE]" or "[BAR]"

Keep these: - Emotional tone and energy - Conversation structure - Timing patterns - Language style (lowercase, typos, slang) - General topics (dating, school, work stress)

Submission Process¶

Step 1: Prepare Your Data¶

Choose which format(s) you're providing: - Few-shot examples (JSON) - Burst patterns (JSON) - Full conversations (JSON)

Step 2: Validate Format¶

Use our validation script (coming soon):

python tools/validate_conversation_data.py your_file.json

Step 3: Submit¶

Create a PR with: - Data files in appropriate directory - Brief description of scenarios/sources - Confirmation of consent/anonymization

Directory structure:

data/
├── few_shot_examples/
│   ├── supportive_advice.json
│   ├── playful_banter.json
│   └── making_plans.json
├── burst_patterns/
│   ├── multi_bubble_explanations.json
│   └── quick_reactions.json
└── fine_tuning/ (future)
    └── full_conversations.json

What We're Looking For¶

High Priority Scenarios¶

Emotional Support: - Comforting friend after breakup - Helping with school/work stress - Celebrating achievements - Dealing with anxiety/overwhelm

Practical Help: - Making plans with multiple people - Rescheduling meetings - Finding information (flights, deadlines) - Solving logistics problems

Social Interactions: - Playful teasing between friends - Inside jokes forming - Catching up after time apart - Group chat coordination

Relationship Stages: - First conversation (stranger) - Early friendship (acquaintance) - Established friendship (friend) - Close friendship (best friend)

What We Already Have¶

Examples from PRD testing: - ✅ Simple banter ("how's it going") - ✅ Task requests ("move my 3pm meeting") - ✅ Factual questions ("what time is my flight") - ✅ Pure reactions ("HAHAHA")

Diversity Goals¶

We want to represent: - Different age groups (college, early career, etc.) - Various relationship types (friends, family, romantic) - Multiple communication styles (verbose, terse, emoji-heavy) - Different emotional contexts (happy, stressed, neutral)

Examples of Good vs. Bad Submissions¶

✅ GOOD: Few-Shot Example¶

{
  "conversation_id": "advice_001",
  "scenario": "Friend stressed about upcoming exam",
  "tone_tags": ["supportive", "practical", "casual"],
  "examples": [
    {
      "user": "I'm so stressed about Thursday's exam",
      "assistant": "wait what exam 😰",
      "context": "Immediate empathetic reaction + seeking info",
      "notes": "Quick acknowledgment with emoji, asks follow-up"
    },
    {
      "user": "the econ midterm, i'm gonna fail",
      "assistant": "ok breathe",
      "context": "User catastrophizing",
      "notes": "Single word grounding, doesn't lecture"
    }
  ]
}

Why it's good: - Clear scenario - Natural language preserved - Context explained - Shows multi-bubble potential

❌ BAD: Too Formal¶

{
  "user": "I am very stressed about my examination on Thursday.",
  "assistant": "I understand that you are feeling anxious. Would you like to discuss strategies for managing your stress?"
}

Why it's bad: - Not how humans text - Too formal/robotic - Loses natural tone we want to learn

✅ GOOD: Burst Pattern¶

{
  "burst_id": "excited_news_001",
  "messages": [
    {
      "order": 1,
      "text": "WAIT",
      "timestamp_relative_ms": 0,
      "bubble_type": "attention",
      "function": "Stop conversation, build anticipation"
    },
    {
      "order": 2,
      "text": "guess what just happened",
      "timestamp_relative_ms": 1200,
      "bubble_type": "setup",
      "function": "Create suspense before news"
    },
    {
      "order": 3,
      "text": "i got the internship 🎉",
      "timestamp_relative_ms": 2400,
      "bubble_type": "reveal",
      "function": "Deliver the actual news"
    }
  ]
}

Why it's good: - Shows natural build-up pattern - Timing data included - Clear functional breakdown

❌ BAD: Missing Key Data¶

{
  "messages": [
    {"text": "hey"},
    {"text": "what's up"},
    {"text": "not much"}
  ]
}

Why it's bad: - No timing data - No context - No functional analysis - Too generic/not useful

Technical Notes¶

Timestamp Formats¶

We accept: - ISO 8601: 2024-01-15T14:30:00.000Z - Unix epoch (ms): 1705329000000 - Relative (ms): 0, 1200, 2400 (preferred for bursts)

Character Limits¶

For compatibility with our system: - Individual messages: ≤280 characters (texting length) - Burst sequences: 2-5 messages - Full conversations: 5-50 messages

Language¶

Currently supporting: - English (primary) - Casual/informal register - Slang, emojis, internet language welcome

Contact & Support¶

Questions? - File an issue: github.com/yourorg/archety/issues - Slack: #conversation-data - Email: backend-team@yourorg.com

Want to contribute but unsure? - Start with 2-3 few-shot examples - We'll provide feedback - Iterate from there

Appendix: Quick Reference¶

Minimum Viable Submission¶

For few-shot examples: - 3-5 user/assistant exchanges - Clear scenario description - Tone tags - Basic anonymization

For burst patterns: - 1 multi-message sequence (2-5 bubbles) - Relative timing in milliseconds - Functional labels

File Naming Convention¶

{type}_{scenario}_{optional_id}.json

Examples: - fewshot_supportive_advice_001.json - burst_excited_news_001.json - full_dating_advice_conversation.json

Version: 1.0 Last Updated: October 31, 2025 Maintained By: Backend Team (Engineer 2)