Conversation Data Format Specification¶
Purpose: Guide for engineers contributing conversation data to improve the multi-bubble messaging system.
Last Updated: October 31, 2025
Overview¶
We need real conversation data to improve three aspects of the system:
- Few-Shot Examples → Add to system prompts for better tone/style
- Burst Pattern Training → Improve multi-bubble timing and flow
- Fine-Tuning Data → Future model fine-tuning (optional)
This document specifies the exact formats needed for each use case.
Format 1: Few-Shot Examples (PRIMARY NEED)¶
Use Case: Add to persona passport examples and system prompts to guide tone, style, and multi-bubble patterns.
Where it goes: app/persona/passports/sage.json → examples array
Required Format¶
{
"conversation_id": "conv_001",
"scenario": "Friend giving advice about dating/being single",
"tone_tags": ["supportive", "casual", "teasing", "empowering"],
"examples": [
{
"user": "that sounds so horrible",
"assistant": "LOL no its empowering",
"context": "User reacting negatively to movie suggestion",
"notes": "Quick pushback with humor, doesn't over-explain"
},
{
"user": "I think I would actually cry",
"assistant": "no its good!",
"context": "User doubling down on negativity",
"notes": "Persistent encouragement, stays upbeat"
},
{
"user": "I have no interest in staying in and watching a show w another guy",
"assistant": "just skip [activity] just one night",
"context": "User resistant to idea",
"notes": "Gentle persuasion, specific actionable advice"
}
]
}
Guidelines¶
DO: - Keep examples short (1-2 sentence exchanges) - Include diverse scenarios (advice, banter, emotional support, tasks) - Preserve natural typos/lowercase if they show personality - Replace specific names with [PERSON], [PLACE], [ACTIVITY] - Add context to explain the interaction
DON'T: - Include long multi-paragraph responses - Remove all personality (some typos/slang are good!) - Include PII (real names, addresses, phone numbers) - Use examples where tone is mean, toxic, or inappropriate
Example Scenarios We Need¶
High priority: - ✅ Supportive advice (dating, career, school stress) - ⏳ Playful teasing/banter between friends - ⏳ Making plans (logistics, coordinating) - ⏳ Emotional support (dealing with loss, anxiety) - ⏳ Celebrating wins together - ⏳ Inside jokes developing over time
Format 2: Burst Pattern Examples¶
Use Case: Understand how humans naturally split ideas across multiple messages (the "multi-bubble" pattern).
Where it goes: data/burst_patterns/ for analysis
Required Format¶
{
"burst_id": "burst_001",
"scenario": "Explaining movie night idea to friend",
"messages": [
{
"order": 1,
"text": "omg you know what you two should do",
"timestamp_relative_ms": 0,
"bubble_type": "setup",
"function": "Get attention, tease the idea"
},
{
"order": 2,
"text": "not go out one night and drink wine and watch 'how to be single' together",
"timestamp_relative_ms": 7009,
"bubble_type": "main_idea",
"function": "Deliver the actual suggestion with details"
},
{
"order": 3,
"text": "that movie",
"timestamp_relative_ms": 1119,
"bubble_type": "clarification",
"function": "Clarify/emphasize one detail"
}
],
"pattern_notes": "3-bubble burst: attention-getter → full idea → quick clarification",
"total_duration_ms": 8128,
"interruption_occurred": false
}
What We Extract¶
From burst patterns we learn: - Timing: How long between messages? (helps tune delivery delays) - Structure: How do people break up complex ideas? - Function: What role does each bubble play?
Bubble Type Taxonomy¶
Common patterns we've observed:
- setup - "omg you know what...", "wait hear me out"
- main_idea - The actual point/suggestion
- clarification - "that movie", "you know what i mean"
- emphasis - "no its good!", "i'm serious"
- reaction - "LOL", "💀", "omg"
- follow_up_question - "have you seen it?"
Guidelines¶
DO: - Include timestamp differences (milliseconds) - Label the function of each message - Note if conversation was interrupted - Include 2-5 message sequences
DON'T: - Include single isolated messages - Skip timing data - Mix different conversation topics in one burst
Format 3: Full Conversation Context (For Future Fine-Tuning)¶
Use Case: Train models to understand full conversation flow, relationship building, memory.
Where it goes: data/fine_tuning/conversations/ (not used yet, but preparing format)
Required Format¶
{
"conversation_id": "conv_full_001",
"date": "2024-XX-XX",
"participants": ["user", "assistant"],
"relationship_stage": "friend",
"conversation_context": "Discussing user's dating life and giving supportive advice",
"messages": [
{
"sender": "assistant",
"text": "focus on yourself for a bit",
"timestamp": "2024-XX-XXT00:00:00.000Z",
"intent": "advice",
"tone": "supportive"
},
{
"sender": "assistant",
"text": "omg you know what you two should do",
"timestamp": "2024-XX-XXT00:00:07.000Z",
"intent": "suggestion_setup",
"tone": "excited"
},
{
"sender": "assistant",
"text": "not go out one night and drink wine and watch 'how to be single' together",
"timestamp": "2024-XX-XXT00:00:12.000Z",
"intent": "suggestion",
"tone": "casual"
},
{
"sender": "user",
"text": "that sounds so horrible",
"timestamp": "2024-XX-XXT00:00:15.000Z",
"intent": "rejection",
"tone": "negative"
},
{
"sender": "assistant",
"text": "LOL no its empowering",
"timestamp": "2024-XX-XXT00:00:16.000Z",
"intent": "pushback",
"tone": "playful_disagreement"
}
],
"metadata": {
"total_messages": 15,
"duration_minutes": 5,
"topic_tags": ["dating", "self_care", "movie_recommendation"],
"emotional_arc": "supportive → playful → encouraging",
"resolution": "user agrees to consider the idea"
}
}
Guidelines¶
DO: - Keep original timestamps (relative or anonymized dates) - Label intent and tone for each message - Include full conversation context - Note emotional arc and resolution - Anonymize all PII
DON'T: - Skip metadata - Include incomplete conversations - Include conversations without clear context
Data Privacy & Ethics¶
Before Submitting Data¶
Required: - ✅ Consent obtained from all participants - ✅ All real names replaced with placeholders - ✅ Specific locations/dates anonymized - ✅ No sensitive personal information (health, finance, etc.) - ✅ No minors' data - ✅ No toxic/harmful content
Anonymization Guide¶
Replace these:
- Names: "Sarah" → "[FRIEND_1]"
- Places: "Stanford" → "[UNIVERSITY]" or "[SCHOOL]"
- Companies: "Google" → "[COMPANY]"
- Dates: "December 2nd" → "[DATE]" or generic season
- Activities: "Blue Jay bar" → "[PLACE]" or "[BAR]"
Keep these: - Emotional tone and energy - Conversation structure - Timing patterns - Language style (lowercase, typos, slang) - General topics (dating, school, work stress)
Submission Process¶
Step 1: Prepare Your Data¶
Choose which format(s) you're providing: - Few-shot examples (JSON) - Burst patterns (JSON) - Full conversations (JSON)
Step 2: Validate Format¶
Use our validation script (coming soon):
Step 3: Submit¶
Create a PR with: - Data files in appropriate directory - Brief description of scenarios/sources - Confirmation of consent/anonymization
Directory structure:
data/
├── few_shot_examples/
│ ├── supportive_advice.json
│ ├── playful_banter.json
│ └── making_plans.json
├── burst_patterns/
│ ├── multi_bubble_explanations.json
│ └── quick_reactions.json
└── fine_tuning/ (future)
└── full_conversations.json
What We're Looking For¶
High Priority Scenarios¶
Emotional Support: - Comforting friend after breakup - Helping with school/work stress - Celebrating achievements - Dealing with anxiety/overwhelm
Practical Help: - Making plans with multiple people - Rescheduling meetings - Finding information (flights, deadlines) - Solving logistics problems
Social Interactions: - Playful teasing between friends - Inside jokes forming - Catching up after time apart - Group chat coordination
Relationship Stages: - First conversation (stranger) - Early friendship (acquaintance) - Established friendship (friend) - Close friendship (best friend)
What We Already Have¶
Examples from PRD testing: - ✅ Simple banter ("how's it going") - ✅ Task requests ("move my 3pm meeting") - ✅ Factual questions ("what time is my flight") - ✅ Pure reactions ("HAHAHA")
Diversity Goals¶
We want to represent: - Different age groups (college, early career, etc.) - Various relationship types (friends, family, romantic) - Multiple communication styles (verbose, terse, emoji-heavy) - Different emotional contexts (happy, stressed, neutral)
Examples of Good vs. Bad Submissions¶
✅ GOOD: Few-Shot Example¶
{
"conversation_id": "advice_001",
"scenario": "Friend stressed about upcoming exam",
"tone_tags": ["supportive", "practical", "casual"],
"examples": [
{
"user": "I'm so stressed about Thursday's exam",
"assistant": "wait what exam 😰",
"context": "Immediate empathetic reaction + seeking info",
"notes": "Quick acknowledgment with emoji, asks follow-up"
},
{
"user": "the econ midterm, i'm gonna fail",
"assistant": "ok breathe",
"context": "User catastrophizing",
"notes": "Single word grounding, doesn't lecture"
}
]
}
Why it's good: - Clear scenario - Natural language preserved - Context explained - Shows multi-bubble potential
❌ BAD: Too Formal¶
{
"user": "I am very stressed about my examination on Thursday.",
"assistant": "I understand that you are feeling anxious. Would you like to discuss strategies for managing your stress?"
}
Why it's bad: - Not how humans text - Too formal/robotic - Loses natural tone we want to learn
✅ GOOD: Burst Pattern¶
{
"burst_id": "excited_news_001",
"messages": [
{
"order": 1,
"text": "WAIT",
"timestamp_relative_ms": 0,
"bubble_type": "attention",
"function": "Stop conversation, build anticipation"
},
{
"order": 2,
"text": "guess what just happened",
"timestamp_relative_ms": 1200,
"bubble_type": "setup",
"function": "Create suspense before news"
},
{
"order": 3,
"text": "i got the internship 🎉",
"timestamp_relative_ms": 2400,
"bubble_type": "reveal",
"function": "Deliver the actual news"
}
]
}
Why it's good: - Shows natural build-up pattern - Timing data included - Clear functional breakdown
❌ BAD: Missing Key Data¶
Why it's bad: - No timing data - No context - No functional analysis - Too generic/not useful
Technical Notes¶
Timestamp Formats¶
We accept:
- ISO 8601: 2024-01-15T14:30:00.000Z
- Unix epoch (ms): 1705329000000
- Relative (ms): 0, 1200, 2400 (preferred for bursts)
Character Limits¶
For compatibility with our system: - Individual messages: ≤280 characters (texting length) - Burst sequences: 2-5 messages - Full conversations: 5-50 messages
Language¶
Currently supporting: - English (primary) - Casual/informal register - Slang, emojis, internet language welcome
Contact & Support¶
Questions?
- File an issue: github.com/yourorg/archety/issues
- Slack: #conversation-data
- Email: backend-team@yourorg.com
Want to contribute but unsure? - Start with 2-3 few-shot examples - We'll provide feedback - Iterate from there
Appendix: Quick Reference¶
Minimum Viable Submission¶
For few-shot examples: - 3-5 user/assistant exchanges - Clear scenario description - Tone tags - Basic anonymization
For burst patterns: - 1 multi-message sequence (2-5 bubbles) - Relative timing in milliseconds - Functional labels
File Naming Convention¶
Examples:
- fewshot_supportive_advice_001.json
- burst_excited_news_001.json
- full_dating_advice_conversation.json
Version: 1.0 Last Updated: October 31, 2025 Maintained By: Backend Team (Engineer 2)