Training Data Integration¶

Status: ✅ Complete
Added: October 31, 2025
Branch: feature/sage-training-integration

Overview¶

The Sage persona now learns from real conversation examples at runtime. Instead of hardcoding personality traits, the system dynamically selects few-shot examples and burst patterns from curated training datasets based on conversation context.

Architecture¶

Components¶

TrainingDataLoader (app/persona/training_data_loader.py)
Loads and caches training datasets at startup
Provides query methods for examples and patterns
Filters by scenario, tone tags, and quality scores
Training Data Assets (training_data/)
conversational_format.md - Specification for data formats
sage/fewshot_examples.json - 59,000+ curated conversation examples
sage/burst_patterns.json - Multi-bubble messaging patterns
Persona Engine Integration (app/persona/engine.py)
Dynamically selects 3-5 few-shot examples per prompt
Injects examples based on scenario and tone
Falls back to passport defaults if no match
Burst Planner Integration (app/messaging/burst_planner.py)
Loads burst pattern guidance for multi-bubble responses
Shows LLM real examples of how to split messages naturally

How It Works¶

1. Few-Shot Example Selection¶

When building a system prompt, the PersonaEngine:

Checks if persona has training_data.fewshot config in passport
Queries TrainingDataLoader with:
scenario (e.g., "Making plans", "Emotional support")
tone_tags (e.g., ["casual", "supportive", "playful"])
limit (default: 3)
Loader returns quality-scored examples sorted by relevance
Examples are injected into system prompt alongside static examples

Example Config (sage.json):

{
  "training_data": {
    "fewshot": {
      "source": "training_data/sage/fewshot_examples.json",
      "default_scenarios": [
        "Making plans and coordinating schedules",
        "Casual conversation and catching up"
      ],
      "default_tone_tags": ["casual", "playful", "supportive"],
      "limit": 3
    }
  }
}

2. Burst Pattern Guidance¶

When planning multi-bubble responses, the BurstPlanner:

Receives scenario hint from reasoning (e.g., "emotional support")
Queries TrainingDataLoader for 2 matching burst patterns
Formats patterns as guidance text showing:
Scenario description
Pattern structure (e.g., "setup → main_idea → clarification")
Example messages with bubble types
Injects guidance into LLM prompt for natural pacing

Example Pattern:

{
  "burst_id": "burst_007",
  "scenario": "Playful teasing or joke",
  "messages": [
    {
      "order": 1,
      "text": "Haha okie",
      "bubble_type": "reaction",
      "function": "Express emotional reaction"
    },
    {
      "order": 2,
      "text": "Are you gonna cook? And we're gonna movie night?",
      "bubble_type": "follow_up_question",
      "function": "Ask follow-up to continue conversation"
    }
  ],
  "pattern_notes": "2-bubble burst: reaction → follow_up_question",
  "total_duration_ms": 13000
}

Data Format¶

Few-Shot Examples¶

Each example includes: - conversation_id - Unique identifier - scenario - Context category (e.g., "Making plans") - tone_tags - Tone descriptors (e.g., ["casual", "playful"]) - quality_scores - Emotional richness, clarity, usefulness (0-1) - examples - Array of user/assistant exchanges with context

Quality Scoring: - Examples sorted by combined score (emotional_richness + clarity + usefulness) - Higher scores = more natural, engaging, useful exchanges - Loader prioritizes top-scored examples

Burst Patterns¶

Each pattern includes: - burst_id - Unique identifier - scenario - When this pattern applies - messages - Array of bubbles with timing, type, function - pattern_notes - Human-readable structure summary - total_duration_ms - Real timing from source conversation

Integration Points¶

PersonaEngine.build_system_prompt()¶

New Parameters: - scenario: Optional[str] - Current conversation scenario - tone_tags: Optional[Sequence[str]] - Desired tone characteristics

Behavior: 1. Loads static examples from passport 2. Calls _select_dynamic_examples() to fetch training data 3. Merges static + dynamic examples (limit 3 total) 4. Includes context and notes for each example

BurstPlanner.plan_burst()¶

New Parameters: - persona_id: str - Which persona's patterns to use - scenario: Optional[str] - Conversation scenario for pattern matching

Behavior: 1. Extracts scenario from reasoning or uses provided hint 2. Calls _format_burst_guidance() to fetch patterns 3. Injects pattern examples into LLM prompt 4. LLM sees real examples of natural multi-bubble flow

MultiBubbleHandler.handle()¶

Changes: - Passes persona_id and scenario to burst planner - Scenario derived from reasoning.get("user_need")

Benefits¶

1. Data-Driven Personality¶

Sage's tone comes from real conversations, not guesswork
Easy to update personality by adding/curating examples
No code changes needed to refine tone

2. Scalable Training¶

New personas just need their own training dataset
Reuse loader infrastructure across all personas
Quality scoring ensures best examples surface first

3. Context-Aware Responses¶

System picks examples matching current conversation type
Emotional support scenarios get supportive examples
Planning scenarios get practical, coordinating examples

4. Natural Multi-Bubble Flow¶

Burst patterns teach LLM realistic message pacing
Examples show when to split ideas vs. send one bubble
Timing data preserves human-like delays

Files Changed¶

New Files¶

app/persona/training_data_loader.py - Core loader logic
training_data/conversational_format.md - Data spec
training_data/sage/fewshot_examples.json - 59K+ examples
training_data/sage/burst_patterns.json - Multi-bubble patterns
TRAINING_DATA_INTEGRATION.md - This document

Modified Files¶

app/persona/engine.py - Dynamic example selection
app/persona/passports/sage.json - Training data config
app/messaging/burst_planner.py - Pattern guidance injection
app/orchestrator/multi_bubble_handler.py - Pass scenario hints
test_multi_bubble.py - Updated test calls
Claude.md - Documentation updates
README.md - Architecture updates

Usage Example¶

from app.persona.training_data_loader import get_training_data_loader

loader = get_training_data_loader()

# Get few-shot examples for emotional support
examples = loader.get_fewshot_examples(
    persona_id="sage",
    scenario="Emotional support and venting",
    tone_tags=["supportive", "empathetic"],
    limit=3
)

# Get burst patterns for playful teasing
patterns = loader.get_burst_patterns(
    persona_id="sage",
    scenario="Playful teasing or joke",
    limit=2
)

# Access conversation format spec
format_doc = loader.get_conversation_format_notes()

Selection Algorithm (2025-11 Update)¶

Few-shot examples are ranked with weighted scores: quality metrics provide the baseline, exact scenario matches add a strong bonus, partial matches use fuzzy similarity, and tone-tag overlap nudges the ranking.
Diversity rules cap duplicate scenarios when no explicit scenario is supplied so planners see a mix of contexts.
Burst pattern selection applies similar scoring while de-prioritising interrupted bursts and rewarding rich pattern notes/duration.
Deterministic unit coverage lives in test_training_data_selection.py; it locks in behaviour for scenario preference, tone alignment, and diversity caps.

Future Enhancements¶

Short-Term¶

Add verification tests for loader functionality
Expand Echo persona with group-mode training data
Add scenario classification to reasoning system

Long-Term¶

Fine-tune custom model on full dataset
A/B test static vs. dynamic examples
User feedback loop to score example quality
Multi-language training datasets

Testing¶

The loader is tested implicitly through existing integration tests: - test_orchestrator.py - Persona engine loads successfully - test_multi_bubble.py - Burst planner accesses patterns

To verify loader directly:

from app.persona.training_data_loader import get_training_data_loader

loader = get_training_data_loader()
examples = loader.get_fewshot_examples("sage", limit=5)
print(f"Loaded {len(examples)} examples")

Deployment Notes¶

Requirements¶

No new dependencies (uses stdlib json, pathlib)
Training data files must be present at runtime
Loader caches data in memory after first load

Performance¶

Initial load: ~500ms (59K examples)
Subsequent queries: <1ms (cached)
Memory footprint: ~50MB (JSON in RAM)

Rollback¶

If issues arise, revert to static examples: 1. Remove training_data section from sage.json 2. PersonaEngine falls back to passport examples only 3. BurstPlanner returns "(no pattern data available)"

Next Steps: Merge to master and monitor Sage's tone consistency in production.