Skip to content

Training Data Integration

Status: ✅ Complete
Added: October 31, 2025
Branch: feature/sage-training-integration


Overview

The Sage persona now learns from real conversation examples at runtime. Instead of hardcoding personality traits, the system dynamically selects few-shot examples and burst patterns from curated training datasets based on conversation context.


Architecture

Components

  1. TrainingDataLoader (app/persona/training_data_loader.py)
  2. Loads and caches training datasets at startup
  3. Provides query methods for examples and patterns
  4. Filters by scenario, tone tags, and quality scores

  5. Training Data Assets (training_data/)

  6. conversational_format.md - Specification for data formats
  7. sage/fewshot_examples.json - 59,000+ curated conversation examples
  8. sage/burst_patterns.json - Multi-bubble messaging patterns

  9. Persona Engine Integration (app/persona/engine.py)

  10. Dynamically selects 3-5 few-shot examples per prompt
  11. Injects examples based on scenario and tone
  12. Falls back to passport defaults if no match

  13. Burst Planner Integration (app/messaging/burst_planner.py)

  14. Loads burst pattern guidance for multi-bubble responses
  15. Shows LLM real examples of how to split messages naturally

How It Works

1. Few-Shot Example Selection

When building a system prompt, the PersonaEngine:

  1. Checks if persona has training_data.fewshot config in passport
  2. Queries TrainingDataLoader with:
  3. scenario (e.g., "Making plans", "Emotional support")
  4. tone_tags (e.g., ["casual", "supportive", "playful"])
  5. limit (default: 3)
  6. Loader returns quality-scored examples sorted by relevance
  7. Examples are injected into system prompt alongside static examples

Example Config (sage.json):

{
  "training_data": {
    "fewshot": {
      "source": "training_data/sage/fewshot_examples.json",
      "default_scenarios": [
        "Making plans and coordinating schedules",
        "Casual conversation and catching up"
      ],
      "default_tone_tags": ["casual", "playful", "supportive"],
      "limit": 3
    }
  }
}

2. Burst Pattern Guidance

When planning multi-bubble responses, the BurstPlanner:

  1. Receives scenario hint from reasoning (e.g., "emotional support")
  2. Queries TrainingDataLoader for 2 matching burst patterns
  3. Formats patterns as guidance text showing:
  4. Scenario description
  5. Pattern structure (e.g., "setup → main_idea → clarification")
  6. Example messages with bubble types
  7. Injects guidance into LLM prompt for natural pacing

Example Pattern:

{
  "burst_id": "burst_007",
  "scenario": "Playful teasing or joke",
  "messages": [
    {
      "order": 1,
      "text": "Haha okie",
      "bubble_type": "reaction",
      "function": "Express emotional reaction"
    },
    {
      "order": 2,
      "text": "Are you gonna cook? And we're gonna movie night?",
      "bubble_type": "follow_up_question",
      "function": "Ask follow-up to continue conversation"
    }
  ],
  "pattern_notes": "2-bubble burst: reaction → follow_up_question",
  "total_duration_ms": 13000
}


Data Format

Few-Shot Examples

Each example includes: - conversation_id - Unique identifier - scenario - Context category (e.g., "Making plans") - tone_tags - Tone descriptors (e.g., ["casual", "playful"]) - quality_scores - Emotional richness, clarity, usefulness (0-1) - examples - Array of user/assistant exchanges with context

Quality Scoring: - Examples sorted by combined score (emotional_richness + clarity + usefulness) - Higher scores = more natural, engaging, useful exchanges - Loader prioritizes top-scored examples

Burst Patterns

Each pattern includes: - burst_id - Unique identifier - scenario - When this pattern applies - messages - Array of bubbles with timing, type, function - pattern_notes - Human-readable structure summary - total_duration_ms - Real timing from source conversation


Integration Points

PersonaEngine.build_system_prompt()

New Parameters: - scenario: Optional[str] - Current conversation scenario - tone_tags: Optional[Sequence[str]] - Desired tone characteristics

Behavior: 1. Loads static examples from passport 2. Calls _select_dynamic_examples() to fetch training data 3. Merges static + dynamic examples (limit 3 total) 4. Includes context and notes for each example

BurstPlanner.plan_burst()

New Parameters: - persona_id: str - Which persona's patterns to use - scenario: Optional[str] - Conversation scenario for pattern matching

Behavior: 1. Extracts scenario from reasoning or uses provided hint 2. Calls _format_burst_guidance() to fetch patterns 3. Injects pattern examples into LLM prompt 4. LLM sees real examples of natural multi-bubble flow

MultiBubbleHandler.handle()

Changes: - Passes persona_id and scenario to burst planner - Scenario derived from reasoning.get("user_need")


Benefits

1. Data-Driven Personality

  • Sage's tone comes from real conversations, not guesswork
  • Easy to update personality by adding/curating examples
  • No code changes needed to refine tone

2. Scalable Training

  • New personas just need their own training dataset
  • Reuse loader infrastructure across all personas
  • Quality scoring ensures best examples surface first

3. Context-Aware Responses

  • System picks examples matching current conversation type
  • Emotional support scenarios get supportive examples
  • Planning scenarios get practical, coordinating examples

4. Natural Multi-Bubble Flow

  • Burst patterns teach LLM realistic message pacing
  • Examples show when to split ideas vs. send one bubble
  • Timing data preserves human-like delays

Files Changed

New Files

  • app/persona/training_data_loader.py - Core loader logic
  • training_data/conversational_format.md - Data spec
  • training_data/sage/fewshot_examples.json - 59K+ examples
  • training_data/sage/burst_patterns.json - Multi-bubble patterns
  • TRAINING_DATA_INTEGRATION.md - This document

Modified Files

  • app/persona/engine.py - Dynamic example selection
  • app/persona/passports/sage.json - Training data config
  • app/messaging/burst_planner.py - Pattern guidance injection
  • app/orchestrator/multi_bubble_handler.py - Pass scenario hints
  • test_multi_bubble.py - Updated test calls
  • Claude.md - Documentation updates
  • README.md - Architecture updates

Usage Example

from app.persona.training_data_loader import get_training_data_loader

loader = get_training_data_loader()

# Get few-shot examples for emotional support
examples = loader.get_fewshot_examples(
    persona_id="sage",
    scenario="Emotional support and venting",
    tone_tags=["supportive", "empathetic"],
    limit=3
)

# Get burst patterns for playful teasing
patterns = loader.get_burst_patterns(
    persona_id="sage",
    scenario="Playful teasing or joke",
    limit=2
)

# Access conversation format spec
format_doc = loader.get_conversation_format_notes()

Selection Algorithm (2025-11 Update)

  • Few-shot examples are ranked with weighted scores: quality metrics provide the baseline, exact scenario matches add a strong bonus, partial matches use fuzzy similarity, and tone-tag overlap nudges the ranking.
  • Diversity rules cap duplicate scenarios when no explicit scenario is supplied so planners see a mix of contexts.
  • Burst pattern selection applies similar scoring while de-prioritising interrupted bursts and rewarding rich pattern notes/duration.
  • Deterministic unit coverage lives in test_training_data_selection.py; it locks in behaviour for scenario preference, tone alignment, and diversity caps.

Future Enhancements

Short-Term

  • Add verification tests for loader functionality
  • Expand Echo persona with group-mode training data
  • Add scenario classification to reasoning system

Long-Term

  • Fine-tune custom model on full dataset
  • A/B test static vs. dynamic examples
  • User feedback loop to score example quality
  • Multi-language training datasets

Testing

The loader is tested implicitly through existing integration tests: - test_orchestrator.py - Persona engine loads successfully - test_multi_bubble.py - Burst planner accesses patterns

To verify loader directly:

from app.persona.training_data_loader import get_training_data_loader

loader = get_training_data_loader()
examples = loader.get_fewshot_examples("sage", limit=5)
print(f"Loaded {len(examples)} examples")


Deployment Notes

Requirements

  • No new dependencies (uses stdlib json, pathlib)
  • Training data files must be present at runtime
  • Loader caches data in memory after first load

Performance

  • Initial load: ~500ms (59K examples)
  • Subsequent queries: <1ms (cached)
  • Memory footprint: ~50MB (JSON in RAM)

Rollback

If issues arise, revert to static examples: 1. Remove training_data section from sage.json 2. PersonaEngine falls back to passport examples only 3. BurstPlanner returns "(no pattern data available)"


Next Steps: Merge to master and monitor Sage's tone consistency in production.