Conversational Grounding Metrics Plan¶
Goal¶
Measure whether claim-level grounding reduces hallucinations while preserving natural persona conversations.
Decision Framework¶
Primary decision metric: - Hallucination rate on high-risk intents.
Guardrails: - Trust signals (correction complaints, "that's wrong" rate). - Engagement (turn depth, 7-day return rate). - Friction (hard-refusal rate, rephrase loops). - Performance (latency and cost on high-risk turns).
Scope¶
Track all persona replies in direct chats and group chats, segmented by:
- persona_id
- intent
- risk_level
- has_missing_evidence
- has_tool_data
- is_group
Event Contract¶
1) grounding_outcome¶
Emit once per assistant turn after final bubbles are produced.
Required properties:
- chat_guid
- persona_id
- intent
- risk_level (low|medium|high)
- required_evidence_count
- grounded_evidence_count
- missing_evidence_count
- missing_evidence (array)
- requested_tools (array)
- executed_tools (array)
- unsupported_tools (array)
- failed_tools (array)
- claim_rewrites (int)
- response_bubble_count (int)
- structured_output_used (bool)
- latency_ms_total (int)
- llm_cost_usd (float, best-effort)
2) user_correction_signal¶
Emit when user message matches correction patterns within 2 user turns of assistant output.
Correction pattern examples: - "that's wrong" - "no that's not right" - "you made that up" - "not true" - "you said X but"
Required properties:
- chat_guid
- persona_id
- prior_intent
- prior_risk_level
- prior_missing_evidence_count
- prior_claim_rewrites
- turns_since_assistant_reply
- matched_pattern
3) refusal_or_uncertainty_style¶
Emit when assistant returns uncertainty fallback or hard refusal text.
Required properties:
- chat_guid
- persona_id
- intent
- risk_level
- is_hard_refusal (bool)
- is_uncertainty_fallback (bool)
- has_next_step (bool)
- has_followup_question (bool)
4) conversation_quality_snapshot¶
Emit session-level snapshot every 10 turns (or at session end).
Required properties:
- chat_guid
- persona_id
- turn_depth
- avg_bubbles_per_reply
- rephrase_loop_count
- session_duration_seconds
- returned_within_7d (nullable until resolved)
Metric Definitions¶
1) Hallucination Rate (Primary)¶
Definition:
- hallucination_rate = correction_signals_on_high_risk / high_risk_assistant_turns
Computation notes:
- Numerator: user_correction_signal where prior_risk_level = high.
- Denominator: grounding_outcome where risk_level = high.
2) Trust Signals¶
correction_complaint_rate = user_correction_signal / assistant_turnsthat_is_wrong_rate = exact_pattern("that's wrong") / assistant_turns
3) Engagement¶
avg_turn_depthp50_turn_depth,p90_turn_depthreturn_rate_7d = sessions_with_return_within_7d / total_sessions
4) Friction¶
hard_refusal_rate = hard_refusal_turns / assistant_turnsrephrase_loop_rate = sessions_with_rephrase_loop / total_sessions
Rephrase loop rule: - 2+ similar user requests within 4 turns after an assistant fallback/refusal.
5) Latency and Cost¶
p50_latency_ms_high_risk,p95_latency_ms_high_riskavg_cost_usd_high_riskstructured_output_share_high_risk
Dashboard Layout¶
Panels: 1. High-risk volume and hallucination rate trend (daily). 2. Correction signals by persona and intent. 3. Hard-refusal vs uncertainty fallback mix. 4. Turn depth and 7-day return by persona. 5. Latency and cost by risk level.
Filters: - Date range - Persona - Intent - Group vs direct
Alert Thresholds¶
Trigger investigation if either condition holds for 24 hours: - Hallucination rate increases by >20% vs previous 7-day baseline. - p95 high-risk latency increases by >25%.
Trigger rollback if either condition holds for 2 consecutive days: - Hallucination rate increases by >35%. - Hard-refusal rate increases by >30% with turn depth decline >10%.
Implementation Tasks¶
- Emit
grounding_outcomefrom orchestrator analytics tracker. - Add correction detector in message ingest path and emit
user_correction_signal. - Add refusal/fallback classifier in post-send path and emit
refusal_or_uncertainty_style. - Add session aggregator for
conversation_quality_snapshot. - Build dashboard and saved queries.
- Define weekly review cadence with Product + Engineering.
Validation Plan¶
Before trusting dashboard values: 1. Sample 100 high-risk turns and manually verify event correctness. 2. Validate correction matching precision on a labeled set. 3. Verify latency/cost joins against raw provider logs.
Ownership¶
- Orchestrator instrumentation: Backend
- Event taxonomy and dashboard: Data/Analytics
- Weekly readout + thresholds: Product