Conversational Grounding Metrics Plan¶

Goal¶

Measure whether claim-level grounding reduces hallucinations while preserving natural persona conversations.

Decision Framework¶

Primary decision metric: - Hallucination rate on high-risk intents.

Guardrails: - Trust signals (correction complaints, "that's wrong" rate). - Engagement (turn depth, 7-day return rate). - Friction (hard-refusal rate, rephrase loops). - Performance (latency and cost on high-risk turns).

Scope¶

Track all persona replies in direct chats and group chats, segmented by: - persona_id - intent - risk_level - has_missing_evidence - has_tool_data - is_group

Event Contract¶

1) `grounding_outcome`¶

Emit once per assistant turn after final bubbles are produced.

Required properties: - chat_guid - persona_id - intent - risk_level (low|medium|high) - required_evidence_count - grounded_evidence_count - missing_evidence_count - missing_evidence (array) - requested_tools (array) - executed_tools (array) - unsupported_tools (array) - failed_tools (array) - claim_rewrites (int) - response_bubble_count (int) - structured_output_used (bool) - latency_ms_total (int) - llm_cost_usd (float, best-effort)

2) `user_correction_signal`¶

Emit when user message matches correction patterns within 2 user turns of assistant output.

Correction pattern examples: - "that's wrong" - "no that's not right" - "you made that up" - "not true" - "you said X but"

Required properties: - chat_guid - persona_id - prior_intent - prior_risk_level - prior_missing_evidence_count - prior_claim_rewrites - turns_since_assistant_reply - matched_pattern

3) `refusal_or_uncertainty_style`¶

Emit when assistant returns uncertainty fallback or hard refusal text.

Required properties: - chat_guid - persona_id - intent - risk_level - is_hard_refusal (bool) - is_uncertainty_fallback (bool) - has_next_step (bool) - has_followup_question (bool)

4) `conversation_quality_snapshot`¶

Emit session-level snapshot every 10 turns (or at session end).

Required properties: - chat_guid - persona_id - turn_depth - avg_bubbles_per_reply - rephrase_loop_count - session_duration_seconds - returned_within_7d (nullable until resolved)

Metric Definitions¶

1) Hallucination Rate (Primary)¶

Definition: - hallucination_rate = correction_signals_on_high_risk / high_risk_assistant_turns

Computation notes: - Numerator: user_correction_signal where prior_risk_level = high. - Denominator: grounding_outcome where risk_level = high.

2) Trust Signals¶

correction_complaint_rate = user_correction_signal / assistant_turns
that_is_wrong_rate = exact_pattern("that's wrong") / assistant_turns

3) Engagement¶

avg_turn_depth
p50_turn_depth, p90_turn_depth
return_rate_7d = sessions_with_return_within_7d / total_sessions

4) Friction¶

hard_refusal_rate = hard_refusal_turns / assistant_turns
rephrase_loop_rate = sessions_with_rephrase_loop / total_sessions

Rephrase loop rule: - 2+ similar user requests within 4 turns after an assistant fallback/refusal.

5) Latency and Cost¶

p50_latency_ms_high_risk, p95_latency_ms_high_risk
avg_cost_usd_high_risk
structured_output_share_high_risk

Dashboard Layout¶

Panels: 1. High-risk volume and hallucination rate trend (daily). 2. Correction signals by persona and intent. 3. Hard-refusal vs uncertainty fallback mix. 4. Turn depth and 7-day return by persona. 5. Latency and cost by risk level.

Filters: - Date range - Persona - Intent - Group vs direct

Alert Thresholds¶

Trigger investigation if either condition holds for 24 hours: - Hallucination rate increases by >20% vs previous 7-day baseline. - p95 high-risk latency increases by >25%.

Trigger rollback if either condition holds for 2 consecutive days: - Hallucination rate increases by >35%. - Hard-refusal rate increases by >30% with turn depth decline >10%.

Implementation Tasks¶

Emit grounding_outcome from orchestrator analytics tracker.
Add correction detector in message ingest path and emit user_correction_signal.
Add refusal/fallback classifier in post-send path and emit refusal_or_uncertainty_style.
Add session aggregator for conversation_quality_snapshot.
Build dashboard and saved queries.
Define weekly review cadence with Product + Engineering.

Validation Plan¶

Before trusting dashboard values: 1. Sample 100 high-risk turns and manually verify event correctness. 2. Validate correction matching precision on a labeled set. 3. Verify latency/cost joins against raw provider logs.

Ownership¶

Orchestrator instrumentation: Backend
Event taxonomy and dashboard: Data/Analytics
Weekly readout + thresholds: Product