ADR 001: Event Sourcing vs. Traditional State Management for Multiplayer Mini-Apps¶

Date: 2025-11-10 Status: Proposed Deciders: Engineering Team Related: System Overview, Refactoring Plan

Context¶

We need to support multiplayer mini-apps where multiple users can simultaneously interact with shared state (trip planning, fitness challenges, shopping). The system must handle:

Concurrent edits - Multiple users modifying the same data
Offline support - Users can queue actions while disconnected
Audit trails - For debugging and potential dispute resolution
State reconstruction - Ability to replay history
Real-time sync - Other users see changes within 100ms

Current State¶

The existing system uses traditional state management: - PostgreSQL stores current state only - Redis caches for fast reads - Single-user workflows update state directly

This works great for solo workflows but doesn't handle concurrent multi-user edits.

Decision¶

We will use Hybrid Event Sourcing + State Snapshots for multiplayer mini-apps.

Architecture: 1. Event Store (DynamoDB) - Append-only log of all state changes 2. State Snapshots (Redis + PostgreSQL) - Computed current state, rebuilt from events 3. CRDT for Conflicts - Last-write-wins with lamport clock tie-breaking 4. Dual-Mode System - Solo workflows use traditional state, multiplayer uses events

Alternatives Considered¶

Option 1: Pure CRDT (Conflict-Free Replicated Data Types)¶

Pros: - Automatic conflict resolution - Works offline-first naturally - Provably convergent

Cons: - High memory overhead (stores full operation history) - Complex to implement correctly - Harder to debug - Not all data structures have CRDT equivalents

Verdict: ❌ Too complex for MVP, over-engineered

Option 2: Operational Transform (like Google Docs)¶

Pros: - Real-time collaborative editing - Industry-proven (Google Docs, Figma)

Cons: - Extremely complex to implement - Requires low-latency server (not ideal for iMessage polling) - Overkill for our use cases (we're not building a text editor)

Verdict: ❌ Too complex, doesn't fit iMessage async model

Option 3: Pessimistic Locking (Database Locks)¶

Pros: - Simple to implement - No conflicts by definition

Cons: - Terrible UX - users wait for locks - Doesn't work well with iMessage's async nature - Single point of failure

Verdict: ❌ Bad UX, not scalable

Option 4: Traditional State with Optimistic Locking (Version Numbers)¶

Pros: - Simpler than event sourcing - Works with existing PostgreSQL

Cons: - Frequent conflicts in high-activity rooms - Lost updates if version mismatch - No audit trail - Can't replay history

Verdict: ❌ Conflicts too frequent, no history

Option 5: Hybrid Event Sourcing (CHOSEN)¶

Pros: - ✅ Complete audit trail - ✅ Can replay events to reconstruct state - ✅ Offline-friendly (queue events, sync later) - ✅ Flexible conflict resolution strategies - ✅ Easy debugging (see exact event sequence) - ✅ Can add computed views later - ✅ Solo workflows unchanged (opt-in for multiplayer)

Cons: - ⚠️ More storage (events + snapshots) - ⚠️ Slightly higher latency (write event then update snapshot) - ⚠️ Need to design event schemas carefully

Mitigation: - Use DynamoDB with TTL (auto-delete old events after 6 months) - Cache snapshots aggressively in Redis - Event schemas versioned and validated

Verdict: ✅ Best balance of complexity vs. capability

Implementation Details¶

Event Structure¶

@dataclass
class MiniAppEvent:
    # Identity
    id: str  # UUIDv7 (time-sortable)
    app_id: str  # e.g., "trip_planner"
    room_id: str  # e.g., "tokyo_trip_2026"

    # Who & What
    user_id: str  # Who triggered this event
    action: str  # e.g., "vote_cast", "activity_added"
    payload: Dict[str, Any]  # Action-specific data

    # Ordering & Timing
    client_timestamp: datetime  # When user initiated (for display)
    server_timestamp: datetime  # Authoritative time
    lamport_clock: int  # For causal ordering

    # Security
    signature: str  # HMAC to prevent tampering

Storage Schema (DynamoDB)¶

Table: miniapp_events

Partition Key: app_id#room_id  (e.g., "trip_planner#tokyo_trip_2026")
Sort Key: server_timestamp#event_id  (e.g., "2026-01-15T10:30:00.123Z#01JCVW...")

Attributes:
- user_id
- action
- payload (JSON)
- client_timestamp
- lamport_clock
- signature

TTL: expires_at (auto-delete after 6 months)

GSI (Global Secondary Index):
- GSI1: user_id (for "show me all my events")

Why DynamoDB: 1. Serverless - no capacity planning 2. Built-in TTL - auto-cleanup 3. Fast writes - optimized for append-only 4. Auto-scaling 5. Strong consistency option

Cost: ~$5/month per 1000 active rooms

State Reconstruction¶

On-Demand Replay:

async def get_current_state(app_id: str, room_id: str) -> Dict:
    # 1. Check cache
    cached = await redis.get(f"room_state:{app_id}:{room_id}")
    if cached and not is_stale(cached):
        return json.loads(cached)

    # 2. Replay events
    events = await event_store.get_events(app_id, room_id)

    state = get_initial_state(app_id)  # App-specific default
    for event in events:
        state = apply_event(state, event)

    # 3. Cache result
    await redis.setex(f"room_state:{app_id}:{room_id}", 300, json.dumps(state))

    return state

Optimization: Periodic Snapshots

Every 100 events, save a snapshot:

async def save_snapshot(app_id: str, room_id: str, state: Dict, last_event_id: str):
    await db.execute(
        "INSERT INTO room_state_snapshots (app_id, room_id, state, last_event_id, created_at) VALUES ($1,$2,$3,$4,NOW())",
        app_id, room_id, json.dumps(state), last_event_id
    )

Then replay is:

# Get latest snapshot
snapshot = await db.query("SELECT * FROM room_state_snapshots WHERE app_id=$1 AND room_id=$2 ORDER BY created_at DESC LIMIT 1")

state = json.loads(snapshot.state)

# Replay only events AFTER snapshot
events = await event_store.get_events(app_id, room_id, since=snapshot.last_event_id)
for event in events:
    state = apply_event(state, event)

Conflict Resolution¶

Default: Last-Write-Wins (LWW) with Lamport Clock

def resolve_conflict(event1: MiniAppEvent, event2: MiniAppEvent) -> MiniAppEvent:
    # If concurrent (both saw same prior state), use server timestamp
    if event1.lamport_clock == event2.lamport_clock:
        winner = event1 if event1.server_timestamp > event2.server_timestamp else event2
        return winner

    # Otherwise, use causal ordering (higher lamport clock wins)
    return event1 if event1.lamport_clock > event2.lamport_clock else event2

Custom Resolution (Per Mini-App):

Some mini-apps need domain-specific logic: - Trip Planner: Concurrent activity additions = merge both (no conflict) - Fitness Challenge: Concurrent stat updates = keep highest (assumes monotonic increase) - Shopping: Concurrent watchlist adds = merge (set union)

# Mini-app can override
class TripPlannerApp:
    def resolve_conflict(self, event1, event2):
        if event1.action == "activity_added" and event2.action == "activity_added":
            # Both activities valid, merge them
            return [event1, event2]  # Apply both

        # Fall back to LWW
        return super().resolve_conflict(event1, event2)

Trade-offs¶

Pros of Event Sourcing¶

Audit Trail: Every change tracked forever (or until TTL)
Debugging: Can replay exact sequence that led to bug
Temporal Queries: "What did the trip plan look like last Tuesday?"
Compliance: Potential future need for user action logs
Undo/Redo: Could add "undo last change" feature
Offline Sync: Queue events, sync when reconnected

Cons of Event Sourcing¶

Storage Cost: Events + snapshots vs. just current state
Mitigation: DynamoDB TTL, aggressive caching
Eventual Consistency: Brief window where users see different states
Mitigation: Optimistic UI updates, <100ms sync target
Schema Evolution: Old events need to work with new code
Mitigation: Event versioning, backward-compatible changes only
Learning Curve: Team needs to understand event-driven architecture
Mitigation: Good documentation, start with simple apps

Consequences¶

Positive¶

Enables Multiplayer: Can finally build Trip Planner, Fitness Challenge, etc.
Better Debugging: See exact event history when users report issues
Flexible: Can change conflict resolution strategy without breaking existing rooms
Scalable: DynamoDB auto-scales, no manual capacity planning

Negative¶

Complexity: More moving parts than traditional state
Cost: DynamoDB + Redis caching adds ~$50/month initially
Latency: Extra hop to event store (but cached reads mitigate)

Neutral¶

Dual Systems: Solo workflows use existing state, multiplayer uses events
Could eventually migrate all to events, but not required

Validation¶

Success Criteria¶

Correctness: 99.9% of concurrent edits resolve correctly
Performance: p95 state read latency <100ms
Cost: <$200/month for 1000 active rooms
Developer Experience: Engineers can add new mini-apps without deep event sourcing knowledge

Metrics to Monitor¶

Event throughput (writes/sec)
State reconstruction time
Conflict resolution frequency
Cache hit rate
Storage growth

References¶

Martin Fowler: Event Sourcing
AWS DynamoDB Best Practices
CRDT Papers (for conflict resolution strategies)

Revision History¶

2025-11-10: Initial draft