Skip to content

ADR 001: Event Sourcing vs. Traditional State Management for Multiplayer Mini-Apps

Date: 2025-11-10 Status: Proposed Deciders: Engineering Team Related: System Overview, Refactoring Plan


Context

We need to support multiplayer mini-apps where multiple users can simultaneously interact with shared state (trip planning, fitness challenges, shopping). The system must handle:

  1. Concurrent edits - Multiple users modifying the same data
  2. Offline support - Users can queue actions while disconnected
  3. Audit trails - For debugging and potential dispute resolution
  4. State reconstruction - Ability to replay history
  5. Real-time sync - Other users see changes within 100ms

Current State

The existing system uses traditional state management: - PostgreSQL stores current state only - Redis caches for fast reads - Single-user workflows update state directly

This works great for solo workflows but doesn't handle concurrent multi-user edits.


Decision

We will use Hybrid Event Sourcing + State Snapshots for multiplayer mini-apps.

Architecture: 1. Event Store (DynamoDB) - Append-only log of all state changes 2. State Snapshots (Redis + PostgreSQL) - Computed current state, rebuilt from events 3. CRDT for Conflicts - Last-write-wins with lamport clock tie-breaking 4. Dual-Mode System - Solo workflows use traditional state, multiplayer uses events


Alternatives Considered

Option 1: Pure CRDT (Conflict-Free Replicated Data Types)

Pros: - Automatic conflict resolution - Works offline-first naturally - Provably convergent

Cons: - High memory overhead (stores full operation history) - Complex to implement correctly - Harder to debug - Not all data structures have CRDT equivalents

Verdict: ❌ Too complex for MVP, over-engineered

Option 2: Operational Transform (like Google Docs)

Pros: - Real-time collaborative editing - Industry-proven (Google Docs, Figma)

Cons: - Extremely complex to implement - Requires low-latency server (not ideal for iMessage polling) - Overkill for our use cases (we're not building a text editor)

Verdict: ❌ Too complex, doesn't fit iMessage async model

Option 3: Pessimistic Locking (Database Locks)

Pros: - Simple to implement - No conflicts by definition

Cons: - Terrible UX - users wait for locks - Doesn't work well with iMessage's async nature - Single point of failure

Verdict: ❌ Bad UX, not scalable

Option 4: Traditional State with Optimistic Locking (Version Numbers)

Pros: - Simpler than event sourcing - Works with existing PostgreSQL

Cons: - Frequent conflicts in high-activity rooms - Lost updates if version mismatch - No audit trail - Can't replay history

Verdict: ❌ Conflicts too frequent, no history

Option 5: Hybrid Event Sourcing (CHOSEN)

Pros: - ✅ Complete audit trail - ✅ Can replay events to reconstruct state - ✅ Offline-friendly (queue events, sync later) - ✅ Flexible conflict resolution strategies - ✅ Easy debugging (see exact event sequence) - ✅ Can add computed views later - ✅ Solo workflows unchanged (opt-in for multiplayer)

Cons: - ⚠️ More storage (events + snapshots) - ⚠️ Slightly higher latency (write event then update snapshot) - ⚠️ Need to design event schemas carefully

Mitigation: - Use DynamoDB with TTL (auto-delete old events after 6 months) - Cache snapshots aggressively in Redis - Event schemas versioned and validated

Verdict:Best balance of complexity vs. capability


Implementation Details

Event Structure

@dataclass
class MiniAppEvent:
    # Identity
    id: str  # UUIDv7 (time-sortable)
    app_id: str  # e.g., "trip_planner"
    room_id: str  # e.g., "tokyo_trip_2026"

    # Who & What
    user_id: str  # Who triggered this event
    action: str  # e.g., "vote_cast", "activity_added"
    payload: Dict[str, Any]  # Action-specific data

    # Ordering & Timing
    client_timestamp: datetime  # When user initiated (for display)
    server_timestamp: datetime  # Authoritative time
    lamport_clock: int  # For causal ordering

    # Security
    signature: str  # HMAC to prevent tampering

Storage Schema (DynamoDB)

Table: miniapp_events

Partition Key: app_id#room_id  (e.g., "trip_planner#tokyo_trip_2026")
Sort Key: server_timestamp#event_id  (e.g., "2026-01-15T10:30:00.123Z#01JCVW...")

Attributes:
- user_id
- action
- payload (JSON)
- client_timestamp
- lamport_clock
- signature

TTL: expires_at (auto-delete after 6 months)

GSI (Global Secondary Index):
- GSI1: user_id (for "show me all my events")

Why DynamoDB: 1. Serverless - no capacity planning 2. Built-in TTL - auto-cleanup 3. Fast writes - optimized for append-only 4. Auto-scaling 5. Strong consistency option

Cost: ~$5/month per 1000 active rooms

State Reconstruction

On-Demand Replay:

async def get_current_state(app_id: str, room_id: str) -> Dict:
    # 1. Check cache
    cached = await redis.get(f"room_state:{app_id}:{room_id}")
    if cached and not is_stale(cached):
        return json.loads(cached)

    # 2. Replay events
    events = await event_store.get_events(app_id, room_id)

    state = get_initial_state(app_id)  # App-specific default
    for event in events:
        state = apply_event(state, event)

    # 3. Cache result
    await redis.setex(f"room_state:{app_id}:{room_id}", 300, json.dumps(state))

    return state

Optimization: Periodic Snapshots

Every 100 events, save a snapshot:

async def save_snapshot(app_id: str, room_id: str, state: Dict, last_event_id: str):
    await db.execute(
        "INSERT INTO room_state_snapshots (app_id, room_id, state, last_event_id, created_at) VALUES ($1,$2,$3,$4,NOW())",
        app_id, room_id, json.dumps(state), last_event_id
    )

Then replay is:

# Get latest snapshot
snapshot = await db.query("SELECT * FROM room_state_snapshots WHERE app_id=$1 AND room_id=$2 ORDER BY created_at DESC LIMIT 1")

state = json.loads(snapshot.state)

# Replay only events AFTER snapshot
events = await event_store.get_events(app_id, room_id, since=snapshot.last_event_id)
for event in events:
    state = apply_event(state, event)

Conflict Resolution

Default: Last-Write-Wins (LWW) with Lamport Clock

def resolve_conflict(event1: MiniAppEvent, event2: MiniAppEvent) -> MiniAppEvent:
    # If concurrent (both saw same prior state), use server timestamp
    if event1.lamport_clock == event2.lamport_clock:
        winner = event1 if event1.server_timestamp > event2.server_timestamp else event2
        return winner

    # Otherwise, use causal ordering (higher lamport clock wins)
    return event1 if event1.lamport_clock > event2.lamport_clock else event2

Custom Resolution (Per Mini-App):

Some mini-apps need domain-specific logic: - Trip Planner: Concurrent activity additions = merge both (no conflict) - Fitness Challenge: Concurrent stat updates = keep highest (assumes monotonic increase) - Shopping: Concurrent watchlist adds = merge (set union)

# Mini-app can override
class TripPlannerApp:
    def resolve_conflict(self, event1, event2):
        if event1.action == "activity_added" and event2.action == "activity_added":
            # Both activities valid, merge them
            return [event1, event2]  # Apply both

        # Fall back to LWW
        return super().resolve_conflict(event1, event2)

Trade-offs

Pros of Event Sourcing

  1. Audit Trail: Every change tracked forever (or until TTL)
  2. Debugging: Can replay exact sequence that led to bug
  3. Temporal Queries: "What did the trip plan look like last Tuesday?"
  4. Compliance: Potential future need for user action logs
  5. Undo/Redo: Could add "undo last change" feature
  6. Offline Sync: Queue events, sync when reconnected

Cons of Event Sourcing

  1. Storage Cost: Events + snapshots vs. just current state
  2. Mitigation: DynamoDB TTL, aggressive caching
  3. Eventual Consistency: Brief window where users see different states
  4. Mitigation: Optimistic UI updates, <100ms sync target
  5. Schema Evolution: Old events need to work with new code
  6. Mitigation: Event versioning, backward-compatible changes only
  7. Learning Curve: Team needs to understand event-driven architecture
  8. Mitigation: Good documentation, start with simple apps

Consequences

Positive

  • Enables Multiplayer: Can finally build Trip Planner, Fitness Challenge, etc.
  • Better Debugging: See exact event history when users report issues
  • Flexible: Can change conflict resolution strategy without breaking existing rooms
  • Scalable: DynamoDB auto-scales, no manual capacity planning

Negative

  • Complexity: More moving parts than traditional state
  • Cost: DynamoDB + Redis caching adds ~$50/month initially
  • Latency: Extra hop to event store (but cached reads mitigate)

Neutral

  • Dual Systems: Solo workflows use existing state, multiplayer uses events
  • Could eventually migrate all to events, but not required

Validation

Success Criteria

  1. Correctness: 99.9% of concurrent edits resolve correctly
  2. Performance: p95 state read latency <100ms
  3. Cost: <$200/month for 1000 active rooms
  4. Developer Experience: Engineers can add new mini-apps without deep event sourcing knowledge

Metrics to Monitor

  • Event throughput (writes/sec)
  • State reconstruction time
  • Conflict resolution frequency
  • Cache hit rate
  • Storage growth

References

  • Martin Fowler: Event Sourcing
  • AWS DynamoDB Best Practices
  • CRDT Papers (for conflict resolution strategies)

Revision History

  • 2025-11-10: Initial draft