ADR 001: Event Sourcing vs. Traditional State Management for Multiplayer Mini-Apps¶
Date: 2025-11-10 Status: Proposed Deciders: Engineering Team Related: System Overview, Refactoring Plan
Context¶
We need to support multiplayer mini-apps where multiple users can simultaneously interact with shared state (trip planning, fitness challenges, shopping). The system must handle:
- Concurrent edits - Multiple users modifying the same data
- Offline support - Users can queue actions while disconnected
- Audit trails - For debugging and potential dispute resolution
- State reconstruction - Ability to replay history
- Real-time sync - Other users see changes within 100ms
Current State¶
The existing system uses traditional state management: - PostgreSQL stores current state only - Redis caches for fast reads - Single-user workflows update state directly
This works great for solo workflows but doesn't handle concurrent multi-user edits.
Decision¶
We will use Hybrid Event Sourcing + State Snapshots for multiplayer mini-apps.
Architecture: 1. Event Store (DynamoDB) - Append-only log of all state changes 2. State Snapshots (Redis + PostgreSQL) - Computed current state, rebuilt from events 3. CRDT for Conflicts - Last-write-wins with lamport clock tie-breaking 4. Dual-Mode System - Solo workflows use traditional state, multiplayer uses events
Alternatives Considered¶
Option 1: Pure CRDT (Conflict-Free Replicated Data Types)¶
Pros: - Automatic conflict resolution - Works offline-first naturally - Provably convergent
Cons: - High memory overhead (stores full operation history) - Complex to implement correctly - Harder to debug - Not all data structures have CRDT equivalents
Verdict: ❌ Too complex for MVP, over-engineered
Option 2: Operational Transform (like Google Docs)¶
Pros: - Real-time collaborative editing - Industry-proven (Google Docs, Figma)
Cons: - Extremely complex to implement - Requires low-latency server (not ideal for iMessage polling) - Overkill for our use cases (we're not building a text editor)
Verdict: ❌ Too complex, doesn't fit iMessage async model
Option 3: Pessimistic Locking (Database Locks)¶
Pros: - Simple to implement - No conflicts by definition
Cons: - Terrible UX - users wait for locks - Doesn't work well with iMessage's async nature - Single point of failure
Verdict: ❌ Bad UX, not scalable
Option 4: Traditional State with Optimistic Locking (Version Numbers)¶
Pros: - Simpler than event sourcing - Works with existing PostgreSQL
Cons: - Frequent conflicts in high-activity rooms - Lost updates if version mismatch - No audit trail - Can't replay history
Verdict: ❌ Conflicts too frequent, no history
Option 5: Hybrid Event Sourcing (CHOSEN)¶
Pros: - ✅ Complete audit trail - ✅ Can replay events to reconstruct state - ✅ Offline-friendly (queue events, sync later) - ✅ Flexible conflict resolution strategies - ✅ Easy debugging (see exact event sequence) - ✅ Can add computed views later - ✅ Solo workflows unchanged (opt-in for multiplayer)
Cons: - ⚠️ More storage (events + snapshots) - ⚠️ Slightly higher latency (write event then update snapshot) - ⚠️ Need to design event schemas carefully
Mitigation: - Use DynamoDB with TTL (auto-delete old events after 6 months) - Cache snapshots aggressively in Redis - Event schemas versioned and validated
Verdict: ✅ Best balance of complexity vs. capability
Implementation Details¶
Event Structure¶
@dataclass
class MiniAppEvent:
# Identity
id: str # UUIDv7 (time-sortable)
app_id: str # e.g., "trip_planner"
room_id: str # e.g., "tokyo_trip_2026"
# Who & What
user_id: str # Who triggered this event
action: str # e.g., "vote_cast", "activity_added"
payload: Dict[str, Any] # Action-specific data
# Ordering & Timing
client_timestamp: datetime # When user initiated (for display)
server_timestamp: datetime # Authoritative time
lamport_clock: int # For causal ordering
# Security
signature: str # HMAC to prevent tampering
Storage Schema (DynamoDB)¶
Table: miniapp_events
Partition Key: app_id#room_id (e.g., "trip_planner#tokyo_trip_2026")
Sort Key: server_timestamp#event_id (e.g., "2026-01-15T10:30:00.123Z#01JCVW...")
Attributes:
- user_id
- action
- payload (JSON)
- client_timestamp
- lamport_clock
- signature
TTL: expires_at (auto-delete after 6 months)
GSI (Global Secondary Index):
- GSI1: user_id (for "show me all my events")
Why DynamoDB: 1. Serverless - no capacity planning 2. Built-in TTL - auto-cleanup 3. Fast writes - optimized for append-only 4. Auto-scaling 5. Strong consistency option
Cost: ~$5/month per 1000 active rooms
State Reconstruction¶
On-Demand Replay:
async def get_current_state(app_id: str, room_id: str) -> Dict:
# 1. Check cache
cached = await redis.get(f"room_state:{app_id}:{room_id}")
if cached and not is_stale(cached):
return json.loads(cached)
# 2. Replay events
events = await event_store.get_events(app_id, room_id)
state = get_initial_state(app_id) # App-specific default
for event in events:
state = apply_event(state, event)
# 3. Cache result
await redis.setex(f"room_state:{app_id}:{room_id}", 300, json.dumps(state))
return state
Optimization: Periodic Snapshots
Every 100 events, save a snapshot:
async def save_snapshot(app_id: str, room_id: str, state: Dict, last_event_id: str):
await db.execute(
"INSERT INTO room_state_snapshots (app_id, room_id, state, last_event_id, created_at) VALUES ($1,$2,$3,$4,NOW())",
app_id, room_id, json.dumps(state), last_event_id
)
Then replay is:
# Get latest snapshot
snapshot = await db.query("SELECT * FROM room_state_snapshots WHERE app_id=$1 AND room_id=$2 ORDER BY created_at DESC LIMIT 1")
state = json.loads(snapshot.state)
# Replay only events AFTER snapshot
events = await event_store.get_events(app_id, room_id, since=snapshot.last_event_id)
for event in events:
state = apply_event(state, event)
Conflict Resolution¶
Default: Last-Write-Wins (LWW) with Lamport Clock
def resolve_conflict(event1: MiniAppEvent, event2: MiniAppEvent) -> MiniAppEvent:
# If concurrent (both saw same prior state), use server timestamp
if event1.lamport_clock == event2.lamport_clock:
winner = event1 if event1.server_timestamp > event2.server_timestamp else event2
return winner
# Otherwise, use causal ordering (higher lamport clock wins)
return event1 if event1.lamport_clock > event2.lamport_clock else event2
Custom Resolution (Per Mini-App):
Some mini-apps need domain-specific logic: - Trip Planner: Concurrent activity additions = merge both (no conflict) - Fitness Challenge: Concurrent stat updates = keep highest (assumes monotonic increase) - Shopping: Concurrent watchlist adds = merge (set union)
# Mini-app can override
class TripPlannerApp:
def resolve_conflict(self, event1, event2):
if event1.action == "activity_added" and event2.action == "activity_added":
# Both activities valid, merge them
return [event1, event2] # Apply both
# Fall back to LWW
return super().resolve_conflict(event1, event2)
Trade-offs¶
Pros of Event Sourcing¶
- Audit Trail: Every change tracked forever (or until TTL)
- Debugging: Can replay exact sequence that led to bug
- Temporal Queries: "What did the trip plan look like last Tuesday?"
- Compliance: Potential future need for user action logs
- Undo/Redo: Could add "undo last change" feature
- Offline Sync: Queue events, sync when reconnected
Cons of Event Sourcing¶
- Storage Cost: Events + snapshots vs. just current state
- Mitigation: DynamoDB TTL, aggressive caching
- Eventual Consistency: Brief window where users see different states
- Mitigation: Optimistic UI updates, <100ms sync target
- Schema Evolution: Old events need to work with new code
- Mitigation: Event versioning, backward-compatible changes only
- Learning Curve: Team needs to understand event-driven architecture
- Mitigation: Good documentation, start with simple apps
Consequences¶
Positive¶
- Enables Multiplayer: Can finally build Trip Planner, Fitness Challenge, etc.
- Better Debugging: See exact event history when users report issues
- Flexible: Can change conflict resolution strategy without breaking existing rooms
- Scalable: DynamoDB auto-scales, no manual capacity planning
Negative¶
- Complexity: More moving parts than traditional state
- Cost: DynamoDB + Redis caching adds ~$50/month initially
- Latency: Extra hop to event store (but cached reads mitigate)
Neutral¶
- Dual Systems: Solo workflows use existing state, multiplayer uses events
- Could eventually migrate all to events, but not required
Validation¶
Success Criteria¶
- Correctness: 99.9% of concurrent edits resolve correctly
- Performance: p95 state read latency <100ms
- Cost: <$200/month for 1000 active rooms
- Developer Experience: Engineers can add new mini-apps without deep event sourcing knowledge
Metrics to Monitor¶
- Event throughput (writes/sec)
- State reconstruction time
- Conflict resolution frequency
- Cache hit rate
- Storage growth
References¶
- Martin Fowler: Event Sourcing
- AWS DynamoDB Best Practices
- CRDT Papers (for conflict resolution strategies)
Revision History¶
- 2025-11-10: Initial draft