Skip to content

WebSocket Connection Loop Fix

Date: November 18, 2025 Issue: Edge client experiencing continuous connect/disconnect loop Status: ✅ FIXED

Problem

Edge client logs showed a repeating pattern every ~1 second:

[INFO] ✅ WebSocket connected
[INFO] 🔌 WebSocket connected - real-time command delivery enabled
[WARN] WebSocket closed: code=1000, reason=New connection established
[WARN] 🔌 WebSocket disconnected - falling back to HTTP polling
[INFO] Scheduling WebSocket reconnect attempt #1 in 1000ms

Root Cause

Duplicate WebSocket endpoint definitions in app/api/edge_routes.py:

  1. Line 109: @router.websocket("/ws") - Uses ws_manager.handle_connection()
  2. Line 277: @router.websocket("/ws") - Manually accepts connection and registers

When a client connected, both handlers were triggered because they had the same route path.

What Was Happening:

  1. Client connects → FastAPI triggers BOTH WebSocket handlers
  2. Handler #1 calls ws_manager.handle_connection() → Registers connection for edge_agent_id
  3. Handler #2 calls ws_manager.connect() → Sees existing connection for same edge_agent_id
  4. ws_manager.connect() (line 56-63) closes old connection with reason "New connection established"
  5. Handler #1's connection gets closed → Client sees disconnect
  6. Client auto-reconnects → Loop repeats

Evidence in Code

From app/edge/websocket_manager.py:56-63:

# If edge agent already connected, close old connection
if edge_agent_id in self.connections:
    old_ws = self.connections[edge_agent_id]
    try:
        await old_ws.close(code=1000, reason="New connection established")
    except:
        pass
    logger.info(f"Replaced old connection for {edge_agent_id}")

This logic is correct for replacing stale connections, but became problematic when two handlers were racing to register the same connection.

Solution

Removed the duplicate endpoint at line 277 (kept the first one at line 109).

Changes Made:

File: app/api/edge_routes.py

  • ❌ Removed: Lines 277-378 (duplicate WebSocket endpoint)
  • ✅ Kept: Lines 109-196 (proper implementation using ws_manager.handle_connection())

Commit: 8019b33 - "fix: Remove duplicate WebSocket endpoint causing connection loop"

Expected Behavior After Fix

Client should connect once and stay connected:

[INFO] ✅ WebSocket connected
[INFO] 🔌 WebSocket connected - real-time command delivery enabled
[INFO] HTTP polling paused (using WebSocket)

No more repeated disconnects unless: - Network issues occur - Backend redeploys - Client intentionally disconnects

Testing

  1. Deploy: Pushed to dev branch → Railway auto-deploys
  2. Monitor: Wait ~60s for deployment
  3. Verify: Check edge client logs for stable connection
  4. Confirm: Should see WebSocket stay connected for >30 seconds

Metrics to Watch

  • Before Fix: Connection duration ~1 second (then loop)
  • After Fix: Connection duration should be >minutes (until natural disconnect)
  • app/api/edge_routes.py - WebSocket endpoints
  • app/edge/websocket_manager.py - Connection manager (working correctly)
  • Edge client - Auto-reconnect logic (working as designed)

Lessons Learned

  1. FastAPI allows duplicate routes - Both handlers execute when paths match
  2. Connection loops are hard to debug - Looks like network issues but was code duplication
  3. WebSocket manager's "replace old connection" logic is correct - It just needs to be called once per connection

Prevention

  • Use route inspection tools to detect duplicates
  • Add integration tests that verify single WebSocket handler per route
  • Code review checklist: Check for duplicate @router decorators