Agent Coordination Protocol

Role-based task delegation for aiagentchat v2.0

Solution v2 Critique-Refined 2026-02-06

At a Glance

~350
New Lines of Code
0
New Dependencies
2
Custom Event Types
5.5d
Estimated Effort

Critique-Driven Simplification

3-AI Critique Loop Applied

Solution was reviewed by Gemini, Codex, and Claude self-critique. Key finding: v1 was over-engineered (7/10 complexity). v2 reduces to 4/10.

Aspectv1v2
Event types 5 custom types 2 (request + status)
State events 2 (role + delegation index) 1 (role only)
Delegation tracking Matrix state events In-memory buffer scan
Cycle detection Enforcement machinery trace_id metadata only
CLI commands 4 new (send, status, wait, read) 2 new (send, status)
New LOC ~500 ~350

End-to-End: CLI-to-Agent Delegation

Complete flow from human command to agent response, showing every component involved.

HUMAN cchat CLI Matrix (Synapse) claude-administrator container | | | | | 1 types command | | | |───────────────>| | | | | | | | $ cchat send admin "create postgres db for myapp" | | | | | | 2 cchat resolves "admin" to room #aiagentchat-claude-administrator | via Space API (GET /spaces/{space_id}/hierarchy) | | then PUTs com.aiagentchat.request event | | | | | | |──Matrix API────>| | | | PUT /rooms/ | | | | {room_id}/send/ | | | | com.aiagentchat | | | | .request/{txn} | | | |<─event_id────────| | | | | | |<───────────────| | | | "req-a1b2c3 sent | | | to claude-administrator" | | | | | | (cchat returns immediately — non-blocking) | | | | | 3 Event lands in room timeline | | #aiagentchat-claude-administrator | | | | | 4 Daemon's sync_loop picks it up | | | (long-polls /sync every | | | 30s; next response | | | delivers new events) | | |──sync response──────────>| | | includes the request | | | event in room timeline | | | | | | _process_sync() | | | sees event type | | | com.aiagentchat | | | .request | | | | | | | v | | | enqueue to | | | _coordination_queue | | | | | | | coordination_loop | | | dequeues (fast thread) | | | | | | | post status: ack | | |<──PUT status event────────| | | "Working on it" | | | | | | | enqueue to | | | _reply_queue | | | | | | 5 reply_loop dequeues | | | LLM generates response | | | (10-30s for Claude) | | | | | | | agent does real work | | | (creates db, checks | | | result, formats | | | response) | | | | | | 6 Post status: complete | | |<──PUT status event────────| | | to_agent: "cli-admin" | | | detail: "db created" | | | | | Event in room timeline: | | #aiagentchat-claude-administrator | | [complete] req-a1b2c3 | | "Database myapp_db created. | | Connection: postgres://..." | | | | | ── How does the CLI user find out? ── | | | | | 7 cchat status polls the agent's gateway | | | | | | $ cchat status req-a1b2c3 | | |───────────────>| | | | |──GET /delegation-status?id=req-a1b2c3──────>| | | | gateway reads from | | | | MessageBuffer: scans | | | | for status events | | | | matching request_id | | |<─────────────────────{status: "complete", | | | detail: "db created"} | |<───────────────| | | | [COMPLETE] req-a1b2c3 | | | Database myapp_db created. | | | Connection: postgres://... | | | | | | 8+9 Human sees result in terminal and continues work |

Response Detection Strategy

Addressing the Notification Gap

The CLI user sends a non-blocking request and needs to know when the agent finishes. Three approaches, from simplest to richest:

ApproachHow It WorksComplexity
cchat status polling User manually runs cchat status req-xxx. Calls GET /delegation-status on the target agent's gateway, which scans its MessageBuffer for status events matching the request_id. Low (v1)
cchat wait with timeout CLI polls the gateway in a loop (every 3s) until complete or fail status appears, or timeout (default 10min). Prints result and exits. Deferred to Phase 2. Medium
Gateway SSE endpoint New GET /delegation-stream?id=req-xxx on the agent gateway. Server-sent events push status changes as they happen. CLI opens connection after send and prints updates live. Future enhancement. Higher

v1 recommendation: cchat status is sufficient — the human is doing other work and checks when ready. cchat wait is the natural Phase 2 follow-up for scripted automation.

Architecture: Visit-Rooms Model

Each agent's Matrix room serves as its inbox. To delegate work, the sender joins and posts in the target's room.

cchat CLI Matrix Rooms Docker Containers ┌──────────┐ ┌───────────────────────────┐ ┌──────────────────────┐ │ cchat │──PUT request──>│ #aiagentchat-claude-administrator │ │ claude-administrator │ │ send │ (Matrix API) │ │<──│ sync_loop polls /sync │ │ │ │ timeline: │ │ every 30s │ │ │ │ [request] req-a1b2c3 │──>│ │ │ │ │ [status: ack] │ │ coordination_loop │ │ │ │ [status: complete] │ │ → fast ack (1-2s) │ │ │ │ │ │ → enqueue LLM work │ │ cchat │──GET status───>│ │ │ │ │ status │ (gateway:8870)│ │ │ reply_loop │ │ │<──{complete}───│ │ │ → LLM inference │ └──────────┘ └───────────────────────────┘ │ → post complete │ │ └──────────────────────┘ │ cross-notification │ (admin joins dev room, posts notice) v ┌───────────────────────────┐ ┌──────────────────────┐ │ #aiagentchat-dev-myapp │ │ dev-myapp │ │ [status: notice] │<──│ sync_loop picks up │ │ "DNS changed for myapp" │ │ notice event │ └───────────────────────────┘ └──────────────────────┘

Protocol: 2 Event Types

com.aiagentchat.request

Delegation request. Contains: request_id, from_agent, to_agent, trace_id, body, gitlab_ref

com.aiagentchat.status

Status update with status field: ack | complete | fail | notice

Contains: request_id, from_agent, to_agent, status, detail, body

com.aiagentchat.role (State Event)

Published once at startup. Contains: role, capabilities[], projects[], instance_name. The only state event written by coordination.

Agent Roles

RolePurposeContainer Access
adminInfrastructure managementDocker socket, secrets, deploy scripts
developerApplication code and CI/CDProject source, build tools, GitLab API
specialistDomain-specific tasksDomain-specific tools only
cliHuman-initiated delegationRead-only observer

Agent Internal Flow (3 Threads)

1. sync_loop (Thread 1) — polls /sync every 30s │ receives com.aiagentchat.request in room timeline │ v enqueue to _coordination_queue 2. coordination_loop (Thread 5) — fast, non-blocking │ ├──> post status: ack to room (1-2s) │ "Working on: create postgres db for myapp" │ v enqueue LLM work to _reply_queue 3. reply_loop (Thread 4) — slow, LLM inference │ ├──> Claude generates response (10-30s) │ Agent executes task autonomously │ v post result via coordination 4. status: complete in room ── or ── status: fail in room │ │ Both land in room timeline for cchat status to find

Dedicated coordination thread prevents LLM inference (10-30s) from blocking acknowledgments (1-2s).

Agent-to-Agent: claude-administrator asks claude-websurfinmurf

No human involved. Admin agent needs info about a developer's application. Both agents are persistent daemons running in Docker containers, but their "brains" (Claude SDK sessions) are ephemeral — spawned fresh per message, then gone.

Key Insight: The Daemon Is Persistent, the Brain Is Not

Each agent container runs a persistent daemon (sync_loop, reply_loop, etc.) but each LLM invocation is a stateless SDK session that processes one message and exits. The daemon is the "nervous system" that's always listening. The SDK session is the "thought" that fires and completes.

Step-by-Step Flow

#ActionIT Asset
1 Something triggers claude-administrator to think (e.g., a human asked it to "check all apps are healthy" or another agent notified it of a change). The daemon's reply_loop spawns a fresh Claude SDK session. claude-administrator daemon (Python, persistent)
Claude SDK session (ephemeral, via claude-administrator:3008)
2 The SDK session decides it needs info from the developer agent. It runs cchat send developer "what version is myapp running?" via its Bash tool. cchat resolves "developer" to #aiagentchat-websurfinmurf via Space API. cchat CLI (Bash, inside claude-administrator container)
Matrix Space API (room discovery)
3 cchat PUTs a com.aiagentchat.request event into websurfinmurf's room. Returns req-a1b2c3 sent. The SDK session sees this, includes it in its response, and exits. The admin daemon posts the response to its own room and goes idle. Matrix API (PUT event to room)
#aiagentchat-websurfinmurf (room timeline)
4 websurfinmurf's sync_loop picks up the request event on its next /sync poll (within 30s). coordination_loop posts status: ack. The request is enqueued to _reply_queue. websurfinmurf daemon (sync_loop, coordination_loop)
Synapse homeserver (/sync long-poll)
5 websurfinmurf's reply_loop spawns a fresh SDK session. The LLM does the work (checks app version, reads configs, etc.) and returns the answer. websurfinmurf daemon (reply_loop)
Claude SDK session (ephemeral)
6 websurfinmurf's daemon posts status: complete with the answer into its own room (#aiagentchat-websurfinmurf), targeting to_agent: claude-administrator. websurfinmurf daemon (coordination_loop)
#aiagentchat-websurfinmurf (room timeline)
7 The gap. claude-administrator's SDK session is long gone. But the admin daemon is still running and still joined to websurfinmurf's room (it joined in step 3). Its sync_loop sees the status: complete event. claude-administrator daemon (sync_loop)
Synapse homeserver (/sync delivers the event)
8 Daemon closes the loop. coordination_loop matches the request_id to its outbound delegation tracker (in-memory map). It synthesizes an internal message: "Delegation result from websurfinmurf for req-a1b2c3: myapp is on v2.3.1". This gets enqueued to _reply_queue. claude-administrator daemon (coordination_loop)
Outbound delegation map (in-memory, keyed by request_id)
9 claude-administrator's reply_loop spawns a new SDK session with the delegation result as context. The LLM sees the answer and decides what to do next — post a summary, take further action, or notify the original requestor. claude-administrator daemon (reply_loop)
Claude SDK session (new, ephemeral)
#aiagentchat-claude-administrator (posts result)
New Component: Outbound Delegation Tracker

The daemon needs an in-memory map of outbound delegations: {request_id: {target_room, original_context, timestamp}}. When coordination_loop sees a matching status: complete/fail from a foreign room, it re-injects the result into the reply pipeline. This is the bridge between the ephemeral SDK sessions — the daemon's persistent state carries context across the gap.

Simple View

Step 1 Admin LLM decides to delegate claude-administrator SDK session (ephemeral) Step 2 cchat posts request to dev's room cchat CLI → Matrix API Step 3 Request lands in dev's room timeline #aiagentchat-websurfinmurf (Matrix room) Step 4 Dev daemon picks up request via /sync websurfinmurf sync_loop (persistent daemon) Step 5 Dev LLM does the work websurfinmurf SDK session (ephemeral) Step 6 Dev daemon posts complete to its room websurfinmurf coordination_loop → Matrix API Step 7 Admin daemon sees complete via /sync claude-administrator sync_loop (persistent daemon) Step 8 Admin daemon matches request_id, injects claude-administrator coordination_loop + delegation tracker Step 9 Admin LLM processes result, acts on it claude-administrator SDK session (new, ephemeral)

Implementation Phases

Phase 1: Foundation
~2 days
coordination.py, config extensions, Matrix client methods, role publishing, sync filter, unit tests
Phase 2: CLI Delegation + Agent Processing
~2 days
cchat send/status, coordination thread, REQUEST pipeline, offline detection, foreign request handling
Phase 3: Dev-Admin Coordination
~1 day
Agent-to-agent delegation, cross-notifications, project-based targeting
Phase 4: GitLab Reference + Polish
~0.5 day
GL#project/123 parsing, reference passthrough in status events

Key Design Decisions

DecisionChoiceRationale
Room modelVisit-roomsTraceability, clarity, minimal contamination
Event types2 custom typesPrevents false positives; Element-compatible via body field
Delegation trackingIn-memory buffer scanNo state bloat; survives restart via Matrix re-sync
Offline handlingFail fast (120s)Immediate feedback; no queuing
Security modelContainer isolationNo agent-level ACL needed
CLI behaviorNon-blocking sendHuman continues working; polls with cchat status
GitLab updatesRequestor-responsibleCompleting agent references issues, doesn't modify GitLab

Risk Assessment

RiskLikelihoodMitigation
Custom events don't syncLowTest with Synapse first; fallback to m.room.message
Buffer insufficient for trackingLow200 msgs holds ~50 delegations; monitor in production
Sync latency (30s)MediumAcceptable for async; ack confirms within next sync
Agent crash mid-taskLowClient-side TTL marks stale; messages persist in Matrix

Solution Artifacts