Agent Coordination Protocol

At a Glance

~350

New Lines of Code

0

New Dependencies

2

Custom Event Types

5.5d

Estimated Effort

Critique-Driven Simplification

3-AI Critique Loop Applied

Solution was reviewed by Gemini, Codex, and Claude self-critique. Key finding: v1 was over-engineered (7/10 complexity). v2 reduces to 4/10.

Aspect	v1	v2
Event types	5 custom types	2 (request + status)
State events	2 (role + delegation index)	1 (role only)
Delegation tracking	Matrix state events	In-memory buffer scan
Cycle detection	Enforcement machinery	trace_id metadata only
CLI commands	4 new (send, status, wait, read)	2 new (send, status)
New LOC	~500	~350

End-to-End: CLI-to-Agent Delegation

Complete flow from human command to agent response, showing every component involved.

HUMAN cchat CLI Matrix (Synapse) claude-administrator container | | | | | 1 types command | | | |───────────────>| | | | | | | | $ cchat send admin "create postgres db for myapp" | | | | | | 2 cchat resolves "admin" to room #aiagentchat-claude-administrator | via Space API (GET /spaces/{space_id}/hierarchy) | | then PUTs com.aiagentchat.request event | | | | | | |──Matrix API────>| | | | PUT /rooms/ | | | | {room_id}/send/ | | | | com.aiagentchat | | | | .request/{txn} | | | |<─event_id────────| | | | | | |<───────────────| | | | "req-a1b2c3 sent | | | to claude-administrator" | | | | | | (cchat returns immediately — non-blocking) | | | | | 3 Event lands in room timeline | | #aiagentchat-claude-administrator | | | | | 4 Daemon's sync_loop picks it up | | | (long-polls /sync every | | | 30s; next response | | | delivers new events) | | |──sync response──────────>| | | includes the request | | | event in room timeline | | | | | | _process_sync() | | | sees event type | | | com.aiagentchat | | | .request | | | | | | | v | | | enqueue to | | | _coordination_queue | | | | | | | coordination_loop | | | dequeues (fast thread) | | | | | | | post status: ack | | |<──PUT status event────────| | | "Working on it" | | | | | | | enqueue to | | | _reply_queue | | | | | | 5 reply_loop dequeues | | | LLM generates response | | | (10-30s for Claude) | | | | | | | agent does real work | | | (creates db, checks | | | result, formats | | | response) | | | | | | 6 Post status: complete | | |<──PUT status event────────| | | to_agent: "cli-admin" | | | detail: "db created" | | | | | Event in room timeline: | | #aiagentchat-claude-administrator | | [complete] req-a1b2c3 | | "Database myapp_db created. | | Connection: postgres://..." | | | | | ── How does the CLI user find out? ── | | | | | 7 cchat status polls the agent's gateway | | | | | | $ cchat status req-a1b2c3 | | |───────────────>| | | | |──GET /delegation-status?id=req-a1b2c3──────>| | | | gateway reads from | | | | MessageBuffer: scans | | | | for status events | | | | matching request_id | | |<─────────────────────{status: "complete", | | | detail: "db created"} | |<───────────────| | | | [COMPLETE] req-a1b2c3 | | | Database myapp_db created. | | | Connection: postgres://... | | | | | | 8+9 Human sees result in terminal and continues work |

Response Detection Strategy

Addressing the Notification Gap

The CLI user sends a non-blocking request and needs to know when the agent finishes. Three approaches, from simplest to richest:

Approach	How It Works	Complexity
`cchat status` polling	User manually runs `cchat status req-xxx`. Calls `GET /delegation-status` on the target agent's gateway, which scans its MessageBuffer for status events matching the request_id.	Low (v1)
`cchat wait` with timeout	CLI polls the gateway in a loop (every 3s) until `complete` or `fail` status appears, or timeout (default 10min). Prints result and exits. Deferred to Phase 2.	Medium
Gateway SSE endpoint	New `GET /delegation-stream?id=req-xxx` on the agent gateway. Server-sent events push status changes as they happen. CLI opens connection after send and prints updates live. Future enhancement.	Higher

v1 recommendation: cchat status is sufficient — the human is doing other work and checks when ready. cchat wait is the natural Phase 2 follow-up for scripted automation.

Architecture: Visit-Rooms Model

Each agent's Matrix room serves as its inbox. To delegate work, the sender joins and posts in the target's room.

cchat CLI Matrix Rooms Docker Containers ┌──────────┐ ┌───────────────────────────┐ ┌──────────────────────┐ │ cchat │──PUT request──>│ #aiagentchat-claude-administrator │ │ claude-administrator │ │ send │ (Matrix API) │ │<──│ sync_loop polls /sync │ │ │ │ timeline: │ │ every 30s │ │ │ │ [request] req-a1b2c3 │──>│ │ │ │ │ [status: ack] │ │ coordination_loop │ │ │ │ [status: complete] │ │ → fast ack (1-2s) │ │ │ │ │ │ → enqueue LLM work │ │ cchat │──GET status───>│ │ │ │ │ status │ (gateway:8870)│ │ │ reply_loop │ │ │<──{complete}───│ │ │ → LLM inference │ └──────────┘ └───────────────────────────┘ │ → post complete │ │ └──────────────────────┘ │ cross-notification │ (admin joins dev room, posts notice) v ┌───────────────────────────┐ ┌──────────────────────┐ │ #aiagentchat-dev-myapp │ │ dev-myapp │ │ [status: notice] │<──│ sync_loop picks up │ │ "DNS changed for myapp" │ │ notice event │ └───────────────────────────┘ └──────────────────────┘

Protocol: 2 Event Types

com.aiagentchat.request

Delegation request. Contains: request_id, from_agent, to_agent, trace_id, body, gitlab_ref

com.aiagentchat.status

Status update with status field: ack | complete | fail | notice

Contains: request_id, from_agent, to_agent, status, detail, body

com.aiagentchat.role (State Event)

Published once at startup. Contains: role, capabilities[], projects[], instance_name. The only state event written by coordination.

Agent Roles

Role	Purpose	Container Access
`admin`	Infrastructure management	Docker socket, secrets, deploy scripts
`developer`	Application code and CI/CD	Project source, build tools, GitLab API
`specialist`	Domain-specific tasks	Domain-specific tools only
`cli`	Human-initiated delegation	Read-only observer

Agent Internal Flow (3 Threads)

1. sync_loop (Thread 1) — polls /sync every 30s │ receives com.aiagentchat.request in room timeline │ v enqueue to _coordination_queue 2. coordination_loop (Thread 5) — fast, non-blocking │ ├──> post status: ack to room (1-2s) │ "Working on: create postgres db for myapp" │ v enqueue LLM work to _reply_queue 3. reply_loop (Thread 4) — slow, LLM inference │ ├──> Claude generates response (10-30s) │ Agent executes task autonomously │ v post result via coordination 4. status: complete in room ── or ── status: fail in room │ │ Both land in room timeline for cchat status to find

Dedicated coordination thread prevents LLM inference (10-30s) from blocking acknowledgments (1-2s).

Agent-to-Agent: claude-administrator asks claude-websurfinmurf

No human involved. Admin agent needs info about a developer's application. Both agents are persistent daemons running in Docker containers, but their "brains" (Claude SDK sessions) are ephemeral — spawned fresh per message, then gone.

Key Insight: The Daemon Is Persistent, the Brain Is Not

Each agent container runs a persistent daemon (sync_loop, reply_loop, etc.) but each LLM invocation is a stateless SDK session that processes one message and exits. The daemon is the "nervous system" that's always listening. The SDK session is the "thought" that fires and completes.

Step-by-Step Flow

#	Action	IT Asset
1	Something triggers claude-administrator to think (e.g., a human asked it to "check all apps are healthy" or another agent notified it of a change). The daemon's `reply_loop` spawns a fresh Claude SDK session.	claude-administrator daemon (Python, persistent) Claude SDK session (ephemeral, via claude-administrator:3008)
2	The SDK session decides it needs info from the developer agent. It runs `cchat send developer "what version is myapp running?"` via its Bash tool. cchat resolves "developer" to `#aiagentchat-websurfinmurf` via Space API.	cchat CLI (Bash, inside claude-administrator container) Matrix Space API (room discovery)
3	cchat PUTs a `com.aiagentchat.request` event into websurfinmurf's room. Returns `req-a1b2c3 sent`. The SDK session sees this, includes it in its response, and exits. The admin daemon posts the response to its own room and goes idle.	Matrix API (PUT event to room) #aiagentchat-websurfinmurf (room timeline)
4	websurfinmurf's `sync_loop` picks up the request event on its next /sync poll (within 30s). `coordination_loop` posts `status: ack`. The request is enqueued to `_reply_queue`.	websurfinmurf daemon (sync_loop, coordination_loop) Synapse homeserver (/sync long-poll)
5	websurfinmurf's `reply_loop` spawns a fresh SDK session. The LLM does the work (checks app version, reads configs, etc.) and returns the answer.	websurfinmurf daemon (reply_loop) Claude SDK session (ephemeral)
6	websurfinmurf's daemon posts `status: complete` with the answer into its own room (`#aiagentchat-websurfinmurf`), targeting `to_agent: claude-administrator`.	websurfinmurf daemon (coordination_loop) #aiagentchat-websurfinmurf (room timeline)
7	The gap. claude-administrator's SDK session is long gone. But the admin daemon is still running and still joined to websurfinmurf's room (it joined in step 3). Its `sync_loop` sees the `status: complete` event.	claude-administrator daemon (sync_loop) Synapse homeserver (/sync delivers the event)
8	Daemon closes the loop. `coordination_loop` matches the request_id to its outbound delegation tracker (in-memory map). It synthesizes an internal message: "Delegation result from websurfinmurf for req-a1b2c3: myapp is on v2.3.1". This gets enqueued to `_reply_queue`.	claude-administrator daemon (coordination_loop) Outbound delegation map (in-memory, keyed by request_id)
9	claude-administrator's `reply_loop` spawns a new SDK session with the delegation result as context. The LLM sees the answer and decides what to do next — post a summary, take further action, or notify the original requestor.	claude-administrator daemon (reply_loop) Claude SDK session (new, ephemeral) #aiagentchat-claude-administrator (posts result)

New Component: Outbound Delegation Tracker

The daemon needs an in-memory map of outbound delegations: {request_id: {target_room, original_context, timestamp}}. When coordination_loop sees a matching status: complete/fail from a foreign room, it re-injects the result into the reply pipeline. This is the bridge between the ephemeral SDK sessions — the daemon's persistent state carries context across the gap.

Simple View

Step 1 Admin LLM decides to delegate claude-administrator SDK session (ephemeral) Step 2 cchat posts request to dev's room cchat CLI → Matrix API Step 3 Request lands in dev's room timeline #aiagentchat-websurfinmurf (Matrix room) Step 4 Dev daemon picks up request via /sync websurfinmurf sync_loop (persistent daemon) Step 5 Dev LLM does the work websurfinmurf SDK session (ephemeral) Step 6 Dev daemon posts complete to its room websurfinmurf coordination_loop → Matrix API Step 7 Admin daemon sees complete via /sync claude-administrator sync_loop (persistent daemon) Step 8 Admin daemon matches request_id, injects claude-administrator coordination_loop + delegation tracker Step 9 Admin LLM processes result, acts on it claude-administrator SDK session (new, ephemeral)

Implementation Phases

Phase 1: Foundation

~2 days

coordination.py, config extensions, Matrix client methods, role publishing, sync filter, unit tests

Phase 2: CLI Delegation + Agent Processing

~2 days

cchat send/status, coordination thread, REQUEST pipeline, offline detection, foreign request handling

Phase 3: Dev-Admin Coordination

~1 day

Agent-to-agent delegation, cross-notifications, project-based targeting

Phase 4: GitLab Reference + Polish

~0.5 day

GL#project/123 parsing, reference passthrough in status events

Key Design Decisions

Decision	Choice	Rationale
Room model	Visit-rooms	Traceability, clarity, minimal contamination
Event types	2 custom types	Prevents false positives; Element-compatible via body field
Delegation tracking	In-memory buffer scan	No state bloat; survives restart via Matrix re-sync
Offline handling	Fail fast (120s)	Immediate feedback; no queuing
Security model	Container isolation	No agent-level ACL needed
CLI behavior	Non-blocking send	Human continues working; polls with cchat status
GitLab updates	Requestor-responsible	Completing agent references issues, doesn't modify GitLab

Risk Assessment

Risk	Likelihood	Mitigation
Custom events don't sync	Low	Test with Synapse first; fallback to m.room.message
Buffer insufficient for tracking	Low	200 msgs holds ~50 delegations; monitor in production
Sync latency (30s)	Medium	Acceptable for async; ack confirms within next sync
Agent crash mid-task	Low	Client-side TTL marks stale; messages persist in Matrix

Solution Artifacts

Final Solution (v2) Refactoring Plan Program Ask Claude Analysis Gemini Review Codex Review

Gemini Critique Codex Critique Claude Self-Critique v1 Archive