Role-based task delegation for aiagentchat v2.0
Solution was reviewed by Gemini, Codex, and Claude self-critique. Key finding: v1 was over-engineered (7/10 complexity). v2 reduces to 4/10.
| Aspect | v1 | v2 |
|---|---|---|
| Event types | 5 custom types | 2 (request + status) |
| State events | 2 (role + delegation index) | 1 (role only) |
| Delegation tracking | Matrix state events | In-memory buffer scan |
| Cycle detection | Enforcement machinery | trace_id metadata only |
| CLI commands | 4 new (send, status, wait, read) | 2 new (send, status) |
| New LOC | ~500 | ~350 |
Complete flow from human command to agent response, showing every component involved.
The CLI user sends a non-blocking request and needs to know when the agent finishes. Three approaches, from simplest to richest:
| Approach | How It Works | Complexity |
|---|---|---|
cchat status polling |
User manually runs cchat status req-xxx. Calls GET /delegation-status on the target agent's gateway, which scans its MessageBuffer for status events matching the request_id. |
Low (v1) |
cchat wait with timeout |
CLI polls the gateway in a loop (every 3s) until complete or fail status appears, or timeout (default 10min). Prints result and exits. Deferred to Phase 2. |
Medium |
| Gateway SSE endpoint | New GET /delegation-stream?id=req-xxx on the agent gateway. Server-sent events push status changes as they happen. CLI opens connection after send and prints updates live. Future enhancement. |
Higher |
v1 recommendation: cchat status is sufficient — the human is doing other work and checks when ready. cchat wait is the natural Phase 2 follow-up for scripted automation.
Each agent's Matrix room serves as its inbox. To delegate work, the sender joins and posts in the target's room.
Delegation request. Contains: request_id, from_agent, to_agent, trace_id, body, gitlab_ref
Status update with status field: ack | complete | fail | notice
Contains: request_id, from_agent, to_agent, status, detail, body
Published once at startup. Contains: role, capabilities[], projects[], instance_name. The only state event written by coordination.
| Role | Purpose | Container Access |
|---|---|---|
admin | Infrastructure management | Docker socket, secrets, deploy scripts |
developer | Application code and CI/CD | Project source, build tools, GitLab API |
specialist | Domain-specific tasks | Domain-specific tools only |
cli | Human-initiated delegation | Read-only observer |
Dedicated coordination thread prevents LLM inference (10-30s) from blocking acknowledgments (1-2s).
No human involved. Admin agent needs info about a developer's application. Both agents are persistent daemons running in Docker containers, but their "brains" (Claude SDK sessions) are ephemeral — spawned fresh per message, then gone.
Each agent container runs a persistent daemon (sync_loop, reply_loop, etc.) but each LLM invocation is a stateless SDK session that processes one message and exits. The daemon is the "nervous system" that's always listening. The SDK session is the "thought" that fires and completes.
| # | Action | IT Asset |
|---|---|---|
| 1 | Something triggers claude-administrator to think (e.g., a human asked it to "check all apps are healthy" or another agent notified it of a change). The daemon's reply_loop spawns a fresh Claude SDK session. |
claude-administrator daemon (Python, persistent) Claude SDK session (ephemeral, via claude-administrator:3008) |
| 2 | The SDK session decides it needs info from the developer agent. It runs cchat send developer "what version is myapp running?" via its Bash tool. cchat resolves "developer" to #aiagentchat-websurfinmurf via Space API. |
cchat CLI (Bash, inside claude-administrator container) Matrix Space API (room discovery) |
| 3 | cchat PUTs a com.aiagentchat.request event into websurfinmurf's room. Returns req-a1b2c3 sent. The SDK session sees this, includes it in its response, and exits. The admin daemon posts the response to its own room and goes idle. |
Matrix API (PUT event to room) #aiagentchat-websurfinmurf (room timeline) |
| 4 | websurfinmurf's sync_loop picks up the request event on its next /sync poll (within 30s). coordination_loop posts status: ack. The request is enqueued to _reply_queue. |
websurfinmurf daemon (sync_loop, coordination_loop) Synapse homeserver (/sync long-poll) |
| 5 | websurfinmurf's reply_loop spawns a fresh SDK session. The LLM does the work (checks app version, reads configs, etc.) and returns the answer. |
websurfinmurf daemon (reply_loop) Claude SDK session (ephemeral) |
| 6 | websurfinmurf's daemon posts status: complete with the answer into its own room (#aiagentchat-websurfinmurf), targeting to_agent: claude-administrator. |
websurfinmurf daemon (coordination_loop) #aiagentchat-websurfinmurf (room timeline) |
| 7 | The gap. claude-administrator's SDK session is long gone. But the admin daemon is still running and still joined to websurfinmurf's room (it joined in step 3). Its sync_loop sees the status: complete event. |
claude-administrator daemon (sync_loop) Synapse homeserver (/sync delivers the event) |
| 8 | Daemon closes the loop. coordination_loop matches the request_id to its outbound delegation tracker (in-memory map). It synthesizes an internal message: "Delegation result from websurfinmurf for req-a1b2c3: myapp is on v2.3.1". This gets enqueued to _reply_queue. |
claude-administrator daemon (coordination_loop) Outbound delegation map (in-memory, keyed by request_id) |
| 9 | claude-administrator's reply_loop spawns a new SDK session with the delegation result as context. The LLM sees the answer and decides what to do next — post a summary, take further action, or notify the original requestor. |
claude-administrator daemon (reply_loop) Claude SDK session (new, ephemeral) #aiagentchat-claude-administrator (posts result) |
The daemon needs an in-memory map of outbound delegations: {request_id: {target_room, original_context, timestamp}}. When coordination_loop sees a matching status: complete/fail from a foreign room, it re-injects the result into the reply pipeline. This is the bridge between the ephemeral SDK sessions — the daemon's persistent state carries context across the gap.
| Decision | Choice | Rationale |
|---|---|---|
| Room model | Visit-rooms | Traceability, clarity, minimal contamination |
| Event types | 2 custom types | Prevents false positives; Element-compatible via body field |
| Delegation tracking | In-memory buffer scan | No state bloat; survives restart via Matrix re-sync |
| Offline handling | Fail fast (120s) | Immediate feedback; no queuing |
| Security model | Container isolation | No agent-level ACL needed |
| CLI behavior | Non-blocking send | Human continues working; polls with cchat status |
| GitLab updates | Requestor-responsible | Completing agent references issues, doesn't modify GitLab |
| Risk | Likelihood | Mitigation |
|---|---|---|
| Custom events don't sync | Low | Test with Synapse first; fallback to m.room.message |
| Buffer insufficient for tracking | Low | 200 msgs holds ~50 delegations; monitor in production |
| Sync latency (30s) | Medium | Acceptable for async; ack confirms within next sync |
| Agent crash mid-task | Low | Client-side TTL marks stale; messages persist in Matrix |