AgentSquad Runtime Operations
This runbook is the SSOT for driving and observing AgentSquad runs against the live runtime. Written from the live API so the operational path is findable without guessing.
Runtime access
Section titled “Runtime access”- API base:
http://10.10.10.47:18088(edge → fractalops-api). - Auth: edge Pomerium claim headers, NOT a bearer token:
x-pomerium-claim-email: <you>@yamon.iox-pomerium-claim-groups: /org/platform/roles/super_admin
- Health:
GET /healthz→{"status":"ok"}. GET /openapi.jsonis the authoritative endpoint list — grep it, do not guess paths.
Studio run API (admin)
Section titled “Studio run API (admin)”Prefix /v1/admin/studio (NOT /v1/studio — that 404s):
| Action | Endpoint |
|---|---|
| Launch a run | POST /v1/admin/studio/runs |
| Get run | GET /v1/admin/studio/runs/{run_id}?tenant_id=default |
| Execute (tick agents) | POST /v1/admin/studio/runs/{run_id}/execute?tenant_id=default |
| Session state (observe) | GET /v1/admin/studio/runs/{run_id}/session-state?tenant_id=default |
| Reports (per tick) | GET /v1/admin/studio/runs/{run_id}/reports?tenant_id=default (returns a LIST) |
| Handoff trail | GET /v1/admin/studio/runs/{run_id}/handoff-trail?tenant_id=default |
| Cleanup / control / replay | POST .../runs/{run_id}/{cleanup,control,replay} |
| Templates | GET /v1/admin/studio/templates |
Templates
Section titled “Templates”ouroboros, pm-ouroboros, agentsquad (project delivery / migration),
research-team.
Launch body (StudioRunIn)
Section titled “Launch body (StudioRunIn)”{ "template_id": "agentsquad", "project_slug": "sanmopia-modernization", "runtime_kind": "project_delivery", "work_scope": "<the task>", "next_goal": "<first concrete step>"}- Project slugs are exact. Sanmopia’s slug is
sanmopia-modernization, NOTsanmopia— a wrong slug makes the launch hang on provisioning (HTTP 000 / timeout). Confirm the slug from a prior run’sproject_slugbefore launching. - Launch returns
{run_id, status: "ready", agent_roster: [...]}in ~6s. The roster starts as the intake pair (planner + curator); more agents join by handoff. executereturns fast withstatus: "scheduled"— the tick runs ASYNC; pollsession-stateto watch progress.
Agent identity model (as seen on the roster)
Section titled “Agent identity model (as seen on the roster)”Each roster agent carries:
user_principal_name/subject_key=<role>@yamon.io(e.g.planner@yamon.io),identity_binding_mode: named_agent_microsoft,secret_lease_ref: named-agent/<role>— the msgraph-bound named-agent UPN.agent_nickname= an auto-generated handle (e.g.dangerous-vector), deterministic per (run, agent, role). No manual assignment.- The ContextForge gateway token identity is separate:
<projcode>.<nickname>@<domain>(so the gateway injects a spoof-proofX-Authenticated-Userto upstream MCP servers; AMA binds the project from it). skill_ids/mcp_server_idsare EMPTY in the roster snapshot — the loadout resolves them at session launch, not in the roster view.
Observing a run
Section titled “Observing a run”Poll session-state; each session reports live_status, current_step,
blocker_code, session_id, claude_session_id, tool_call_count.
Lifecycle: ready → waiting_first_tool (Claude session warming) → first tool
call → working → session_completed. A run is blocked if any agent is blocked.
Common blockers (and what they mean)
Section titled “Common blockers (and what they mean)”agent_first_tool_required— the agent’s Claude session started (claude_session_idis set) but it did not make the required first tool call, so the launch contract blocks it. Recurs at the intake phase. Check: the agent’s MCP tools are actually attached (the required first tool is usually an MCP/HUD tool — if the MCP config or gateway auth is broken the agent literally cannot call it), and the dispatch prompt elicited a tool call.- See the full blocker set in
GET get_hud_snapshot→execution_report_contract.BLOCKERS.
Verified live (2026-06)
Section titled “Verified live (2026-06)”- ContextForge gateway is on 1.0.3; backend mints the 1.0.x-shaped admin token (issuer/audience/jti) — agent gateway auth works, no 401s in the gateway log during a run.
- Deployed
fractalops-apiimage auto-pins forward on each merge (e.g.gha-2820-<sha>), so merged backend changes are live within one pin bump. - The skill poll-pull (
/v1/skills) was NOT exercised by intake agents in the observed run (no/v1/skillscalls) — they block at first-tool before the pull, and the workspace shims carrying the pull may lag the api image.
GitHub writes go through the CodexGate App
Section titled “GitHub writes go through the CodexGate App”Squad PRs and issues are authored by the CodexGate GitHub App
(fractalops-codexgate[bot]), not a raw org PAT. CodexGateGitHubAppService._installation_token
mints a real ghs_… installation token (app id 3249317, installation
120794317 on yamonco, all-repo) for git auth and the REST/GraphQL calls.
- Creds come from the k8s secret
fractalops-codexgate-github-app— wired viaenvFromon api/worker/studio-worker and individualvalueFromon agent-server (the re-kick config builder). - The old
FRACTALOPS_GITHUB_APP_TOKENoverride env (an org PAT) was removed everywhere; it used to win over the App creds inside_installation_token. - Control-plane egress to
api.github.comis confirmed working.
See Agent PR Submission Pipeline for the full actor model.
Error tracking (GlitchTip)
Section titled “Error tracking (GlitchTip)”- The
testerandcompactorroles get the GlitchTip MCP via theobservability-triagebundle (serversglitchtip+fractalops-hud; policieserror-triage,performance-regression-check,issue-resolution-proof). The official GlitchTip MCP at/mcpexposes 17 tools and is registered in ContextForge as gatewayyamon-glitchtip. - Every scaffolded project gets an auto-provisioned per-project DSN baked into
its starter
.env(frontendPUBLIC_SENTRY_DSN, backendSENTRY_DSN) — fail-open, never blocks scaffolding. - Ingest is public (
exposure-scope: public): preview apps POST events to the DSN unauthenticated (the DSN public key is the per-project auth). The UI, management API, and/mcpstill require auth.
See Build Plane Observability for the plane detail.
Manual secrets (OpenBao follow-up pending)
Section titled “Manual secrets (OpenBao follow-up pending)”These runtime secrets are provisioned manually today; OpenBao + ExternalSecret migration is a pending follow-up:
fractalops-codexgate-github-app— CodexGate App creds (same manual pattern as the org-PATfractalops-github-runtimesecret).glitchtip-mcp-bearer(armory namespace) — the GlitchTip org API token used asAuthorization: Bearer …by the armory-registeredyamon-glitchtipgateway.
Daytona web and SSH access reproducibility
Section titled “Daytona web and SSH access reproducibility”Daytona access is split across three layers:
- Kubernetes GitOps owns Traefik
daytona-sshTCP entryPoint, thedaytona-ssh-gateway-edgeIngressRouteTCP, Daytona API envSSH_GATEWAY_URL=daytona-ssh.yamon.io:2222, and the default workspace image. daytona-runner-runtimereconcile owns Daytona DB region drift:region.sshGatewayUrlmust bedaytona-ssh.yamon.io:2222; otherwise the Daytona dashboard generates an unusablessh -p 2222 ...@daytona.yamon.iocommand.- PVE edge forwarding is reconciled by
ops/infra/reconcile_daytona_edge_forwarding.sh. Run it onpve0withINSTALL_SYSTEMD=trueto install the persistent systemd unit. It scopes public DNAT to192.168.219.10only, forwarding TCP443and2222to Traefik on10.10.10.47, and preserves the Keycloak Microsoft-login passthrough exception.
The upstream router still needs explicit port forwards:
TCP 443 -> 192.168.219.10:443TCP 2222 -> 192.168.219.10:2222TCP 30000 -> 192.168.219.10:30000Cloudflare DNS records for proxy.monstore.io,
*.proxy.monstore.io, daytona-ssh.yamon.io, and
*.ssh.daytona.yamon.io must remain DNS-only. Do not proxy SSH hosts through
Cloudflare.
VS Code access
Section titled “VS Code access”Use Daytona SSH Access with VS Code Remote SSH. The current Daytona API creates a
short-lived SSH token at /api/sandbox/{sandboxId}/ssh-access; FractalOps
surfaces that token in the Portal Local VS Code launch flow as:
ssh -p 2222 <token>@daytona-ssh.yamon.ioPaste that command into VS Code Remote SSH, or copy it from the Portal launch progress panel. This is the supported path for current Daytona releases.
Do not treat the legacy daytonaio.daytona VS Code extension as a supported
workspace client for this runtime. Its published extension still calls older
root-level endpoints such as /cluster, /team, and /workspace, while the
deployed Daytona API exposes current /api/sandbox/*, /api/users/me, and
organization/runner endpoints. id.daytona.yamon.io/realms/default, the
Keycloak vscode callback client, sibling hosts, and TCP 30000 are retained
only as compatibility surfaces for callback/profile experiments and must not be
used as proof that the old extension can list or open modern sandboxes.
Diagnosing a zero-output first-tool block (observed 2026-06)
Section titled “Diagnosing a zero-output first-tool block (observed 2026-06)”Symptom: agents go blocked with agent_first_tool_required, a stable
claude_session_id (so session resume / #1528 is working), but reports are
{"status":"blocked"} and observability-feed has 0 events / 0 runtime_logs
— the agent’s Claude session resumes but emits nothing.
Where it runs: agent Claude sessions execute in a Daytona sandbox
(execution_plane: daytona-sandbox), NOT in fractalops-api or
fractalops-agent-server — so api/agent-server logs won’t show claude output.
Read the Daytona workspace logs for the real signal.
Check the execution-plane health first (this is usually the cause, not the control plane):
kubectl get nodes— all Ready?- Longhorn storage:
kubectl get volumes.longhorn.io -n longhorn-system— in the observed run there were 11 faulted/unknown volumes (detached/ orphaned). A degraded storage plane can stop new Daytona workspaces from attaching healthy PVCs → claude can’t run → zero events. - Daytona core pods (
-n daytona) Running. - Skill pull cannot stall the launch (bounded to a 5s timeout).
What was verified NOT the cause: the ContextForge 1.0.3 gateway + backend admin token (no 401s) and session resume (#1528) both work. A zero-output block is an execution-plane (Daytona/storage) problem, not the MCP/auth control plane.