Skip to content

AgentSquad Runtime Operations

This runbook is the SSOT for driving and observing AgentSquad runs against the live runtime. Written from the live API so the operational path is findable without guessing.

  • API base: http://10.10.10.47:18088 (edge → fractalops-api).
  • Auth: edge Pomerium claim headers, NOT a bearer token:
    • x-pomerium-claim-email: <you>@yamon.io
    • x-pomerium-claim-groups: /org/platform/roles/super_admin
  • Health: GET /healthz{"status":"ok"}.
  • GET /openapi.json is the authoritative endpoint list — grep it, do not guess paths.

Prefix /v1/admin/studio (NOT /v1/studio — that 404s):

ActionEndpoint
Launch a runPOST /v1/admin/studio/runs
Get runGET /v1/admin/studio/runs/{run_id}?tenant_id=default
Execute (tick agents)POST /v1/admin/studio/runs/{run_id}/execute?tenant_id=default
Session state (observe)GET /v1/admin/studio/runs/{run_id}/session-state?tenant_id=default
Reports (per tick)GET /v1/admin/studio/runs/{run_id}/reports?tenant_id=default (returns a LIST)
Handoff trailGET /v1/admin/studio/runs/{run_id}/handoff-trail?tenant_id=default
Cleanup / control / replayPOST .../runs/{run_id}/{cleanup,control,replay}
TemplatesGET /v1/admin/studio/templates

ouroboros, pm-ouroboros, agentsquad (project delivery / migration), research-team.

{
"template_id": "agentsquad",
"project_slug": "sanmopia-modernization",
"runtime_kind": "project_delivery",
"work_scope": "<the task>",
"next_goal": "<first concrete step>"
}
  • Project slugs are exact. Sanmopia’s slug is sanmopia-modernization, NOT sanmopia — a wrong slug makes the launch hang on provisioning (HTTP 000 / timeout). Confirm the slug from a prior run’s project_slug before launching.
  • Launch returns {run_id, status: "ready", agent_roster: [...]} in ~6s. The roster starts as the intake pair (planner + curator); more agents join by handoff.
  • execute returns fast with status: "scheduled" — the tick runs ASYNC; poll session-state to watch progress.

Agent identity model (as seen on the roster)

Section titled “Agent identity model (as seen on the roster)”

Each roster agent carries:

  • user_principal_name / subject_key = <role>@yamon.io (e.g. planner@yamon.io), identity_binding_mode: named_agent_microsoft, secret_lease_ref: named-agent/<role> — the msgraph-bound named-agent UPN.
  • agent_nickname = an auto-generated handle (e.g. dangerous-vector), deterministic per (run, agent, role). No manual assignment.
  • The ContextForge gateway token identity is separate: <projcode>.<nickname>@<domain> (so the gateway injects a spoof-proof X-Authenticated-User to upstream MCP servers; AMA binds the project from it).
  • skill_ids / mcp_server_ids are EMPTY in the roster snapshot — the loadout resolves them at session launch, not in the roster view.

Poll session-state; each session reports live_status, current_step, blocker_code, session_id, claude_session_id, tool_call_count.

Lifecycle: readywaiting_first_tool (Claude session warming) → first tool call → working → session_completed. A run is blocked if any agent is blocked.

  • agent_first_tool_required — the agent’s Claude session started (claude_session_id is set) but it did not make the required first tool call, so the launch contract blocks it. Recurs at the intake phase. Check: the agent’s MCP tools are actually attached (the required first tool is usually an MCP/HUD tool — if the MCP config or gateway auth is broken the agent literally cannot call it), and the dispatch prompt elicited a tool call.
  • See the full blocker set in GET get_hud_snapshotexecution_report_contract.BLOCKERS.
  • ContextForge gateway is on 1.0.3; backend mints the 1.0.x-shaped admin token (issuer/audience/jti) — agent gateway auth works, no 401s in the gateway log during a run.
  • Deployed fractalops-api image auto-pins forward on each merge (e.g. gha-2820-<sha>), so merged backend changes are live within one pin bump.
  • The skill poll-pull (/v1/skills) was NOT exercised by intake agents in the observed run (no /v1/skills calls) — they block at first-tool before the pull, and the workspace shims carrying the pull may lag the api image.

GitHub writes go through the CodexGate App

Section titled “GitHub writes go through the CodexGate App”

Squad PRs and issues are authored by the CodexGate GitHub App (fractalops-codexgate[bot]), not a raw org PAT. CodexGateGitHubAppService._installation_token mints a real ghs_… installation token (app id 3249317, installation 120794317 on yamonco, all-repo) for git auth and the REST/GraphQL calls.

  • Creds come from the k8s secret fractalops-codexgate-github-app — wired via envFrom on api/worker/studio-worker and individual valueFrom on agent-server (the re-kick config builder).
  • The old FRACTALOPS_GITHUB_APP_TOKEN override env (an org PAT) was removed everywhere; it used to win over the App creds inside _installation_token.
  • Control-plane egress to api.github.com is confirmed working.

See Agent PR Submission Pipeline for the full actor model.

  • The tester and compactor roles get the GlitchTip MCP via the observability-triage bundle (servers glitchtip + fractalops-hud; policies error-triage, performance-regression-check, issue-resolution-proof). The official GlitchTip MCP at /mcp exposes 17 tools and is registered in ContextForge as gateway yamon-glitchtip.
  • Every scaffolded project gets an auto-provisioned per-project DSN baked into its starter .env (frontend PUBLIC_SENTRY_DSN, backend SENTRY_DSN) — fail-open, never blocks scaffolding.
  • Ingest is public (exposure-scope: public): preview apps POST events to the DSN unauthenticated (the DSN public key is the per-project auth). The UI, management API, and /mcp still require auth.

See Build Plane Observability for the plane detail.

Manual secrets (OpenBao follow-up pending)

Section titled “Manual secrets (OpenBao follow-up pending)”

These runtime secrets are provisioned manually today; OpenBao + ExternalSecret migration is a pending follow-up:

  • fractalops-codexgate-github-app — CodexGate App creds (same manual pattern as the org-PAT fractalops-github-runtime secret).
  • glitchtip-mcp-bearer (armory namespace) — the GlitchTip org API token used as Authorization: Bearer … by the armory-registered yamon-glitchtip gateway.

Daytona web and SSH access reproducibility

Section titled “Daytona web and SSH access reproducibility”

Daytona access is split across three layers:

  • Kubernetes GitOps owns Traefik daytona-ssh TCP entryPoint, the daytona-ssh-gateway-edge IngressRouteTCP, Daytona API env SSH_GATEWAY_URL=daytona-ssh.yamon.io:2222, and the default workspace image.
  • daytona-runner-runtime reconcile owns Daytona DB region drift: region.sshGatewayUrl must be daytona-ssh.yamon.io:2222; otherwise the Daytona dashboard generates an unusable ssh -p 2222 ...@daytona.yamon.io command.
  • PVE edge forwarding is reconciled by ops/infra/reconcile_daytona_edge_forwarding.sh. Run it on pve0 with INSTALL_SYSTEMD=true to install the persistent systemd unit. It scopes public DNAT to 192.168.219.10 only, forwarding TCP 443 and 2222 to Traefik on 10.10.10.47, and preserves the Keycloak Microsoft-login passthrough exception.

The upstream router still needs explicit port forwards:

TCP 443 -> 192.168.219.10:443
TCP 2222 -> 192.168.219.10:2222
TCP 30000 -> 192.168.219.10:30000

Cloudflare DNS records for proxy.monstore.io, *.proxy.monstore.io, daytona-ssh.yamon.io, and *.ssh.daytona.yamon.io must remain DNS-only. Do not proxy SSH hosts through Cloudflare.

Use Daytona SSH Access with VS Code Remote SSH. The current Daytona API creates a short-lived SSH token at /api/sandbox/{sandboxId}/ssh-access; FractalOps surfaces that token in the Portal Local VS Code launch flow as:

ssh -p 2222 <token>@daytona-ssh.yamon.io

Paste that command into VS Code Remote SSH, or copy it from the Portal launch progress panel. This is the supported path for current Daytona releases.

Do not treat the legacy daytonaio.daytona VS Code extension as a supported workspace client for this runtime. Its published extension still calls older root-level endpoints such as /cluster, /team, and /workspace, while the deployed Daytona API exposes current /api/sandbox/*, /api/users/me, and organization/runner endpoints. id.daytona.yamon.io/realms/default, the Keycloak vscode callback client, sibling hosts, and TCP 30000 are retained only as compatibility surfaces for callback/profile experiments and must not be used as proof that the old extension can list or open modern sandboxes.

Diagnosing a zero-output first-tool block (observed 2026-06)

Section titled “Diagnosing a zero-output first-tool block (observed 2026-06)”

Symptom: agents go blocked with agent_first_tool_required, a stable claude_session_id (so session resume / #1528 is working), but reports are {"status":"blocked"} and observability-feed has 0 events / 0 runtime_logs — the agent’s Claude session resumes but emits nothing.

Where it runs: agent Claude sessions execute in a Daytona sandbox (execution_plane: daytona-sandbox), NOT in fractalops-api or fractalops-agent-server — so api/agent-server logs won’t show claude output. Read the Daytona workspace logs for the real signal.

Check the execution-plane health first (this is usually the cause, not the control plane):

  • kubectl get nodes — all Ready?
  • Longhorn storage: kubectl get volumes.longhorn.io -n longhorn-system — in the observed run there were 11 faulted/unknown volumes (detached/ orphaned). A degraded storage plane can stop new Daytona workspaces from attaching healthy PVCs → claude can’t run → zero events.
  • Daytona core pods (-n daytona) Running.
  • Skill pull cannot stall the launch (bounded to a 5s timeout).

What was verified NOT the cause: the ContextForge 1.0.3 gateway + backend admin token (no 401s) and session resume (#1528) both work. A zero-output block is an execution-plane (Daytona/storage) problem, not the MCP/auth control plane.