Build Plane Observability
Build Plane Observability
Section titled “Build Plane Observability”This page is the ubiquitous language for build duration, cache, resource, and source-transfer telemetry of the platform’s own CI image release pipeline (FractalOps API/worker/agent-server/portal runtime images). Use these names in workflow steps, scripts, metrics, warehouse columns, dashboards, and agent prompts.
This is not the developer dev loop. There is no in-sandbox build/ship plane and no per-project build pipeline; dev previews run as bare processes — see Dev Preview Plane.
Ubiquitous Terms
Section titled “Ubiquitous Terms”| Term | Meaning | Do not call it |
|---|---|---|
Build Plane | The CI runtime surface that turns FractalOps source into pushed runtime images. GitOps pinning is a downstream release step. | Daytona, Dokploy, test runner |
Build Plane Event | One fact emitted by the build plane for a workflow run and step. | pytest result, smoke test |
Build Step | Named measured phase such as runtime_image_build or gitops_push. | random job step |
Telemetry Marker | Start/finish wrapper that records step duration and contextual attributes. | shell log, timing hack |
Telemetry Sink | Durable event destination. Current sinks are Mimir metrics, Kafka runtime events, and optional ClickHouse direct inserts. | console output |
Evidence Projection | Queryable ClickHouse row derived from OTLP/Kafka/raw events or direct inserts. | source of domain truth |
Event Flow
Section titled “Event Flow”GitHub Actions runtime-release -> ops/ci/record_build_pipeline_event.sh -> ops/infra/build_pipeline_event_telemetry.py -> OTLP HTTP through the build-plane edge endpoint -> OpenTelemetry Collector -> Mimir metrics store -> Kafka topic fractalops.runtime.events -> warehouse.fractalops_events_raw -> ClickHouse projections and dashboardsMimir is the central metrics store. ClickHouse remains the event, fact, and warehouse proof plane. Kafka remains the runtime event stream. The collector is the routing boundary, not the durable metrics database.
Direct ClickHouse insertion remains supported for controlled jobs that have
FRACTALOPS_CLICKHOUSE_* credentials. It is a sink, not the canonical transport.
OTLP is the default build-plane transport because every runner can emit it
through topology-derived collector endpoints.
GitHub Actions and other non-cluster runners must use the Build Plane edge
endpoint, currently FRACTALOPS_OTEL_OTLP_HTTP_ENDPOINT=http://10.10.10.47:19010.
In-cluster workers may use
FRACTALOPS_OTEL_OTLP_HTTP_CLUSTER_ENDPOINT=http://opentelemetry-collector.observability.svc.cluster.local:4318.
Trace sink (Studio phase timing)
Section titled “Trace sink (Studio phase timing)”Control-plane phase-timing traces now flow to ClickHouse. Two faults were corrected:
- The redpanda topic retention was capped while disk was full, which blocked the Kafka produce path. Retention was restored.
- The agent-server OTLP exporter endpoint is now pointed at
/v1/traces. LangGraph posts toOTEL_EXPORTER_OTLP_ENDPOINTas-is, so the path must be the full traces route, not the collector root.
Error Tracking Plane
Section titled “Error Tracking Plane”GlitchTip is the error-tracking plane. It is the Sentry-compatible application error and performance surface; it is not a metrics, ontology, or proof authority.
| Term | Meaning | Do not call it |
|---|---|---|
Error Tracking Plane | GlitchTip error/performance surface for runtime and preview-app exceptions. | Mimir, ClickHouse proof, OTel collector |
Project DSN | Per-project public ingest key. The DSN public key is the per-project auth for unauthenticated event POSTs. | Bearer API token |
GlitchTip MCP | The official built-in GlitchTip MCP server, registered in the armory. | A custom MCP wrapper |
Deployment (platform/k8s/apps/glitchtip, ArgoCD Application glitchtip in the
runtime-services project):
- New
glitchtipnamespace: GlitchTip 6.1.8 web + celery worker + valkey + migrate hook. - DB is a
glitchtipdatabase/role on the shared CloudNativePG clusterfractalops-postgresql. Org slug isfractalops. - Ingress is
exposure-scope: public(the standard self-hosted Sentry model). Preview apps POST events to their DSN unauthenticated — the DSN public key is the per-project auth. GlitchTip’s own auth still gates everything else: the UI requires login (open registration disabled), and the management API and/mcprequire a Bearer token.
Official GlitchTip MCP
Section titled “Official GlitchTip MCP”GlitchTip 6.1+ ships an official built-in MCP server at /mcp (enabled by
GLITCHTIP_ENABLE_MCP=True and GLITCHTIP_ENABLE_LOGS=True). A custom
hand-rolled MCP wrapper that was briefly used was removed in favor of the
official one.
- It exposes 17 tools:
list_organizations,list_projects,list_issues,get_issue,get_latest_event,get_event,update_issue,list_transaction_groups,get_transaction_group,list_transaction_spans,list_span_groups,detect_n_plus_one,get_transaction_trend,list_alerts,list_monitors,list_logs,get_log. - It accepts a static Bearer API token service-to-service. The armory registrar
registers it in ContextForge as gateway
yamon-glitchtip, URLhttp://glitchtip-web.glitchtip.svc.cluster.local:8080/mcp,auth_type=authheaders,Authorization: Bearer <org API token>(token in secretglitchtip-mcp-bearer, armory namespace). - The
testerandcompactoragent roles get theobservability-triagebundle: servers (glitchtip,fractalops-hud) + policies (error-triage,performance-regression-check,issue-resolution-proof). Workspace-shims is bumped to 0.1.98 with the official glitchtip tool names.
Auto-provisioned per-project DSN
Section titled “Auto-provisioned per-project DSN”project_factory auto-provisions error tracking for every scaffolded project
with zero manual setup. glitchtip_dsn_provisioning.ensure_project_dsn(project_slug)
get-or-creates the project’s GlitchTip project and returns its public DSN. It is
fail-open — DSN provisioning never blocks scaffolding.
- The DSN is baked into the starter
.envfiles: frontendPUBLIC_SENTRY_DSNand backendSENTRY_DSN. GlitchTip is Sentry-compatible, so projects use@sentry/astro/sentry-sdk. - Settings:
FRACTALOPS_GLITCHTIP_API_URL,FRACTALOPS_GLITCHTIP_API_TOKEN,FRACTALOPS_GLITCHTIP_ORG. - App-level SDK init is the agents’ responsibility; the baked DSN
.envis the control-plane handoff.
Stable Event Names
Section titled “Stable Event Names”Event names use <build_step>_started and <build_step>_finished.
| Build Step | Purpose |
|---|---|
build_plane_probe | Build driver and builder setup. |
runtime_base_image_build | Base runtime image build and push. |
runtime_image_build | FractalOps API/runtime image build and push. |
agent_server_image_build | LangGraph agent server image build and push. |
gitops_pin_update | Local GitOps manifest/image pin mutation. |
gitops_push | GitOps commit and remote push. |
runtime_rollout_convergence | Runtime GitOps commit convergence and deployment image propagation. |
New steps must be named by domain outcome, not implementation detail. Prefer
source_bundle_upload_finished over tar_command_finished.
Required Attributes
Section titled “Required Attributes”Every Build Plane Event should carry:
| Attribute | Source |
|---|---|
run_id | GITHUB_RUN_ID or build-plane run id. |
workflow_id | GITHUB_WORKFLOW or runner workflow id. |
task_queue | Build task queue, for example github-actions-build-plane. |
worker_id | Runner host/user identity. |
runner_name, runner_os, runner_arch | GitHub runner metadata when present. |
project | Project slug, default fractalops. |
branch, commit_sha | Source revision. |
duration_ms | Finish marker duration. |
cache_key, cache_hit | Build cache identity and hit state when known. |
image_ref, image_digest | Image target and produced digest when known. |
resource_cpu, resource_memory | Requested or observed build resource envelope when known. |
metadata_json | Small structured extension for step-specific facts. |
status | started, success, failure, or cancelled. |
OpenTelemetry attributes are exported under fractalops.build.*. FractalOps-only
extensions must stay under this namespace; use standard OpenTelemetry semantic
keys only when they already describe the field.
Code Ownership
Section titled “Code Ownership”| Concern | Owner file |
|---|---|
| Topology-derived build-plane env | ops/infra/build_plane_env.py |
| GitHub Actions env export | ops/ci/resolve_build_plane.sh |
| Start/finish markers | ops/ci/record_build_pipeline_event.sh |
| OTLP and ClickHouse emission | ops/infra/build_pipeline_event_telemetry.py |
| Build-plane OTLP HTTP edge route | platform/k8s/apps/opentelemetry-collector/templates/ingressroute.yaml |
| Central metrics store | platform/k8s/apps/mimir/ |
| Runtime image build workflow | .github/workflows/runtime-release.yml |
| Runtime GitOps bump workflow | .github/workflows/runtime-gitops-bump.yml |
| Runtime rollout observe workflow | .github/workflows/runtime-rollout-observe.yml |
| Warehouse raw sink | warehouse.fractalops_events_raw |
| Optional direct projection | warehouse.build_events |
Do not add ad hoc timers directly to workflow bodies when a Telemetry Marker
can express the same phase. Add a named Build Step, emit start/finish, and
extend this page first when the term is new.