Skip to content

Build Plane Observability

This page is the ubiquitous language for build duration, cache, resource, and source-transfer telemetry of the platform’s own CI image release pipeline (FractalOps API/worker/agent-server/portal runtime images). Use these names in workflow steps, scripts, metrics, warehouse columns, dashboards, and agent prompts.

This is not the developer dev loop. There is no in-sandbox build/ship plane and no per-project build pipeline; dev previews run as bare processes — see Dev Preview Plane.

TermMeaningDo not call it
Build PlaneThe CI runtime surface that turns FractalOps source into pushed runtime images. GitOps pinning is a downstream release step.Daytona, Dokploy, test runner
Build Plane EventOne fact emitted by the build plane for a workflow run and step.pytest result, smoke test
Build StepNamed measured phase such as runtime_image_build or gitops_push.random job step
Telemetry MarkerStart/finish wrapper that records step duration and contextual attributes.shell log, timing hack
Telemetry SinkDurable event destination. Current sinks are Mimir metrics, Kafka runtime events, and optional ClickHouse direct inserts.console output
Evidence ProjectionQueryable ClickHouse row derived from OTLP/Kafka/raw events or direct inserts.source of domain truth
GitHub Actions runtime-release
-> ops/ci/record_build_pipeline_event.sh
-> ops/infra/build_pipeline_event_telemetry.py
-> OTLP HTTP through the build-plane edge endpoint
-> OpenTelemetry Collector
-> Mimir metrics store
-> Kafka topic fractalops.runtime.events
-> warehouse.fractalops_events_raw
-> ClickHouse projections and dashboards

Mimir is the central metrics store. ClickHouse remains the event, fact, and warehouse proof plane. Kafka remains the runtime event stream. The collector is the routing boundary, not the durable metrics database.

Direct ClickHouse insertion remains supported for controlled jobs that have FRACTALOPS_CLICKHOUSE_* credentials. It is a sink, not the canonical transport. OTLP is the default build-plane transport because every runner can emit it through topology-derived collector endpoints.

GitHub Actions and other non-cluster runners must use the Build Plane edge endpoint, currently FRACTALOPS_OTEL_OTLP_HTTP_ENDPOINT=http://10.10.10.47:19010. In-cluster workers may use FRACTALOPS_OTEL_OTLP_HTTP_CLUSTER_ENDPOINT=http://opentelemetry-collector.observability.svc.cluster.local:4318.

Control-plane phase-timing traces now flow to ClickHouse. Two faults were corrected:

  • The redpanda topic retention was capped while disk was full, which blocked the Kafka produce path. Retention was restored.
  • The agent-server OTLP exporter endpoint is now pointed at /v1/traces. LangGraph posts to OTEL_EXPORTER_OTLP_ENDPOINT as-is, so the path must be the full traces route, not the collector root.

GlitchTip is the error-tracking plane. It is the Sentry-compatible application error and performance surface; it is not a metrics, ontology, or proof authority.

TermMeaningDo not call it
Error Tracking PlaneGlitchTip error/performance surface for runtime and preview-app exceptions.Mimir, ClickHouse proof, OTel collector
Project DSNPer-project public ingest key. The DSN public key is the per-project auth for unauthenticated event POSTs.Bearer API token
GlitchTip MCPThe official built-in GlitchTip MCP server, registered in the armory.A custom MCP wrapper

Deployment (platform/k8s/apps/glitchtip, ArgoCD Application glitchtip in the runtime-services project):

  • New glitchtip namespace: GlitchTip 6.1.8 web + celery worker + valkey + migrate hook.
  • DB is a glitchtip database/role on the shared CloudNativePG cluster fractalops-postgresql. Org slug is fractalops.
  • Ingress is exposure-scope: public (the standard self-hosted Sentry model). Preview apps POST events to their DSN unauthenticated — the DSN public key is the per-project auth. GlitchTip’s own auth still gates everything else: the UI requires login (open registration disabled), and the management API and /mcp require a Bearer token.

GlitchTip 6.1+ ships an official built-in MCP server at /mcp (enabled by GLITCHTIP_ENABLE_MCP=True and GLITCHTIP_ENABLE_LOGS=True). A custom hand-rolled MCP wrapper that was briefly used was removed in favor of the official one.

  • It exposes 17 tools: list_organizations, list_projects, list_issues, get_issue, get_latest_event, get_event, update_issue, list_transaction_groups, get_transaction_group, list_transaction_spans, list_span_groups, detect_n_plus_one, get_transaction_trend, list_alerts, list_monitors, list_logs, get_log.
  • It accepts a static Bearer API token service-to-service. The armory registrar registers it in ContextForge as gateway yamon-glitchtip, URL http://glitchtip-web.glitchtip.svc.cluster.local:8080/mcp, auth_type=authheaders, Authorization: Bearer <org API token> (token in secret glitchtip-mcp-bearer, armory namespace).
  • The tester and compactor agent roles get the observability-triage bundle: servers (glitchtip, fractalops-hud) + policies (error-triage, performance-regression-check, issue-resolution-proof). Workspace-shims is bumped to 0.1.98 with the official glitchtip tool names.

project_factory auto-provisions error tracking for every scaffolded project with zero manual setup. glitchtip_dsn_provisioning.ensure_project_dsn(project_slug) get-or-creates the project’s GlitchTip project and returns its public DSN. It is fail-open — DSN provisioning never blocks scaffolding.

  • The DSN is baked into the starter .env files: frontend PUBLIC_SENTRY_DSN and backend SENTRY_DSN. GlitchTip is Sentry-compatible, so projects use @sentry/astro / sentry-sdk.
  • Settings: FRACTALOPS_GLITCHTIP_API_URL, FRACTALOPS_GLITCHTIP_API_TOKEN, FRACTALOPS_GLITCHTIP_ORG.
  • App-level SDK init is the agents’ responsibility; the baked DSN .env is the control-plane handoff.

Event names use <build_step>_started and <build_step>_finished.

Build StepPurpose
build_plane_probeBuild driver and builder setup.
runtime_base_image_buildBase runtime image build and push.
runtime_image_buildFractalOps API/runtime image build and push.
agent_server_image_buildLangGraph agent server image build and push.
gitops_pin_updateLocal GitOps manifest/image pin mutation.
gitops_pushGitOps commit and remote push.
runtime_rollout_convergenceRuntime GitOps commit convergence and deployment image propagation.

New steps must be named by domain outcome, not implementation detail. Prefer source_bundle_upload_finished over tar_command_finished.

Every Build Plane Event should carry:

AttributeSource
run_idGITHUB_RUN_ID or build-plane run id.
workflow_idGITHUB_WORKFLOW or runner workflow id.
task_queueBuild task queue, for example github-actions-build-plane.
worker_idRunner host/user identity.
runner_name, runner_os, runner_archGitHub runner metadata when present.
projectProject slug, default fractalops.
branch, commit_shaSource revision.
duration_msFinish marker duration.
cache_key, cache_hitBuild cache identity and hit state when known.
image_ref, image_digestImage target and produced digest when known.
resource_cpu, resource_memoryRequested or observed build resource envelope when known.
metadata_jsonSmall structured extension for step-specific facts.
statusstarted, success, failure, or cancelled.

OpenTelemetry attributes are exported under fractalops.build.*. FractalOps-only extensions must stay under this namespace; use standard OpenTelemetry semantic keys only when they already describe the field.

ConcernOwner file
Topology-derived build-plane envops/infra/build_plane_env.py
GitHub Actions env exportops/ci/resolve_build_plane.sh
Start/finish markersops/ci/record_build_pipeline_event.sh
OTLP and ClickHouse emissionops/infra/build_pipeline_event_telemetry.py
Build-plane OTLP HTTP edge routeplatform/k8s/apps/opentelemetry-collector/templates/ingressroute.yaml
Central metrics storeplatform/k8s/apps/mimir/
Runtime image build workflow.github/workflows/runtime-release.yml
Runtime GitOps bump workflow.github/workflows/runtime-gitops-bump.yml
Runtime rollout observe workflow.github/workflows/runtime-rollout-observe.yml
Warehouse raw sinkwarehouse.fractalops_events_raw
Optional direct projectionwarehouse.build_events

Do not add ad hoc timers directly to workflow bodies when a Telemetry Marker can express the same phase. Add a named Build Step, emit start/finish, and extend this page first when the term is new.