Skip to content

Semantic Plane Redesign & Runtime Hardening

This page records the 2026-06 hardening pass: the DataHub semantic-plane redesign, the central Supabase data plane, a run of AgentSquad runtime fixes, infrastructure resilience work, and cluster resource optimization.

Diagnosis: why the catalog gave no benefit

Section titled “Diagnosis: why the catalog gave no benefit”

The semantics → DataHub machinery was fully built but not operational:

  • Write path exists but never fires. DataHubPublishExecutor already emits MetadataChangeProposals to GMS as idempotent UPSERT ingestProposal calls (catalog + lineage + an ontology-fact fallback for unregistered custom aspects). But its only caller is the admin endpoint — there is no scheduled publish, so the catalog never auto-populates. It is also gated on source_health == healthy, and the live DataHub search index was degraded, so even a manual publish skipped.
  • Read loop is absent. Even populated, agents never query the catalog — there was no agent→DataHub grounding path, so a populated catalog would still deliver zero benefit.
  • Net effect: the live catalog held only residual identity/access edges (no dataset index at all), so search 500’d and the plane was an empty shell.

DataHub should be the agents’ truth plane:

  1. Grounding — backend/contract agents read the real schema + lineage instead of inventing values (this is the source the anti-invention gate cites).
  2. Reuse / dedup — “does this schema already exist?” prevents per-project DB sprawl.
  3. Impact / safety — downstream lineage before a change avoids breaking dependents.
  4. Provenance — which run/proposal produced which asset (proof + traceability).

Design: close the loop with the official DataHub MCP

Section titled “Design: close the loop with the official DataHub MCP”

DataHub ships an official Model Context Protocol server (acryldata/mcp-server-datahub), self-hosted for DataHub Core. It exposes discovery (/q structured search with boolean logic + filters), table/column-level multi-hop lineage, schema + real-query/SQL drafting, and metadata (ownership, tags, glossary, quality) to AI agents over the standard MCP. Rather than hand-rolling a brief tool, the read loop is wired through this official MCP into the agent armory (alongside serena / searxnggrid / agent-memory-archive / feature-plane).

The redesign is therefore three operational moves, not new write code:

  1. Write — consolidate the emit into one sound, extensible MCP emitter and run it on a schedule so the catalog stays current (datasets, schema, lineage: sanmopia legacy → central Supabase schema → project DBs).
  2. Read — deploy the official DataHub MCP server and add it to the agent tool inventory so agents query catalog/lineage/schema/SQL in natural language.
  3. Stewards — operationalize the existing draft agents that own this plane: scribe (delivery leaves explicit lineage, no hidden joins), cataloger (compact DataHub briefs replacing transcript scanning), ontos (bind delivery to ontology + source-health + warehouse semantics), curator (lineage + memory → Agent Wiki).

The creative core is transcript-scan → structured-truth-query: agents read a compact, grounded context graph instead of re-reading raw history — large token and reasoning savings, plus grounding and self-improvement (Ouroboros decisions from the graph).

A single self-hosted Supabase (Kong gateway, GoTrue, PostgREST, postgres-meta, Studio) was deployed on the existing CNPG-backed Postgres, GitOps-managed via Argo (Helm chart + ExternalSecret from OpenBao), with an idempotent roles-bootstrap Job. It serves both the FractalOps platform and projects as a schema-per-project data plane, removing per-project database sprawl.

  • A supabase connector + SupabasePostgresExecutor provision a per-project schema
    • the standard anon/authenticated/service_role grants on the central CNPG — verified live against the real cluster.
  • Project guidance routes databases to the central plane (never a per-project DB stack); Dokploy keeps persistent backing services (databases, static-site / vercel-sim hosting, big-facility compose) only. Dev previews run as bare processes in the sandbox, not on Dokploy — see Dev Preview Plane.
  • Agents receive the central connection as a secure asset: address via the run’s dataPlaneConnection, and the anon/service_role keys via an OpenBao lease ref resolved at runtime — key values never land in run metadata, prompts, or the workspace on disk.
AreaFix
Anti-inventionNew backend domain files that hardcode a business-rule constant (fee/rate/commission/…) as a non-trivial decimal with no source citation are walled ungrounded_business_rule — a general provenance rule replacing the reactive token blocklist. Verified against a real invented PR.
Stable session identity (CQRS)A re-run minted a fresh Claude session id, so it could not resume and its slot key changed → slot-exhaustion → the agent hung workspace-less at waiting_first_tool. Rebuilt the dormant stable-id helper as a DDD/CQRS slice (domain stable id + command stabilize + query resolve), wired into session seeding so every launch lands on a deterministic-per-(run, agent) id → resumes its session and reuses its slot.
Slot capacityThe shared project workspace capped concurrent agent slots at 3; the 4th+ agent in the full roster hung. Settings-driven cap, default 8, so the roster fits (idle slots reclaimed by hibernation).
Hibernation TTLAgent sandboxes were created auto_stop=0 (never stop). Settings-backed defaults: auto-stop 15 min (hibernate idle, resume on demand), auto-archive 12 h, no destructive auto-delete.
Resume self-healresume treated a torn-down (cleaned) session as resumable and stranded the agent in cleaned; cleaned/cleanup_run sessions are now non-resumable so resume reseeds them fresh while healthy sessions keep continuity.
  • Broken memory-HPA. fractalops-api and fractalops-worker autoscaled on a memory target below each pod’s flat Python import baseline, so the HPA pinned both at maxReplicas forever while CPU idled. Dropped memory-based autoscaling (CPU-only for flat-baseline stateless services), right-sized requests; api 4→2, worker drains to 2 — relieving the node CPU-request saturation behind the recurring pod-scheduling pressure.
  • VM recovery. A k3s control/etcd VM stayed dead because (a) it was stopped with no auto-restart, (b) its kubelet cert was future-dated from a transient clock skew so k3s 401’d on restart, and (c) NTP wasn’t syncing. Fixed by HA-managing the VM (auto-restart), rotating the node certs at the correct time, and restoring the shared-pool NTP source so the clock — and therefore the certs — stay valid.
  • Preview/test envs → on-demand. The fractalops-preview-a/b/c and fractalops-test isolated Portal/API validation envs ran 24/7 (~3.5 GiB) on stale images. Made them on-demand (replicas 0, HPAs removed) — reclaiming the RAM while keeping the GitOps definitions for one-command revival.
  • Argo drift. ESO defaults injected into ExternalSecret remoteRef fields kept the central Supabase app perpetually OutOfSync; added ignoreDifferences so it converges to Synced.

Shipped as a sequence of small, reviewed GitOps/backend pull requests (numbers #1519#1530 in yamonco/fractalops), each with co-located tests and a single focused change, merged after the standard admin gate.