Semantic Plane Redesign & Runtime Hardening

This page records the 2026-06 hardening pass: the DataHub semantic-plane redesign, the central Supabase data plane, a run of AgentSquad runtime fixes, infrastructure resilience work, and cluster resource optimization.

DataHub semantic plane — redesign

Diagnosis: why the catalog gave no benefit

The semantics → DataHub machinery was fully built but not operational:

Write path exists but never fires. DataHubPublishExecutor already emits MetadataChangeProposals to GMS as idempotent UPSERT ingestProposal calls (catalog + lineage + an ontology-fact fallback for unregistered custom aspects). But its only caller is the admin endpoint — there is no scheduled publish, so the catalog never auto-populates. It is also gated on source_health == healthy, and the live DataHub search index was degraded, so even a manual publish skipped.
Read loop is absent. Even populated, agents never query the catalog — there was no agent→DataHub grounding path, so a populated catalog would still deliver zero benefit.
Net effect: the live catalog held only residual identity/access edges (no dataset index at all), so search 500’d and the plane was an empty shell.

The benefit, stated plainly

DataHub should be the agents’ truth plane:

Grounding — backend/contract agents read the real schema + lineage instead of inventing values (this is the source the anti-invention gate cites).
Reuse / dedup — “does this schema already exist?” prevents per-project DB sprawl.
Impact / safety — downstream lineage before a change avoids breaking dependents.
Provenance — which run/proposal produced which asset (proof + traceability).

Design: close the loop with the official DataHub MCP

DataHub ships an official Model Context Protocol server (acryldata/mcp-server-datahub), self-hosted for DataHub Core. It exposes discovery (/q structured search with boolean logic + filters), table/column-level multi-hop lineage, schema + real-query/SQL drafting, and metadata (ownership, tags, glossary, quality) to AI agents over the standard MCP. Rather than hand-rolling a brief tool, the read loop is wired through this official MCP into the agent armory (alongside serena / searxnggrid / agent-memory-archive / feature-plane).

The redesign is therefore three operational moves, not new write code:

Write — consolidate the emit into one sound, extensible MCP emitter and run it on a schedule so the catalog stays current (datasets, schema, lineage: sanmopia legacy → central Supabase schema → project DBs).
Read — deploy the official DataHub MCP server and add it to the agent tool inventory so agents query catalog/lineage/schema/SQL in natural language.
Stewards — operationalize the existing draft agents that own this plane: scribe (delivery leaves explicit lineage, no hidden joins), cataloger (compact DataHub briefs replacing transcript scanning), ontos (bind delivery to ontology + source-health + warehouse semantics), curator (lineage + memory → Agent Wiki).

The creative core is transcript-scan → structured-truth-query: agents read a compact, grounded context graph instead of re-reading raw history — large token and reasoning savings, plus grounding and self-improvement (Ouroboros decisions from the graph).

Central Supabase data plane

A single self-hosted Supabase (Kong gateway, GoTrue, PostgREST, postgres-meta, Studio) was deployed on the existing CNPG-backed Postgres, GitOps-managed via Argo (Helm chart + ExternalSecret from OpenBao), with an idempotent roles-bootstrap Job. It serves both the FractalOps platform and projects as a schema-per-project data plane, removing per-project database sprawl.

A supabase connector + SupabasePostgresExecutor provision a per-project schema
- the standard anon/authenticated/service_role grants on the central CNPG — verified live against the real cluster.
Project guidance routes databases to the central plane (never a per-project DB stack); Dokploy keeps persistent backing services (databases, static-site / vercel-sim hosting, big-facility compose) only. Dev previews run as bare processes in the sandbox, not on Dokploy — see Dev Preview Plane.
Agents receive the central connection as a secure asset: address via the run’s dataPlaneConnection, and the anon/service_role keys via an OpenBao lease ref resolved at runtime — key values never land in run metadata, prompts, or the workspace on disk.

AgentSquad runtime fixes

Area	Fix
Anti-invention	New backend domain files that hardcode a business-rule constant (fee/rate/commission/…) as a non-trivial decimal with no source citation are walled `ungrounded_business_rule` — a general provenance rule replacing the reactive token blocklist. Verified against a real invented PR.
Stable session identity (CQRS)	A re-run minted a fresh Claude session id, so it could not resume and its slot key changed → slot-exhaustion → the agent hung workspace-less at `waiting_first_tool`. Rebuilt the dormant stable-id helper as a DDD/CQRS slice (domain stable id + command `stabilize` + query `resolve`), wired into session seeding so every launch lands on a deterministic-per-(run, agent) id → resumes its session and reuses its slot.
Slot capacity	The shared project workspace capped concurrent agent slots at 3; the 4th+ agent in the full roster hung. Settings-driven cap, default 8, so the roster fits (idle slots reclaimed by hibernation).
Hibernation TTL	Agent sandboxes were created `auto_stop=0` (never stop). Settings-backed defaults: auto-stop 15 min (hibernate idle, resume on demand), auto-archive 12 h, no destructive auto-delete.
Resume self-heal	`resume` treated a torn-down (`cleaned`) session as resumable and stranded the agent in `cleaned`; cleaned/`cleanup_run` sessions are now non-resumable so resume reseeds them fresh while healthy sessions keep continuity.

Infrastructure resilience & optimization

Broken memory-HPA. fractalops-api and fractalops-worker autoscaled on a memory target below each pod’s flat Python import baseline, so the HPA pinned both at maxReplicas forever while CPU idled. Dropped memory-based autoscaling (CPU-only for flat-baseline stateless services), right-sized requests; api 4→2, worker drains to 2 — relieving the node CPU-request saturation behind the recurring pod-scheduling pressure.
VM recovery. A k3s control/etcd VM stayed dead because (a) it was stopped with no auto-restart, (b) its kubelet cert was future-dated from a transient clock skew so k3s 401’d on restart, and (c) NTP wasn’t syncing. Fixed by HA-managing the VM (auto-restart), rotating the node certs at the correct time, and restoring the shared-pool NTP source so the clock — and therefore the certs — stay valid.
Preview/test envs → on-demand. The fractalops-preview-a/b/c and fractalops-test isolated Portal/API validation envs ran 24/7 (~3.5 GiB) on stale images. Made them on-demand (replicas 0, HPAs removed) — reclaiming the RAM while keeping the GitOps definitions for one-command revival.
Argo drift. ESO defaults injected into ExternalSecret remoteRef fields kept the central Supabase app perpetually OutOfSync; added ignoreDifferences so it converges to Synced.

Provenance

Shipped as a sequence of small, reviewed GitOps/backend pull requests (numbers #1519–#1530 in yamonco/fractalops), each with co-located tests and a single focused change, merged after the standard admin gate.