Semantic Plane Redesign & Runtime Hardening
This page records the 2026-06 hardening pass: the DataHub semantic-plane redesign, the central Supabase data plane, a run of AgentSquad runtime fixes, infrastructure resilience work, and cluster resource optimization.
DataHub semantic plane — redesign
Section titled “DataHub semantic plane — redesign”Diagnosis: why the catalog gave no benefit
Section titled “Diagnosis: why the catalog gave no benefit”The semantics → DataHub machinery was fully built but not operational:
- Write path exists but never fires.
DataHubPublishExecutoralready emits MetadataChangeProposals to GMS as idempotentUPSERTingestProposalcalls (catalog + lineage + an ontology-fact fallback for unregistered custom aspects). But its only caller is the admin endpoint — there is no scheduled publish, so the catalog never auto-populates. It is also gated onsource_health == healthy, and the live DataHub search index was degraded, so even a manual publish skipped. - Read loop is absent. Even populated, agents never query the catalog — there was no agent→DataHub grounding path, so a populated catalog would still deliver zero benefit.
- Net effect: the live catalog held only residual identity/access edges (no
datasetindex at all), so search 500’d and the plane was an empty shell.
The benefit, stated plainly
Section titled “The benefit, stated plainly”DataHub should be the agents’ truth plane:
- Grounding — backend/contract agents read the real schema + lineage instead of inventing values (this is the source the anti-invention gate cites).
- Reuse / dedup — “does this schema already exist?” prevents per-project DB sprawl.
- Impact / safety — downstream lineage before a change avoids breaking dependents.
- Provenance — which run/proposal produced which asset (proof + traceability).
Design: close the loop with the official DataHub MCP
Section titled “Design: close the loop with the official DataHub MCP”DataHub ships an official Model Context Protocol server
(acryldata/mcp-server-datahub), self-hosted for DataHub Core. It exposes
discovery (/q structured search with boolean logic + filters), table/column-level
multi-hop lineage, schema + real-query/SQL drafting, and metadata (ownership, tags,
glossary, quality) to AI agents over the standard MCP. Rather than hand-rolling a
brief tool, the read loop is wired through this official MCP into the agent armory
(alongside serena / searxnggrid / agent-memory-archive / feature-plane).
The redesign is therefore three operational moves, not new write code:
- Write — consolidate the emit into one sound, extensible MCP emitter and run it on a schedule so the catalog stays current (datasets, schema, lineage: sanmopia legacy → central Supabase schema → project DBs).
- Read — deploy the official DataHub MCP server and add it to the agent tool inventory so agents query catalog/lineage/schema/SQL in natural language.
- Stewards — operationalize the existing draft agents that own this plane:
scribe(delivery leaves explicit lineage, no hidden joins),cataloger(compact DataHub briefs replacing transcript scanning),ontos(bind delivery to ontology + source-health + warehouse semantics),curator(lineage + memory → Agent Wiki).
The creative core is transcript-scan → structured-truth-query: agents read a compact, grounded context graph instead of re-reading raw history — large token and reasoning savings, plus grounding and self-improvement (Ouroboros decisions from the graph).
Central Supabase data plane
Section titled “Central Supabase data plane”A single self-hosted Supabase (Kong gateway, GoTrue, PostgREST, postgres-meta,
Studio) was deployed on the existing CNPG-backed Postgres, GitOps-managed via Argo
(Helm chart + ExternalSecret from OpenBao), with an idempotent roles-bootstrap
Job. It serves both the FractalOps platform and projects as a schema-per-project
data plane, removing per-project database sprawl.
- A
supabaseconnector +SupabasePostgresExecutorprovision a per-project schema- the standard
anon/authenticated/service_rolegrants on the central CNPG — verified live against the real cluster.
- the standard
- Project guidance routes databases to the central plane (never a per-project DB stack); Dokploy keeps persistent backing services (databases, static-site / vercel-sim hosting, big-facility compose) only. Dev previews run as bare processes in the sandbox, not on Dokploy — see Dev Preview Plane.
- Agents receive the central connection as a secure asset: address via the run’s
dataPlaneConnection, and theanon/service_rolekeys via an OpenBao lease ref resolved at runtime — key values never land in run metadata, prompts, or the workspace on disk.
AgentSquad runtime fixes
Section titled “AgentSquad runtime fixes”| Area | Fix |
|---|---|
| Anti-invention | New backend domain files that hardcode a business-rule constant (fee/rate/commission/…) as a non-trivial decimal with no source citation are walled ungrounded_business_rule — a general provenance rule replacing the reactive token blocklist. Verified against a real invented PR. |
| Stable session identity (CQRS) | A re-run minted a fresh Claude session id, so it could not resume and its slot key changed → slot-exhaustion → the agent hung workspace-less at waiting_first_tool. Rebuilt the dormant stable-id helper as a DDD/CQRS slice (domain stable id + command stabilize + query resolve), wired into session seeding so every launch lands on a deterministic-per-(run, agent) id → resumes its session and reuses its slot. |
| Slot capacity | The shared project workspace capped concurrent agent slots at 3; the 4th+ agent in the full roster hung. Settings-driven cap, default 8, so the roster fits (idle slots reclaimed by hibernation). |
| Hibernation TTL | Agent sandboxes were created auto_stop=0 (never stop). Settings-backed defaults: auto-stop 15 min (hibernate idle, resume on demand), auto-archive 12 h, no destructive auto-delete. |
| Resume self-heal | resume treated a torn-down (cleaned) session as resumable and stranded the agent in cleaned; cleaned/cleanup_run sessions are now non-resumable so resume reseeds them fresh while healthy sessions keep continuity. |
Infrastructure resilience & optimization
Section titled “Infrastructure resilience & optimization”- Broken memory-HPA.
fractalops-apiandfractalops-workerautoscaled on a memory target below each pod’s flat Python import baseline, so the HPA pinned both atmaxReplicasforever while CPU idled. Dropped memory-based autoscaling (CPU-only for flat-baseline stateless services), right-sized requests; api 4→2, worker drains to 2 — relieving the node CPU-request saturation behind the recurring pod-scheduling pressure. - VM recovery. A k3s control/etcd VM stayed dead because (a) it was stopped with no auto-restart, (b) its kubelet cert was future-dated from a transient clock skew so k3s 401’d on restart, and (c) NTP wasn’t syncing. Fixed by HA-managing the VM (auto-restart), rotating the node certs at the correct time, and restoring the shared-pool NTP source so the clock — and therefore the certs — stay valid.
- Preview/test envs → on-demand. The
fractalops-preview-a/b/candfractalops-testisolated Portal/API validation envs ran 24/7 (~3.5 GiB) on stale images. Made them on-demand (replicas 0, HPAs removed) — reclaiming the RAM while keeping the GitOps definitions for one-command revival. - Argo drift. ESO defaults injected into
ExternalSecretremoteReffields kept the central Supabase app perpetuallyOutOfSync; addedignoreDifferencesso it converges toSynced.
Provenance
Section titled “Provenance”Shipped as a sequence of small, reviewed GitOps/backend pull requests (numbers
#1519–#1530 in yamonco/fractalops), each with co-located tests and a single
focused change, merged after the standard admin gate.