K3s Node Hygiene

FractalOps K3s nodes must not depend on manual container runtime cleanup. Apply the managed K3s drop-in with:

platform/k8s/reconcile_k3s_node_gc.sh

Set RESTART=true only during a maintenance window:

RESTART=true platform/k8s/reconcile_k3s_node_gc.sh

The drop-in configures kubelet image garbage collection, disk eviction thresholds, and container log rotation:

kubelet-arg:
  - "image-gc-high-threshold=85"
  - "image-gc-low-threshold=80"
  - "eviction-hard=nodefs.available<10%,imagefs.available<10%"
  - "eviction-minimum-reclaim=nodefs.available=5Gi,imagefs.available=5Gi"
  - "container-log-max-size=20Mi"
  - "container-log-max-files=3"

Do not add terminated-pod-gc-threshold; this K3s/Kubernetes version rejects that kubelet flag and the node will fail readiness.

Routine validation:

ssh root@10.10.10.41 'kubectl get --raw=/readyz?verbose'
ssh root@10.10.10.41 'kubectl get nodes -o wide'
ssh root@10.10.10.41 'kubectl get cronjobs -A'

Daytona runner nodes are intentionally Docker-backed. On Debian-packaged Docker, socket activation can leave /run/docker.sock present while docker info cannot connect. Reconcile runner nodes with:

platform/k8s/reconcile_daytona_runner_docker.sh

The script selects the daytona-sandbox-c=true node, disables docker.socket, and makes docker.service own unix:///run/docker.sock directly. This keeps the Daytona runner chart light: node runtime preparation stays on the node, not in a Kubernetes mutation shim.

Keep the Daytona runner region aligned with the live runner registration. The K3s runner currently registers as fractalops-k3s; do not reintroduce the old dev-workspaces_v7FU region or the scheduler will report No available runners even when the runner pod is healthy. Keep Kubernetes spec changes in the Daytona Helm values or chart version. Do not use kubectl set env or ad hoc DaemonSet JSON patches as a permanent reconciliation path.

For finished workload buildup, prefer native retention settings:

CronJob.spec.successfulJobsHistoryLimit
CronJob.spec.failedJobsHistoryLimit
Job.spec.ttlSecondsAfterFinished when the chart exposes it
Deployment.spec.revisionHistoryLimit for high-churn Deployments

Use direct deletion only for already-stale objects after checking owners:

kubectl delete pod -A --field-selector=status.phase=Succeeded --wait=false
kubectl delete pod -A --field-selector=status.phase=Failed --wait=false
kubectl get rs -A -o jsonpath='{range .items[?(@.spec.replicas==0)]}{.metadata.namespace}{" "}{.metadata.name}{"\n"}{end}' \
  | while read -r ns name; do kubectl -n "$ns" delete rs "$name" --wait=false; done