Skip to content

K3s Node Hygiene

FractalOps K3s nodes must not depend on manual container runtime cleanup. Apply the managed K3s drop-in with:

Terminal window
platform/k8s/reconcile_k3s_node_gc.sh

Set RESTART=true only during a maintenance window:

Terminal window
RESTART=true platform/k8s/reconcile_k3s_node_gc.sh

The drop-in configures kubelet image garbage collection, disk eviction thresholds, and container log rotation:

kubelet-arg:
- "image-gc-high-threshold=85"
- "image-gc-low-threshold=80"
- "eviction-hard=nodefs.available<10%,imagefs.available<10%"
- "eviction-minimum-reclaim=nodefs.available=5Gi,imagefs.available=5Gi"
- "container-log-max-size=20Mi"
- "container-log-max-files=3"

Do not add terminated-pod-gc-threshold; this K3s/Kubernetes version rejects that kubelet flag and the node will fail readiness.

Routine validation:

Terminal window
ssh root@10.10.10.41 'kubectl get --raw=/readyz?verbose'
ssh root@10.10.10.41 'kubectl get nodes -o wide'
ssh root@10.10.10.41 'kubectl get cronjobs -A'

Daytona runner nodes are intentionally Docker-backed. On Debian-packaged Docker, socket activation can leave /run/docker.sock present while docker info cannot connect. Reconcile runner nodes with:

Terminal window
platform/k8s/reconcile_daytona_runner_docker.sh

The script selects the daytona-sandbox-c=true node, disables docker.socket, and makes docker.service own unix:///run/docker.sock directly. This keeps the Daytona runner chart light: node runtime preparation stays on the node, not in a Kubernetes mutation shim.

Keep the Daytona runner region aligned with the live runner registration. The K3s runner currently registers as fractalops-k3s; do not reintroduce the old dev-workspaces_v7FU region or the scheduler will report No available runners even when the runner pod is healthy. Keep Kubernetes spec changes in the Daytona Helm values or chart version. Do not use kubectl set env or ad hoc DaemonSet JSON patches as a permanent reconciliation path.

For finished workload buildup, prefer native retention settings:

  • CronJob.spec.successfulJobsHistoryLimit
  • CronJob.spec.failedJobsHistoryLimit
  • Job.spec.ttlSecondsAfterFinished when the chart exposes it
  • Deployment.spec.revisionHistoryLimit for high-churn Deployments

Use direct deletion only for already-stale objects after checking owners:

Terminal window
kubectl delete pod -A --field-selector=status.phase=Succeeded --wait=false
kubectl delete pod -A --field-selector=status.phase=Failed --wait=false
kubectl get rs -A -o jsonpath='{range .items[?(@.spec.replicas==0)]}{.metadata.namespace}{" "}{.metadata.name}{"\n"}{end}' \
| while read -r ns name; do kubectl -n "$ns" delete rs "$name" --wait=false; done