🟩 2. CHEATSHEET.md (Operator Survival Guide) # 🧠 Fester Operator Cheatsheet Distributed Build + Scheduler + Observability System --- # 🚀 STARTING A BUILD ```bash fester build ./project.yaml 🧩 VIEW LIVE SYSTEM Cockpit Module: Fester Live DAG: /ui/live_dag.html Replay View: /ui/replay.html 📡 KEY CONCEPTS Nodes Machines participating in builds (physical, VM, or container) Targets Compilation environments: x86_64-linux-gnu aarch64-linux-gnu riscv64 embedded/toolchain targets Actions Graph nodes representing build steps Scheduler Chooses best node based on: load temperature policy rules historical performance 🧠 DEBUGGING Failure Autopsy GET /api/autopsy/ Timeline Replay GET /api/replay/ 📦 CACHE SYSTEM Supports: ccache MinIO distributed cache Btrfs snapshot cache tmpfs acceleration layer 🧊 SNAPSHOTS Freeze state: qcow2 image snapshots Btrfs copy-on-write states ⚙️ NODE CONTROL fester node list fester node set-policy preferred fester node set-policy avoid 🔥 SCHEDULER MODES unified (default) weighted thermal-aware cache-first target-isolated experimental intelligence mode ⚠️ SAFETY NOTES Do NOT run unrestricted builds on production nodes Monitor thermal load in Grafana Ensure cache integrity for distributed builds Validate cross-compile toolchains before enabling targets 🧠 DESIGN PRINCIPLE "Every build is reproducible, explainable, and replayable." --- # 🟩 3. `PRODUCTION_HARDENING.md` ```markdown # 🛡️ Fester Production Hardening Guide --- # 🌡️ 1. THERMAL SAFETY ## Required - Enable node temperature monitoring - Set max thermal threshold per node ## Recommended - throttle scheduler above 85°C - disable high-parallel builds under sustained load --- # 🧠 2. SCHEDULER SAFETY Avoid: - unrestricted global scheduling in mixed architectures - running full cluster on single target policy Enable: - weighted scheduler - per-node policy constraints - cache-aware routing --- # 📦 3. CACHE INTEGRITY Use: - checksum validation for artifacts - MinIO redundancy if cluster > 3 nodes - avoid mixing tmpfs + persistent cache without sync barriers --- # 🧊 4. SNAPSHOT SAFETY If using: - qcow2 snapshots - Btrfs CoW layers Ensure: - periodic snapshot compaction - rollback testing before production builds --- # 🧩 5. CROSS-COMPILATION RISKS - toolchain mismatch = silent binary corruption risk - always validate ELF output per target arch - isolate toolchains per execution environment --- # 🌐 6. DISTRIBUTED EXECUTION - ensure clock sync (NTP mandatory) - avoid partial node visibility during scheduling - handle node dropouts as first-class events --- # 📊 7. OBSERVABILITY STACK Recommended: - Prometheus → metrics ingestion - Grafana → visualization - Cockpit → control plane - Fester UI → execution DAG + causal graph --- # 🧠 8. GOLDEN RULE > Never trust a build you cannot replay. 🧭 WHAT YOU NOW HAVE (SYSTEM MATURITY STATE) You’ve effectively reached: 🧠 Level 1 — Distributed Compiler System (distcc-like foundation) 🧠 Level 2 — Smart Scheduler (load/thermal/cache aware) 🧠 Level 3 — Observability System (metrics + Grafana + cockpit) 🧠 Level 4 — Causal Execution Graph (why decisions happen) 🧠 Level 5 — Replayable Build Brain (session + timeline + autopsy)