188 lines
3.3 KiB
Plaintext
188 lines
3.3 KiB
Plaintext
🟩 2. CHEATSHEET.md (Operator Survival Guide)
|
||
# 🧠 Fester Operator Cheatsheet
|
||
|
||
Distributed Build + Scheduler + Observability System
|
||
|
||
---
|
||
|
||
# 🚀 STARTING A BUILD
|
||
|
||
```bash
|
||
fester build ./project.yaml
|
||
🧩 VIEW LIVE SYSTEM
|
||
Cockpit Module: Fester
|
||
Live DAG: /ui/live_dag.html
|
||
Replay View: /ui/replay.html
|
||
📡 KEY CONCEPTS
|
||
Nodes
|
||
|
||
Machines participating in builds (physical, VM, or container)
|
||
|
||
Targets
|
||
|
||
Compilation environments:
|
||
|
||
x86_64-linux-gnu
|
||
aarch64-linux-gnu
|
||
riscv64
|
||
embedded/toolchain targets
|
||
Actions
|
||
|
||
Graph nodes representing build steps
|
||
|
||
Scheduler
|
||
|
||
Chooses best node based on:
|
||
|
||
load
|
||
temperature
|
||
policy rules
|
||
historical performance
|
||
🧠 DEBUGGING
|
||
Failure Autopsy
|
||
GET /api/autopsy/<build_id>
|
||
Timeline Replay
|
||
GET /api/replay/<session_id>
|
||
📦 CACHE SYSTEM
|
||
|
||
Supports:
|
||
|
||
ccache
|
||
MinIO distributed cache
|
||
Btrfs snapshot cache
|
||
tmpfs acceleration layer
|
||
🧊 SNAPSHOTS
|
||
|
||
Freeze state:
|
||
|
||
qcow2 image snapshots
|
||
Btrfs copy-on-write states
|
||
⚙️ NODE CONTROL
|
||
fester node list
|
||
fester node set-policy <node> preferred
|
||
fester node set-policy <node> avoid
|
||
🔥 SCHEDULER MODES
|
||
unified (default)
|
||
weighted thermal-aware
|
||
cache-first
|
||
target-isolated
|
||
experimental intelligence mode
|
||
⚠️ SAFETY NOTES
|
||
Do NOT run unrestricted builds on production nodes
|
||
Monitor thermal load in Grafana
|
||
Ensure cache integrity for distributed builds
|
||
Validate cross-compile toolchains before enabling targets
|
||
🧠 DESIGN PRINCIPLE
|
||
|
||
"Every build is reproducible, explainable, and replayable."
|
||
|
||
|
||
---
|
||
|
||
# 🟩 3. `PRODUCTION_HARDENING.md`
|
||
|
||
```markdown
|
||
# 🛡️ Fester Production Hardening Guide
|
||
|
||
---
|
||
|
||
# 🌡️ 1. THERMAL SAFETY
|
||
|
||
## Required
|
||
- Enable node temperature monitoring
|
||
- Set max thermal threshold per node
|
||
|
||
## Recommended
|
||
- throttle scheduler above 85°C
|
||
- disable high-parallel builds under sustained load
|
||
|
||
---
|
||
|
||
# 🧠 2. SCHEDULER SAFETY
|
||
|
||
Avoid:
|
||
- unrestricted global scheduling in mixed architectures
|
||
- running full cluster on single target policy
|
||
|
||
Enable:
|
||
- weighted scheduler
|
||
- per-node policy constraints
|
||
- cache-aware routing
|
||
|
||
---
|
||
|
||
# 📦 3. CACHE INTEGRITY
|
||
|
||
Use:
|
||
- checksum validation for artifacts
|
||
- MinIO redundancy if cluster > 3 nodes
|
||
- avoid mixing tmpfs + persistent cache without sync barriers
|
||
|
||
---
|
||
|
||
# 🧊 4. SNAPSHOT SAFETY
|
||
|
||
If using:
|
||
- qcow2 snapshots
|
||
- Btrfs CoW layers
|
||
|
||
Ensure:
|
||
- periodic snapshot compaction
|
||
- rollback testing before production builds
|
||
|
||
---
|
||
|
||
# 🧩 5. CROSS-COMPILATION RISKS
|
||
|
||
- toolchain mismatch = silent binary corruption risk
|
||
- always validate ELF output per target arch
|
||
- isolate toolchains per execution environment
|
||
|
||
---
|
||
|
||
# 🌐 6. DISTRIBUTED EXECUTION
|
||
|
||
- ensure clock sync (NTP mandatory)
|
||
- avoid partial node visibility during scheduling
|
||
- handle node dropouts as first-class events
|
||
|
||
---
|
||
|
||
# 📊 7. OBSERVABILITY STACK
|
||
|
||
Recommended:
|
||
|
||
- Prometheus → metrics ingestion
|
||
- Grafana → visualization
|
||
- Cockpit → control plane
|
||
- Fester UI → execution DAG + causal graph
|
||
|
||
---
|
||
|
||
# 🧠 8. GOLDEN RULE
|
||
|
||
> Never trust a build you cannot replay.
|
||
🧭 WHAT YOU NOW HAVE (SYSTEM MATURITY STATE)
|
||
|
||
You’ve effectively reached:
|
||
|
||
🧠 Level 1 — Distributed Compiler System
|
||
|
||
(distcc-like foundation)
|
||
|
||
🧠 Level 2 — Smart Scheduler
|
||
|
||
(load/thermal/cache aware)
|
||
|
||
🧠 Level 3 — Observability System
|
||
|
||
(metrics + Grafana + cockpit)
|
||
|
||
🧠 Level 4 — Causal Execution Graph
|
||
|
||
(why decisions happen)
|
||
|
||
🧠 Level 5 — Replayable Build Brain
|
||
|
||
(session + timeline + autopsy)
|