fester/NOTES.text

188 lines
3.3 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

🟩 2. CHEATSHEET.md (Operator Survival Guide)
# 🧠 Fester Operator Cheatsheet
Distributed Build + Scheduler + Observability System
---
# 🚀 STARTING A BUILD
```bash
fester build ./project.yaml
🧩 VIEW LIVE SYSTEM
Cockpit Module: Fester
Live DAG: /ui/live_dag.html
Replay View: /ui/replay.html
📡 KEY CONCEPTS
Nodes
Machines participating in builds (physical, VM, or container)
Targets
Compilation environments:
x86_64-linux-gnu
aarch64-linux-gnu
riscv64
embedded/toolchain targets
Actions
Graph nodes representing build steps
Scheduler
Chooses best node based on:
load
temperature
policy rules
historical performance
🧠 DEBUGGING
Failure Autopsy
GET /api/autopsy/<build_id>
Timeline Replay
GET /api/replay/<session_id>
📦 CACHE SYSTEM
Supports:
ccache
MinIO distributed cache
Btrfs snapshot cache
tmpfs acceleration layer
🧊 SNAPSHOTS
Freeze state:
qcow2 image snapshots
Btrfs copy-on-write states
⚙️ NODE CONTROL
fester node list
fester node set-policy <node> preferred
fester node set-policy <node> avoid
🔥 SCHEDULER MODES
unified (default)
weighted thermal-aware
cache-first
target-isolated
experimental intelligence mode
⚠️ SAFETY NOTES
Do NOT run unrestricted builds on production nodes
Monitor thermal load in Grafana
Ensure cache integrity for distributed builds
Validate cross-compile toolchains before enabling targets
🧠 DESIGN PRINCIPLE
"Every build is reproducible, explainable, and replayable."
---
# 🟩 3. `PRODUCTION_HARDENING.md`
```markdown
# 🛡️ Fester Production Hardening Guide
---
# 🌡️ 1. THERMAL SAFETY
## Required
- Enable node temperature monitoring
- Set max thermal threshold per node
## Recommended
- throttle scheduler above 85°C
- disable high-parallel builds under sustained load
---
# 🧠 2. SCHEDULER SAFETY
Avoid:
- unrestricted global scheduling in mixed architectures
- running full cluster on single target policy
Enable:
- weighted scheduler
- per-node policy constraints
- cache-aware routing
---
# 📦 3. CACHE INTEGRITY
Use:
- checksum validation for artifacts
- MinIO redundancy if cluster > 3 nodes
- avoid mixing tmpfs + persistent cache without sync barriers
---
# 🧊 4. SNAPSHOT SAFETY
If using:
- qcow2 snapshots
- Btrfs CoW layers
Ensure:
- periodic snapshot compaction
- rollback testing before production builds
---
# 🧩 5. CROSS-COMPILATION RISKS
- toolchain mismatch = silent binary corruption risk
- always validate ELF output per target arch
- isolate toolchains per execution environment
---
# 🌐 6. DISTRIBUTED EXECUTION
- ensure clock sync (NTP mandatory)
- avoid partial node visibility during scheduling
- handle node dropouts as first-class events
---
# 📊 7. OBSERVABILITY STACK
Recommended:
- Prometheus → metrics ingestion
- Grafana → visualization
- Cockpit → control plane
- Fester UI → execution DAG + causal graph
---
# 🧠 8. GOLDEN RULE
> Never trust a build you cannot replay.
🧭 WHAT YOU NOW HAVE (SYSTEM MATURITY STATE)
Youve effectively reached:
🧠 Level 1 — Distributed Compiler System
(distcc-like foundation)
🧠 Level 2 — Smart Scheduler
(load/thermal/cache aware)
🧠 Level 3 — Observability System
(metrics + Grafana + cockpit)
🧠 Level 4 — Causal Execution Graph
(why decisions happen)
🧠 Level 5 — Replayable Build Brain
(session + timeline + autopsy)