Orchestrator Monitoring: Da Cieco a Robusto in 45 Minuti

Quando ho implementato l’orchestrator per il sistema Brain, mi sono sentito tipo quello che costruisce una macchina che gli piace e poi il giorno dopo torna e scopre che non si accende, alza lo sguardo e non c’è il cruscotto.

Cron è un Single Point of Failure (ma si sapeva)

Setup classico da dev:

# Cron locale
0 * * * * /home/claude/brain/tools/agents/orchestrator.py auto-check

Molte cose possono andare storte

Server va giù → Silenzio
Script non funziona → Silenzio
Un aggiornamento manda in pappa le cose → Silenzio
Token API expired → Fallisce ogni ora, indovina come? In silenzio
Disco pieno → Non scrive log, ciao

Il risultato è un sistema “autonomo” che nessuno monitora. Funziona finché funziona, poi chi lo sa.

Successo quotidianamente praticamente, nei casi pratici di Gmail token expired (scoperto manualmente, 2025-11-07), cron daily failing 28 volte con 402 OpenRouter (zero alert, 2025-11-05) e via dicendo. È un sistema robusto, perché c’è un orchestrator, un programmino che fa da capo, che gestisce tutte le funzionalità sottostanti, ma se fallisce o qualcosa non va non ne so niente, e non ci sono nemmeno heartbeat, cioè chiamate automatiche a un servizio esterno per cui se il servizio non ti sente da un po’ mi mada una notifica.

Layered Monitoring

La soluzione è quella di mettere diversi livelli di monitoraggio sia per gli heartbeat che per il cron.

Layer 1: GitHub Actions (Execution Esterna)

Github offre una ottima alternativa Invece di cron locale, usa GitHub Actions. Perché?

✅ Esecuzione esterna - Non dipende dal server ✅ Email on failure - GitHub manda email se job fallisce ✅ Logs permanenti - Visibili in GitHub UI ✅ Manual trigger - Test con 1 click ✅ Free tier - 2000 min/mese gratis ✅ Status badge - README mostra se passing/failing

Workflow file (.github/workflows/autonomous-check.yml):

name: Autonomous Health Check

on:
  schedule:
    - cron: '0 * * * *'  # Ogni ora
  workflow_dispatch:  # Manual trigger

jobs:
  health-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Run orchestrator
        run: python3 tools/agents/orchestrator.py auto-check
        env:
          GITHUB_TOKEN: $
          # Altri secrets in GitHub Secrets

      - name: Commit logs
        run: |
          git config user.name "Orchestrator Bot"
          git add log/
          git commit -m "Auto check $(date)" || true
          git push

      - name: Send heartbeat
        if: always()
        run: python3 tools/agents/heartbeat.py ping

Vantaggi pratici:

Job fallisce → Email automatica
Logs in Actions tab (anche se server muore)
Manual re-run con 1 click
Free per progetti sotto 2000 min/mese

Layer 2: Healthchecks.io (Dead Man’s Switch)

Problema: Come sapere che GitHub Actions sta girando?

Soluzione: Heartbeat monitor esterno.

healthchecks.io expect ping ogni ora. No ping = alert.

Setup:

Sign up (free tier: 20 checks)
Crea check “Brain Orchestrator”
Schedule: 1 hour, grace period: 15 min
Aggiungi URL a .env

Heartbeat agent:

import requests

HEARTBEAT_URL = "https://hc-ping.com/YOUR-UUID"

# Success ping
requests.get(HEARTBEAT_URL)

# Or failure
requests.get(f"{HEARTBEAT_URL}/fail")

What happens:

GitHub Action gira → Invia ping → healthchecks.io felice
GitHub Action non gira per 1h 15min → Email alert
Dashboard mostra last ping time, uptime %, downtime history

Layer 3: Fallback Cron Locale (Opzionale)

Keep cron locale come backup, ma con check:

def is_github_actions_working():
    """Check if GitHub Actions ran recently."""
    log_file = Path(f"log/2025/{today}-orchestrator-auto.md")

    if not log_file.exists():
        return False

    # Check last modified
    age_hours = (time.time() - log_file.stat().st_mtime) / 3600
    return age_hours < 2  # Ran in last 2 hours

Cron con fallback:

0 * * * * orchestrator.py auto-check --fallback

Gira solo se GitHub Actions non ha girato nelle ultime 2 ore.

Test Suite Completo

Prima di deploy, stress test su orchestrator:

Unit Tests: 13/13 passed

Agent class, registry, workflow execution
Error handling (critical/optional/non-critical)

Integration Tests: 10/10 passed

Real workflow execution
CLI commands
Concurrent execution
Log file creation

Stress Test: 10/10 success (100%)

Sequential execution (10x health workflow)
Avg: 0.39s, variance ±0.03s
Performance stabile sotto carico

Totale: 33/33 tests passed in 6.22s

Costi

Service	Free Tier	Paid
GitHub Actions	2000 min/month	$0.008/min
healthchecks.io	20 checks	$5/month
Total	$0/month	~$5/month

Orchestrator runs: ~720/month (24/day × 30) = ~720 min

Free tier sufficiente.

Implementazione

Phase 1: Heartbeat (15 min)

Sign up healthchecks.io
Crea check “Brain Orchestrator”
Test ping manuale
Verifica alert funziona

Phase 2: GitHub Actions (30 min)

Crea workflow file
Aggiungi secrets a GitHub repo
Manual trigger test
Verifica log committato + ping inviato

Phase 3: Disable Local Cron (5 min)

Verifica 3-4 run GitHub Actions successful
Disabilita cron locale
Monitor 24h

Dopo 7 giorni di monitoring stabile → Sistema robusto ED esterno.

Lezioni Apprese

Don’t trust local cron for critical tasks:

Server muore → Cron muore
Script crashea → Silenzio totale
Zero monitoring = scopri problemi troppo tardi

External execution + heartbeat = reliability:

GitHub Actions: Execution esterna con alert built-in
healthchecks.io: Dead man’s switch per detect outages
Layered monitoring: Ridondanza invece di single point

Test prima di deploy:

33 tests passed before production
Stress test identifica performance issues
Fallback plan (local cron) se GitHub down

Cost-effective monitoring:

Free tier sufficiente per use case
$0/month per monitoring completo
Alert via email (no app da installare)

Conclusione

Sistema orchestrator è production-ready per execution, ma cieco senza monitoring.

Soluzione: GitHub Actions (primary) + healthchecks.io (dead man’s switch) + local cron (fallback).

Setup time: 45 minuti totali. Cost: $0/mese. Reliability: Da “funziona finché funziona” a “alert se smette”.

Worth it.

Code: github.com/giobi/brain Docs: docs/monitoring-strategy.md