Monitoring 13 Autonomous Agents in Production: What to Log, Alert, and Ignore

June 15, 2024

I run 13 Claude agents on cron, fully unattended. Last week two of them failed. I found out within 2 minutes — not because I built a monitoring stack, but because I put five lines of JSONL into every task from day one.

Here's what that looks like and why I didn't build a dashboard.

Autonomous doesn't mean invisible

For the first three weeks, the system ran without any logging. Agents fired on cron, did something, finished. I'd check manually: went to the folder, saw new files — must've worked.

The problem showed up quietly. One agent ran without errors and wrote files — but with different content than expected. I noticed nine days later, only because I happened to open an old run for comparison. No crash, no exception. Just lost cycles.

Silent degradation is worse than a hard error. It doesn't stop the process — it shifts it. The output arrives, but it's wrong.

After that incident I made two decisions. First: three statuses instead of two. Second: write a log entry at the very start of each run, before any business logic.

Log structure: one line per event

Each agent writes to its own file: 06 - Операции/cron/logs/T{N}.jsonl. A typical line:

{"ts":"2024-06-14T03:10:22Z","task":"T14","status":"ok","slot_date":"2024-06-01","pillar":3,"slug":"ai-agent-roi-before-adoption","score":8.3,"duration_s":412}

Seven fields. ts is UTC ISO8601. task is the job identifier. status is one of three values. The rest are task-specific. But ts, task, status — required everywhere, no exceptions.

Why JSONL and not a database? Because grep, jq, and Python work with JSONL out of the box. At 13 tasks writing 1–5 lines a day, that's 5–10 KB per day. A year from now: ~3 MB. Not a storage problem, not a parsing problem.

The three statuses that change everything

Three-status logging (ok, degraded, error) gives autonomous agents a way to report partial success — which binary success/failure logging cannot express.

ok means the task ran cleanly. error means no usable output, needs intervention. Between them is the status that turned out to matter most.

degraded means the task finished partially, but there's still useful output. Example: agent T14 collects topic candidates from external sources. If all WebFetch calls fail due to network errors, it doesn't return error — it pulls a candidate from the local Topic Library and writes degraded. The process continues with less data.

In the first month, seven out of 13 tasks returned degraded at least once. None required manual intervention — they worked through it. With binary success/failure, I'd have had a stream of "partial successes" and eventually learned to ignore alerts. That's how alerting culture dies.

There are also dedicated early-exit statuses like inbox_full and no_pending_slots. Neither is error or degraded. Normal completion with a specific reason. Separating these matters — otherwise you get false alerts.

For how to decide what to trust in agent *output* (separate from whether the run succeeded), I wrote about it in AI agents in production: 3 trust patterns.

The claude_start pattern: separate "didn't start" from "crashed midway"

The started log entry — written before any business logic — distinguishes launch failures from mid-run crashes. Without it, both look the same: an absent ok.

The first line in any log:

{"ts":"2024-06-14T03:10:01Z","task":"T14","status":"started"}

Why? Because there are two different failure modes. First: the agent started but crashed halfway. Second: the agent never started — cron didn't fire, Python couldn't find a dependency, a lockfile stuck from a previous run.

Without the started entry, you just see the absence of ok and can't tell if the failure was inside or outside. With it, the split is immediate: no started at the expected time means the issue is in the launch itself. started present but no ok or degraded means the issue is inside the logic.

This entry takes maybe five seconds to write. It pays back when debugging a failure post-mortem.

Alerts without Datadog

No dashboards. One shell script on cron, running hourly:

for task in T14 T15 T16; do
  recent=$(tail -20 "cron/logs/${task}.jsonl" | grep '"status":"error"' | wc -l)
  if [ "$recent" -gt 0 ]; then
    ./send_tg_notification.sh "⚠️ ${task}: error in last 20 runs"
  fi
done

The notification goes to Telegram. Not email — I don't check email immediately. Telegram hits the phone within a few minutes.

That's the whole monitoring setup. No collector agent, no time-series database, no Grafana. Setup took 20 minutes. Maintenance is zero.

What I deliberately don't monitor

In the first few weeks I wanted to add quality metrics: output length, template conformance, usefulness score. I stopped, because those aren't reliability metrics — they're content quality metrics. Quality is review work, not monitoring.

I also don't monitor execution time. An agent might take 3 minutes or 25, depending on external services and context size. As long as it finishes without error, duration doesn't matter. Add a time threshold and you get false positives every time an API slows down, and eventually you stop paying attention to them. Same failure mode as binary status logging, just slower.

Better to think through this once now than rediscover it six months into production.

Three months out

degraded turned out to be the most useful status. And the most surprising: I expected most failures to be error. They aren't. Most failures are external instabilities — WebFetch, rate limits, network timeouts — that agents handle themselves when there's a fallback in the prompt.

That changed how I write prompts. I used to write "if this doesn't work, stop." Now I write "if this doesn't work, do this instead and mark as degraded." Small change in the prompt, large change in operational behavior.

Three months of JSONL across all tasks is about 12 MB. The error status accounts for roughly 2% of entries. Everything else is ok, degraded, or normal early exits. That's a healthy picture — and it would have been invisible without structured logging.

If you're deploying your first autonomous agent, start exactly here: three statuses, one line per event, Telegram alert only on error. Nothing else. On how to write the spec before the first run, there's a separate piece on spec-first AI development.