We Taught an LLM When Not to Act
A story about 3 AM alerts, rogue log files, and what it actually means to reason under pressure.

It's 3:07 AM. Your phone screams. The database is down. The logs say: connection failure.
The obvious move is to restart the database.
So you do. It fails. You restart it again. It fails again.
Because the database isn't the problem.
Somewhere deeper in the system, a rogue process has been quietly writing to a log file for hours. That log file has swallowed every last byte of disk space. The database just happened to fall first — it was the first domino, not the cause.
A good Site Reliability Engineer figures this out in minutes: check disk usage, find the flood, kill the process, then restart the database. In that order.
An LLM agent, left to its own devices, restarts the database. Again. And again. It knows exactly what the command does. It just doesn't know why things broke — or what to check first.
That's the gap Auto-SRE was built to close.

The Problem with AI Agents in Live Systems
Language models are impressive at a lot of things: writing code, summarizing documents, answering questions. But those tasks share something in common — they're static. The model reads, thinks, and responds. Nothing breaks if it gets it wrong.
Real infrastructure is different. Actions have consequences. Restarting a service before freeing disk space wastes precious minutes. Killing the wrong process makes things worse. And sometimes — this is the hard part — the correct move is to do nothing at all.
Current AI agents fail in live systems for one consistent reason: they act on what they see, not what caused it. They fix symptoms, not sources. They skip verification. They never learned that restraint is a skill.
The reason is simple: they've never experienced consequences. Their training had no real system to break.
Auto-SRE changes that.

A Live Simulator, Real Commands, Real Consequences
Auto-SRE is a training environment where an AI agent lives inside a simulated Linux system. Not a toy. Not an abstraction. A real-ish filesystem, running processes, service health states — and hidden failure conditions the agent has to discover.
The agent can only do what a human engineer would do: run shell commands and read what comes back.
$ df -h          # how full is the disk?
$ ps aux         # what's running?
$ cat /var/log/app.log   # what do the logs say?
$ kill 6666      # terminate the rogue process
$ systemctl restart db   # now bring the database back
Here's the crucial design decision: the agent is not rewarded for typing the right command. It's rewarded for the system actually being healthy again.
That single rule forces genuine reasoning. You can't game it. You can't cargo-cult a solution. You have to understand what's broken and why — because only fixing the real cause earns any reward.

Ten Scenarios, Ten Skills
Training runs across ten escalating scenarios, each designed to teach one specific capability:
TaskScenarioSkillT01Misnamed config file prevents app startDiagnosis + file repairT02Rogue process owns the wrong portTargeted process killT03App crashes because packages aren't installedEnvironment setupT04System reports failure. Everything is fine.Do nothing.T05Service hung — log file ate all disk spaceRoot cause over surface symptomT06Memory-hungry process starves everything elseResource monitoringT07Logger floods disk → DB crashes → app goes downLong-horizon causal chainT08Service crashes on every restart due to memory leakBreak the loop before recoveringT09Three services down, must restart in exact orderOrdered, dependency-aware recoveryT10Auth broken by a wrong value buried in a configLog-to-config tracing
Task 7 — the cascading failure — is the hardest. Three services fail from a single root cause, but the agent only sees the final symptom. Restarting the database first fails. Clearing the log without killing the process fails. Only the correct sequence, in the correct order, works.

What Actually Improved
Training ran on a modest setup: a 1.5 billion parameter model, a single GPU, 250 steps. Not a frontier system. The goal was to see whether reasoning could actually be shaped — not to set records.
The overall reward curve is noisy. Ten tasks of wildly different difficulty will do that. But the per-task story is where things get interesting.
The standout result: Task 4 — the false alarm.
In this scenario, the system reports a failure. But everything is healthy. The agent must check the system state — and then choose not to act.
Across every training run, this score climbed:

Early training: 0.69
After 10 epochs: 0.90
After 25 epochs: 0.91

A score of 0.91 means the agent correctly identified a healthy system and held back in more than 9 out of 10 episodes.
That's not a trick. That's learned restraint — one of the hardest behaviors to instill in any agent, AI or otherwise.
Complex multi-step scenarios (cascading failure, dependency chains) started near zero and climbed into the 0.16–0.24 range. Frontier tasks like precise process kills and OOM scenarios sat around 0.08 — real signal, but more training needed.

Before and After
The difference in behavior is stark.
Before training:
step 1 → systemctl restart db   # fails, disk still full
step 2 → systemctl restart db   # fails again
step 3 → systemctl restart db   # reward: 0.01
After training:
step 1 → df -h                  # diagnose: disk at 100%
step 2 → rm /var/log/syslog     # clear the root cause
step 3 → kill 6666              # stop the logger
step 4 → systemctl restart db   # now it works
reward → +0.88 ✓
The agent didn't just learn a sequence of commands. It learned to ask why before asking what.

Where This Goes
The hardest tasks remain unsolved — port conflicts, memory-leak scenarios, config tracing — and that's honest. 250 steps on a small model can only begin to teach multi-step causal reasoning. More compute, longer training, and richer environments are the direct path forward.
On the horizon: multi-agent teams that coordinate during live incidents, connections to real Kubernetes clusters and cloud monitoring, and environments that generate new failure scenarios based on what the agent is still getting wrong.

Why It Matters
Benchmarks test what a model knows. This environment tests what a model does — under uncertainty, across multiple steps, with real consequences for wrong choices.
As infrastructure becomes more complex and more automated, the distance between "knows the right answer" and "takes the right action" is the distance that matters most.
Auto-SRE is a step toward closing it.

Built with OpenEnv · GRPO · Unsloth · Qwen2.5-1.5B · Try the live demo on Hugging Face Spaces


Live huggingface preview - https://huggingface.co/spaces/goated1/auto-sre


![image](https://cdn-uploads.huggingface.co/production/uploads/69c422241dd1e042df95aec0/qweIkneAIQ4raVUARJip7.png)

Github repo - https://github.com/goatedAreeeb/auto-dev-


README.md - https://github.com/goatedAreeeb/auto-dev-/blob/main/README.md