# ArchitectureEnv — Teaching an LLM to Design Software Systems

*Meta PyTorch × OpenEnv Hackathon Grand Finale, April 2026*

🤗 **Space**: https://huggingface.co/spaces/thepikachu/architecture-env  
🧠 **Model**: https://huggingface.co/thepikachu/architecture-sft-model  
📓 **Notebook**: [ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb](https://huggingface.co/spaces/thepikachu/architecture-env/blob/main/notebooks/ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb)

---

## The problem nobody talks about

Every engineering team that has ever built a distributed system knows the drill.
Someone opens a Confluence doc, someone else draws boxes in Miro, and then the
back-and-forth begins. Do we use Kafka or RabbitMQ here? Should the worker
connect directly to the database or go through a cache first? Does the
WebSocket gateway fan out to the broker, or does the broker fan out to the
gateway? These conversations take hours. Then requirements change, the design
evolves, and they take hours again.

The obvious response was: just ask an LLM. And that works, up to a point. You
describe your system, the model suggests components, you iterate. But there's a
hidden cost — every response still needs a human to read it, validate it against
the actual requirements, catch the missing connections, and push back when the
model picks the wrong technology for the task. You're not removing the review
loop; you're just moving it.

We wanted to close that loop entirely.

---

## The idea: turn architecture design into an RL problem

The key insight is that software architecture design has a structure that most
open-ended LLM tasks don't: **it can be verified**. Given a task description,
there is a set of components that must be present, a set of connections that
must exist between them, and a measurable score for how close a design is to
correct. You don't need a human in the loop to know whether `worker → database`
is connected. The environment can check.

If you can verify it, you can train on it.

We built **ArchitectureEnv** — an OpenEnv environment where an agent designs
software systems one action at a time. The agent can `add` a component,
`connect` two components, or `submit` the finished design. After each action,
the environment returns a reward in `[0, 1]` based on how complete and
technically sound the architecture is.

Six real-world tasks across three difficulty levels:

| Task | Difficulty | Key challenge |
|------|-----------|---------------|
| `url_shortener` | Easy | Load balancer, cache, DB, abuse prevention |
| `chat_system` | Medium | WebSocket fanout, presence, async broker |
| `ecommerce_platform` | Medium | Search, payments, async workers |
| `youtube_platform` | Hard | Transcoding pipeline, CDN, recommendations |
| `ride_sharing` | Hard | Geospatial indexing, real-time matching |
| `ml_platform` | Hard | Feature store, model registry, inference server |

The reward function is deterministic and composable — no LLM judge, no ambiguity:

- **Component coverage**: which required families are present (e.g. `cache`, `broker`, `database`)
- **Connection coverage**: which required edges exist between components
- **Technology-fit bonus**: Kafka over RabbitMQ for streaming workloads, Redis over Memcached when presence tracking is needed
- **Bonus items**: optional cross-cutting concerns — auth, observability, rate limiting, payment gateways
- **Coherence penalty**: deducted for irrelevant or conflicting components

This multi-component reward makes the environment hard to game. An agent can't
just dump every component it knows — irrelevant components incur a penalty, and
bonus items only score if the required core is already complete.

---

## The knowledge layer: the System Design Encyclopedia

Before training anything, we needed the model to know *which* components belong
to *which* architectural families. A "cache" family could mean Redis, Memcached,
or Varnish — but for a ride-sharing platform tracking driver locations in memory,
Redis is the right answer. For a YouTube-scale streaming platform, Kafka is the
right broker, not RabbitMQ.

We built `encyclopedia_rules.py`, a knowledge layer built on top of the
**System Design Components Encyclopedia** covering 40+ component families. It does four things:

**1. Family → concrete component mappings.** Every abstract family required by the environment
(e.g. `broker`, `cache`, `search`, `storage`) is mapped to the concrete
implementation that scores highest for a given task type. The agent's system
prompt is enriched with these mappings before every episode.

**2. Task-specific technology overrides.** Some mappings differ by task. The
broker family defaults to `rabbitmq` for simple async work but overrides to
`kafka` for streaming-heavy tasks (`youtube_platform`, `ride_sharing`). The
cache family defaults to `redis` but the ml_platform uses `feast` as the
feature store. These overrides are resolved at inference time, not hard-coded.

**3. Bonus target lists.** Each task has a set of optional bonus components
(auth, observability, rate-limiting, payment gateway, recommendation workers,
etc.). The encyclopedia maps these back to concrete components the agent can
`add` by name, closing the gap between abstract scoring criteria and executable
actions.

**4. Enriched system prompt.** The system prompt injected into every agent call
includes a structured summary of the relevant component families, their
concrete implementations, and the integration patterns between them — drawn
directly from the encyclopedia's integration pattern catalog.

This is what separates the agentic system from raw LLM inference. The model
doesn't need to hallucinate that Kafka is a good fit for streaming — the
encyclopedia tells it, and the Planner enforces it.

---

## Training: SFT first, then RL

A raw base model given an architecture task will output something that looks
like an architecture discussion — not a sequence of `add` and `connect`
commands. Before we could use reinforcement learning, we needed the model to
speak the language of the environment.

We fine-tuned **Qwen2.5-3B-Instruct** using Unsloth with 4-bit QLoRA on a
supervised dataset of 120 correct action sequences across all six task types.
This is the SFT stage — it doesn't teach strategy, it teaches format. After SFT,
the model reliably outputs `add postgres`, `connect broker worker`, `submit` —
the vocabulary the environment understands.

**SFT training ran for 150 steps (10 epochs, batch size 8) on a T4 GPU in 13 minutes.**
Cross-entropy loss dropped from 3.57 at step 10 to a plateau of ~0.006 by step 70,
with a final average training loss of 0.208 (skewed by the steep early descent).
Evaluated directly after training (single-shot greedy decode, no agentic loop),
the SFT checkpoint achieved **0.997 average reward** across all six tasks — 4 tasks
at 1.0, and ecommerce/ml_platform at 0.99.

SFT had saturated the task format. We then ran GRPO on top of the SFT checkpoint to
explore whether RL could improve priority ordering — required items before bonus items,
required connections before submitting.

Using HuggingFace TRL's GRPOTrainer, we ran the environment as the reward function —
the model generates an action plan, the environment executes it, and the final score
drives the weight update. We trained for **50 GRPO steps from the SFT checkpoint**
(batch size 4, ~10 minutes on T4). As the reward curve shows, GRPO improved the
raw model's training-time scores, but when both checkpoints were evaluated through
the full agentic inference stack, the SFT-only checkpoint outperformed SFT+GRPO
(0.978 vs 0.928) — the GRPO run degraded `ml_platform` significantly. This is
discussed in detail in the section below.

![SFT Loss Curve](https://huggingface.co/spaces/thepikachu/architecture-env/resolve/main/plots/loss_curve.png)
*SFT cross-entropy loss over 150 steps. Loss axis: cross-entropy. Step axis: training step.*

![Reward Curve](https://huggingface.co/spaces/thepikachu/architecture-env/resolve/main/plots/reward_curve.png)
*Average environment reward across all six tasks, comparing SFT-only and SFT+GRPO checkpoints over training. Reward axis: environment score [0, 1]. Step axis: training step.*

---

## The agentic layer: Planner, Critic, Negotiator

Here's where things got interesting. We found that the SFT model, when wrapped
in a structured multi-agent inference loop informed by the encyclopedia,
matched or exceeded GRPO on most tasks — with full interpretability of every decision.

The loop at every step:

**PlannerAgent** — deterministic, zero-hallucination. It reads the live
environment state (present components, missing required items, missing
connections, available bonus targets) and computes the ground-truth next action
using the encyclopedia's priority ordering: required items → required connections
→ bonus items → submit. The planner's hint is injected into the LLM's prompt.

**LLM (the fine-tuned SFT model)** — generates its own action proposal informed
by the planner hint. This is where domain knowledge from the encyclopedia pays
off: the model has been trained on action sequences that reflect the same
technology choices the encyclopedia prescribes.

**CriticAgent** — validates the LLM's proposal against the live environment
state before execution. It checks: does the component exist before connecting?
Is required work complete before bonus items? Is the missing-items list empty
before submit? If the proposal fails any check, it's rejected without hitting
the environment.

**NegotiatorAgent** — repairs rejected proposals using the planner as ground
truth. Every rejection is logged with its reason (`critic_rejected: 'websocket_gateway' not present; add it before connecting → repair: add websocket_gateway`), creating a fully auditable decision trail.

The Critic was the most important piece. Without it, the LLM would occasionally
submit with missing connections still in the queue, or attempt to connect
components that hadn't been added yet. The Critic catches both, every step,
before the environment ever sees the action.

---

## Results

**Agentic inference demo** — real LLM calls, no caching, run on 2026-04-25:

| Task | SFT, no agentic loop | SFT + agentic loop | Δ |
|------|---------------------|--------------------|---|
| chat_system | 0.930 | **1.000** | +0.070 |
| ecommerce_platform | 0.900 | **0.990** | +0.090 |
| youtube_platform | 0.930 | **1.000** | +0.070 |
| ride_sharing | 0.930 | 0.890 | −0.040 |
| ml_platform | 0.930 | **0.990** | +0.060 |
| url_shortener | 0.930 | **1.000** | +0.070 |
| **Average** | **0.925** | **0.978** | **+0.053** |

Five of six tasks hit 0.99 or above. The `ride_sharing` episode scored 0.890 — the
agent completed the required architecture correctly but the Negotiator's deduplication
guard did not prevent an extra bonus-item attempt that incurred a step penalty before
submit. This is a known edge case in the Negotiator and not a model failure — the
required architecture was fully complete before the penalty was incurred.

Bonus items collected in the improved run confirm the agent fully understands the
encyclopedia's scoring model:

- `chat_system` → auth, observability, presence_service, notification_service
- `youtube_platform` → auth, observability, recommendation_worker
- `url_shortener` → auth, observability, rate_limiting
- `ml_platform` → auth, observability

---

## Why SFT + agentic loop, not SFT + GRPO + agentic loop?

We ran both pipelines end-to-end through the same Planner-Critic-Negotiator inference
stack and compared the results directly.

**SFT + agentic loop** averaged **0.978** across all six tasks — five at 0.99 or above.
**SFT + GRPO + agentic loop** averaged **0.928** — GRPO made things worse, not better.

| Task | SFT + agentic | SFT + GRPO + agentic | Δ |
|------|--------------|----------------------|---|
| url_shortener | 1.000 | 1.000 | 0.000 |
| chat_system | 1.000 | 1.000 | 0.000 |
| ecommerce_platform | 0.990 | 0.990 | 0.000 |
| youtube_platform | 1.000 | 1.000 | 0.000 |
| ride_sharing | 0.890 | 1.000 | +0.110 |
| ml_platform | 0.990 | 0.580 | −0.410 |
| **Average** | **0.978** | **0.928** | **−0.050** |

The regression on `ml_platform` is the clearest signal. The GRPO+SFT checkpoint
submitted with `storage` still missing, zero connections made, and three required
edges completely absent — a score of 0.58. The SFT checkpoint through the same
agentic loop scored 0.99 on the same task. GRPO disrupted the LoRA weights enough
to break the model's ability to complete hard tasks, while adding nothing on the
five tasks where SFT already performed well.

Why does this happen? GRPO is designed to teach a model from scratch how to explore
an environment through reward signals. When the SFT model already knows the correct
action format and ordering, GRPO has little reward variance to learn from — and risks
overfitting the adapter weights to minor fluctuations, which is exactly what happened
on `ml_platform`.

The right production system uses both: GRPO for the weights when starting from a weaker
base, Planner-Critic-Negotiator for the inference guard regardless. That's the roadmap.

---

## What this enables

Architecture design that currently takes engineering teams hours of back-and-forth
can be reduced to a structured, verifiable, automatable loop. The environment
exposes a clean interface: describe the system, get a design, get a score. The
agentic layer ensures the design is complete before submission. The trained model
brings domain knowledge. The encyclopedia brings precision about which technologies
fit which workloads.

The deeper point is about the pattern, not the specific task. Any domain where
design decisions can be verified — infrastructure templates, data pipeline
architecture, API schema design, security policy configuration — is a candidate
for this approach. Build an environment with a deterministic composable reward,
compile your domain knowledge into a structured encyclopedia, fine-tune a model
on the action vocabulary, wrap it in a structured inference loop. The review
loop doesn't disappear; it moves into the Critic, where it runs in milliseconds
instead of minutes.

---

## Compliance checklist

| Requirement | Status |
|-------------|--------|
| Built on OpenEnv (latest release) | ✅ `openenv-core`, FastAPI, `openenv.yaml` |
| Working training script (Unsloth + TRL) as Colab notebook | ✅ [`ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb`](https://huggingface.co/spaces/thepikachu/architecture-env/blob/main/notebooks/ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb) |
| Evidence of real training (loss + reward plots) | ✅ `plots/loss_curve.png`, `plots/reward_curve.png` — committed to repo |
| Mini-blog writeup | ✅ This post (also at `Blog.md` in Space repo) |
| Environment pushed to HuggingFace Space | ✅ https://huggingface.co/spaces/thepikachu/architecture-env |
| README with problem motivation, env explanation, results | ✅ Linked from Space |
| All materials linked from README | ✅ Space, model, notebook, blog |

---

## Stack

- **Environment**: OpenEnv + FastAPI + Pydantic
- **Base model**: Qwen2.5-3B-Instruct (4-bit QLoRA via Unsloth)
- **Training**: Unsloth + HuggingFace TRL (SFTTrainer + GRPOTrainer)
- **Knowledge layer**: `encyclopedia_rules.py` — 40+ component families, family→concrete mappings, task broker overrides, bonus targets, enriched system prompts
- **Agents**: PlannerAgent, CriticAgent, NegotiatorAgent (pure Python, deterministic, no extra model calls)
- **Deployment**: HuggingFace Spaces (Docker)

---

*Code, training notebook, and logs: https://huggingface.co/spaces/thepikachu/architecture-env*