Spaces:
Sleeping
ArchitectureEnv β Teaching an LLM to Design Software Systems
Meta PyTorch Γ OpenEnv Hackathon Grand Finale, April 2026
π€ Space: https://huggingface.co/spaces/thepikachu/architecture-env
π§ Model: https://huggingface.co/thepikachu/architecture-sft-model
π Notebook: ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb
The problem nobody talks about
Every engineering team that has ever built a distributed system knows the drill. Someone opens a Confluence doc, someone else draws boxes in Miro, and then the back-and-forth begins. Do we use Kafka or RabbitMQ here? Should the worker connect directly to the database or go through a cache first? Does the WebSocket gateway fan out to the broker, or does the broker fan out to the gateway? These conversations take hours. Then requirements change, the design evolves, and they take hours again.
The obvious response was: just ask an LLM. And that works, up to a point. You describe your system, the model suggests components, you iterate. But there's a hidden cost β every response still needs a human to read it, validate it against the actual requirements, catch the missing connections, and push back when the model picks the wrong technology for the task. You're not removing the review loop; you're just moving it.
We wanted to close that loop entirely.
The idea: turn architecture design into an RL problem
The key insight is that software architecture design has a structure that most
open-ended LLM tasks don't: it can be verified. Given a task description,
there is a set of components that must be present, a set of connections that
must exist between them, and a measurable score for how close a design is to
correct. You don't need a human in the loop to know whether worker β database
is connected. The environment can check.
If you can verify it, you can train on it.
We built ArchitectureEnv β an OpenEnv environment where an agent designs
software systems one action at a time. The agent can add a component,
connect two components, or submit the finished design. After each action,
the environment returns a reward in [0, 1] based on how complete and
technically sound the architecture is.
Six real-world tasks across three difficulty levels:
| Task | Difficulty | Key challenge |
|---|---|---|
url_shortener |
Easy | Load balancer, cache, DB, abuse prevention |
chat_system |
Medium | WebSocket fanout, presence, async broker |
ecommerce_platform |
Medium | Search, payments, async workers |
youtube_platform |
Hard | Transcoding pipeline, CDN, recommendations |
ride_sharing |
Hard | Geospatial indexing, real-time matching |
ml_platform |
Hard | Feature store, model registry, inference server |
The reward function is deterministic and composable β no LLM judge, no ambiguity:
- Component coverage: which required families are present (e.g.
cache,broker,database) - Connection coverage: which required edges exist between components
- Technology-fit bonus: Kafka over RabbitMQ for streaming workloads, Redis over Memcached when presence tracking is needed
- Bonus items: optional cross-cutting concerns β auth, observability, rate limiting, payment gateways
- Coherence penalty: deducted for irrelevant or conflicting components
This multi-component reward makes the environment hard to game. An agent can't just dump every component it knows β irrelevant components incur a penalty, and bonus items only score if the required core is already complete.
The knowledge layer: the System Design Encyclopedia
Before training anything, we needed the model to know which components belong to which architectural families. A "cache" family could mean Redis, Memcached, or Varnish β but for a ride-sharing platform tracking driver locations in memory, Redis is the right answer. For a YouTube-scale streaming platform, Kafka is the right broker, not RabbitMQ.
We built encyclopedia_rules.py, a knowledge layer built on top of the
System Design Components Encyclopedia covering 40+ component families. It does four things:
1. Family β concrete component mappings. Every abstract family required by the environment
(e.g. broker, cache, search, storage) is mapped to the concrete
implementation that scores highest for a given task type. The agent's system
prompt is enriched with these mappings before every episode.
2. Task-specific technology overrides. Some mappings differ by task. The
broker family defaults to rabbitmq for simple async work but overrides to
kafka for streaming-heavy tasks (youtube_platform, ride_sharing). The
cache family defaults to redis but the ml_platform uses feast as the
feature store. These overrides are resolved at inference time, not hard-coded.
3. Bonus target lists. Each task has a set of optional bonus components
(auth, observability, rate-limiting, payment gateway, recommendation workers,
etc.). The encyclopedia maps these back to concrete components the agent can
add by name, closing the gap between abstract scoring criteria and executable
actions.
4. Enriched system prompt. The system prompt injected into every agent call includes a structured summary of the relevant component families, their concrete implementations, and the integration patterns between them β drawn directly from the encyclopedia's integration pattern catalog.
This is what separates the agentic system from raw LLM inference. The model doesn't need to hallucinate that Kafka is a good fit for streaming β the encyclopedia tells it, and the Planner enforces it.
Training: SFT first, then RL
A raw base model given an architecture task will output something that looks
like an architecture discussion β not a sequence of add and connect
commands. Before we could use reinforcement learning, we needed the model to
speak the language of the environment.
We fine-tuned Qwen2.5-3B-Instruct using Unsloth with 4-bit QLoRA on a
supervised dataset of 120 correct action sequences across all six task types.
This is the SFT stage β it doesn't teach strategy, it teaches format. After SFT,
the model reliably outputs add postgres, connect broker worker, submit β
the vocabulary the environment understands.
SFT training ran for 150 steps (10 epochs, batch size 8) on a T4 GPU in 13 minutes. Cross-entropy loss dropped from 3.57 at step 10 to a plateau of ~0.006 by step 70, with a final average training loss of 0.208 (skewed by the steep early descent). Evaluated directly after training (single-shot greedy decode, no agentic loop), the SFT checkpoint achieved 0.997 average reward across all six tasks β 4 tasks at 1.0, and ecommerce/ml_platform at 0.99.
SFT had saturated the task format. We then ran GRPO on top of the SFT checkpoint to explore whether RL could improve priority ordering β required items before bonus items, required connections before submitting.
Using HuggingFace TRL's GRPOTrainer, we ran the environment as the reward function β
the model generates an action plan, the environment executes it, and the final score
drives the weight update. We trained for 50 GRPO steps from the SFT checkpoint
(batch size 4, ~10 minutes on T4). As the reward curve shows, GRPO improved the
raw model's training-time scores, but when both checkpoints were evaluated through
the full agentic inference stack, the SFT-only checkpoint outperformed SFT+GRPO
(0.978 vs 0.928) β the GRPO run degraded ml_platform significantly. This is
discussed in detail in the section below.
SFT cross-entropy loss over 150 steps. Loss axis: cross-entropy. Step axis: training step.
Average environment reward across all six tasks, comparing SFT-only and SFT+GRPO checkpoints over training. Reward axis: environment score [0, 1]. Step axis: training step.
The agentic layer: Planner, Critic, Negotiator
Here's where things got interesting. We found that the SFT model, when wrapped in a structured multi-agent inference loop informed by the encyclopedia, matched or exceeded GRPO on most tasks β with full interpretability of every decision.
The loop at every step:
PlannerAgent β deterministic, zero-hallucination. It reads the live environment state (present components, missing required items, missing connections, available bonus targets) and computes the ground-truth next action using the encyclopedia's priority ordering: required items β required connections β bonus items β submit. The planner's hint is injected into the LLM's prompt.
LLM (the fine-tuned SFT model) β generates its own action proposal informed by the planner hint. This is where domain knowledge from the encyclopedia pays off: the model has been trained on action sequences that reflect the same technology choices the encyclopedia prescribes.
CriticAgent β validates the LLM's proposal against the live environment state before execution. It checks: does the component exist before connecting? Is required work complete before bonus items? Is the missing-items list empty before submit? If the proposal fails any check, it's rejected without hitting the environment.
NegotiatorAgent β repairs rejected proposals using the planner as ground
truth. Every rejection is logged with its reason (critic_rejected: 'websocket_gateway' not present; add it before connecting β repair: add websocket_gateway), creating a fully auditable decision trail.
The Critic was the most important piece. Without it, the LLM would occasionally submit with missing connections still in the queue, or attempt to connect components that hadn't been added yet. The Critic catches both, every step, before the environment ever sees the action.
Results
Agentic inference demo β real LLM calls, no caching, run on 2026-04-25:
| Task | SFT, no agentic loop | SFT + agentic loop | Ξ |
|---|---|---|---|
| chat_system | 0.930 | 1.000 | +0.070 |
| ecommerce_platform | 0.900 | 0.990 | +0.090 |
| youtube_platform | 0.930 | 1.000 | +0.070 |
| ride_sharing | 0.930 | 0.890 | β0.040 |
| ml_platform | 0.930 | 0.990 | +0.060 |
| url_shortener | 0.930 | 1.000 | +0.070 |
| Average | 0.925 | 0.978 | +0.053 |
Five of six tasks hit 0.99 or above. The ride_sharing episode scored 0.890 β the
agent completed the required architecture correctly but the Negotiator's deduplication
guard did not prevent an extra bonus-item attempt that incurred a step penalty before
submit. This is a known edge case in the Negotiator and not a model failure β the
required architecture was fully complete before the penalty was incurred.
Bonus items collected in the improved run confirm the agent fully understands the encyclopedia's scoring model:
chat_systemβ auth, observability, presence_service, notification_serviceyoutube_platformβ auth, observability, recommendation_workerurl_shortenerβ auth, observability, rate_limitingml_platformβ auth, observability
Why SFT + agentic loop, not SFT + GRPO + agentic loop?
We ran both pipelines end-to-end through the same Planner-Critic-Negotiator inference stack and compared the results directly.
SFT + agentic loop averaged 0.978 across all six tasks β five at 0.99 or above. SFT + GRPO + agentic loop averaged 0.928 β GRPO made things worse, not better.
| Task | SFT + agentic | SFT + GRPO + agentic | Ξ |
|---|---|---|---|
| url_shortener | 1.000 | 1.000 | 0.000 |
| chat_system | 1.000 | 1.000 | 0.000 |
| ecommerce_platform | 0.990 | 0.990 | 0.000 |
| youtube_platform | 1.000 | 1.000 | 0.000 |
| ride_sharing | 0.890 | 1.000 | +0.110 |
| ml_platform | 0.990 | 0.580 | β0.410 |
| Average | 0.978 | 0.928 | β0.050 |
The regression on ml_platform is the clearest signal. The GRPO+SFT checkpoint
submitted with storage still missing, zero connections made, and three required
edges completely absent β a score of 0.58. The SFT checkpoint through the same
agentic loop scored 0.99 on the same task. GRPO disrupted the LoRA weights enough
to break the model's ability to complete hard tasks, while adding nothing on the
five tasks where SFT already performed well.
Why does this happen? GRPO is designed to teach a model from scratch how to explore
an environment through reward signals. When the SFT model already knows the correct
action format and ordering, GRPO has little reward variance to learn from β and risks
overfitting the adapter weights to minor fluctuations, which is exactly what happened
on ml_platform.
The right production system uses both: GRPO for the weights when starting from a weaker base, Planner-Critic-Negotiator for the inference guard regardless. That's the roadmap.
What this enables
Architecture design that currently takes engineering teams hours of back-and-forth can be reduced to a structured, verifiable, automatable loop. The environment exposes a clean interface: describe the system, get a design, get a score. The agentic layer ensures the design is complete before submission. The trained model brings domain knowledge. The encyclopedia brings precision about which technologies fit which workloads.
The deeper point is about the pattern, not the specific task. Any domain where design decisions can be verified β infrastructure templates, data pipeline architecture, API schema design, security policy configuration β is a candidate for this approach. Build an environment with a deterministic composable reward, compile your domain knowledge into a structured encyclopedia, fine-tune a model on the action vocabulary, wrap it in a structured inference loop. The review loop doesn't disappear; it moves into the Critic, where it runs in milliseconds instead of minutes.
Compliance checklist
| Requirement | Status |
|---|---|
| Built on OpenEnv (latest release) | β
openenv-core, FastAPI, openenv.yaml |
| Working training script (Unsloth + TRL) as Colab notebook | β
ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb |
| Evidence of real training (loss + reward plots) | β
plots/loss_curve.png, plots/reward_curve.png β committed to repo |
| Mini-blog writeup | β
This post (also at Blog.md in Space repo) |
| Environment pushed to HuggingFace Space | β https://huggingface.co/spaces/thepikachu/architecture-env |
| README with problem motivation, env explanation, results | β Linked from Space |
| All materials linked from README | β Space, model, notebook, blog |
Stack
- Environment: OpenEnv + FastAPI + Pydantic
- Base model: Qwen2.5-3B-Instruct (4-bit QLoRA via Unsloth)
- Training: Unsloth + HuggingFace TRL (SFTTrainer + GRPOTrainer)
- Knowledge layer:
encyclopedia_rules.pyβ 40+ component families, familyβconcrete mappings, task broker overrides, bonus targets, enriched system prompts - Agents: PlannerAgent, CriticAgent, NegotiatorAgent (pure Python, deterministic, no extra model calls)
- Deployment: HuggingFace Spaces (Docker)
Code, training notebook, and logs: https://huggingface.co/spaces/thepikachu/architecture-env