Spaces:

thepikachu
/

architecture-env

Sleeping

App Files Files Community

architecture-env / Blog.md

thepikachu

Refine GRPO evaluation details and clarify model performance comparisons in Blog.md

83f7214 2 months ago

preview code

Raw

History Blame Contribute Delete

15.8 kB

ArchitectureEnv — Teaching an LLM to Design Software Systems

Meta PyTorch × OpenEnv Hackathon Grand Finale, April 2026

🤗 Space: https://huggingface.co/spaces/thepikachu/architecture-env
🧠 Model: https://huggingface.co/thepikachu/architecture-sft-model
📓 Notebook: ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb

The problem nobody talks about

Every engineering team that has ever built a distributed system knows the drill. Someone opens a Confluence doc, someone else draws boxes in Miro, and then the back-and-forth begins. Do we use Kafka or RabbitMQ here? Should the worker connect directly to the database or go through a cache first? Does the WebSocket gateway fan out to the broker, or does the broker fan out to the gateway? These conversations take hours. Then requirements change, the design evolves, and they take hours again.

The obvious response was: just ask an LLM. And that works, up to a point. You describe your system, the model suggests components, you iterate. But there's a hidden cost — every response still needs a human to read it, validate it against the actual requirements, catch the missing connections, and push back when the model picks the wrong technology for the task. You're not removing the review loop; you're just moving it.

We wanted to close that loop entirely.

The idea: turn architecture design into an RL problem

The key insight is that software architecture design has a structure that most open-ended LLM tasks don't: it can be verified. Given a task description, there is a set of components that must be present, a set of connections that must exist between them, and a measurable score for how close a design is to correct. You don't need a human in the loop to know whether worker → database is connected. The environment can check.

If you can verify it, you can train on it.

We built ArchitectureEnv — an OpenEnv environment where an agent designs software systems one action at a time. The agent can add a component, connect two components, or submit the finished design. After each action, the environment returns a reward in [0, 1] based on how complete and technically sound the architecture is.

Six real-world tasks across three difficulty levels:

Task	Difficulty	Key challenge
`url_shortener`	Easy	Load balancer, cache, DB, abuse prevention
`chat_system`	Medium	WebSocket fanout, presence, async broker
`ecommerce_platform`	Medium	Search, payments, async workers
`youtube_platform`	Hard	Transcoding pipeline, CDN, recommendations
`ride_sharing`	Hard	Geospatial indexing, real-time matching
`ml_platform`	Hard	Feature store, model registry, inference server

The reward function is deterministic and composable — no LLM judge, no ambiguity:

Component coverage: which required families are present (e.g. cache, broker, database)
Connection coverage: which required edges exist between components
Technology-fit bonus: Kafka over RabbitMQ for streaming workloads, Redis over Memcached when presence tracking is needed
Bonus items: optional cross-cutting concerns — auth, observability, rate limiting, payment gateways
Coherence penalty: deducted for irrelevant or conflicting components

This multi-component reward makes the environment hard to game. An agent can't just dump every component it knows — irrelevant components incur a penalty, and bonus items only score if the required core is already complete.

The knowledge layer: the System Design Encyclopedia

Before training anything, we needed the model to know which components belong to which architectural families. A "cache" family could mean Redis, Memcached, or Varnish — but for a ride-sharing platform tracking driver locations in memory, Redis is the right answer. For a YouTube-scale streaming platform, Kafka is the right broker, not RabbitMQ.

We built encyclopedia_rules.py, a knowledge layer built on top of the System Design Components Encyclopedia covering 40+ component families. It does four things:

1. Family → concrete component mappings. Every abstract family required by the environment (e.g. broker, cache, search, storage) is mapped to the concrete implementation that scores highest for a given task type. The agent's system prompt is enriched with these mappings before every episode.

2. Task-specific technology overrides. Some mappings differ by task. The broker family defaults to rabbitmq for simple async work but overrides to kafka for streaming-heavy tasks (youtube_platform, ride_sharing). The cache family defaults to redis but the ml_platform uses feast as the feature store. These overrides are resolved at inference time, not hard-coded.

3. Bonus target lists. Each task has a set of optional bonus components (auth, observability, rate-limiting, payment gateway, recommendation workers, etc.). The encyclopedia maps these back to concrete components the agent can add by name, closing the gap between abstract scoring criteria and executable actions.

4. Enriched system prompt. The system prompt injected into every agent call includes a structured summary of the relevant component families, their concrete implementations, and the integration patterns between them — drawn directly from the encyclopedia's integration pattern catalog.

This is what separates the agentic system from raw LLM inference. The model doesn't need to hallucinate that Kafka is a good fit for streaming — the encyclopedia tells it, and the Planner enforces it.

Training: SFT first, then RL

A raw base model given an architecture task will output something that looks like an architecture discussion — not a sequence of add and connect commands. Before we could use reinforcement learning, we needed the model to speak the language of the environment.

We fine-tuned Qwen2.5-3B-Instruct using Unsloth with 4-bit QLoRA on a supervised dataset of 120 correct action sequences across all six task types. This is the SFT stage — it doesn't teach strategy, it teaches format. After SFT, the model reliably outputs add postgres, connect broker worker, submit — the vocabulary the environment understands.

SFT training ran for 150 steps (10 epochs, batch size 8) on a T4 GPU in 13 minutes. Cross-entropy loss dropped from 3.57 at step 10 to a plateau of ~0.006 by step 70, with a final average training loss of 0.208 (skewed by the steep early descent). Evaluated directly after training (single-shot greedy decode, no agentic loop), the SFT checkpoint achieved 0.997 average reward across all six tasks — 4 tasks at 1.0, and ecommerce/ml_platform at 0.99.

SFT had saturated the task format. We then ran GRPO on top of the SFT checkpoint to explore whether RL could improve priority ordering — required items before bonus items, required connections before submitting.

Using HuggingFace TRL's GRPOTrainer, we ran the environment as the reward function — the model generates an action plan, the environment executes it, and the final score drives the weight update. We trained for 50 GRPO steps from the SFT checkpoint (batch size 4, ~10 minutes on T4). As the reward curve shows, GRPO improved the raw model's training-time scores, but when both checkpoints were evaluated through the full agentic inference stack, the SFT-only checkpoint outperformed SFT+GRPO (0.978 vs 0.928) — the GRPO run degraded ml_platform significantly. This is discussed in detail in the section below.

SFT cross-entropy loss over 150 steps. Loss axis: cross-entropy. Step axis: training step.

Average environment reward across all six tasks, comparing SFT-only and SFT+GRPO checkpoints over training. Reward axis: environment score [0, 1]. Step axis: training step.

The agentic layer: Planner, Critic, Negotiator

Here's where things got interesting. We found that the SFT model, when wrapped in a structured multi-agent inference loop informed by the encyclopedia, matched or exceeded GRPO on most tasks — with full interpretability of every decision.

The loop at every step:

PlannerAgent — deterministic, zero-hallucination. It reads the live environment state (present components, missing required items, missing connections, available bonus targets) and computes the ground-truth next action using the encyclopedia's priority ordering: required items → required connections → bonus items → submit. The planner's hint is injected into the LLM's prompt.

LLM (the fine-tuned SFT model) — generates its own action proposal informed by the planner hint. This is where domain knowledge from the encyclopedia pays off: the model has been trained on action sequences that reflect the same technology choices the encyclopedia prescribes.

CriticAgent — validates the LLM's proposal against the live environment state before execution. It checks: does the component exist before connecting? Is required work complete before bonus items? Is the missing-items list empty before submit? If the proposal fails any check, it's rejected without hitting the environment.

NegotiatorAgent — repairs rejected proposals using the planner as ground truth. Every rejection is logged with its reason (critic_rejected: 'websocket_gateway' not present; add it before connecting → repair: add websocket_gateway), creating a fully auditable decision trail.

The Critic was the most important piece. Without it, the LLM would occasionally submit with missing connections still in the queue, or attempt to connect components that hadn't been added yet. The Critic catches both, every step, before the environment ever sees the action.

Results

Agentic inference demo — real LLM calls, no caching, run on 2026-04-25:

Task	SFT, no agentic loop	SFT + agentic loop	Δ
chat_system	0.930	1.000	+0.070
ecommerce_platform	0.900	0.990	+0.090
youtube_platform	0.930	1.000	+0.070
ride_sharing	0.930	0.890	−0.040
ml_platform	0.930	0.990	+0.060
url_shortener	0.930	1.000	+0.070
Average	0.925	0.978	+0.053

Five of six tasks hit 0.99 or above. The ride_sharing episode scored 0.890 — the agent completed the required architecture correctly but the Negotiator's deduplication guard did not prevent an extra bonus-item attempt that incurred a step penalty before submit. This is a known edge case in the Negotiator and not a model failure — the required architecture was fully complete before the penalty was incurred.

Bonus items collected in the improved run confirm the agent fully understands the encyclopedia's scoring model:

chat_system → auth, observability, presence_service, notification_service
youtube_platform → auth, observability, recommendation_worker
url_shortener → auth, observability, rate_limiting
ml_platform → auth, observability

Why SFT + agentic loop, not SFT + GRPO + agentic loop?

We ran both pipelines end-to-end through the same Planner-Critic-Negotiator inference stack and compared the results directly.

SFT + agentic loop averaged 0.978 across all six tasks — five at 0.99 or above. SFT + GRPO + agentic loop averaged 0.928 — GRPO made things worse, not better.

Task	SFT + agentic	SFT + GRPO + agentic	Δ
url_shortener	1.000	1.000	0.000
chat_system	1.000	1.000	0.000
ecommerce_platform	0.990	0.990	0.000
youtube_platform	1.000	1.000	0.000
ride_sharing	0.890	1.000	+0.110
ml_platform	0.990	0.580	−0.410
Average	0.978	0.928	−0.050

The regression on ml_platform is the clearest signal. The GRPO+SFT checkpoint submitted with storage still missing, zero connections made, and three required edges completely absent — a score of 0.58. The SFT checkpoint through the same agentic loop scored 0.99 on the same task. GRPO disrupted the LoRA weights enough to break the model's ability to complete hard tasks, while adding nothing on the five tasks where SFT already performed well.

Why does this happen? GRPO is designed to teach a model from scratch how to explore an environment through reward signals. When the SFT model already knows the correct action format and ordering, GRPO has little reward variance to learn from — and risks overfitting the adapter weights to minor fluctuations, which is exactly what happened on ml_platform.

The right production system uses both: GRPO for the weights when starting from a weaker base, Planner-Critic-Negotiator for the inference guard regardless. That's the roadmap.

What this enables

Architecture design that currently takes engineering teams hours of back-and-forth can be reduced to a structured, verifiable, automatable loop. The environment exposes a clean interface: describe the system, get a design, get a score. The agentic layer ensures the design is complete before submission. The trained model brings domain knowledge. The encyclopedia brings precision about which technologies fit which workloads.

The deeper point is about the pattern, not the specific task. Any domain where design decisions can be verified — infrastructure templates, data pipeline architecture, API schema design, security policy configuration — is a candidate for this approach. Build an environment with a deterministic composable reward, compile your domain knowledge into a structured encyclopedia, fine-tune a model on the action vocabulary, wrap it in a structured inference loop. The review loop doesn't disappear; it moves into the Critic, where it runs in milliseconds instead of minutes.

Compliance checklist

Requirement	Status
Built on OpenEnv (latest release)	✅ `openenv-core`, FastAPI, `openenv.yaml`
Working training script (Unsloth + TRL) as Colab notebook	✅ `ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb`
Evidence of real training (loss + reward plots)	✅ `plots/loss_curve.png`, `plots/reward_curve.png` — committed to repo
Mini-blog writeup	✅ This post (also at `Blog.md` in Space repo)
Environment pushed to HuggingFace Space	✅ https://huggingface.co/spaces/thepikachu/architecture-env
README with problem motivation, env explanation, results	✅ Linked from Space
All materials linked from README	✅ Space, model, notebook, blog

Stack

Environment: OpenEnv + FastAPI + Pydantic
Base model: Qwen2.5-3B-Instruct (4-bit QLoRA via Unsloth)
Training: Unsloth + HuggingFace TRL (SFTTrainer + GRPOTrainer)
Knowledge layer: encyclopedia_rules.py — 40+ component families, family→concrete mappings, task broker overrides, bonus targets, enriched system prompts
Agents: PlannerAgent, CriticAgent, NegotiatorAgent (pure Python, deterministic, no extra model calls)
Deployment: HuggingFace Spaces (Docker)

Code, training notebook, and logs: https://huggingface.co/spaces/thepikachu/architecture-env