Spaces:

thepikachu
/

architecture-env

Sleeping

App Files Files Community

architecture-env / Blog.md

thepikachu

Refine GRPO evaluation details and clarify model performance comparisons in Blog.md

83f7214 2 months ago

preview code

Raw

History Blame Contribute Delete

15.8 kB

	# ArchitectureEnv — Teaching an LLM to Design Software Systems

	Meta PyTorch × OpenEnv Hackathon Grand Finale, April 2026

	🤗 Space: https://huggingface.co/spaces/thepikachu/architecture-env
	🧠 Model: https://huggingface.co/thepikachu/architecture-sft-model
	📓 Notebook: [ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb](https://huggingface.co/spaces/thepikachu/architecture-env/blob/main/notebooks/ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb)

	---

	## The problem nobody talks about

	Every engineering team that has ever built a distributed system knows the drill.
	Someone opens a Confluence doc, someone else draws boxes in Miro, and then the
	back-and-forth begins. Do we use Kafka or RabbitMQ here? Should the worker
	connect directly to the database or go through a cache first? Does the
	WebSocket gateway fan out to the broker, or does the broker fan out to the
	gateway? These conversations take hours. Then requirements change, the design
	evolves, and they take hours again.

	The obvious response was: just ask an LLM. And that works, up to a point. You
	describe your system, the model suggests components, you iterate. But there's a
	hidden cost — every response still needs a human to read it, validate it against
	the actual requirements, catch the missing connections, and push back when the
	model picks the wrong technology for the task. You're not removing the review
	loop; you're just moving it.

	We wanted to close that loop entirely.

	---

	## The idea: turn architecture design into an RL problem

	The key insight is that software architecture design has a structure that most
	open-ended LLM tasks don't: it can be verified. Given a task description,
	there is a set of components that must be present, a set of connections that
	must exist between them, and a measurable score for how close a design is to
	correct. You don't need a human in the loop to know whether `worker → database`
	is connected. The environment can check.

	If you can verify it, you can train on it.

	We built ArchitectureEnv — an OpenEnv environment where an agent designs
	software systems one action at a time. The agent can `add` a component,
	`connect` two components, or `submit` the finished design. After each action,
	the environment returns a reward in `[0, 1]` based on how complete and
	technically sound the architecture is.

	Six real-world tasks across three difficulty levels:

	\| Task \| Difficulty \| Key challenge \|
	\|------\|-----------\|---------------\|
	\| `url_shortener` \| Easy \| Load balancer, cache, DB, abuse prevention \|
	\| `chat_system` \| Medium \| WebSocket fanout, presence, async broker \|
	\| `ecommerce_platform` \| Medium \| Search, payments, async workers \|
	\| `youtube_platform` \| Hard \| Transcoding pipeline, CDN, recommendations \|
	\| `ride_sharing` \| Hard \| Geospatial indexing, real-time matching \|
	\| `ml_platform` \| Hard \| Feature store, model registry, inference server \|

	The reward function is deterministic and composable — no LLM judge, no ambiguity:

	- Component coverage: which required families are present (e.g. `cache`, `broker`, `database`)
	- Connection coverage: which required edges exist between components
	- Technology-fit bonus: Kafka over RabbitMQ for streaming workloads, Redis over Memcached when presence tracking is needed
	- Bonus items: optional cross-cutting concerns — auth, observability, rate limiting, payment gateways
	- Coherence penalty: deducted for irrelevant or conflicting components

	This multi-component reward makes the environment hard to game. An agent can't
	just dump every component it knows — irrelevant components incur a penalty, and
	bonus items only score if the required core is already complete.

	---

	## The knowledge layer: the System Design Encyclopedia

	Before training anything, we needed the model to know which components belong
	to which architectural families. A "cache" family could mean Redis, Memcached,
	or Varnish — but for a ride-sharing platform tracking driver locations in memory,
	Redis is the right answer. For a YouTube-scale streaming platform, Kafka is the
	right broker, not RabbitMQ.

	We built `encyclopedia_rules.py`, a knowledge layer built on top of the
	System Design Components Encyclopedia covering 40+ component families. It does four things:

	1. Family → concrete component mappings. Every abstract family required by the environment
	(e.g. `broker`, `cache`, `search`, `storage`) is mapped to the concrete
	implementation that scores highest for a given task type. The agent's system
	prompt is enriched with these mappings before every episode.

	2. Task-specific technology overrides. Some mappings differ by task. The
	broker family defaults to `rabbitmq` for simple async work but overrides to
	`kafka` for streaming-heavy tasks (`youtube_platform`, `ride_sharing`). The
	cache family defaults to `redis` but the ml_platform uses `feast` as the
	feature store. These overrides are resolved at inference time, not hard-coded.

	3. Bonus target lists. Each task has a set of optional bonus components
	(auth, observability, rate-limiting, payment gateway, recommendation workers,
	etc.). The encyclopedia maps these back to concrete components the agent can
	`add` by name, closing the gap between abstract scoring criteria and executable
	actions.

	4. Enriched system prompt. The system prompt injected into every agent call
	includes a structured summary of the relevant component families, their
	concrete implementations, and the integration patterns between them — drawn
	directly from the encyclopedia's integration pattern catalog.

	This is what separates the agentic system from raw LLM inference. The model
	doesn't need to hallucinate that Kafka is a good fit for streaming — the
	encyclopedia tells it, and the Planner enforces it.

	---

	## Training: SFT first, then RL

	A raw base model given an architecture task will output something that looks
	like an architecture discussion — not a sequence of `add` and `connect`
	commands. Before we could use reinforcement learning, we needed the model to
	speak the language of the environment.

	We fine-tuned Qwen2.5-3B-Instruct using Unsloth with 4-bit QLoRA on a
	supervised dataset of 120 correct action sequences across all six task types.
	This is the SFT stage — it doesn't teach strategy, it teaches format. After SFT,
	the model reliably outputs `add postgres`, `connect broker worker`, `submit` —
	the vocabulary the environment understands.

	SFT training ran for 150 steps (10 epochs, batch size 8) on a T4 GPU in 13 minutes.
	Cross-entropy loss dropped from 3.57 at step 10 to a plateau of ~0.006 by step 70,
	with a final average training loss of 0.208 (skewed by the steep early descent).
	Evaluated directly after training (single-shot greedy decode, no agentic loop),
	the SFT checkpoint achieved 0.997 average reward across all six tasks — 4 tasks
	at 1.0, and ecommerce/ml_platform at 0.99.

	SFT had saturated the task format. We then ran GRPO on top of the SFT checkpoint to
	explore whether RL could improve priority ordering — required items before bonus items,
	required connections before submitting.

	Using HuggingFace TRL's GRPOTrainer, we ran the environment as the reward function —
	the model generates an action plan, the environment executes it, and the final score
	drives the weight update. We trained for 50 GRPO steps from the SFT checkpoint
	(batch size 4, ~10 minutes on T4). As the reward curve shows, GRPO improved the
	raw model's training-time scores, but when both checkpoints were evaluated through
	the full agentic inference stack, the SFT-only checkpoint outperformed SFT+GRPO
	(0.978 vs 0.928) — the GRPO run degraded `ml_platform` significantly. This is
	discussed in detail in the section below.

	![SFT Loss Curve](https://huggingface.co/spaces/thepikachu/architecture-env/resolve/main/plots/loss_curve.png)
	SFT cross-entropy loss over 150 steps. Loss axis: cross-entropy. Step axis: training step.

	![Reward Curve](https://huggingface.co/spaces/thepikachu/architecture-env/resolve/main/plots/reward_curve.png)
	Average environment reward across all six tasks, comparing SFT-only and SFT+GRPO checkpoints over training. Reward axis: environment score [0, 1]. Step axis: training step.

	---

	## The agentic layer: Planner, Critic, Negotiator

	Here's where things got interesting. We found that the SFT model, when wrapped
	in a structured multi-agent inference loop informed by the encyclopedia,
	matched or exceeded GRPO on most tasks — with full interpretability of every decision.

	The loop at every step:

	PlannerAgent — deterministic, zero-hallucination. It reads the live
	environment state (present components, missing required items, missing
	connections, available bonus targets) and computes the ground-truth next action
	using the encyclopedia's priority ordering: required items → required connections
	→ bonus items → submit. The planner's hint is injected into the LLM's prompt.

	LLM (the fine-tuned SFT model) — generates its own action proposal informed
	by the planner hint. This is where domain knowledge from the encyclopedia pays
	off: the model has been trained on action sequences that reflect the same
	technology choices the encyclopedia prescribes.

	CriticAgent — validates the LLM's proposal against the live environment
	state before execution. It checks: does the component exist before connecting?
	Is required work complete before bonus items? Is the missing-items list empty
	before submit? If the proposal fails any check, it's rejected without hitting
	the environment.

	NegotiatorAgent — repairs rejected proposals using the planner as ground
	truth. Every rejection is logged with its reason (`critic_rejected: 'websocket_gateway' not present; add it before connecting → repair: add websocket_gateway`), creating a fully auditable decision trail.

	The Critic was the most important piece. Without it, the LLM would occasionally
	submit with missing connections still in the queue, or attempt to connect
	components that hadn't been added yet. The Critic catches both, every step,
	before the environment ever sees the action.

	---

	## Results

	Agentic inference demo — real LLM calls, no caching, run on 2026-04-25:

	\| Task \| SFT, no agentic loop \| SFT + agentic loop \| Δ \|
	\|------\|---------------------\|--------------------\|---\|
	\| chat_system \| 0.930 \| 1.000 \| +0.070 \|
	\| ecommerce_platform \| 0.900 \| 0.990 \| +0.090 \|
	\| youtube_platform \| 0.930 \| 1.000 \| +0.070 \|
	\| ride_sharing \| 0.930 \| 0.890 \| −0.040 \|
	\| ml_platform \| 0.930 \| 0.990 \| +0.060 \|
	\| url_shortener \| 0.930 \| 1.000 \| +0.070 \|
	\| Average \| 0.925 \| 0.978 \| +0.053 \|

	Five of six tasks hit 0.99 or above. The `ride_sharing` episode scored 0.890 — the
	agent completed the required architecture correctly but the Negotiator's deduplication
	guard did not prevent an extra bonus-item attempt that incurred a step penalty before
	submit. This is a known edge case in the Negotiator and not a model failure — the
	required architecture was fully complete before the penalty was incurred.

	Bonus items collected in the improved run confirm the agent fully understands the
	encyclopedia's scoring model:

	- `chat_system` → auth, observability, presence_service, notification_service
	- `youtube_platform` → auth, observability, recommendation_worker
	- `url_shortener` → auth, observability, rate_limiting
	- `ml_platform` → auth, observability

	---

	## Why SFT + agentic loop, not SFT + GRPO + agentic loop?

	We ran both pipelines end-to-end through the same Planner-Critic-Negotiator inference
	stack and compared the results directly.

	SFT + agentic loop averaged 0.978 across all six tasks — five at 0.99 or above.
	SFT + GRPO + agentic loop averaged 0.928 — GRPO made things worse, not better.

	\| Task \| SFT + agentic \| SFT + GRPO + agentic \| Δ \|
	\|------\|--------------\|----------------------\|---\|
	\| url_shortener \| 1.000 \| 1.000 \| 0.000 \|
	\| chat_system \| 1.000 \| 1.000 \| 0.000 \|
	\| ecommerce_platform \| 0.990 \| 0.990 \| 0.000 \|
	\| youtube_platform \| 1.000 \| 1.000 \| 0.000 \|
	\| ride_sharing \| 0.890 \| 1.000 \| +0.110 \|
	\| ml_platform \| 0.990 \| 0.580 \| −0.410 \|
	\| Average \| 0.978 \| 0.928 \| −0.050 \|

	The regression on `ml_platform` is the clearest signal. The GRPO+SFT checkpoint
	submitted with `storage` still missing, zero connections made, and three required
	edges completely absent — a score of 0.58. The SFT checkpoint through the same
	agentic loop scored 0.99 on the same task. GRPO disrupted the LoRA weights enough
	to break the model's ability to complete hard tasks, while adding nothing on the
	five tasks where SFT already performed well.

	Why does this happen? GRPO is designed to teach a model from scratch how to explore
	an environment through reward signals. When the SFT model already knows the correct
	action format and ordering, GRPO has little reward variance to learn from — and risks
	overfitting the adapter weights to minor fluctuations, which is exactly what happened
	on `ml_platform`.

	The right production system uses both: GRPO for the weights when starting from a weaker
	base, Planner-Critic-Negotiator for the inference guard regardless. That's the roadmap.

	---

	## What this enables

	Architecture design that currently takes engineering teams hours of back-and-forth
	can be reduced to a structured, verifiable, automatable loop. The environment
	exposes a clean interface: describe the system, get a design, get a score. The
	agentic layer ensures the design is complete before submission. The trained model
	brings domain knowledge. The encyclopedia brings precision about which technologies
	fit which workloads.

	The deeper point is about the pattern, not the specific task. Any domain where
	design decisions can be verified — infrastructure templates, data pipeline
	architecture, API schema design, security policy configuration — is a candidate
	for this approach. Build an environment with a deterministic composable reward,
	compile your domain knowledge into a structured encyclopedia, fine-tune a model
	on the action vocabulary, wrap it in a structured inference loop. The review
	loop doesn't disappear; it moves into the Critic, where it runs in milliseconds
	instead of minutes.

	---

	## Compliance checklist

	\| Requirement \| Status \|
	\|-------------\|--------\|
	\| Built on OpenEnv (latest release) \| ✅ `openenv-core`, FastAPI, `openenv.yaml` \|
	\| Working training script (Unsloth + TRL) as Colab notebook \| ✅ [`ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb`](https://huggingface.co/spaces/thepikachu/architecture-env/blob/main/notebooks/ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb) \|
	\| Evidence of real training (loss + reward plots) \| ✅ `plots/loss_curve.png`, `plots/reward_curve.png` — committed to repo \|
	\| Mini-blog writeup \| ✅ This post (also at `Blog.md` in Space repo) \|
	\| Environment pushed to HuggingFace Space \| ✅ https://huggingface.co/spaces/thepikachu/architecture-env \|
	\| README with problem motivation, env explanation, results \| ✅ Linked from Space \|
	\| All materials linked from README \| ✅ Space, model, notebook, blog \|

	---

	## Stack

	- Environment: OpenEnv + FastAPI + Pydantic
	- Base model: Qwen2.5-3B-Instruct (4-bit QLoRA via Unsloth)
	- Training: Unsloth + HuggingFace TRL (SFTTrainer + GRPOTrainer)
	- Knowledge layer: `encyclopedia_rules.py` — 40+ component families, family→concrete mappings, task broker overrides, bonus targets, enriched system prompts
	- Agents: PlannerAgent, CriticAgent, NegotiatorAgent (pure Python, deterministic, no extra model calls)
	- Deployment: HuggingFace Spaces (Docker)

	---

	Code, training notebook, and logs: https://huggingface.co/spaces/thepikachu/architecture-env