Spaces:
Sleeping
Sleeping
| # ArchitectureEnv β Teaching an LLM to Design Software Systems | |
| *Meta PyTorch Γ OpenEnv Hackathon Grand Finale, April 2026* | |
| π€ **Space**: https://huggingface.co/spaces/thepikachu/architecture-env | |
| π§ **Model**: https://huggingface.co/thepikachu/architecture-sft-model | |
| π **Notebook**: [ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb](https://huggingface.co/spaces/thepikachu/architecture-env/blob/main/notebooks/ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb) | |
| --- | |
| ## The problem nobody talks about | |
| Every engineering team that has ever built a distributed system knows the drill. | |
| Someone opens a Confluence doc, someone else draws boxes in Miro, and then the | |
| back-and-forth begins. Do we use Kafka or RabbitMQ here? Should the worker | |
| connect directly to the database or go through a cache first? Does the | |
| WebSocket gateway fan out to the broker, or does the broker fan out to the | |
| gateway? These conversations take hours. Then requirements change, the design | |
| evolves, and they take hours again. | |
| The obvious response was: just ask an LLM. And that works, up to a point. You | |
| describe your system, the model suggests components, you iterate. But there's a | |
| hidden cost β every response still needs a human to read it, validate it against | |
| the actual requirements, catch the missing connections, and push back when the | |
| model picks the wrong technology for the task. You're not removing the review | |
| loop; you're just moving it. | |
| We wanted to close that loop entirely. | |
| --- | |
| ## The idea: turn architecture design into an RL problem | |
| The key insight is that software architecture design has a structure that most | |
| open-ended LLM tasks don't: **it can be verified**. Given a task description, | |
| there is a set of components that must be present, a set of connections that | |
| must exist between them, and a measurable score for how close a design is to | |
| correct. You don't need a human in the loop to know whether `worker β database` | |
| is connected. The environment can check. | |
| If you can verify it, you can train on it. | |
| We built **ArchitectureEnv** β an OpenEnv environment where an agent designs | |
| software systems one action at a time. The agent can `add` a component, | |
| `connect` two components, or `submit` the finished design. After each action, | |
| the environment returns a reward in `[0, 1]` based on how complete and | |
| technically sound the architecture is. | |
| Six real-world tasks across three difficulty levels: | |
| | Task | Difficulty | Key challenge | | |
| |------|-----------|---------------| | |
| | `url_shortener` | Easy | Load balancer, cache, DB, abuse prevention | | |
| | `chat_system` | Medium | WebSocket fanout, presence, async broker | | |
| | `ecommerce_platform` | Medium | Search, payments, async workers | | |
| | `youtube_platform` | Hard | Transcoding pipeline, CDN, recommendations | | |
| | `ride_sharing` | Hard | Geospatial indexing, real-time matching | | |
| | `ml_platform` | Hard | Feature store, model registry, inference server | | |
| The reward function is deterministic and composable β no LLM judge, no ambiguity: | |
| - **Component coverage**: which required families are present (e.g. `cache`, `broker`, `database`) | |
| - **Connection coverage**: which required edges exist between components | |
| - **Technology-fit bonus**: Kafka over RabbitMQ for streaming workloads, Redis over Memcached when presence tracking is needed | |
| - **Bonus items**: optional cross-cutting concerns β auth, observability, rate limiting, payment gateways | |
| - **Coherence penalty**: deducted for irrelevant or conflicting components | |
| This multi-component reward makes the environment hard to game. An agent can't | |
| just dump every component it knows β irrelevant components incur a penalty, and | |
| bonus items only score if the required core is already complete. | |
| --- | |
| ## The knowledge layer: the System Design Encyclopedia | |
| Before training anything, we needed the model to know *which* components belong | |
| to *which* architectural families. A "cache" family could mean Redis, Memcached, | |
| or Varnish β but for a ride-sharing platform tracking driver locations in memory, | |
| Redis is the right answer. For a YouTube-scale streaming platform, Kafka is the | |
| right broker, not RabbitMQ. | |
| We built `encyclopedia_rules.py`, a knowledge layer built on top of the | |
| **System Design Components Encyclopedia** covering 40+ component families. It does four things: | |
| **1. Family β concrete component mappings.** Every abstract family required by the environment | |
| (e.g. `broker`, `cache`, `search`, `storage`) is mapped to the concrete | |
| implementation that scores highest for a given task type. The agent's system | |
| prompt is enriched with these mappings before every episode. | |
| **2. Task-specific technology overrides.** Some mappings differ by task. The | |
| broker family defaults to `rabbitmq` for simple async work but overrides to | |
| `kafka` for streaming-heavy tasks (`youtube_platform`, `ride_sharing`). The | |
| cache family defaults to `redis` but the ml_platform uses `feast` as the | |
| feature store. These overrides are resolved at inference time, not hard-coded. | |
| **3. Bonus target lists.** Each task has a set of optional bonus components | |
| (auth, observability, rate-limiting, payment gateway, recommendation workers, | |
| etc.). The encyclopedia maps these back to concrete components the agent can | |
| `add` by name, closing the gap between abstract scoring criteria and executable | |
| actions. | |
| **4. Enriched system prompt.** The system prompt injected into every agent call | |
| includes a structured summary of the relevant component families, their | |
| concrete implementations, and the integration patterns between them β drawn | |
| directly from the encyclopedia's integration pattern catalog. | |
| This is what separates the agentic system from raw LLM inference. The model | |
| doesn't need to hallucinate that Kafka is a good fit for streaming β the | |
| encyclopedia tells it, and the Planner enforces it. | |
| --- | |
| ## Training: SFT first, then RL | |
| A raw base model given an architecture task will output something that looks | |
| like an architecture discussion β not a sequence of `add` and `connect` | |
| commands. Before we could use reinforcement learning, we needed the model to | |
| speak the language of the environment. | |
| We fine-tuned **Qwen2.5-3B-Instruct** using Unsloth with 4-bit QLoRA on a | |
| supervised dataset of 120 correct action sequences across all six task types. | |
| This is the SFT stage β it doesn't teach strategy, it teaches format. After SFT, | |
| the model reliably outputs `add postgres`, `connect broker worker`, `submit` β | |
| the vocabulary the environment understands. | |
| **SFT training ran for 150 steps (10 epochs, batch size 8) on a T4 GPU in 13 minutes.** | |
| Cross-entropy loss dropped from 3.57 at step 10 to a plateau of ~0.006 by step 70, | |
| with a final average training loss of 0.208 (skewed by the steep early descent). | |
| Evaluated directly after training (single-shot greedy decode, no agentic loop), | |
| the SFT checkpoint achieved **0.997 average reward** across all six tasks β 4 tasks | |
| at 1.0, and ecommerce/ml_platform at 0.99. | |
| SFT had saturated the task format. We then ran GRPO on top of the SFT checkpoint to | |
| explore whether RL could improve priority ordering β required items before bonus items, | |
| required connections before submitting. | |
| Using HuggingFace TRL's GRPOTrainer, we ran the environment as the reward function β | |
| the model generates an action plan, the environment executes it, and the final score | |
| drives the weight update. We trained for **50 GRPO steps from the SFT checkpoint** | |
| (batch size 4, ~10 minutes on T4). As the reward curve shows, GRPO improved the | |
| raw model's training-time scores, but when both checkpoints were evaluated through | |
| the full agentic inference stack, the SFT-only checkpoint outperformed SFT+GRPO | |
| (0.978 vs 0.928) β the GRPO run degraded `ml_platform` significantly. This is | |
| discussed in detail in the section below. | |
|  | |
| *SFT cross-entropy loss over 150 steps. Loss axis: cross-entropy. Step axis: training step.* | |
|  | |
| *Average environment reward across all six tasks, comparing SFT-only and SFT+GRPO checkpoints over training. Reward axis: environment score [0, 1]. Step axis: training step.* | |
| --- | |
| ## The agentic layer: Planner, Critic, Negotiator | |
| Here's where things got interesting. We found that the SFT model, when wrapped | |
| in a structured multi-agent inference loop informed by the encyclopedia, | |
| matched or exceeded GRPO on most tasks β with full interpretability of every decision. | |
| The loop at every step: | |
| **PlannerAgent** β deterministic, zero-hallucination. It reads the live | |
| environment state (present components, missing required items, missing | |
| connections, available bonus targets) and computes the ground-truth next action | |
| using the encyclopedia's priority ordering: required items β required connections | |
| β bonus items β submit. The planner's hint is injected into the LLM's prompt. | |
| **LLM (the fine-tuned SFT model)** β generates its own action proposal informed | |
| by the planner hint. This is where domain knowledge from the encyclopedia pays | |
| off: the model has been trained on action sequences that reflect the same | |
| technology choices the encyclopedia prescribes. | |
| **CriticAgent** β validates the LLM's proposal against the live environment | |
| state before execution. It checks: does the component exist before connecting? | |
| Is required work complete before bonus items? Is the missing-items list empty | |
| before submit? If the proposal fails any check, it's rejected without hitting | |
| the environment. | |
| **NegotiatorAgent** β repairs rejected proposals using the planner as ground | |
| truth. Every rejection is logged with its reason (`critic_rejected: 'websocket_gateway' not present; add it before connecting β repair: add websocket_gateway`), creating a fully auditable decision trail. | |
| The Critic was the most important piece. Without it, the LLM would occasionally | |
| submit with missing connections still in the queue, or attempt to connect | |
| components that hadn't been added yet. The Critic catches both, every step, | |
| before the environment ever sees the action. | |
| --- | |
| ## Results | |
| **Agentic inference demo** β real LLM calls, no caching, run on 2026-04-25: | |
| | Task | SFT, no agentic loop | SFT + agentic loop | Ξ | | |
| |------|---------------------|--------------------|---| | |
| | chat_system | 0.930 | **1.000** | +0.070 | | |
| | ecommerce_platform | 0.900 | **0.990** | +0.090 | | |
| | youtube_platform | 0.930 | **1.000** | +0.070 | | |
| | ride_sharing | 0.930 | 0.890 | β0.040 | | |
| | ml_platform | 0.930 | **0.990** | +0.060 | | |
| | url_shortener | 0.930 | **1.000** | +0.070 | | |
| | **Average** | **0.925** | **0.978** | **+0.053** | | |
| Five of six tasks hit 0.99 or above. The `ride_sharing` episode scored 0.890 β the | |
| agent completed the required architecture correctly but the Negotiator's deduplication | |
| guard did not prevent an extra bonus-item attempt that incurred a step penalty before | |
| submit. This is a known edge case in the Negotiator and not a model failure β the | |
| required architecture was fully complete before the penalty was incurred. | |
| Bonus items collected in the improved run confirm the agent fully understands the | |
| encyclopedia's scoring model: | |
| - `chat_system` β auth, observability, presence_service, notification_service | |
| - `youtube_platform` β auth, observability, recommendation_worker | |
| - `url_shortener` β auth, observability, rate_limiting | |
| - `ml_platform` β auth, observability | |
| --- | |
| ## Why SFT + agentic loop, not SFT + GRPO + agentic loop? | |
| We ran both pipelines end-to-end through the same Planner-Critic-Negotiator inference | |
| stack and compared the results directly. | |
| **SFT + agentic loop** averaged **0.978** across all six tasks β five at 0.99 or above. | |
| **SFT + GRPO + agentic loop** averaged **0.928** β GRPO made things worse, not better. | |
| | Task | SFT + agentic | SFT + GRPO + agentic | Ξ | | |
| |------|--------------|----------------------|---| | |
| | url_shortener | 1.000 | 1.000 | 0.000 | | |
| | chat_system | 1.000 | 1.000 | 0.000 | | |
| | ecommerce_platform | 0.990 | 0.990 | 0.000 | | |
| | youtube_platform | 1.000 | 1.000 | 0.000 | | |
| | ride_sharing | 0.890 | 1.000 | +0.110 | | |
| | ml_platform | 0.990 | 0.580 | β0.410 | | |
| | **Average** | **0.978** | **0.928** | **β0.050** | | |
| The regression on `ml_platform` is the clearest signal. The GRPO+SFT checkpoint | |
| submitted with `storage` still missing, zero connections made, and three required | |
| edges completely absent β a score of 0.58. The SFT checkpoint through the same | |
| agentic loop scored 0.99 on the same task. GRPO disrupted the LoRA weights enough | |
| to break the model's ability to complete hard tasks, while adding nothing on the | |
| five tasks where SFT already performed well. | |
| Why does this happen? GRPO is designed to teach a model from scratch how to explore | |
| an environment through reward signals. When the SFT model already knows the correct | |
| action format and ordering, GRPO has little reward variance to learn from β and risks | |
| overfitting the adapter weights to minor fluctuations, which is exactly what happened | |
| on `ml_platform`. | |
| The right production system uses both: GRPO for the weights when starting from a weaker | |
| base, Planner-Critic-Negotiator for the inference guard regardless. That's the roadmap. | |
| --- | |
| ## What this enables | |
| Architecture design that currently takes engineering teams hours of back-and-forth | |
| can be reduced to a structured, verifiable, automatable loop. The environment | |
| exposes a clean interface: describe the system, get a design, get a score. The | |
| agentic layer ensures the design is complete before submission. The trained model | |
| brings domain knowledge. The encyclopedia brings precision about which technologies | |
| fit which workloads. | |
| The deeper point is about the pattern, not the specific task. Any domain where | |
| design decisions can be verified β infrastructure templates, data pipeline | |
| architecture, API schema design, security policy configuration β is a candidate | |
| for this approach. Build an environment with a deterministic composable reward, | |
| compile your domain knowledge into a structured encyclopedia, fine-tune a model | |
| on the action vocabulary, wrap it in a structured inference loop. The review | |
| loop doesn't disappear; it moves into the Critic, where it runs in milliseconds | |
| instead of minutes. | |
| --- | |
| ## Compliance checklist | |
| | Requirement | Status | | |
| |-------------|--------| | |
| | Built on OpenEnv (latest release) | β `openenv-core`, FastAPI, `openenv.yaml` | | |
| | Working training script (Unsloth + TRL) as Colab notebook | β [`ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb`](https://huggingface.co/spaces/thepikachu/architecture-env/blob/main/notebooks/ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb) | | |
| | Evidence of real training (loss + reward plots) | β `plots/loss_curve.png`, `plots/reward_curve.png` β committed to repo | | |
| | Mini-blog writeup | β This post (also at `Blog.md` in Space repo) | | |
| | Environment pushed to HuggingFace Space | β https://huggingface.co/spaces/thepikachu/architecture-env | | |
| | README with problem motivation, env explanation, results | β Linked from Space | | |
| | All materials linked from README | β Space, model, notebook, blog | | |
| --- | |
| ## Stack | |
| - **Environment**: OpenEnv + FastAPI + Pydantic | |
| - **Base model**: Qwen2.5-3B-Instruct (4-bit QLoRA via Unsloth) | |
| - **Training**: Unsloth + HuggingFace TRL (SFTTrainer + GRPOTrainer) | |
| - **Knowledge layer**: `encyclopedia_rules.py` β 40+ component families, familyβconcrete mappings, task broker overrides, bonus targets, enriched system prompts | |
| - **Agents**: PlannerAgent, CriticAgent, NegotiatorAgent (pure Python, deterministic, no extra model calls) | |
| - **Deployment**: HuggingFace Spaces (Docker) | |
| --- | |
| *Code, training notebook, and logs: https://huggingface.co/spaces/thepikachu/architecture-env* |