# ArchitectureEnv β€” Teaching an LLM to Design Software Systems *Meta PyTorch Γ— OpenEnv Hackathon Grand Finale, April 2026* πŸ€— **Space**: https://huggingface.co/spaces/thepikachu/architecture-env 🧠 **Model**: https://huggingface.co/thepikachu/architecture-sft-model πŸ““ **Notebook**: [ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb](https://huggingface.co/spaces/thepikachu/architecture-env/blob/main/notebooks/ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb) --- ## The problem nobody talks about Every engineering team that has ever built a distributed system knows the drill. Someone opens a Confluence doc, someone else draws boxes in Miro, and then the back-and-forth begins. Do we use Kafka or RabbitMQ here? Should the worker connect directly to the database or go through a cache first? Does the WebSocket gateway fan out to the broker, or does the broker fan out to the gateway? These conversations take hours. Then requirements change, the design evolves, and they take hours again. The obvious response was: just ask an LLM. And that works, up to a point. You describe your system, the model suggests components, you iterate. But there's a hidden cost β€” every response still needs a human to read it, validate it against the actual requirements, catch the missing connections, and push back when the model picks the wrong technology for the task. You're not removing the review loop; you're just moving it. We wanted to close that loop entirely. --- ## The idea: turn architecture design into an RL problem The key insight is that software architecture design has a structure that most open-ended LLM tasks don't: **it can be verified**. Given a task description, there is a set of components that must be present, a set of connections that must exist between them, and a measurable score for how close a design is to correct. You don't need a human in the loop to know whether `worker β†’ database` is connected. The environment can check. If you can verify it, you can train on it. We built **ArchitectureEnv** β€” an OpenEnv environment where an agent designs software systems one action at a time. The agent can `add` a component, `connect` two components, or `submit` the finished design. After each action, the environment returns a reward in `[0, 1]` based on how complete and technically sound the architecture is. Six real-world tasks across three difficulty levels: | Task | Difficulty | Key challenge | |------|-----------|---------------| | `url_shortener` | Easy | Load balancer, cache, DB, abuse prevention | | `chat_system` | Medium | WebSocket fanout, presence, async broker | | `ecommerce_platform` | Medium | Search, payments, async workers | | `youtube_platform` | Hard | Transcoding pipeline, CDN, recommendations | | `ride_sharing` | Hard | Geospatial indexing, real-time matching | | `ml_platform` | Hard | Feature store, model registry, inference server | The reward function is deterministic and composable β€” no LLM judge, no ambiguity: - **Component coverage**: which required families are present (e.g. `cache`, `broker`, `database`) - **Connection coverage**: which required edges exist between components - **Technology-fit bonus**: Kafka over RabbitMQ for streaming workloads, Redis over Memcached when presence tracking is needed - **Bonus items**: optional cross-cutting concerns β€” auth, observability, rate limiting, payment gateways - **Coherence penalty**: deducted for irrelevant or conflicting components This multi-component reward makes the environment hard to game. An agent can't just dump every component it knows β€” irrelevant components incur a penalty, and bonus items only score if the required core is already complete. --- ## The knowledge layer: the System Design Encyclopedia Before training anything, we needed the model to know *which* components belong to *which* architectural families. A "cache" family could mean Redis, Memcached, or Varnish β€” but for a ride-sharing platform tracking driver locations in memory, Redis is the right answer. For a YouTube-scale streaming platform, Kafka is the right broker, not RabbitMQ. We built `encyclopedia_rules.py`, a knowledge layer built on top of the **System Design Components Encyclopedia** covering 40+ component families. It does four things: **1. Family β†’ concrete component mappings.** Every abstract family required by the environment (e.g. `broker`, `cache`, `search`, `storage`) is mapped to the concrete implementation that scores highest for a given task type. The agent's system prompt is enriched with these mappings before every episode. **2. Task-specific technology overrides.** Some mappings differ by task. The broker family defaults to `rabbitmq` for simple async work but overrides to `kafka` for streaming-heavy tasks (`youtube_platform`, `ride_sharing`). The cache family defaults to `redis` but the ml_platform uses `feast` as the feature store. These overrides are resolved at inference time, not hard-coded. **3. Bonus target lists.** Each task has a set of optional bonus components (auth, observability, rate-limiting, payment gateway, recommendation workers, etc.). The encyclopedia maps these back to concrete components the agent can `add` by name, closing the gap between abstract scoring criteria and executable actions. **4. Enriched system prompt.** The system prompt injected into every agent call includes a structured summary of the relevant component families, their concrete implementations, and the integration patterns between them β€” drawn directly from the encyclopedia's integration pattern catalog. This is what separates the agentic system from raw LLM inference. The model doesn't need to hallucinate that Kafka is a good fit for streaming β€” the encyclopedia tells it, and the Planner enforces it. --- ## Training: SFT first, then RL A raw base model given an architecture task will output something that looks like an architecture discussion β€” not a sequence of `add` and `connect` commands. Before we could use reinforcement learning, we needed the model to speak the language of the environment. We fine-tuned **Qwen2.5-3B-Instruct** using Unsloth with 4-bit QLoRA on a supervised dataset of 120 correct action sequences across all six task types. This is the SFT stage β€” it doesn't teach strategy, it teaches format. After SFT, the model reliably outputs `add postgres`, `connect broker worker`, `submit` β€” the vocabulary the environment understands. **SFT training ran for 150 steps (10 epochs, batch size 8) on a T4 GPU in 13 minutes.** Cross-entropy loss dropped from 3.57 at step 10 to a plateau of ~0.006 by step 70, with a final average training loss of 0.208 (skewed by the steep early descent). Evaluated directly after training (single-shot greedy decode, no agentic loop), the SFT checkpoint achieved **0.997 average reward** across all six tasks β€” 4 tasks at 1.0, and ecommerce/ml_platform at 0.99. SFT had saturated the task format. We then ran GRPO on top of the SFT checkpoint to explore whether RL could improve priority ordering β€” required items before bonus items, required connections before submitting. Using HuggingFace TRL's GRPOTrainer, we ran the environment as the reward function β€” the model generates an action plan, the environment executes it, and the final score drives the weight update. We trained for **50 GRPO steps from the SFT checkpoint** (batch size 4, ~10 minutes on T4). As the reward curve shows, GRPO improved the raw model's training-time scores, but when both checkpoints were evaluated through the full agentic inference stack, the SFT-only checkpoint outperformed SFT+GRPO (0.978 vs 0.928) β€” the GRPO run degraded `ml_platform` significantly. This is discussed in detail in the section below. ![SFT Loss Curve](https://huggingface.co/spaces/thepikachu/architecture-env/resolve/main/plots/loss_curve.png) *SFT cross-entropy loss over 150 steps. Loss axis: cross-entropy. Step axis: training step.* ![Reward Curve](https://huggingface.co/spaces/thepikachu/architecture-env/resolve/main/plots/reward_curve.png) *Average environment reward across all six tasks, comparing SFT-only and SFT+GRPO checkpoints over training. Reward axis: environment score [0, 1]. Step axis: training step.* --- ## The agentic layer: Planner, Critic, Negotiator Here's where things got interesting. We found that the SFT model, when wrapped in a structured multi-agent inference loop informed by the encyclopedia, matched or exceeded GRPO on most tasks β€” with full interpretability of every decision. The loop at every step: **PlannerAgent** β€” deterministic, zero-hallucination. It reads the live environment state (present components, missing required items, missing connections, available bonus targets) and computes the ground-truth next action using the encyclopedia's priority ordering: required items β†’ required connections β†’ bonus items β†’ submit. The planner's hint is injected into the LLM's prompt. **LLM (the fine-tuned SFT model)** β€” generates its own action proposal informed by the planner hint. This is where domain knowledge from the encyclopedia pays off: the model has been trained on action sequences that reflect the same technology choices the encyclopedia prescribes. **CriticAgent** β€” validates the LLM's proposal against the live environment state before execution. It checks: does the component exist before connecting? Is required work complete before bonus items? Is the missing-items list empty before submit? If the proposal fails any check, it's rejected without hitting the environment. **NegotiatorAgent** β€” repairs rejected proposals using the planner as ground truth. Every rejection is logged with its reason (`critic_rejected: 'websocket_gateway' not present; add it before connecting β†’ repair: add websocket_gateway`), creating a fully auditable decision trail. The Critic was the most important piece. Without it, the LLM would occasionally submit with missing connections still in the queue, or attempt to connect components that hadn't been added yet. The Critic catches both, every step, before the environment ever sees the action. --- ## Results **Agentic inference demo** β€” real LLM calls, no caching, run on 2026-04-25: | Task | SFT, no agentic loop | SFT + agentic loop | Ξ” | |------|---------------------|--------------------|---| | chat_system | 0.930 | **1.000** | +0.070 | | ecommerce_platform | 0.900 | **0.990** | +0.090 | | youtube_platform | 0.930 | **1.000** | +0.070 | | ride_sharing | 0.930 | 0.890 | βˆ’0.040 | | ml_platform | 0.930 | **0.990** | +0.060 | | url_shortener | 0.930 | **1.000** | +0.070 | | **Average** | **0.925** | **0.978** | **+0.053** | Five of six tasks hit 0.99 or above. The `ride_sharing` episode scored 0.890 β€” the agent completed the required architecture correctly but the Negotiator's deduplication guard did not prevent an extra bonus-item attempt that incurred a step penalty before submit. This is a known edge case in the Negotiator and not a model failure β€” the required architecture was fully complete before the penalty was incurred. Bonus items collected in the improved run confirm the agent fully understands the encyclopedia's scoring model: - `chat_system` β†’ auth, observability, presence_service, notification_service - `youtube_platform` β†’ auth, observability, recommendation_worker - `url_shortener` β†’ auth, observability, rate_limiting - `ml_platform` β†’ auth, observability --- ## Why SFT + agentic loop, not SFT + GRPO + agentic loop? We ran both pipelines end-to-end through the same Planner-Critic-Negotiator inference stack and compared the results directly. **SFT + agentic loop** averaged **0.978** across all six tasks β€” five at 0.99 or above. **SFT + GRPO + agentic loop** averaged **0.928** β€” GRPO made things worse, not better. | Task | SFT + agentic | SFT + GRPO + agentic | Ξ” | |------|--------------|----------------------|---| | url_shortener | 1.000 | 1.000 | 0.000 | | chat_system | 1.000 | 1.000 | 0.000 | | ecommerce_platform | 0.990 | 0.990 | 0.000 | | youtube_platform | 1.000 | 1.000 | 0.000 | | ride_sharing | 0.890 | 1.000 | +0.110 | | ml_platform | 0.990 | 0.580 | βˆ’0.410 | | **Average** | **0.978** | **0.928** | **βˆ’0.050** | The regression on `ml_platform` is the clearest signal. The GRPO+SFT checkpoint submitted with `storage` still missing, zero connections made, and three required edges completely absent β€” a score of 0.58. The SFT checkpoint through the same agentic loop scored 0.99 on the same task. GRPO disrupted the LoRA weights enough to break the model's ability to complete hard tasks, while adding nothing on the five tasks where SFT already performed well. Why does this happen? GRPO is designed to teach a model from scratch how to explore an environment through reward signals. When the SFT model already knows the correct action format and ordering, GRPO has little reward variance to learn from β€” and risks overfitting the adapter weights to minor fluctuations, which is exactly what happened on `ml_platform`. The right production system uses both: GRPO for the weights when starting from a weaker base, Planner-Critic-Negotiator for the inference guard regardless. That's the roadmap. --- ## What this enables Architecture design that currently takes engineering teams hours of back-and-forth can be reduced to a structured, verifiable, automatable loop. The environment exposes a clean interface: describe the system, get a design, get a score. The agentic layer ensures the design is complete before submission. The trained model brings domain knowledge. The encyclopedia brings precision about which technologies fit which workloads. The deeper point is about the pattern, not the specific task. Any domain where design decisions can be verified β€” infrastructure templates, data pipeline architecture, API schema design, security policy configuration β€” is a candidate for this approach. Build an environment with a deterministic composable reward, compile your domain knowledge into a structured encyclopedia, fine-tune a model on the action vocabulary, wrap it in a structured inference loop. The review loop doesn't disappear; it moves into the Critic, where it runs in milliseconds instead of minutes. --- ## Compliance checklist | Requirement | Status | |-------------|--------| | Built on OpenEnv (latest release) | βœ… `openenv-core`, FastAPI, `openenv.yaml` | | Working training script (Unsloth + TRL) as Colab notebook | βœ… [`ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb`](https://huggingface.co/spaces/thepikachu/architecture-env/blob/main/notebooks/ArchitectureEnv_SFT_HF_Deploy_Notebook.ipynb) | | Evidence of real training (loss + reward plots) | βœ… `plots/loss_curve.png`, `plots/reward_curve.png` β€” committed to repo | | Mini-blog writeup | βœ… This post (also at `Blog.md` in Space repo) | | Environment pushed to HuggingFace Space | βœ… https://huggingface.co/spaces/thepikachu/architecture-env | | README with problem motivation, env explanation, results | βœ… Linked from Space | | All materials linked from README | βœ… Space, model, notebook, blog | --- ## Stack - **Environment**: OpenEnv + FastAPI + Pydantic - **Base model**: Qwen2.5-3B-Instruct (4-bit QLoRA via Unsloth) - **Training**: Unsloth + HuggingFace TRL (SFTTrainer + GRPOTrainer) - **Knowledge layer**: `encyclopedia_rules.py` β€” 40+ component families, familyβ†’concrete mappings, task broker overrides, bonus targets, enriched system prompts - **Agents**: PlannerAgent, CriticAgent, NegotiatorAgent (pure Python, deterministic, no extra model calls) - **Deployment**: HuggingFace Spaces (Docker) --- *Code, training notebook, and logs: https://huggingface.co/spaces/thepikachu/architecture-env*