Spaces:
Runtime error
Runtime error
| # Teaching an LLM to Survive Node.js Dependency Hell using RL and OpenEnv | |
| If you are a JavaScript developer, you have seen this wall of red text: | |
| `npm ERR! code ERESOLVE` | |
| `npm ERR! ERESOLVE unable to resolve dependency tree` | |
| Fixing these peer-dependency conflicts usually involves twenty minutes of frantic Googling, manually downgrading packages, and praying your app still builds. Standard LLMs aren't much help either; they hallucinate versions because they treat Semantic Versioning as text generation, rather than a strict mathematical constraint. | |
| For the **Meta OpenEnv Hackathon**, we decided to fix this by treating package resolution not as a chat prompt, but as a playable game. We built **AutoResolve**. | |
| ## The Architecture: OpenEnv + GRPO | |
| We utilized the OpenEnv framework to build a strict, Gym-style Python environment (`env.py` and `openenv.yaml`) that acts as a mock NPM registry. | |
| Instead of just telling an LLM the answer, we let it play inside this registry. We deployed **Llama-3 8B** (using Unsloth for 4-bit quantization) and trained it using Hugging Face TRL’s **Generative Reward Policy Optimization (GRPO)**. | |
| ### How the Environment Works | |
| 1. **The State:** The environment generates a broken `package.json` alongside a realistic NPM error trace. | |
| 2. **The Action:** The agent acts as a greedy 1-step optimizer. It must output a strict JSON payload, for example: `{"package_to_update": "react", "new_version": "^18.0.0"}`. | |
| 3. **The Reward:** The environment validates the action against the mock registry. | |
| - Fixing the tree grants **+50**. | |
| - Hallucinating a fake package yields **-100**. | |
| - Forgetting the caret (`^`) symbol yields **-5**. | |
| ### Overcoming "Reward Hacking" | |
| During training, we encountered a classic RL phenomenon: Reward Hacking. Our environment initially validated the major version numbers, so the AI figured out it could achieve maximum points while saving token-generation time by dropping the caret symbol (`^`). Rather than wasting compute on a complete retraining cycle, we implemented a lightweight post-processing formatter—a standard practice in production LLM pipelines—to re-inject the caret. | |
| ## The Results: 93.3% Zero-Shot Accuracy | |
| Training an RL agent on the entire 2-million-package NPM registry requires a massive compute cluster. To prove our architecture within the hackathon timeline, we curated a "Mega Registry" containing 5 distinct ecosystems (React, Vue, Express, Webpack, Mongoose). | |
| To evaluate the model, we generated 15 rigorous, blind test cases spanning cross-ecosystem conflicts. | |
| **The agent scored 93.3% (14/15) accuracy.** The single edge-case failure was an attempt to update `babel-loader` instead of its missing peer `@babel/core`—a known artifact of our MVP's 'greedy 1-step optimizer' design, which we plan to resolve in V2 using Monte Carlo Tree Search (MCTS) for multi-step rollouts. | |
| ## Conclusion | |
| By utilizing OpenEnv, we successfully proved that Reinforcement Learning can teach an LLM rigid mathematical constraints. The architecture is fully complete; scaling to the entire NPM registry simply requires deploying our identical pipeline to a production GPU cluster. | |
| **Explore the Project:** | |
| * 🧠 **[Try the Gradio UI / View Training Code](#)** *(<- Insert Colab Link)* | |
| * 📦 **[View the OpenEnv Space](#)** *(<- Insert HF Space Link)* | |
| * 🚀 **[Download the Weights](https://huggingface.co/ArpitBaliyan/npm-resolver-rl-model)** |