Spaces:
Runtime error
A newer version of the Gradio SDK is available: 6.14.0
Teaching an LLM to Survive Node.js Dependency Hell using RL and OpenEnv
If you are a JavaScript developer, you have seen this wall of red text:
npm ERR! code ERESOLVE
npm ERR! ERESOLVE unable to resolve dependency tree
Fixing these peer-dependency conflicts usually involves twenty minutes of frantic Googling, manually downgrading packages, and praying your app still builds. Standard LLMs aren't much help either; they hallucinate versions because they treat Semantic Versioning as text generation, rather than a strict mathematical constraint.
For the Meta OpenEnv Hackathon, we decided to fix this by treating package resolution not as a chat prompt, but as a playable game. We built AutoResolve.
The Architecture: OpenEnv + GRPO
We utilized the OpenEnv framework to build a strict, Gym-style Python environment (env.py and openenv.yaml) that acts as a mock NPM registry.
Instead of just telling an LLM the answer, we let it play inside this registry. We deployed Llama-3 8B (using Unsloth for 4-bit quantization) and trained it using Hugging Face TRL’s Generative Reward Policy Optimization (GRPO).
How the Environment Works
- The State: The environment generates a broken
package.jsonalongside a realistic NPM error trace. - The Action: The agent acts as a greedy 1-step optimizer. It must output a strict JSON payload, for example:
{"package_to_update": "react", "new_version": "^18.0.0"}. - The Reward: The environment validates the action against the mock registry.
- Fixing the tree grants +50.
- Hallucinating a fake package yields -100.
- Forgetting the caret (
^) symbol yields -5.
Overcoming "Reward Hacking"
During training, we encountered a classic RL phenomenon: Reward Hacking. Our environment initially validated the major version numbers, so the AI figured out it could achieve maximum points while saving token-generation time by dropping the caret symbol (^). Rather than wasting compute on a complete retraining cycle, we implemented a lightweight post-processing formatter—a standard practice in production LLM pipelines—to re-inject the caret.
The Results: 93.3% Zero-Shot Accuracy
Training an RL agent on the entire 2-million-package NPM registry requires a massive compute cluster. To prove our architecture within the hackathon timeline, we curated a "Mega Registry" containing 5 distinct ecosystems (React, Vue, Express, Webpack, Mongoose).
To evaluate the model, we generated 15 rigorous, blind test cases spanning cross-ecosystem conflicts.
The agent scored 93.3% (14/15) accuracy. The single edge-case failure was an attempt to update babel-loader instead of its missing peer @babel/core—a known artifact of our MVP's 'greedy 1-step optimizer' design, which we plan to resolve in V2 using Monte Carlo Tree Search (MCTS) for multi-step rollouts.
Conclusion
By utilizing OpenEnv, we successfully proved that Reinforcement Learning can teach an LLM rigid mathematical constraints. The architecture is fully complete; scaling to the entire NPM registry simply requires deploying our identical pipeline to a production GPU cluster.
Explore the Project:
- 🧠 Try the Gradio UI / View Training Code (<- Insert Colab Link)
- 📦 View the OpenEnv Space (<- Insert HF Space Link)
- 🚀 Download the Weights