# Teaching an LLM to Survive Node.js Dependency Hell using RL and OpenEnv

If you are a JavaScript developer, you have seen this wall of red text:
`npm ERR! code ERESOLVE`
`npm ERR! ERESOLVE unable to resolve dependency tree`

Fixing these peer-dependency conflicts usually involves twenty minutes of frantic Googling, manually downgrading packages, and praying your app still builds. Standard LLMs aren't much help either; they hallucinate versions because they treat Semantic Versioning as text generation, rather than a strict mathematical constraint. 

For the **Meta OpenEnv Hackathon**, we decided to fix this by treating package resolution not as a chat prompt, but as a playable game. We built **AutoResolve**.

## The Architecture: OpenEnv + GRPO
We utilized the OpenEnv framework to build a strict, Gym-style Python environment (`env.py` and `openenv.yaml`) that acts as a mock NPM registry. 

Instead of just telling an LLM the answer, we let it play inside this registry. We deployed **Llama-3 8B** (using Unsloth for 4-bit quantization) and trained it using Hugging Face TRL’s **Generative Reward Policy Optimization (GRPO)**.

### How the Environment Works
1. **The State:** The environment generates a broken `package.json` alongside a realistic NPM error trace.
2. **The Action:** The agent acts as a greedy 1-step optimizer. It must output a strict JSON payload, for example: `{"package_to_update": "react", "new_version": "^18.0.0"}`.
3. **The Reward:** The environment validates the action against the mock registry. 
   - Fixing the tree grants **+50**. 
   - Hallucinating a fake package yields **-100**. 
   - Forgetting the caret (`^`) symbol yields **-5**.

### Overcoming "Reward Hacking"
During training, we encountered a classic RL phenomenon: Reward Hacking. Our environment initially validated the major version numbers, so the AI figured out it could achieve maximum points while saving token-generation time by dropping the caret symbol (`^`). Rather than wasting compute on a complete retraining cycle, we implemented a lightweight post-processing formatter—a standard practice in production LLM pipelines—to re-inject the caret.

## The Results: 93.3% Zero-Shot Accuracy
Training an RL agent on the entire 2-million-package NPM registry requires a massive compute cluster. To prove our architecture within the hackathon timeline, we curated a "Mega Registry" containing 5 distinct ecosystems (React, Vue, Express, Webpack, Mongoose). 

To evaluate the model, we generated 15 rigorous, blind test cases spanning cross-ecosystem conflicts. 
**The agent scored 93.3% (14/15) accuracy.** The single edge-case failure was an attempt to update `babel-loader` instead of its missing peer `@babel/core`—a known artifact of our MVP's 'greedy 1-step optimizer' design, which we plan to resolve in V2 using Monte Carlo Tree Search (MCTS) for multi-step rollouts.

## Conclusion
By utilizing OpenEnv, we successfully proved that Reinforcement Learning can teach an LLM rigid mathematical constraints. The architecture is fully complete; scaling to the entire NPM registry simply requires deploying our identical pipeline to a production GPU cluster. 

**Explore the Project:**
* 🧠 **[Try the Gradio UI / View Training Code](#)** *(<- Insert Colab Link)*
* 📦 **[View the OpenEnv Space](#)** *(<- Insert HF Space Link)*
* 🚀 **[Download the Weights](https://huggingface.co/ArpitBaliyan/npm-resolver-rl-model)**