shivanandh033's picture
Add Blog.md writeup and link blog + colab in README
59c1de1
# Teaching an AI to Plan a Wedding: Reinforcement Learning Meets Indian Culture
*AR'26 Meta OpenEnv Hackathon β€” Submission Writeup*
---
## The Setup
We asked a state-of-the-art 7B language model to plan a 3-day Indian wedding. Here's what it did:
- Booked the DJ before finding a venue
- Scheduled the sacred *Pheras* ceremony at 4:15 PM β€” smack in the middle of *Rahu Kaal*, the most inauspicious time of day
- Finalized the plan after 3 steps, having left two entire wedding days completely empty
This wasn't a fluke. We ran it 50 times. Same story every time.
The problem isn't that the model is "dumb." The problem is that **zero-shot LLM generation fundamentally cannot handle long-horizon, stateful, constraint-dense planning**. The model doesn't track its remaining budget across 20 decisions. It doesn't know that *Haldi* must happen before *Pheras*. It has no feedback loop to catch its own mistakes.
We needed to change the paradigm. Instead of asking the model to generate a plan, we needed to train it to **act** inside a world.
---
## The Environment
We built `WeddingPlannerEnv` using the **OpenEnv** framework β€” a stateful, programmatic simulation of a 3-day Indian wedding planning process.
The agent operates in a sequential loop:
1. **Observe** the current state: remaining budget, city, guest count, auspicious *Muhurat* windows, booked events, and any active conflicts.
2. **Act** by outputting a single structured JSON action (e.g., book a caterer, schedule the priest).
3. **React** to the consequences: the environment deducts from the budget, detects scheduling conflicts, and surfaces a new observation.
The environment enforces real-world constraints through three internal APIs:
- **VendorAPI**: Checks availability, city coverage, and capacity. Calculates cost dynamically as `price_per_head Γ— guest_count`.
- **CalendarAPI**: Looks up date-specific *Muhurat* windows and flags *Rahu Kaal* violations.
- **ConflictDetector**: Sweeps the full itinerary after every step, checking for double-bookings, ritual ordering violations (*Pheras* before *Haldi* is illegal), and missing catering for large events.
Any conflict detected immediately surfaces in `active_conflicts` of the next observation β€” the agent must self-correct. Nothing is auto-resolved.
---
## The Reward Function
At the end of every episode (either on `finalize_plan` or after 20 steps), the environment computes a **composite reward** weighted across 5 objectives:
```python
WEIGHTS = {
"coverage": 0.35, # Are all 5 event types booked?
"budget": 0.25, # Efficient use of budget without going into debt?
"muhurat": 0.20, # Is the ceremony inside the auspicious window?
"conflicts": 0.10, # Zero active conflicts at finalization?
"guest_ux": 0.10, # Did guests get a diverse, complete experience?
}
```
Why multi-objective? Because a single binary reward is easy to game. With 5 independent signals, the model must genuinely balance trade-offs β€” it can't spike one metric without paying in another.
---
## The Training
We used **Group Relative Policy Optimization (GRPO)** from HuggingFace TRL, with **Unsloth** for ultra-fast 4-bit QLoRA training on a single L4 GPU.
The key design decision: GRPO generates multiple rollouts (action completions) for each prompt and uses the *relative* reward difference between them to compute gradients. This means the model is continuously self-comparing β€” slightly better actions get reinforced, worse ones get suppressed.
We seeded 200 unique episodes (randomized city, guest count, budget) and trained for 3 epochs / 300 steps. Each prompt step generated 4 competing rollouts.
One important engineering fix: the original plan routed all reward evaluations through a single shared environment server. But GRPO's concurrent rollouts would all be stepping the same stateful environment simultaneously, contaminating each other's rewards. We solved this by instantiating a **completely isolated local environment per rollout**, reconstructing state from the prompt payload.
---
## The Results
After 300 training steps (~2.5 hours on an L4 GPU):
| Metric | Before | After |
|---|---|---|
| JSON output format | Hallucinated text | 100% strict JSON |
| Muhurat scheduling | Inside Rahu Kaal | Correctly in 08:30–11:00 window |
| Budget discipline | 60% wasted or overdrawn | Efficient, within limits |
| Active conflicts | 2–4 per episode | 0–1 per episode |
| **Mean episode reward** | **~0.21** | **~0.44 (+110%)** |
The reward curve showed a clean learning signal β€” slow initial progress as the model learned the JSON schema, followed by a steeper climb as it internalized the budget and scheduling constraints.
---
## What's Next
The current training used only `difficulty="easy"`. The plan is to run a curriculum β€” moving to medium and hard modes once the base policy stabilizes β€” pushing reward well past 0.6 on hard scenarios.
The environment is live and open. Anyone can call `/reset` and `/step` against the hosted Space and connect their own agent to it.
---
## Links
- 🏟️ **Live Environment**: [huggingface.co/spaces/shivanandh033/wedding-planner-env](https://huggingface.co/spaces/shivanandh033/wedding-planner-env)
- πŸ€– **Model Card**: [huggingface.co/shivanandh033/wedding-planner-7b](https://huggingface.co/shivanandh033/wedding-planner-7b)
- πŸ““ **Training Notebook**: [`train_colab.ipynb`](./train_colab.ipynb)
---
*Built for the AR'26 Meta OpenEnv Hackathon | Theme #3.1 β€” World Modeling: Professional Tasks*