# Teaching an AI to Plan a Wedding: Reinforcement Learning Meets Indian Culture

*AR'26 Meta OpenEnv Hackathon — Submission Writeup*

---

## The Setup

We asked a state-of-the-art 7B language model to plan a 3-day Indian wedding. Here's what it did:

- Booked the DJ before finding a venue
- Scheduled the sacred *Pheras* ceremony at 4:15 PM — smack in the middle of *Rahu Kaal*, the most inauspicious time of day
- Finalized the plan after 3 steps, having left two entire wedding days completely empty

This wasn't a fluke. We ran it 50 times. Same story every time.

The problem isn't that the model is "dumb." The problem is that **zero-shot LLM generation fundamentally cannot handle long-horizon, stateful, constraint-dense planning**. The model doesn't track its remaining budget across 20 decisions. It doesn't know that *Haldi* must happen before *Pheras*. It has no feedback loop to catch its own mistakes.

We needed to change the paradigm. Instead of asking the model to generate a plan, we needed to train it to **act** inside a world.

---

## The Environment

We built `WeddingPlannerEnv` using the **OpenEnv** framework — a stateful, programmatic simulation of a 3-day Indian wedding planning process.

The agent operates in a sequential loop:

1. **Observe** the current state: remaining budget, city, guest count, auspicious *Muhurat* windows, booked events, and any active conflicts.
2. **Act** by outputting a single structured JSON action (e.g., book a caterer, schedule the priest).
3. **React** to the consequences: the environment deducts from the budget, detects scheduling conflicts, and surfaces a new observation.

The environment enforces real-world constraints through three internal APIs:

- **VendorAPI**: Checks availability, city coverage, and capacity. Calculates cost dynamically as `price_per_head × guest_count`.
- **CalendarAPI**: Looks up date-specific *Muhurat* windows and flags *Rahu Kaal* violations.
- **ConflictDetector**: Sweeps the full itinerary after every step, checking for double-bookings, ritual ordering violations (*Pheras* before *Haldi* is illegal), and missing catering for large events.

Any conflict detected immediately surfaces in `active_conflicts` of the next observation — the agent must self-correct. Nothing is auto-resolved.

---

## The Reward Function

At the end of every episode (either on `finalize_plan` or after 20 steps), the environment computes a **composite reward** weighted across 5 objectives:

```python
WEIGHTS = {
    "coverage":  0.35,   # Are all 5 event types booked?
    "budget":    0.25,   # Efficient use of budget without going into debt?
    "muhurat":   0.20,   # Is the ceremony inside the auspicious window?
    "conflicts": 0.10,   # Zero active conflicts at finalization?
    "guest_ux":  0.10,   # Did guests get a diverse, complete experience?
}
```

Why multi-objective? Because a single binary reward is easy to game. With 5 independent signals, the model must genuinely balance trade-offs — it can't spike one metric without paying in another.

---

## The Training

We used **Group Relative Policy Optimization (GRPO)** from HuggingFace TRL, with **Unsloth** for ultra-fast 4-bit QLoRA training on a single L4 GPU.

The key design decision: GRPO generates multiple rollouts (action completions) for each prompt and uses the *relative* reward difference between them to compute gradients. This means the model is continuously self-comparing — slightly better actions get reinforced, worse ones get suppressed.

We seeded 200 unique episodes (randomized city, guest count, budget) and trained for 3 epochs / 300 steps. Each prompt step generated 4 competing rollouts.

One important engineering fix: the original plan routed all reward evaluations through a single shared environment server. But GRPO's concurrent rollouts would all be stepping the same stateful environment simultaneously, contaminating each other's rewards. We solved this by instantiating a **completely isolated local environment per rollout**, reconstructing state from the prompt payload.

---

## The Results

After 300 training steps (~2.5 hours on an L4 GPU):

| Metric | Before | After |
|---|---|---|
| JSON output format | Hallucinated text | 100% strict JSON |
| Muhurat scheduling | Inside Rahu Kaal | Correctly in 08:30–11:00 window |
| Budget discipline | 60% wasted or overdrawn | Efficient, within limits |
| Active conflicts | 2–4 per episode | 0–1 per episode |
| **Mean episode reward** | **~0.21** | **~0.44 (+110%)** |

The reward curve showed a clean learning signal — slow initial progress as the model learned the JSON schema, followed by a steeper climb as it internalized the budget and scheduling constraints.

---

## What's Next

The current training used only `difficulty="easy"`. The plan is to run a curriculum — moving to medium and hard modes once the base policy stabilizes — pushing reward well past 0.6 on hard scenarios.

The environment is live and open. Anyone can call `/reset` and `/step` against the hosted Space and connect their own agent to it.

---

## Links

- 🏟️ **Live Environment**: [huggingface.co/spaces/shivanandh033/wedding-planner-env](https://huggingface.co/spaces/shivanandh033/wedding-planner-env)
- 🤖 **Model Card**: [huggingface.co/shivanandh033/wedding-planner-7b](https://huggingface.co/shivanandh033/wedding-planner-7b)
- 📓 **Training Notebook**: [`train_colab.ipynb`](./train_colab.ipynb)

---

*Built for the AR'26 Meta OpenEnv Hackathon | Theme #3.1 — World Modeling: Professional Tasks*