# Teaching an AI to Plan a Wedding: Reinforcement Learning Meets Indian Culture *AR'26 Meta OpenEnv Hackathon — Submission Writeup* --- ## The Setup We asked a state-of-the-art 7B language model to plan a 3-day Indian wedding. Here's what it did: - Booked the DJ before finding a venue - Scheduled the sacred *Pheras* ceremony at 4:15 PM — smack in the middle of *Rahu Kaal*, the most inauspicious time of day - Finalized the plan after 3 steps, having left two entire wedding days completely empty This wasn't a fluke. We ran it 50 times. Same story every time. The problem isn't that the model is "dumb." The problem is that **zero-shot LLM generation fundamentally cannot handle long-horizon, stateful, constraint-dense planning**. The model doesn't track its remaining budget across 20 decisions. It doesn't know that *Haldi* must happen before *Pheras*. It has no feedback loop to catch its own mistakes. We needed to change the paradigm. Instead of asking the model to generate a plan, we needed to train it to **act** inside a world. --- ## The Environment We built `WeddingPlannerEnv` using the **OpenEnv** framework — a stateful, programmatic simulation of a 3-day Indian wedding planning process. The agent operates in a sequential loop: 1. **Observe** the current state: remaining budget, city, guest count, auspicious *Muhurat* windows, booked events, and any active conflicts. 2. **Act** by outputting a single structured JSON action (e.g., book a caterer, schedule the priest). 3. **React** to the consequences: the environment deducts from the budget, detects scheduling conflicts, and surfaces a new observation. The environment enforces real-world constraints through three internal APIs: - **VendorAPI**: Checks availability, city coverage, and capacity. Calculates cost dynamically as `price_per_head × guest_count`. - **CalendarAPI**: Looks up date-specific *Muhurat* windows and flags *Rahu Kaal* violations. - **ConflictDetector**: Sweeps the full itinerary after every step, checking for double-bookings, ritual ordering violations (*Pheras* before *Haldi* is illegal), and missing catering for large events. Any conflict detected immediately surfaces in `active_conflicts` of the next observation — the agent must self-correct. Nothing is auto-resolved. --- ## The Reward Function At the end of every episode (either on `finalize_plan` or after 20 steps), the environment computes a **composite reward** weighted across 5 objectives: ```python WEIGHTS = { "coverage": 0.35, # Are all 5 event types booked? "budget": 0.25, # Efficient use of budget without going into debt? "muhurat": 0.20, # Is the ceremony inside the auspicious window? "conflicts": 0.10, # Zero active conflicts at finalization? "guest_ux": 0.10, # Did guests get a diverse, complete experience? } ``` Why multi-objective? Because a single binary reward is easy to game. With 5 independent signals, the model must genuinely balance trade-offs — it can't spike one metric without paying in another. --- ## The Training We used **Group Relative Policy Optimization (GRPO)** from HuggingFace TRL, with **Unsloth** for ultra-fast 4-bit QLoRA training on a single L4 GPU. The key design decision: GRPO generates multiple rollouts (action completions) for each prompt and uses the *relative* reward difference between them to compute gradients. This means the model is continuously self-comparing — slightly better actions get reinforced, worse ones get suppressed. We seeded 200 unique episodes (randomized city, guest count, budget) and trained for 3 epochs / 300 steps. Each prompt step generated 4 competing rollouts. One important engineering fix: the original plan routed all reward evaluations through a single shared environment server. But GRPO's concurrent rollouts would all be stepping the same stateful environment simultaneously, contaminating each other's rewards. We solved this by instantiating a **completely isolated local environment per rollout**, reconstructing state from the prompt payload. --- ## The Results After 300 training steps (~2.5 hours on an L4 GPU): | Metric | Before | After | |---|---|---| | JSON output format | Hallucinated text | 100% strict JSON | | Muhurat scheduling | Inside Rahu Kaal | Correctly in 08:30–11:00 window | | Budget discipline | 60% wasted or overdrawn | Efficient, within limits | | Active conflicts | 2–4 per episode | 0–1 per episode | | **Mean episode reward** | **~0.21** | **~0.44 (+110%)** | The reward curve showed a clean learning signal — slow initial progress as the model learned the JSON schema, followed by a steeper climb as it internalized the budget and scheduling constraints. --- ## What's Next The current training used only `difficulty="easy"`. The plan is to run a curriculum — moving to medium and hard modes once the base policy stabilizes — pushing reward well past 0.6 on hard scenarios. The environment is live and open. Anyone can call `/reset` and `/step` against the hosted Space and connect their own agent to it. --- ## Links - 🏟️ **Live Environment**: [huggingface.co/spaces/shivanandh033/wedding-planner-env](https://huggingface.co/spaces/shivanandh033/wedding-planner-env) - 🤖 **Model Card**: [huggingface.co/shivanandh033/wedding-planner-7b](https://huggingface.co/shivanandh033/wedding-planner-7b) - 📓 **Training Notebook**: [`train_colab.ipynb`](./train_colab.ipynb) --- *Built for the AR'26 Meta OpenEnv Hackathon | Theme #3.1 — World Modeling: Professional Tasks*