Spaces:
Sleeping
Sleeping
| # Teaching an AI to Plan a Wedding: Reinforcement Learning Meets Indian Culture | |
| *AR'26 Meta OpenEnv Hackathon β Submission Writeup* | |
| --- | |
| ## The Setup | |
| We asked a state-of-the-art 7B language model to plan a 3-day Indian wedding. Here's what it did: | |
| - Booked the DJ before finding a venue | |
| - Scheduled the sacred *Pheras* ceremony at 4:15 PM β smack in the middle of *Rahu Kaal*, the most inauspicious time of day | |
| - Finalized the plan after 3 steps, having left two entire wedding days completely empty | |
| This wasn't a fluke. We ran it 50 times. Same story every time. | |
| The problem isn't that the model is "dumb." The problem is that **zero-shot LLM generation fundamentally cannot handle long-horizon, stateful, constraint-dense planning**. The model doesn't track its remaining budget across 20 decisions. It doesn't know that *Haldi* must happen before *Pheras*. It has no feedback loop to catch its own mistakes. | |
| We needed to change the paradigm. Instead of asking the model to generate a plan, we needed to train it to **act** inside a world. | |
| --- | |
| ## The Environment | |
| We built `WeddingPlannerEnv` using the **OpenEnv** framework β a stateful, programmatic simulation of a 3-day Indian wedding planning process. | |
| The agent operates in a sequential loop: | |
| 1. **Observe** the current state: remaining budget, city, guest count, auspicious *Muhurat* windows, booked events, and any active conflicts. | |
| 2. **Act** by outputting a single structured JSON action (e.g., book a caterer, schedule the priest). | |
| 3. **React** to the consequences: the environment deducts from the budget, detects scheduling conflicts, and surfaces a new observation. | |
| The environment enforces real-world constraints through three internal APIs: | |
| - **VendorAPI**: Checks availability, city coverage, and capacity. Calculates cost dynamically as `price_per_head Γ guest_count`. | |
| - **CalendarAPI**: Looks up date-specific *Muhurat* windows and flags *Rahu Kaal* violations. | |
| - **ConflictDetector**: Sweeps the full itinerary after every step, checking for double-bookings, ritual ordering violations (*Pheras* before *Haldi* is illegal), and missing catering for large events. | |
| Any conflict detected immediately surfaces in `active_conflicts` of the next observation β the agent must self-correct. Nothing is auto-resolved. | |
| --- | |
| ## The Reward Function | |
| At the end of every episode (either on `finalize_plan` or after 20 steps), the environment computes a **composite reward** weighted across 5 objectives: | |
| ```python | |
| WEIGHTS = { | |
| "coverage": 0.35, # Are all 5 event types booked? | |
| "budget": 0.25, # Efficient use of budget without going into debt? | |
| "muhurat": 0.20, # Is the ceremony inside the auspicious window? | |
| "conflicts": 0.10, # Zero active conflicts at finalization? | |
| "guest_ux": 0.10, # Did guests get a diverse, complete experience? | |
| } | |
| ``` | |
| Why multi-objective? Because a single binary reward is easy to game. With 5 independent signals, the model must genuinely balance trade-offs β it can't spike one metric without paying in another. | |
| --- | |
| ## The Training | |
| We used **Group Relative Policy Optimization (GRPO)** from HuggingFace TRL, with **Unsloth** for ultra-fast 4-bit QLoRA training on a single L4 GPU. | |
| The key design decision: GRPO generates multiple rollouts (action completions) for each prompt and uses the *relative* reward difference between them to compute gradients. This means the model is continuously self-comparing β slightly better actions get reinforced, worse ones get suppressed. | |
| We seeded 200 unique episodes (randomized city, guest count, budget) and trained for 3 epochs / 300 steps. Each prompt step generated 4 competing rollouts. | |
| One important engineering fix: the original plan routed all reward evaluations through a single shared environment server. But GRPO's concurrent rollouts would all be stepping the same stateful environment simultaneously, contaminating each other's rewards. We solved this by instantiating a **completely isolated local environment per rollout**, reconstructing state from the prompt payload. | |
| --- | |
| ## The Results | |
| After 300 training steps (~2.5 hours on an L4 GPU): | |
| | Metric | Before | After | | |
| |---|---|---| | |
| | JSON output format | Hallucinated text | 100% strict JSON | | |
| | Muhurat scheduling | Inside Rahu Kaal | Correctly in 08:30β11:00 window | | |
| | Budget discipline | 60% wasted or overdrawn | Efficient, within limits | | |
| | Active conflicts | 2β4 per episode | 0β1 per episode | | |
| | **Mean episode reward** | **~0.21** | **~0.44 (+110%)** | | |
| The reward curve showed a clean learning signal β slow initial progress as the model learned the JSON schema, followed by a steeper climb as it internalized the budget and scheduling constraints. | |
| --- | |
| ## What's Next | |
| The current training used only `difficulty="easy"`. The plan is to run a curriculum β moving to medium and hard modes once the base policy stabilizes β pushing reward well past 0.6 on hard scenarios. | |
| The environment is live and open. Anyone can call `/reset` and `/step` against the hosted Space and connect their own agent to it. | |
| --- | |
| ## Links | |
| - ποΈ **Live Environment**: [huggingface.co/spaces/shivanandh033/wedding-planner-env](https://huggingface.co/spaces/shivanandh033/wedding-planner-env) | |
| - π€ **Model Card**: [huggingface.co/shivanandh033/wedding-planner-7b](https://huggingface.co/shivanandh033/wedding-planner-7b) | |
| - π **Training Notebook**: [`train_colab.ipynb`](./train_colab.ipynb) | |
| --- | |
| *Built for the AR'26 Meta OpenEnv Hackathon | Theme #3.1 β World Modeling: Professional Tasks* | |