Spaces:

shivanandh033
/

wedding-planner-env

Sleeping

App Files Files Community

wedding-planner-env / Blog.md

shivanandh033

Add Blog.md writeup and link blog + colab in README

59c1de1 about 1 month ago

preview code

raw

history blame contribute delete

5.57 kB

	# Teaching an AI to Plan a Wedding: Reinforcement Learning Meets Indian Culture

	AR'26 Meta OpenEnv Hackathon — Submission Writeup

	---

	## The Setup

	We asked a state-of-the-art 7B language model to plan a 3-day Indian wedding. Here's what it did:

	- Booked the DJ before finding a venue
	- Scheduled the sacred Pheras ceremony at 4:15 PM — smack in the middle of Rahu Kaal, the most inauspicious time of day
	- Finalized the plan after 3 steps, having left two entire wedding days completely empty

	This wasn't a fluke. We ran it 50 times. Same story every time.

	The problem isn't that the model is "dumb." The problem is that zero-shot LLM generation fundamentally cannot handle long-horizon, stateful, constraint-dense planning. The model doesn't track its remaining budget across 20 decisions. It doesn't know that Haldi must happen before Pheras. It has no feedback loop to catch its own mistakes.

	We needed to change the paradigm. Instead of asking the model to generate a plan, we needed to train it to act inside a world.

	---

	## The Environment

	We built `WeddingPlannerEnv` using the OpenEnv framework — a stateful, programmatic simulation of a 3-day Indian wedding planning process.

	The agent operates in a sequential loop:

	1. Observe the current state: remaining budget, city, guest count, auspicious Muhurat windows, booked events, and any active conflicts.
	2. Act by outputting a single structured JSON action (e.g., book a caterer, schedule the priest).
	3. React to the consequences: the environment deducts from the budget, detects scheduling conflicts, and surfaces a new observation.

	The environment enforces real-world constraints through three internal APIs:

	- VendorAPI: Checks availability, city coverage, and capacity. Calculates cost dynamically as `price_per_head × guest_count`.
	- CalendarAPI: Looks up date-specific Muhurat windows and flags Rahu Kaal violations.
	- ConflictDetector: Sweeps the full itinerary after every step, checking for double-bookings, ritual ordering violations (Pheras before Haldi is illegal), and missing catering for large events.

	Any conflict detected immediately surfaces in `active_conflicts` of the next observation — the agent must self-correct. Nothing is auto-resolved.

	---

	## The Reward Function

	At the end of every episode (either on `finalize_plan` or after 20 steps), the environment computes a composite reward weighted across 5 objectives:

	```python
	WEIGHTS = {
	"coverage": 0.35, # Are all 5 event types booked?
	"budget": 0.25, # Efficient use of budget without going into debt?
	"muhurat": 0.20, # Is the ceremony inside the auspicious window?
	"conflicts": 0.10, # Zero active conflicts at finalization?
	"guest_ux": 0.10, # Did guests get a diverse, complete experience?
	}
	```

	Why multi-objective? Because a single binary reward is easy to game. With 5 independent signals, the model must genuinely balance trade-offs — it can't spike one metric without paying in another.

	---

	## The Training

	We used Group Relative Policy Optimization (GRPO) from HuggingFace TRL, with Unsloth for ultra-fast 4-bit QLoRA training on a single L4 GPU.

	The key design decision: GRPO generates multiple rollouts (action completions) for each prompt and uses the relative reward difference between them to compute gradients. This means the model is continuously self-comparing — slightly better actions get reinforced, worse ones get suppressed.

	We seeded 200 unique episodes (randomized city, guest count, budget) and trained for 3 epochs / 300 steps. Each prompt step generated 4 competing rollouts.

	One important engineering fix: the original plan routed all reward evaluations through a single shared environment server. But GRPO's concurrent rollouts would all be stepping the same stateful environment simultaneously, contaminating each other's rewards. We solved this by instantiating a completely isolated local environment per rollout, reconstructing state from the prompt payload.

	---

	## The Results

	After 300 training steps (~2.5 hours on an L4 GPU):

	\| Metric \| Before \| After \|
	\|---\|---\|---\|
	\| JSON output format \| Hallucinated text \| 100% strict JSON \|
	\| Muhurat scheduling \| Inside Rahu Kaal \| Correctly in 08:30–11:00 window \|
	\| Budget discipline \| 60% wasted or overdrawn \| Efficient, within limits \|
	\| Active conflicts \| 2–4 per episode \| 0–1 per episode \|
	\| Mean episode reward \| ~0.21 \| ~0.44 (+110%) \|

	The reward curve showed a clean learning signal — slow initial progress as the model learned the JSON schema, followed by a steeper climb as it internalized the budget and scheduling constraints.

	---

	## What's Next

	The current training used only `difficulty="easy"`. The plan is to run a curriculum — moving to medium and hard modes once the base policy stabilizes — pushing reward well past 0.6 on hard scenarios.

	The environment is live and open. Anyone can call `/reset` and `/step` against the hosted Space and connect their own agent to it.

	---

	## Links

	- 🏟️ Live Environment: [huggingface.co/spaces/shivanandh033/wedding-planner-env](https://huggingface.co/spaces/shivanandh033/wedding-planner-env)
	- 🤖 Model Card: [huggingface.co/shivanandh033/wedding-planner-7b](https://huggingface.co/shivanandh033/wedding-planner-7b)
	- 📓 Training Notebook: [`train_colab.ipynb`](./train_colab.ipynb)

	---

	Built for the AR'26 Meta OpenEnv Hackathon \| Theme #3.1 — World Modeling: Professional Tasks