shivanandh033's picture
Add Blog.md writeup and link blog + colab in README
59c1de1

Teaching an AI to Plan a Wedding: Reinforcement Learning Meets Indian Culture

AR'26 Meta OpenEnv Hackathon β€” Submission Writeup


The Setup

We asked a state-of-the-art 7B language model to plan a 3-day Indian wedding. Here's what it did:

  • Booked the DJ before finding a venue
  • Scheduled the sacred Pheras ceremony at 4:15 PM β€” smack in the middle of Rahu Kaal, the most inauspicious time of day
  • Finalized the plan after 3 steps, having left two entire wedding days completely empty

This wasn't a fluke. We ran it 50 times. Same story every time.

The problem isn't that the model is "dumb." The problem is that zero-shot LLM generation fundamentally cannot handle long-horizon, stateful, constraint-dense planning. The model doesn't track its remaining budget across 20 decisions. It doesn't know that Haldi must happen before Pheras. It has no feedback loop to catch its own mistakes.

We needed to change the paradigm. Instead of asking the model to generate a plan, we needed to train it to act inside a world.


The Environment

We built WeddingPlannerEnv using the OpenEnv framework β€” a stateful, programmatic simulation of a 3-day Indian wedding planning process.

The agent operates in a sequential loop:

  1. Observe the current state: remaining budget, city, guest count, auspicious Muhurat windows, booked events, and any active conflicts.
  2. Act by outputting a single structured JSON action (e.g., book a caterer, schedule the priest).
  3. React to the consequences: the environment deducts from the budget, detects scheduling conflicts, and surfaces a new observation.

The environment enforces real-world constraints through three internal APIs:

  • VendorAPI: Checks availability, city coverage, and capacity. Calculates cost dynamically as price_per_head Γ— guest_count.
  • CalendarAPI: Looks up date-specific Muhurat windows and flags Rahu Kaal violations.
  • ConflictDetector: Sweeps the full itinerary after every step, checking for double-bookings, ritual ordering violations (Pheras before Haldi is illegal), and missing catering for large events.

Any conflict detected immediately surfaces in active_conflicts of the next observation β€” the agent must self-correct. Nothing is auto-resolved.


The Reward Function

At the end of every episode (either on finalize_plan or after 20 steps), the environment computes a composite reward weighted across 5 objectives:

WEIGHTS = {
    "coverage":  0.35,   # Are all 5 event types booked?
    "budget":    0.25,   # Efficient use of budget without going into debt?
    "muhurat":   0.20,   # Is the ceremony inside the auspicious window?
    "conflicts": 0.10,   # Zero active conflicts at finalization?
    "guest_ux":  0.10,   # Did guests get a diverse, complete experience?
}

Why multi-objective? Because a single binary reward is easy to game. With 5 independent signals, the model must genuinely balance trade-offs β€” it can't spike one metric without paying in another.


The Training

We used Group Relative Policy Optimization (GRPO) from HuggingFace TRL, with Unsloth for ultra-fast 4-bit QLoRA training on a single L4 GPU.

The key design decision: GRPO generates multiple rollouts (action completions) for each prompt and uses the relative reward difference between them to compute gradients. This means the model is continuously self-comparing β€” slightly better actions get reinforced, worse ones get suppressed.

We seeded 200 unique episodes (randomized city, guest count, budget) and trained for 3 epochs / 300 steps. Each prompt step generated 4 competing rollouts.

One important engineering fix: the original plan routed all reward evaluations through a single shared environment server. But GRPO's concurrent rollouts would all be stepping the same stateful environment simultaneously, contaminating each other's rewards. We solved this by instantiating a completely isolated local environment per rollout, reconstructing state from the prompt payload.


The Results

After 300 training steps (~2.5 hours on an L4 GPU):

Metric Before After
JSON output format Hallucinated text 100% strict JSON
Muhurat scheduling Inside Rahu Kaal Correctly in 08:30–11:00 window
Budget discipline 60% wasted or overdrawn Efficient, within limits
Active conflicts 2–4 per episode 0–1 per episode
Mean episode reward ~0.21 ~0.44 (+110%)

The reward curve showed a clean learning signal β€” slow initial progress as the model learned the JSON schema, followed by a steeper climb as it internalized the budget and scheduling constraints.


What's Next

The current training used only difficulty="easy". The plan is to run a curriculum β€” moving to medium and hard modes once the base policy stabilizes β€” pushing reward well past 0.6 on hard scenarios.

The environment is live and open. Anyone can call /reset and /step against the hosted Space and connect their own agent to it.


Links


Built for the AR'26 Meta OpenEnv Hackathon | Theme #3.1 β€” World Modeling: Professional Tasks