πŸ’ Indian Wedding Planner RL Agent

A Qwen2.5-7B model fine-tuned with GRPO Reinforcement Learning to autonomously plan complete 3-day Indian weddings inside a stateful, logically-constrained simulation built on the OpenEnv framework.

🏟️ Live Environment shivanandh033/wedding-planner-env
πŸ““ Training Notebook train_colab.ipynb
πŸ“Š Mean Episode Reward Improved from ~0.21 β†’ ~0.44 (+110% over 300 GRPO steps)
βš™οΈ Training Stack Unsloth + HuggingFace TRL (GRPO), single L4 GPU, ~2.5 hrs

🎯 Model Summary

This model was trained using Group Relative Policy Optimization (GRPO) on the custom WeddingPlannerEnv β€” an OpenEnv-compliant simulator of a 3-day Indian wedding.

The agent must plan the wedding by:

  • Booking vendors (caterers, photographers, decorators, priests, DJs) within a strict dynamic budget
  • Scheduling events inside auspicious Muhurat windows from the Hindu calendar while avoiding Rahu Kaal
  • Detecting and resolving logistical conflicts (double-bookings, ritual ordering violations, missing catering)

The model outputs structured JSON actions and learns from environment rewards through multi-step sequential interaction.


πŸ‹οΈ Training Details

Parameter Value
Base Model unsloth/Qwen2.5-7B-Instruct
LoRA rank / alpha 16 / 32
Quantization 4-bit (Unsloth)
Rollouts per prompt 4 (num_generations=4)
Training epochs 3
Total steps 300
Learning rate 5e-6
GPU NVIDIA L4 (24 GB VRAM)
Training time ~2.5 hours

Curriculum Stages

Stage Guests Budget Multiplier Goal
Easy 100–150 2.0Γ— Learn JSON schema + basic booking logic
Medium 200–300 1.3Γ— Learn budget discipline under pressure
Hard 350–500 1.05Γ— Master tight constraints at scale

πŸ“Š Reward Function

The environment computes a weighted multi-objective terminal reward:

WEIGHTS = {
    "coverage":  0.35,   # % of required vendor categories booked (5 event types)
    "budget":    0.25,   # Budget efficiency β€” penalizes deficit and extreme under-spend
    "muhurat":   0.20,   # Ceremony timing compliance β€” Rahu Kaal violation = 0.0
    "conflicts": 0.10,   # Zero active conflicts at finalization
    "guest_ux":  0.10,   # Event diversity β€” guests get a complete experience
}

πŸš€ How to Use

Connect to the Live Environment

import requests

# Initialize a new episode
obs = requests.post(
    "https://shivanandh033-wedding-planner-env.hf.space/reset",
    json={"seed": 42, "difficulty": "medium"}
).json()["observation"]

print(obs)
# {
#   "city": "Delhi", "guest_count": 237, "budget_remaining": 1079325,
#   "muhurat_windows": {"pheras": {"start": "08:30", "end": "11:00"}, ...},
#   "booked_events": [], "active_conflicts": [], "step": 0
# }

Run the Agent

from unsloth import FastLanguageModel
import json, re, requests

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="shivanandh033/wedding-planner-7b",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

SYSTEM = """You are an expert Indian wedding planner.
Plan a 3-day wedding by issuing ONE JSON action per turn.
Valid actions: book_vendor, resolve_conflict, negotiate, finalize_plan.
Always output valid JSON only."""

ENV_URL = "https://shivanandh033-wedding-planner-env.hf.space"
obs = requests.post(f"{ENV_URL}/reset", json={"seed": 42, "difficulty": "medium"}).json()["observation"]

for step in range(20):
    prompt = f"{SYSTEM}\n\nCurrent state:\n{json.dumps(obs, indent=2)}\n\nYour action:"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    output = model.generate(**inputs, max_new_tokens=128, temperature=0.7)
    response = tokenizer.decode(output[0], skip_special_tokens=True)[len(prompt):]

    match = re.search(r'\{.*?\}', response, re.DOTALL)
    action = json.loads(match.group()) if match else {"type": "finalize_plan"}

    result = requests.post(f"{ENV_URL}/step", json=action).json()
    obs = result["observation"]
    print(f"Step {step+1:2d}: {action['type']:20s} | reward: {result['reward']:.4f}")

    if result["done"]:
        print(f"\nβœ… Final Score: {result['reward']:.4f}")
        break

πŸŒͺ️ The Constraint Engine

The WeddingPlannerEnv sweeps the full itinerary for logical impossibilities after every single step:

  • Ritual Ordering Violations β€” Pheras booked before Haldi β†’ immediate conflict flag
  • Temporal Double Booking β€” same vendor at same time slot on same date
  • Missing Infrastructure β€” 400-guest reception with no catering vendor attached
  • Rahu Kaal Scheduling β€” any event booked 15:00–16:30 β†’ muhurat score collapses to 0.0

Active conflicts surface in active_conflicts of the very next observation. The agent must self-correct β€” conflicts are never auto-resolved.


πŸ“ˆ Evaluation Results

GRPO Training Reward Curve β€” mean episode reward vs training step

Mean episode reward over 300 GRPO training steps. Started at ~0.21, converged to ~0.44 (+110%)

Tested across 200 unique episodes on Easy difficulty over 3 training epochs:

Metric Untrained Baseline GRPO-Trained Agent
JSON output format Frequent hallucinations 100% strict JSON compliance
Muhurat scheduling Events in Rahu Kaal window Correctly targets 08:30–11:00
Budget management ~60% wasted or overdrawn Efficient allocation within limits
Active conflicts at finalization 2–4 per episode 0–1 per episode
Mean episode reward ~0.21 ~0.44 (+110%)

πŸ“„ Citation

@misc{weddingplannerenv2026,
  title  = {WeddingPlannerEnv: GRPO Reinforcement Learning for Culturally-Constrained Event Planning},
  year   = {2026},
  note   = {AR'26 Meta OpenEnv Hackathon Submission},
  url    = {https://huggingface.co/spaces/shivanandh033/wedding-planner-env}
}

AR'26 Meta OpenEnv Hackathon | Theme #3.1 β€” World Modeling: Professional Tasks

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for shivanandh033/wedding-planner-7b

Base model

Qwen/Qwen2.5-7B
Finetuned
(2621)
this model