💍 Indian Wedding Planner RL Agent

A Qwen2.5-7B model fine-tuned with GRPO Reinforcement Learning to autonomously plan complete 3-day Indian weddings inside a stateful, logically-constrained simulation built on the OpenEnv framework.


🏟️ Live Environment	shivanandh033/wedding-planner-env
📓 Training Notebook	`train_colab.ipynb`
📊 Mean Episode Reward	Improved from `~0.21 → ~0.44` (+110% over 300 GRPO steps)
⚙️ Training Stack	Unsloth + HuggingFace TRL (GRPO), single L4 GPU, ~2.5 hrs

🎯 Model Summary

This model was trained using Group Relative Policy Optimization (GRPO) on the custom WeddingPlannerEnv — an OpenEnv-compliant simulator of a 3-day Indian wedding.

The agent must plan the wedding by:

Booking vendors (caterers, photographers, decorators, priests, DJs) within a strict dynamic budget
Scheduling events inside auspicious Muhurat windows from the Hindu calendar while avoiding Rahu Kaal
Detecting and resolving logistical conflicts (double-bookings, ritual ordering violations, missing catering)

The model outputs structured JSON actions and learns from environment rewards through multi-step sequential interaction.

🏋️ Training Details

Parameter	Value
Base Model	`unsloth/Qwen2.5-7B-Instruct`
LoRA rank / alpha	16 / 32
Quantization	4-bit (Unsloth)
Rollouts per prompt	4 (`num_generations=4`)
Training epochs	3
Total steps	300
Learning rate	5e-6
GPU	NVIDIA L4 (24 GB VRAM)
Training time	~2.5 hours

Curriculum Stages

Stage	Guests	Budget Multiplier	Goal
Easy	100–150	2.0×	Learn JSON schema + basic booking logic
Medium	200–300	1.3×	Learn budget discipline under pressure
Hard	350–500	1.05×	Master tight constraints at scale

📊 Reward Function

The environment computes a weighted multi-objective terminal reward:

WEIGHTS = {
    "coverage":  0.35,   # % of required vendor categories booked (5 event types)
    "budget":    0.25,   # Budget efficiency — penalizes deficit and extreme under-spend
    "muhurat":   0.20,   # Ceremony timing compliance — Rahu Kaal violation = 0.0
    "conflicts": 0.10,   # Zero active conflicts at finalization
    "guest_ux":  0.10,   # Event diversity — guests get a complete experience
}

🚀 How to Use

Connect to the Live Environment

import requests

# Initialize a new episode
obs = requests.post(
    "https://shivanandh033-wedding-planner-env.hf.space/reset",
    json={"seed": 42, "difficulty": "medium"}
).json()["observation"]

print(obs)
# {
#   "city": "Delhi", "guest_count": 237, "budget_remaining": 1079325,
#   "muhurat_windows": {"pheras": {"start": "08:30", "end": "11:00"}, ...},
#   "booked_events": [], "active_conflicts": [], "step": 0
# }

Run the Agent

from unsloth import FastLanguageModel
import json, re, requests

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="shivanandh033/wedding-planner-7b",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

SYSTEM = """You are an expert Indian wedding planner.
Plan a 3-day wedding by issuing ONE JSON action per turn.
Valid actions: book_vendor, resolve_conflict, negotiate, finalize_plan.
Always output valid JSON only."""

ENV_URL = "https://shivanandh033-wedding-planner-env.hf.space"
obs = requests.post(f"{ENV_URL}/reset", json={"seed": 42, "difficulty": "medium"}).json()["observation"]

for step in range(20):
    prompt = f"{SYSTEM}\n\nCurrent state:\n{json.dumps(obs, indent=2)}\n\nYour action:"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    output = model.generate(**inputs, max_new_tokens=128, temperature=0.7)
    response = tokenizer.decode(output[0], skip_special_tokens=True)[len(prompt):]

    match = re.search(r'\{.*?\}', response, re.DOTALL)
    action = json.loads(match.group()) if match else {"type": "finalize_plan"}

    result = requests.post(f"{ENV_URL}/step", json=action).json()
    obs = result["observation"]
    print(f"Step {step+1:2d}: {action['type']:20s} | reward: {result['reward']:.4f}")

    if result["done"]:
        print(f"\n✅ Final Score: {result['reward']:.4f}")
        break

🌪️ The Constraint Engine

The WeddingPlannerEnv sweeps the full itinerary for logical impossibilities after every single step:

Ritual Ordering Violations — Pheras booked before Haldi → immediate conflict flag
Temporal Double Booking — same vendor at same time slot on same date
Missing Infrastructure — 400-guest reception with no catering vendor attached
Rahu Kaal Scheduling — any event booked 15:00–16:30 → muhurat score collapses to 0.0

Active conflicts surface in active_conflicts of the very next observation. The agent must self-correct — conflicts are never auto-resolved.

📈 Evaluation Results

GRPO Training Reward Curve — mean episode reward vs training step

Mean episode reward over 300 GRPO training steps. Started at ~0.21, converged to ~0.44 (+110%)

Tested across 200 unique episodes on Easy difficulty over 3 training epochs:

Metric	Untrained Baseline	GRPO-Trained Agent
JSON output format	Frequent hallucinations	100% strict JSON compliance
Muhurat scheduling	Events in Rahu Kaal window	Correctly targets 08:30–11:00
Budget management	~60% wasted or overdrawn	Efficient allocation within limits
Active conflicts at finalization	2–4 per episode	0–1 per episode
Mean episode reward	~0.21	~0.44 (+110%)

📄 Citation

@misc{weddingplannerenv2026,
  title  = {WeddingPlannerEnv: GRPO Reinforcement Learning for Culturally-Constrained Event Planning},
  year   = {2026},
  note   = {AR'26 Meta OpenEnv Hackathon Submission},
  url    = {https://huggingface.co/spaces/shivanandh033/wedding-planner-env}
}

AR'26 Meta OpenEnv Hackathon | Theme #3.1 — World Modeling: Professional Tasks

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Model tree for shivanandh033/wedding-planner-7b

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

unsloth/Qwen2.5-7B-Instruct

Finetuned

(2621)

this model