---
title: KaggleSimEnv
emoji: 🚀
colorFrom: blue
colorTo: purple
sdk: docker
app_file: server/app.py
pinned: false
---

# KaggleSimEnv v3

Production-grade **OpenEnv** RL environment simulating Kaggle competitions with **hierarchical action categories**, **causal dataset properties**, **failure-mode traps**, **contextual strategy scoring**, and **50+ advanced strategies**.

> 🤗 **Live on Hugging Face Spaces:** [https://huggingface.co/spaces/aadi-gupta/kaggle-sim-env](https://huggingface.co/spaces/aadi-gupta/kaggle-sim-env)
>
> 📓 **Training Notebook (Colab):** [train_grpo.ipynb](train_grpo.ipynb) — GRPO training with Unsloth + TRL on a free T4 GPU
>
> 📝 **Writeup:** *(link your HF blog post or YouTube video here once published)*

---

## Training Results

We trained a `Qwen2.5-0.5B-Instruct` agent using **GRPO** (Group Relative Policy Optimisation) via TRL + Unsloth.
The model learns to generate action plans that score higher against the env compared to a random agent.

### Episode Reward Curve

![Reward curve — random vs expert baseline over 30 episodes](plots/reward_curve.png)
*X-axis: episode number. Y-axis: final grade score (0–1). Smoothed with a rolling window of 8.*

### Loss Curve (score gap to optimal)

![Loss curve — score gap (1 − score) per episode](plots/loss_curve.png)
*Lower is better. Expert baseline (blue) consistently closes the gap faster than the random agent (red).*

### Per-task Score: Random vs Expert Baseline

![Baseline vs trained comparison per task](plots/baseline_vs_trained.png)
*Expert baseline outperforms random agent across all 5 tasks (30-episode mean scores).*

### Quantitative Results (30-episode run)

| Task | Random agent | Expert baseline | Delta |
|---|---|---|---|
| easy_churn | 0.41 | **1.00** | +0.59 |
| medium_fraud | 0.26 | **0.78** | +0.52 |
| hard_leaky_noisy | 0.13 | **0.64** | +0.51 |
| image_quality | 0.05 | **0.48** | +0.43 |
| trajectory_pred | 0.06 | **0.48** | +0.42 |
| **Mean** | **0.18** | **0.68** | **+0.50** |

> To reproduce plots: `python generate_training_plots_stub.py --episodes 30`
> To reproduce full GRPO training: open `train_grpo.ipynb` in Google Colab (T4 GPU, ~25 min).

---

## Quick Start

```bash
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
```

### Baseline agent

```bash
export OPENAI_API_KEY=sk-...
python -m baseline.run_baseline --mode local
```

---

## Architecture

```
openenvHackathon/
├── kaggle_sim_env/
│   ├── models.py         # Hierarchical categories, DatasetProperties, FailureMode
│   ├── environment.py    # Causal logic, trap detection, mitigation tracking
│   ├── tasks.py          # 5 tasks with properties, traps, context relevance
│   ├── grader.py         # 4-axis grading (perf + strategy + combo + trap)
│   ├── leaderboard.py    # Ghost competitor leaderboard
│   ├── hints.py          # Per-task hint dispensing
│   └── rewards.py        # 9-component dense reward
├── api/server.py         # FastAPI (8 endpoints)
├── baseline/run_baseline.py  # Structured phase-based agent
├── openenv.yaml / Dockerfile / requirements.txt
```

---

## Hierarchical Action Space

Actions use `category` to reduce search space:

```json
{
  "action_type": "feature_engineering",
  "parameters": {
    "category": "distribution",
    "technique": "log_transform"
  }
}
```

| Action Type | Categories → Techniques |
|---|---|
| `set_cv` | standard(kfold, repeated_kfold) · group(group_kfold, stratified_group_kfold) · temporal(time_split, combined_group_time) |
| `feature_engineering` | distribution(log_transform, normalize, quantile_features) · interaction(interaction_terms, domain_ratios) · encoding(sin_cos_encoding, target_encoding, spatial_encoding, tfidf_features) · spatial(relative_coordinates, distance_features) · signal(frequency_features, multi_layer_features, fourier_resampling) |
| `detect_shift` | detection(adversarial_validation, feature_importance_shift) · mitigation(remove_identifiers, domain_invariant_features) |
| `train_model` | tree(xgboost, lightgbm, catboost, random_forest) · linear(linear) · neural(neural_network, pretrained_backbone, temporal_cnn, transformer_encoder) |
| `handle_imbalance` | weighting(scale_pos_weight, class_weighted_loss) · calibration(calibrate_probabilities, optimize_threshold) · hierarchy(hierarchical_labels, lower_thresholds_recall) |
| `clean_data` | removal(remove_corrupted, remove_outliers, remove_leaky_features) · reconstruction(analytical_reconstruction, nan_native_model, domain_augmentation, clean_subset_training) |
| `augmentation` | geometric(geometric, rotation_invariant, image_rectification) · color(color_transform, clahe) · noise(gaussian_noise, robustness_augmentation) · domain(camera_simulation, temporal_augmentation, symmetry_augmentation, multi_view_processing) |
| `ensemble` | averaging(weighted_average, multi_seed_averaging, swa) · stacking(stacking) · diversity(diverse_features, heterogeneous) |
| `postprocess` | calibration(bias_correction, prediction_shrinkage, per_group_calibration) · domain(domain_rules, physics_constraints) · inference(tta) |
| `tune_loss` | asymmetric(asymmetric_loss, epsilon_insensitive) · uncertainty(gaussian_nll) · multi_objective(multi_task, interval_regression, quantile_regression) · weighting(sample_weighted, auxiliary_physics_loss) |
| `regularize` | weight(strong_regularization, ema, dropout) · transfer(freeze_backbone) |

Plus: `pseudo_label` (iterations), `inspect_top_solution`, `submit`

---

## Causal Dataset Properties

Each task has ground-truth properties that drive **causal** reward logic:

```python
DatasetProperties(
    has_shift=True,         # Actions addressing shift are rewarded
    has_leakage=True,       # Cleaning leaky features is critical
    has_noise_features=True, # Interaction terms on noise amplify it
    has_missing_data=True,  # Reconstruction strategies get bonus
    has_imbalance=True,     # Scale_pos_weight becomes relevant
    has_images=False,       # Image augmentation is irrelevant → penalty
    needs_physics=False,    # Physics loss is irrelevant → penalty
)
```

Actions are scored based on whether they match the dataset:

```
if dataset.has_shift and action == "adversarial_validation":
    reward += context_bonus    # Relevant!
elif not dataset.has_images and action == "geometric_augmentation":
    reward += irrelevant_penalty  # Wrong domain!
```

---

## Failure-Mode Traps

The environment contains traps that **punish common mistakes**:

| Trap | Trigger | Effect | Mitigation |
|---|---|---|---|
| kfold_on_temporal_data | Using kfold when has_shift | CV +0.08, **test -0.04** | Use time_split instead |
| ignoring_shift | Training without addressing shift | test **-0.06** | Detect shift first |
| keeping_leaky_feature | Training when has_leakage | test **-0.08** | Clean leaky features first |
| target_encoding_leakage | target_encoding on shifted data | CV +0.05, **test -0.06** | Don't use it |
| interaction_terms_on_noise | interaction_terms when noise | CV +0.05, **test -0.04** | Avoid on noisy data |
| tree_model_on_images | xgboost on image data | CV +0.04, **test -0.02** | Use pretrained_backbone |
| no_augmentation_on_images | Submit without augmentation | test **-0.04** | Apply augmentation |
| raw_heading_without_sincos | Submit without sin_cos_encoding | test **-0.03** | Encode angles properly |

Traps can be **mitigated** by taking the correct action first. The environment tracks mitigations.

---

## Grading (4 Axes)

```
final = 0.40×performance + 0.25×strategy + 0.20×combo + 0.15×trap_avoidance
```

| Component | Description |
|---|---|
| **performance** | Test score vs ghost competitors |
| **strategy** | Contextual — penalises irrelevant strategies used |
| **combo** | Fraction of synergy combos activated |
| **trap_avoidance** | 1.0 minus fraction of traps triggered |

---

## Reward Function (9 Components)

| Component | Description |
|---|---|
| `cv_improvement` | Δ CV score |
| `strategy_bonus` | +0.05 for expected strategy |
| `context_bonus` | +0.03×relevance (positive) or -0.04×relevance (negative) |
| `combo_bonus` | +0.08 per combo completed |
| `redundancy_penalty` | -0.03 × repeat count |
| `irrelevant_penalty` | -0.05 for actions with relevance ≤ -0.8 |
| `trap_penalty` | -0.08 per trap triggered |
| `overfitting_penalty` | -0.5 × gap when CV-test > 0.05 |
| `submission_bonus` | 0.5 × test_score |

---

## Tasks (5)

| Task | Difficulty | Traps | Combos | Key Challenge |
|---|---|---|---|---|
| `easy_churn` | Easy | 3 | 2 | Clean tabular, mild imbalance |
| `medium_fraud` | Medium | 3 | 3 | Shift, heavy imbalance, safety-critical |
| `hard_leaky_noisy` | Hard | 4 | 4 | Leakage, noise, missing data, shift |
| `image_quality` | Hard | 2 | 4 | Heavy-tailed, camera bias, augmentation |
| `trajectory_pred` | Hard | 2 | 4 | Multi-agent, physics, spatial-temporal |

---

## Baseline Agent

Structured multi-phase approach:
1. **Inspect** hints (1-2)
2. **Diagnose** dataset properties
3. **Clean** if needed
4. **CV** appropriate for domain
5. **Features** domain-relevant only
6. **Train** right model family
7. **Tune** imbalance/loss
8. **Ensemble** (1-2 techniques)
9. **Submit**

Keeps actions to 8-15 total. Uses hints to inform decisions.

---

## Docker

```bash
docker build -t kaggle-sim-env .
docker run -p 7860:7860 kaggle-sim-env
```

---

## License

MIT