Spaces:
Running
Running
v3: benchmark results, final report, agent/eval improvements, smoke test fixes
Browse files- FINAL_REPORT.md +180 -0
- agents.py +174 -3
- benchmark_guides.md +245 -0
- dataset_generators.py +14 -4
- evaluate_agent.py +10 -3
- examples/local_smoke_test.py +22 -10
- results_heuristic.json +506 -0
- results_llm.json +464 -0
- results_random.json +505 -0
- server/cleaning_environment.py +33 -16
- training_colab.py +2 -0
FINAL_REPORT.md
ADDED
|
@@ -0,0 +1,180 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# FSDS Cleaning Agent β Evaluation Report
|
| 2 |
+
|
| 3 |
+
**Date:** 2026-03-08
|
| 4 |
+
**Environment:** `https://israaaML-fsds-cleaning-env.hf.space`
|
| 5 |
+
**Episodes per task:** 3
|
| 6 |
+
**Tasks evaluated:** `ecommerce_mobile`, `subscription_churn`, `delivery_eta`
|
| 7 |
+
**Total episodes per agent:** 45 (15 tasks Γ 3 episodes)
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## 1. Summary Table
|
| 12 |
+
|
| 13 |
+
| Agent | Success Rate | Avg Return | Avg Steps | Avg Invalid Actions | Quality Gate Passed |
|
| 14 |
+
|---|---|---|---|---|---|
|
| 15 |
+
| HeuristicAgent | 0.00% | -0.0878 | 12.2 | 0 | No |
|
| 16 |
+
| RandomAgent | 0.00% | -0.0912 | 3.2 | 0 | No |
|
| 17 |
+
| LLMAgent (GRPO) | N/A* | N/A* | N/A* | N/A* | N/A* |
|
| 18 |
+
|
| 19 |
+
> *LLMAgent could not be evaluated locally because `unsloth` is a Colab/GPU-only library not installed in the local Python environment. See Section 4 for details and the recommended fix.
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## 2. Agent-by-Agent Analysis
|
| 24 |
+
|
| 25 |
+
### 2.1 HeuristicAgent (Rule-based baseline)
|
| 26 |
+
|
| 27 |
+
The HeuristicAgent follows a hard-coded, task-specific cleaning policy derived from the known `required_ops` for each task. It is the **intended upper-bound reference** for this environment.
|
| 28 |
+
|
| 29 |
+
**Results:**
|
| 30 |
+
|
| 31 |
+
| Task | Avg Return | Avg Retention | Steps | Quality Gate |
|
| 32 |
+
|---|---|---|---|---|
|
| 33 |
+
| ecommerce_mobile | -0.0600 | 98.06% | 12 | Failed |
|
| 34 |
+
| subscription_churn | -0.1108 | 98.17% | 14 | Failed |
|
| 35 |
+
| delivery_eta | -0.0927 | 97.98% | 13 | Failed |
|
| 36 |
+
|
| 37 |
+
**Key observations:**
|
| 38 |
+
- Executes the full cleaning pipeline correctly (12β14 steps).
|
| 39 |
+
- Achieves ~98% data retention across all tasks, which is healthy.
|
| 40 |
+
- Returns are negative because every step incurs a small step penalty (`-reward_per_step`), and no terminal success reward is collected since quality gates never pass.
|
| 41 |
+
- Zero invalid actions β all tool calls are structurally correct.
|
| 42 |
+
- The consistent `quality_gate_passed: False` across all tasks and all episodes suggests the environment's quality gate thresholds may require operations beyond what the scripted policy currently includes, or a configuration mismatch exists between the policy and the active server version.
|
| 43 |
+
|
| 44 |
+
**Interpretation:** The heuristic agent is behaviourally correct (right tools, right order, good retention) but does not cross the quality gate threshold. This is a signal about the quality gate strictness, not about the agent's cleaning ability.
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
### 2.2 RandomAgent (Lower-bound baseline)
|
| 49 |
+
|
| 50 |
+
The RandomAgent samples actions uniformly at random from the valid action space.
|
| 51 |
+
|
| 52 |
+
**Results:**
|
| 53 |
+
|
| 54 |
+
| Metric | Value |
|
| 55 |
+
|---|---|
|
| 56 |
+
| Success Rate | 0.00% |
|
| 57 |
+
| Avg Return | -0.0912 |
|
| 58 |
+
| Avg Steps | 3.2 |
|
| 59 |
+
| Avg Invalid Actions | 0 |
|
| 60 |
+
|
| 61 |
+
**Key observations:**
|
| 62 |
+
- Terminates early (avg 3.2 steps) because it randomly selects `submit_solution` before meaningful cleaning is done.
|
| 63 |
+
- Slightly worse average return than HeuristicAgent (-0.0912 vs -0.0878), confirming the heuristic is doing something useful even if not enough to pass quality gates.
|
| 64 |
+
- Zero invalid actions because the action sampler only picks structurally valid tool calls.
|
| 65 |
+
- The small gap between Random and Heuristic returns is partly due to the RandomAgent's short episodes β fewer steps means fewer step penalties, partially offsetting its bad cleaning quality.
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
### 2.3 LLMAgent β GRPO Fine-tuned Model
|
| 70 |
+
|
| 71 |
+
**Status: Not evaluated locally.**
|
| 72 |
+
|
| 73 |
+
All 45 episodes failed with:
|
| 74 |
+
```
|
| 75 |
+
Error: No module named 'unsloth'
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
**Root cause:** `unsloth` is a Colab-optimised library that patches the HuggingFace `transformers` stack for 4-bit GPU training. It is not pip-installable in standard CPU/MPS environments without CUDA. The trained checkpoint (`./data-cleaning-grpo-final`) is a LoRA adapter that requires the Unsloth model loader to be instantiated correctly.
|
| 79 |
+
|
| 80 |
+
**This is an infrastructure constraint, not a model quality issue.** The model itself trained successfully (Cell 9 completed without errors in Colab).
|
| 81 |
+
|
| 82 |
+
---
|
| 83 |
+
|
| 84 |
+
## 3. Comparative Analysis
|
| 85 |
+
|
| 86 |
+
```
|
| 87 |
+
Return ranking (higher is better):
|
| 88 |
+
LLMAgent (GRPO): N/A (not evaluated)
|
| 89 |
+
HeuristicAgent: -0.0878 β best evaluated
|
| 90 |
+
RandomAgent: -0.0912 β worst evaluated
|
| 91 |
+
|
| 92 |
+
Step efficiency (fewer steps = faster decisions):
|
| 93 |
+
RandomAgent: 3.2 (but premature submission)
|
| 94 |
+
HeuristicAgent: 12.2 (full pipeline execution)
|
| 95 |
+
LLMAgent: N/A
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
The HeuristicAgent is the better agent of the two that ran:
|
| 99 |
+
- It executes a complete, reasoned cleaning sequence.
|
| 100 |
+
- It achieves higher data retention (~98% vs ~100% for Random, but Random does no cleaning).
|
| 101 |
+
- Its negative return is purely a step-penalty artefact, not evidence of bad cleaning.
|
| 102 |
+
|
| 103 |
+
The RandomAgent's slightly fewer step-penalty losses are misleading β it simply stops early without cleaning anything meaningful.
|
| 104 |
+
|
| 105 |
+
---
|
| 106 |
+
|
| 107 |
+
## 4. How to Evaluate the LLMAgent
|
| 108 |
+
|
| 109 |
+
Run the evaluation in **Google Colab** (T4 GPU recommended) where `unsloth` is available:
|
| 110 |
+
|
| 111 |
+
```python
|
| 112 |
+
# In Colab, after installing dependencies:
|
| 113 |
+
# !pip install -q "trl>=0.12.0" "accelerate>=0.34.0" "peft>=0.13.0" "bitsandbytes>=0.43.0"
|
| 114 |
+
# !pip install -q unsloth
|
| 115 |
+
# !pip install -q "git+https://huggingface.co/spaces/israaaML/fsds_cleaning_env"
|
| 116 |
+
|
| 117 |
+
from fsds_cleaning_env.agents import LLMAgent
|
| 118 |
+
from fsds_cleaning_env.evaluate_agent import run_evaluation
|
| 119 |
+
|
| 120 |
+
agent = LLMAgent(model_path="./data-cleaning-grpo-final")
|
| 121 |
+
results = run_evaluation(
|
| 122 |
+
agent,
|
| 123 |
+
base_url="https://israaaML-fsds-cleaning-env.hf.space",
|
| 124 |
+
max_episodes_per_task=3,
|
| 125 |
+
)
|
| 126 |
+
|
| 127 |
+
print(f"Success rate: {results['aggregate']['success_rate']:.2%}")
|
| 128 |
+
print(f"Avg return: {results['aggregate']['avg_return']:.4f}")
|
| 129 |
+
print(f"Avg steps: {results['aggregate']['avg_steps']:.1f}")
|
| 130 |
+
```
|
| 131 |
+
|
| 132 |
+
Expected comparison targets once evaluated:
|
| 133 |
+
|
| 134 |
+
| Metric | Random (lower bound) | Heuristic (reference) | LLM target |
|
| 135 |
+
|---|---|---|---|
|
| 136 |
+
| Success rate | 0% | 0%* | >0% |
|
| 137 |
+
| Avg return | -0.0912 | -0.0878 | > -0.0878 |
|
| 138 |
+
| Avg steps | 3.2 | 12.2 | ~10β15 |
|
| 139 |
+
|
| 140 |
+
> *The 0% success rate for the Heuristic agent is likely caused by a quality gate configuration issue on the server β investigate `run_quality_gates` responses to confirm which specific checks are failing.
|
| 141 |
+
|
| 142 |
+
---
|
| 143 |
+
|
| 144 |
+
## 5. Issues Identified & Next Steps
|
| 145 |
+
|
| 146 |
+
### Issue 1 β Quality gates never pass (affects all agents)
|
| 147 |
+
The environment returns `quality_gate_passed: False` for every episode including the HeuristicAgent, which applies the correct canonical operations. This is unexpected.
|
| 148 |
+
|
| 149 |
+
**Recommended action:** Run a manual debug episode and inspect the `run_quality_gates` response payload to see which specific checks fail and why.
|
| 150 |
+
|
| 151 |
+
```python
|
| 152 |
+
with FSDSCleaningEnv(base_url=ENV_URL).sync() as env:
|
| 153 |
+
env.reset(task_id="ecommerce_mobile")
|
| 154 |
+
# ... apply cleaning ops ...
|
| 155 |
+
result = env.call_tool("run_quality_gates")
|
| 156 |
+
print(result) # inspect which tests fail
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
### Issue 2 β LLMAgent requires Colab/GPU environment
|
| 160 |
+
The trained LoRA adapter depends on `unsloth` and 4-bit quantisation (bitsandbytes + CUDA).
|
| 161 |
+
|
| 162 |
+
**Recommended action:** Run LLMAgent evaluation in Colab using the code in Section 4.
|
| 163 |
+
|
| 164 |
+
### Issue 3 β SFT warm-start checkpoint not used for GRPO
|
| 165 |
+
`training_colab.py` line 60 still points to the base model, not the SFT checkpoint:
|
| 166 |
+
```python
|
| 167 |
+
MODEL_NAME = "unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit"
|
| 168 |
+
# MODEL_NAME = "./data-cleaning-sft-final" β not activated
|
| 169 |
+
```
|
| 170 |
+
Switching to the SFT warm-start before the next GRPO run should improve convergence significantly.
|
| 171 |
+
|
| 172 |
+
---
|
| 173 |
+
|
| 174 |
+
## 6. Conclusion
|
| 175 |
+
|
| 176 |
+
Of the two agents successfully evaluated, the **HeuristicAgent is clearly superior** β it executes a complete and structured data-cleaning pipeline with ~98% retention and zero invalid actions. The **RandomAgent** serves as a noisy lower bound, terminating prematurely without meaningful cleaning.
|
| 177 |
+
|
| 178 |
+
The **LLMAgent (GRPO)** trained successfully in Colab but requires a GPU environment to evaluate. Once evaluated in Colab, it should be compared against the Heuristic reference on the three metrics: success rate, average return, and average steps. A positive success rate would be a strong signal that RL training transferred useful cleaning behaviour beyond the scripted baseline.
|
| 179 |
+
|
| 180 |
+
The most important outstanding issue is diagnosing why quality gates fail even for the HeuristicAgent β resolving this is a prerequisite for any agent achieving a non-zero success rate.
|
agents.py
CHANGED
|
@@ -23,20 +23,25 @@ HEURISTIC_POLICIES: dict[str, list[tuple[str, str | None]]] = {
|
|
| 23 |
"ecommerce_mobile": [
|
| 24 |
("replace_invalid_with_null", "country"),
|
| 25 |
("replace_invalid_with_null", "items_in_cart"),
|
| 26 |
-
("
|
| 27 |
("cast_numeric", "items_in_cart"),
|
|
|
|
| 28 |
("impute_numeric", "items_in_cart"),
|
|
|
|
| 29 |
("clip_outliers_iqr", "items_in_cart"),
|
| 30 |
("clip_outliers_iqr", "order_value"),
|
| 31 |
("normalize_categories", "device_os"),
|
| 32 |
("normalize_categories", "country"),
|
|
|
|
|
|
|
| 33 |
("cast_datetime", "event_date"),
|
|
|
|
| 34 |
],
|
| 35 |
"subscription_churn": [
|
| 36 |
-
("drop_duplicates", None),
|
| 37 |
("replace_invalid_with_null", "monthly_spend"),
|
| 38 |
("replace_invalid_with_null", "age"),
|
| 39 |
("replace_invalid_with_null", "tenure_months"),
|
|
|
|
| 40 |
("cast_numeric", "age"),
|
| 41 |
("cast_numeric", "monthly_spend"),
|
| 42 |
("cast_numeric", "tenure_months"),
|
|
@@ -45,19 +50,28 @@ HEURISTIC_POLICIES: dict[str, list[tuple[str, str | None]]] = {
|
|
| 45 |
("impute_numeric", "tenure_months"),
|
| 46 |
("clip_outliers_iqr", "monthly_spend"),
|
| 47 |
("normalize_categories", "plan_type"),
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
],
|
| 49 |
"delivery_eta": [
|
| 50 |
-
("drop_duplicates", None),
|
| 51 |
("replace_invalid_with_null", "driver_rating"),
|
| 52 |
("replace_invalid_with_null", "late_deliveries_last_30d"),
|
|
|
|
|
|
|
| 53 |
("cast_numeric", "driver_rating"),
|
| 54 |
("cast_numeric", "delivery_distance_km"),
|
| 55 |
("cast_numeric", "late_deliveries_last_30d"),
|
| 56 |
("impute_numeric", "driver_rating"),
|
| 57 |
("impute_numeric", "late_deliveries_last_30d"),
|
|
|
|
| 58 |
("clip_outliers_iqr", "delivery_distance_km"),
|
| 59 |
("normalize_categories", "city"),
|
| 60 |
("normalize_categories", "vehicle_type"),
|
|
|
|
|
|
|
|
|
|
| 61 |
],
|
| 62 |
}
|
| 63 |
|
|
@@ -279,12 +293,169 @@ class LLMAgentAdapter:
|
|
| 279 |
return trajectory
|
| 280 |
|
| 281 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 282 |
__all__ = [
|
| 283 |
"Agent",
|
| 284 |
"AgentWithAct",
|
| 285 |
"ToolCall",
|
| 286 |
"RandomAgent",
|
| 287 |
"HeuristicAgent",
|
|
|
|
| 288 |
"LLMAgentAdapter",
|
| 289 |
"HEURISTIC_POLICIES",
|
|
|
|
| 290 |
]
|
|
|
|
| 23 |
"ecommerce_mobile": [
|
| 24 |
("replace_invalid_with_null", "country"),
|
| 25 |
("replace_invalid_with_null", "items_in_cart"),
|
| 26 |
+
("replace_invalid_with_null", "device_os"),
|
| 27 |
("cast_numeric", "items_in_cart"),
|
| 28 |
+
("cast_numeric", "order_value"),
|
| 29 |
("impute_numeric", "items_in_cart"),
|
| 30 |
+
("impute_numeric", "order_value"),
|
| 31 |
("clip_outliers_iqr", "items_in_cart"),
|
| 32 |
("clip_outliers_iqr", "order_value"),
|
| 33 |
("normalize_categories", "device_os"),
|
| 34 |
("normalize_categories", "country"),
|
| 35 |
+
("impute_categorical", "device_os"),
|
| 36 |
+
("impute_categorical", "country"),
|
| 37 |
("cast_datetime", "event_date"),
|
| 38 |
+
("drop_duplicates", None),
|
| 39 |
],
|
| 40 |
"subscription_churn": [
|
|
|
|
| 41 |
("replace_invalid_with_null", "monthly_spend"),
|
| 42 |
("replace_invalid_with_null", "age"),
|
| 43 |
("replace_invalid_with_null", "tenure_months"),
|
| 44 |
+
("replace_invalid_with_null", "payment_method"),
|
| 45 |
("cast_numeric", "age"),
|
| 46 |
("cast_numeric", "monthly_spend"),
|
| 47 |
("cast_numeric", "tenure_months"),
|
|
|
|
| 50 |
("impute_numeric", "tenure_months"),
|
| 51 |
("clip_outliers_iqr", "monthly_spend"),
|
| 52 |
("normalize_categories", "plan_type"),
|
| 53 |
+
("normalize_categories", "payment_method"),
|
| 54 |
+
("impute_categorical", "plan_type"),
|
| 55 |
+
("impute_categorical", "payment_method"),
|
| 56 |
+
("drop_duplicates", None),
|
| 57 |
],
|
| 58 |
"delivery_eta": [
|
|
|
|
| 59 |
("replace_invalid_with_null", "driver_rating"),
|
| 60 |
("replace_invalid_with_null", "late_deliveries_last_30d"),
|
| 61 |
+
("replace_invalid_with_null", "city"),
|
| 62 |
+
("replace_invalid_with_null", "vehicle_type"),
|
| 63 |
("cast_numeric", "driver_rating"),
|
| 64 |
("cast_numeric", "delivery_distance_km"),
|
| 65 |
("cast_numeric", "late_deliveries_last_30d"),
|
| 66 |
("impute_numeric", "driver_rating"),
|
| 67 |
("impute_numeric", "late_deliveries_last_30d"),
|
| 68 |
+
("impute_numeric", "delivery_distance_km"),
|
| 69 |
("clip_outliers_iqr", "delivery_distance_km"),
|
| 70 |
("normalize_categories", "city"),
|
| 71 |
("normalize_categories", "vehicle_type"),
|
| 72 |
+
("impute_categorical", "city"),
|
| 73 |
+
("impute_categorical", "vehicle_type"),
|
| 74 |
+
("drop_duplicates", None),
|
| 75 |
],
|
| 76 |
}
|
| 77 |
|
|
|
|
| 293 |
return trajectory
|
| 294 |
|
| 295 |
|
| 296 |
+
SYSTEM_PROMPT = """\
|
| 297 |
+
You are a Data Cleaning Agent working in a Medallion data pipeline (Bronze β Silver).
|
| 298 |
+
|
| 299 |
+
Your job: inspect a dirty dataset and clean it to Silver quality by choosing \
|
| 300 |
+
the right tools in the right order.
|
| 301 |
+
|
| 302 |
+
## Methodology (FSDS + VDS)
|
| 303 |
+
1. INSPECT first: profile_data, preview_data, get_task_brief
|
| 304 |
+
2. CLEAN systematically: fix dtypes, strip whitespace, handle missing values, \
|
| 305 |
+
remove duplicates, clip outliers
|
| 306 |
+
3. VALIDATE before submitting: run_quality_gates to check quality gate
|
| 307 |
+
4. SUBMIT: submit_solution when all tests pass
|
| 308 |
+
|
| 309 |
+
## Output Format
|
| 310 |
+
Each turn, output exactly one JSON action:
|
| 311 |
+
{"tool": "<tool_name>", "arguments": {"operation": "<op>", "column": "<col_or_omit>"}}
|
| 312 |
+
|
| 313 |
+
Top-level tools: profile_data, preview_data, get_task_brief, run_quality_gates, submit_solution
|
| 314 |
+
Cleaning tool: apply_cleaning_operation β requires an "operation" argument.
|
| 315 |
+
|
| 316 |
+
Available operations for apply_cleaning_operation:
|
| 317 |
+
drop_duplicates
|
| 318 |
+
replace_invalid_with_null (requires "column")
|
| 319 |
+
cast_numeric (requires "column")
|
| 320 |
+
cast_datetime (requires "column")
|
| 321 |
+
impute_numeric (requires "column"; optional "strategy": "median"|"mean")
|
| 322 |
+
impute_categorical (requires "column")
|
| 323 |
+
normalize_categories (requires "column")
|
| 324 |
+
clip_outliers_iqr (requires "column")
|
| 325 |
+
|
| 326 |
+
Examples:
|
| 327 |
+
{"tool": "profile_data", "arguments": {}}
|
| 328 |
+
{"tool": "apply_cleaning_operation", "arguments": {"operation": "drop_duplicates"}}
|
| 329 |
+
{"tool": "apply_cleaning_operation", "arguments": {"operation": "cast_numeric", "column": "amount"}}
|
| 330 |
+
{"tool": "run_quality_gates", "arguments": {}}
|
| 331 |
+
{"tool": "submit_solution", "arguments": {}}
|
| 332 |
+
|
| 333 |
+
Think step by step. Inspect before cleaning. Validate before submitting."""
|
| 334 |
+
|
| 335 |
+
|
| 336 |
+
class LLMAgent:
|
| 337 |
+
"""Agent powered by a fine-tuned LLM checkpoint (Unsloth/HuggingFace).
|
| 338 |
+
|
| 339 |
+
Loads the model once on first use and generates one JSON action per step
|
| 340 |
+
conditioned on the current observation and episode history.
|
| 341 |
+
|
| 342 |
+
Args:
|
| 343 |
+
model_path: Path to the saved model directory (e.g. ``./data-cleaning-grpo-final``).
|
| 344 |
+
max_new_tokens: Maximum tokens to generate per step.
|
| 345 |
+
temperature: Sampling temperature (0.0 = greedy).
|
| 346 |
+
"""
|
| 347 |
+
|
| 348 |
+
def __init__(
|
| 349 |
+
self,
|
| 350 |
+
model_path: str = "./data-cleaning-grpo-final",
|
| 351 |
+
max_new_tokens: int = 128,
|
| 352 |
+
temperature: float = 0.0,
|
| 353 |
+
) -> None:
|
| 354 |
+
self.model_path = model_path
|
| 355 |
+
self.max_new_tokens = max_new_tokens
|
| 356 |
+
self.temperature = temperature
|
| 357 |
+
self._model = None
|
| 358 |
+
self._tokenizer = None
|
| 359 |
+
|
| 360 |
+
def _load(self) -> None:
|
| 361 |
+
import json as _json
|
| 362 |
+
from unsloth import FastLanguageModel # type: ignore[import]
|
| 363 |
+
|
| 364 |
+
model, tokenizer = FastLanguageModel.from_pretrained(
|
| 365 |
+
model_name=self.model_path,
|
| 366 |
+
max_seq_length=2048,
|
| 367 |
+
load_in_4bit=True,
|
| 368 |
+
)
|
| 369 |
+
FastLanguageModel.for_inference(model)
|
| 370 |
+
if tokenizer.pad_token is None:
|
| 371 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 372 |
+
self._model = model
|
| 373 |
+
self._tokenizer = tokenizer
|
| 374 |
+
self._json = _json
|
| 375 |
+
|
| 376 |
+
def _build_user_message(
|
| 377 |
+
self, observation: dict[str, Any], history: list[dict[str, Any]]
|
| 378 |
+
) -> str:
|
| 379 |
+
import json as _json
|
| 380 |
+
parts: list[str] = []
|
| 381 |
+
if not history:
|
| 382 |
+
parts.append("You just received a dirty Bronze-layer dataset. What is your first action?")
|
| 383 |
+
else:
|
| 384 |
+
last = history[-1]
|
| 385 |
+
obs_summary = _json.dumps(last["result"], ensure_ascii=False)[:400]
|
| 386 |
+
parts.append(f"Last action: {last['tool_call']['tool']}")
|
| 387 |
+
parts.append(f"Result (truncated): {obs_summary}")
|
| 388 |
+
parts.append("What is your next action?")
|
| 389 |
+
return "\n".join(parts)
|
| 390 |
+
|
| 391 |
+
def _generate(self, observation: dict[str, Any], history: list[dict[str, Any]]) -> str:
|
| 392 |
+
if self._model is None:
|
| 393 |
+
self._load()
|
| 394 |
+
|
| 395 |
+
import torch # type: ignore[import]
|
| 396 |
+
|
| 397 |
+
messages = [
|
| 398 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 399 |
+
{"role": "user", "content": self._build_user_message(observation, history)},
|
| 400 |
+
]
|
| 401 |
+
text = self._tokenizer.apply_chat_template(
|
| 402 |
+
messages, tokenize=False, add_generation_prompt=True
|
| 403 |
+
)
|
| 404 |
+
inputs = self._tokenizer(text, return_tensors="pt").to(self._model.device)
|
| 405 |
+
|
| 406 |
+
with torch.no_grad():
|
| 407 |
+
output_ids = self._model.generate(
|
| 408 |
+
**inputs,
|
| 409 |
+
max_new_tokens=self.max_new_tokens,
|
| 410 |
+
temperature=self.temperature if self.temperature > 0 else None,
|
| 411 |
+
do_sample=self.temperature > 0,
|
| 412 |
+
pad_token_id=self._tokenizer.eos_token_id,
|
| 413 |
+
)
|
| 414 |
+
generated = output_ids[0][inputs["input_ids"].shape[-1]:]
|
| 415 |
+
return self._tokenizer.decode(generated, skip_special_tokens=True)
|
| 416 |
+
|
| 417 |
+
def run_episode(
|
| 418 |
+
self,
|
| 419 |
+
env: Any,
|
| 420 |
+
task_id: str,
|
| 421 |
+
max_steps: int = 18,
|
| 422 |
+
seed: int | None = None,
|
| 423 |
+
**reset_kwargs: Any,
|
| 424 |
+
) -> list[dict[str, Any]]:
|
| 425 |
+
reset_kwargs["seed"] = seed
|
| 426 |
+
env.reset(task_id=task_id, **reset_kwargs)
|
| 427 |
+
trajectory: list[dict[str, Any]] = []
|
| 428 |
+
history: list[dict[str, Any]] = []
|
| 429 |
+
observation: dict[str, Any] = {}
|
| 430 |
+
|
| 431 |
+
for _ in range(max_steps):
|
| 432 |
+
raw = self._generate(observation, history)
|
| 433 |
+
tool_call = _default_parse_llm_output(raw)
|
| 434 |
+
tool_name = tool_call["tool"]
|
| 435 |
+
args = tool_call.get("arguments", {})
|
| 436 |
+
result = env.call_tool(tool_name, **args)
|
| 437 |
+
trajectory.append({
|
| 438 |
+
"tool_name": tool_name,
|
| 439 |
+
"reward": _extract_reward(result),
|
| 440 |
+
"result": result,
|
| 441 |
+
"raw_output": raw,
|
| 442 |
+
})
|
| 443 |
+
history.append({"observation": observation, "tool_call": tool_call, "result": result})
|
| 444 |
+
observation = result
|
| 445 |
+
if result.get("done", False):
|
| 446 |
+
break
|
| 447 |
+
|
| 448 |
+
return trajectory
|
| 449 |
+
|
| 450 |
+
|
| 451 |
__all__ = [
|
| 452 |
"Agent",
|
| 453 |
"AgentWithAct",
|
| 454 |
"ToolCall",
|
| 455 |
"RandomAgent",
|
| 456 |
"HeuristicAgent",
|
| 457 |
+
"LLMAgent",
|
| 458 |
"LLMAgentAdapter",
|
| 459 |
"HEURISTIC_POLICIES",
|
| 460 |
+
"SYSTEM_PROMPT",
|
| 461 |
]
|
benchmark_guides.md
ADDED
|
@@ -0,0 +1,245 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
After reading your project README, two benchmarks stand out as the **best fit for your agent**. Iβll explain why using the characteristics of your environment.
|
| 2 |
+
|
| 3 |
+
Your environment evaluates an agent that must:
|
| 4 |
+
|
| 5 |
+
* profile a dataset
|
| 6 |
+
* detect issues (duplicates, invalid tokens, schema problems)
|
| 7 |
+
* apply cleaning operations
|
| 8 |
+
* pass quality gates
|
| 9 |
+
* submit a cleaned table
|
| 10 |
+
|
| 11 |
+
It is therefore **not just a coding benchmark**. It is an **interactive data-cleaning agent benchmark** with:
|
| 12 |
+
|
| 13 |
+
* tool use
|
| 14 |
+
* multi-step decision making
|
| 15 |
+
* environment feedback
|
| 16 |
+
* reward signals
|
| 17 |
+
|
| 18 |
+
So the benchmarks you choose should reflect **data-science workflows and agentic behavior**, not just code generation.
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
# Recommended Benchmark 1
|
| 23 |
+
|
| 24 |
+
# DS-1000
|
| 25 |
+
|
| 26 |
+
## Why it fits your project
|
| 27 |
+
|
| 28 |
+
Your environment requires the agent to perform **pandas-style cleaning and transformations**.
|
| 29 |
+
|
| 30 |
+
DS-1000 contains many tasks that mirror those operations:
|
| 31 |
+
|
| 32 |
+
Typical operations tested:
|
| 33 |
+
|
| 34 |
+
* handling missing values
|
| 35 |
+
* joins / merges
|
| 36 |
+
* groupby aggregations
|
| 37 |
+
* reshaping tables
|
| 38 |
+
* feature engineering
|
| 39 |
+
* data type fixes
|
| 40 |
+
|
| 41 |
+
These are exactly the kinds of transformations your agent performs with:
|
| 42 |
+
|
| 43 |
+
```
|
| 44 |
+
apply_cleaning_operation(...)
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
in the environment.
|
| 48 |
+
|
| 49 |
+
### What DS-1000 measures well
|
| 50 |
+
|
| 51 |
+
It evaluates the **atomic data-science skills** needed by your agent:
|
| 52 |
+
|
| 53 |
+
* pandas fluency
|
| 54 |
+
* statistical transformations
|
| 55 |
+
* data manipulation correctness
|
| 56 |
+
|
| 57 |
+
### Metric
|
| 58 |
+
|
| 59 |
+
```
|
| 60 |
+
pass@1
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
### Target performance
|
| 64 |
+
|
| 65 |
+
| Level | DS-1000 score |
|
| 66 |
+
| ------ | ------------- |
|
| 67 |
+
| weak | <40% |
|
| 68 |
+
| decent | ~50% |
|
| 69 |
+
| strong | >60% |
|
| 70 |
+
| SOTA | ~70β75% |
|
| 71 |
+
|
| 72 |
+
If your agent achieves **>60%**, its data-manipulation skills are competitive with top LLMs.
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
# Recommended Benchmark 2
|
| 77 |
+
|
| 78 |
+
# DA-Code (Data-Science Agent Benchmark)
|
| 79 |
+
|
| 80 |
+
## Why it fits your project
|
| 81 |
+
|
| 82 |
+
Your system is **not just a code generator**.
|
| 83 |
+
|
| 84 |
+
It requires:
|
| 85 |
+
|
| 86 |
+
* multi-step reasoning
|
| 87 |
+
* environment interaction
|
| 88 |
+
* tool use
|
| 89 |
+
* iterative improvement
|
| 90 |
+
|
| 91 |
+
Exactly like DA-Code.
|
| 92 |
+
|
| 93 |
+
DA-Code tasks look like this:
|
| 94 |
+
|
| 95 |
+
```
|
| 96 |
+
inspect dataset
|
| 97 |
+
clean columns
|
| 98 |
+
engineer features
|
| 99 |
+
train model
|
| 100 |
+
evaluate results
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
Your environment uses a similar pipeline:
|
| 104 |
+
|
| 105 |
+
```
|
| 106 |
+
profile_data
|
| 107 |
+
apply_cleaning_operation
|
| 108 |
+
run_quality_gates
|
| 109 |
+
submit_solution
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
So DA-Code measures the **agentic behavior** your project is targeting.
|
| 113 |
+
|
| 114 |
+
### What DA-Code evaluates
|
| 115 |
+
|
| 116 |
+
* planning
|
| 117 |
+
* iterative reasoning
|
| 118 |
+
* code execution
|
| 119 |
+
* data-analysis workflows
|
| 120 |
+
|
| 121 |
+
### Metric
|
| 122 |
+
|
| 123 |
+
```
|
| 124 |
+
task completion score
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
### Current SOTA
|
| 128 |
+
|
| 129 |
+
| Model | Score |
|
| 130 |
+
| ------------------ | ------- |
|
| 131 |
+
| GPT-4-class agents | ~30β35% |
|
| 132 |
+
| open models | ~15β25% |
|
| 133 |
+
|
| 134 |
+
Even the best systems solve only about **1/3 of tasks**, so itβs a challenging benchmark.
|
| 135 |
+
|
| 136 |
+
---
|
| 137 |
+
|
| 138 |
+
# Why these two benchmarks together are ideal
|
| 139 |
+
|
| 140 |
+
They measure **two complementary capabilities** your project needs.
|
| 141 |
+
|
| 142 |
+
| Capability | Benchmark |
|
| 143 |
+
| ----------------------------- | ----------- |
|
| 144 |
+
| data cleaning / pandas skills | **DS-1000** |
|
| 145 |
+
| agent workflow reasoning | **DA-Code** |
|
| 146 |
+
|
| 147 |
+
Your environment tests both.
|
| 148 |
+
|
| 149 |
+
If your agent scores well on both, it strongly suggests:
|
| 150 |
+
|
| 151 |
+
* it understands data manipulation
|
| 152 |
+
* it can plan multi-step cleaning workflows
|
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
|
| 156 |
+
# How to adapt them to your project
|
| 157 |
+
|
| 158 |
+
I would evaluate your agent in three layers.
|
| 159 |
+
|
| 160 |
+
## 1 β Micro-skills (DS-1000)
|
| 161 |
+
|
| 162 |
+
Measure:
|
| 163 |
+
|
| 164 |
+
```
|
| 165 |
+
pandas correctness
|
| 166 |
+
data transformations
|
| 167 |
+
aggregation logic
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
---
|
| 171 |
+
|
| 172 |
+
## 2 β Agent capability (DA-Code)
|
| 173 |
+
|
| 174 |
+
Measure:
|
| 175 |
+
|
| 176 |
+
```
|
| 177 |
+
multi-step reasoning
|
| 178 |
+
tool usage
|
| 179 |
+
pipeline construction
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
---
|
| 183 |
+
|
| 184 |
+
## 3 β Your custom benchmark
|
| 185 |
+
|
| 186 |
+
Your environment already defines good metrics:
|
| 187 |
+
|
| 188 |
+
* success rate
|
| 189 |
+
* average return
|
| 190 |
+
* invalid actions
|
| 191 |
+
* steps per episode
|
| 192 |
+
|
| 193 |
+
These are excellent **agent evaluation metrics**.
|
| 194 |
+
|
| 195 |
+
---
|
| 196 |
+
|
| 197 |
+
# Suggested evaluation stack for your project
|
| 198 |
+
|
| 199 |
+
Use this hierarchy:
|
| 200 |
+
|
| 201 |
+
```
|
| 202 |
+
Level 1
|
| 203 |
+
DS-1000
|
| 204 |
+
|
| 205 |
+
Level 2
|
| 206 |
+
DA-Code
|
| 207 |
+
|
| 208 |
+
Level 3
|
| 209 |
+
FSDSCleaningEnv evaluation set
|
| 210 |
+
```
|
| 211 |
+
|
| 212 |
+
Where level 3 measures **task-specific performance**.
|
| 213 |
+
|
| 214 |
+
---
|
| 215 |
+
|
| 216 |
+
# One more thing (important)
|
| 217 |
+
|
| 218 |
+
Your environment has a very strong design choice:
|
| 219 |
+
|
| 220 |
+
```
|
| 221 |
+
random dataset per episode
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
This prevents memorization and encourages generalization.
|
| 225 |
+
|
| 226 |
+
Many research benchmarks **do not have this property**, which makes your environment particularly good for RL.
|
| 227 |
+
|
| 228 |
+
---
|
| 229 |
+
|
| 230 |
+
# If your goal is to publish or win a hackathon
|
| 231 |
+
|
| 232 |
+
I would report:
|
| 233 |
+
|
| 234 |
+
```
|
| 235 |
+
DS-1000 score
|
| 236 |
+
DA-Code score
|
| 237 |
+
FSDSCleaningEnv success rate
|
| 238 |
+
```
|
| 239 |
+
|
| 240 |
+
Together those three demonstrate:
|
| 241 |
+
|
| 242 |
+
* coding skill
|
| 243 |
+
* agent reasoning
|
| 244 |
+
* domain-specific cleaning ability
|
| 245 |
+
|
dataset_generators.py
CHANGED
|
@@ -79,14 +79,21 @@ def _apply_noise(
|
|
| 79 |
numeric_columns: list[str],
|
| 80 |
categorical_columns: list[str],
|
| 81 |
target_column: str,
|
|
|
|
| 82 |
) -> pd.DataFrame:
|
| 83 |
-
"""Inject noise into a clean DataFrame. Does not modify the target column.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
rng = np.random.default_rng(seed)
|
| 85 |
out = df.copy()
|
|
|
|
| 86 |
|
| 87 |
-
# 1. Missing values (exclude target)
|
| 88 |
for col in out.columns:
|
| 89 |
-
if col == target_column:
|
| 90 |
continue
|
| 91 |
mask = rng.random(len(out)) < profile.p_missing
|
| 92 |
if mask.any():
|
|
@@ -200,6 +207,7 @@ def generate_mobile_ecommerce(
|
|
| 200 |
numeric_columns=["items_in_cart", "order_value"],
|
| 201 |
categorical_columns=["device_os", "country"],
|
| 202 |
target_column="converted",
|
|
|
|
| 203 |
)
|
| 204 |
|
| 205 |
|
|
@@ -240,6 +248,7 @@ def generate_subscription_churn(
|
|
| 240 |
numeric_columns=["age", "monthly_spend", "tenure_months"],
|
| 241 |
categorical_columns=["plan_type", "payment_method"],
|
| 242 |
target_column="churned",
|
|
|
|
| 243 |
)
|
| 244 |
|
| 245 |
|
|
@@ -279,9 +288,10 @@ def generate_delivery_eta(
|
|
| 279 |
df,
|
| 280 |
seed=seed or rng.integers(0, 2**31),
|
| 281 |
profile=profile,
|
| 282 |
-
numeric_columns=["driver_rating", "delivery_distance_km", "late_deliveries_last_30d"
|
| 283 |
categorical_columns=["city", "vehicle_type"],
|
| 284 |
target_column="delivery_time_minutes",
|
|
|
|
| 285 |
)
|
| 286 |
|
| 287 |
|
|
|
|
| 79 |
numeric_columns: list[str],
|
| 80 |
categorical_columns: list[str],
|
| 81 |
target_column: str,
|
| 82 |
+
skip_missing_columns: list[str] | None = None,
|
| 83 |
) -> pd.DataFrame:
|
| 84 |
+
"""Inject noise into a clean DataFrame. Does not modify the target column.
|
| 85 |
+
|
| 86 |
+
Args:
|
| 87 |
+
skip_missing_columns: Columns to exclude from missing-value injection
|
| 88 |
+
(e.g. ID columns and datetime columns that have no impute operation).
|
| 89 |
+
"""
|
| 90 |
rng = np.random.default_rng(seed)
|
| 91 |
out = df.copy()
|
| 92 |
+
_skip_missing = set(skip_missing_columns or [])
|
| 93 |
|
| 94 |
+
# 1. Missing values (exclude target and skip_missing_columns)
|
| 95 |
for col in out.columns:
|
| 96 |
+
if col == target_column or col in _skip_missing:
|
| 97 |
continue
|
| 98 |
mask = rng.random(len(out)) < profile.p_missing
|
| 99 |
if mask.any():
|
|
|
|
| 207 |
numeric_columns=["items_in_cart", "order_value"],
|
| 208 |
categorical_columns=["device_os", "country"],
|
| 209 |
target_column="converted",
|
| 210 |
+
skip_missing_columns=["session_id", "customer_id", "event_date"],
|
| 211 |
)
|
| 212 |
|
| 213 |
|
|
|
|
| 248 |
numeric_columns=["age", "monthly_spend", "tenure_months"],
|
| 249 |
categorical_columns=["plan_type", "payment_method"],
|
| 250 |
target_column="churned",
|
| 251 |
+
skip_missing_columns=["customer_key"],
|
| 252 |
)
|
| 253 |
|
| 254 |
|
|
|
|
| 288 |
df,
|
| 289 |
seed=seed or rng.integers(0, 2**31),
|
| 290 |
profile=profile,
|
| 291 |
+
numeric_columns=["driver_rating", "delivery_distance_km", "late_deliveries_last_30d"],
|
| 292 |
categorical_columns=["city", "vehicle_type"],
|
| 293 |
target_column="delivery_time_minutes",
|
| 294 |
+
skip_missing_columns=["route_id"],
|
| 295 |
)
|
| 296 |
|
| 297 |
|
evaluate_agent.py
CHANGED
|
@@ -21,7 +21,7 @@ _PROJECT_ROOT = _SCRIPT_DIR.parent
|
|
| 21 |
if str(_PROJECT_ROOT) not in sys.path:
|
| 22 |
sys.path.insert(0, str(_PROJECT_ROOT))
|
| 23 |
|
| 24 |
-
from fsds_cleaning_env.agents import HeuristicAgent, RandomAgent
|
| 25 |
from fsds_cleaning_env.client import FSDSCleaningEnv
|
| 26 |
from fsds_cleaning_env.dataset_generators import EVAL_SEEDS
|
| 27 |
from fsds_cleaning_env.evaluation_tasks import EVAL_TASKS
|
|
@@ -106,9 +106,14 @@ def main() -> None:
|
|
| 106 |
parser = argparse.ArgumentParser(description="Evaluate an agent on the FSDS Cleaning Environment")
|
| 107 |
parser.add_argument(
|
| 108 |
"--agent",
|
| 109 |
-
choices=["random", "heuristic"],
|
| 110 |
default="heuristic",
|
| 111 |
-
help="Which
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
)
|
| 113 |
parser.add_argument(
|
| 114 |
"--base-url",
|
|
@@ -137,6 +142,8 @@ def main() -> None:
|
|
| 137 |
|
| 138 |
if args.agent == "random":
|
| 139 |
agent = RandomAgent(rng=__import__("random").Random(args.seed) if args.seed else None)
|
|
|
|
|
|
|
| 140 |
else:
|
| 141 |
agent = HeuristicAgent()
|
| 142 |
|
|
|
|
| 21 |
if str(_PROJECT_ROOT) not in sys.path:
|
| 22 |
sys.path.insert(0, str(_PROJECT_ROOT))
|
| 23 |
|
| 24 |
+
from fsds_cleaning_env.agents import HeuristicAgent, LLMAgent, RandomAgent
|
| 25 |
from fsds_cleaning_env.client import FSDSCleaningEnv
|
| 26 |
from fsds_cleaning_env.dataset_generators import EVAL_SEEDS
|
| 27 |
from fsds_cleaning_env.evaluation_tasks import EVAL_TASKS
|
|
|
|
| 106 |
parser = argparse.ArgumentParser(description="Evaluate an agent on the FSDS Cleaning Environment")
|
| 107 |
parser.add_argument(
|
| 108 |
"--agent",
|
| 109 |
+
choices=["random", "heuristic", "llm"],
|
| 110 |
default="heuristic",
|
| 111 |
+
help="Which agent to evaluate",
|
| 112 |
+
)
|
| 113 |
+
parser.add_argument(
|
| 114 |
+
"--model-path",
|
| 115 |
+
default="./data-cleaning-grpo-final",
|
| 116 |
+
help="Path to trained LLM checkpoint (used when --agent llm)",
|
| 117 |
)
|
| 118 |
parser.add_argument(
|
| 119 |
"--base-url",
|
|
|
|
| 142 |
|
| 143 |
if args.agent == "random":
|
| 144 |
agent = RandomAgent(rng=__import__("random").Random(args.seed) if args.seed else None)
|
| 145 |
+
elif args.agent == "llm":
|
| 146 |
+
agent = LLMAgent(model_path=args.model_path)
|
| 147 |
else:
|
| 148 |
agent = HeuristicAgent()
|
| 149 |
|
examples/local_smoke_test.py
CHANGED
|
@@ -48,22 +48,34 @@ def main() -> None:
|
|
| 48 |
if step_result.done:
|
| 49 |
return
|
| 50 |
|
| 51 |
-
for operation in [
|
| 52 |
-
"replace_invalid_with_null",
|
| 53 |
-
"
|
| 54 |
-
"
|
| 55 |
-
"
|
| 56 |
-
"
|
| 57 |
-
"
|
| 58 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
]:
|
|
|
|
|
|
|
|
|
|
| 60 |
step_result = client.step(
|
| 61 |
CallToolAction(
|
| 62 |
tool_name="apply_cleaning_operation",
|
| 63 |
-
arguments=
|
| 64 |
)
|
| 65 |
)
|
| 66 |
-
|
|
|
|
| 67 |
if step_result.done:
|
| 68 |
return
|
| 69 |
|
|
|
|
| 48 |
if step_result.done:
|
| 49 |
return
|
| 50 |
|
| 51 |
+
for operation, column in [
|
| 52 |
+
("replace_invalid_with_null", "country"),
|
| 53 |
+
("replace_invalid_with_null", "items_in_cart"),
|
| 54 |
+
("replace_invalid_with_null", "device_os"),
|
| 55 |
+
("cast_numeric", "items_in_cart"),
|
| 56 |
+
("cast_numeric", "order_value"),
|
| 57 |
+
("impute_numeric", "items_in_cart"),
|
| 58 |
+
("impute_numeric", "order_value"),
|
| 59 |
+
("clip_outliers_iqr", "items_in_cart"),
|
| 60 |
+
("clip_outliers_iqr", "order_value"),
|
| 61 |
+
("normalize_categories", "device_os"),
|
| 62 |
+
("normalize_categories", "country"),
|
| 63 |
+
("impute_categorical", "device_os"),
|
| 64 |
+
("impute_categorical", "country"),
|
| 65 |
+
("cast_datetime", "event_date"),
|
| 66 |
+
("drop_duplicates", None),
|
| 67 |
]:
|
| 68 |
+
args: dict = {"operation": operation}
|
| 69 |
+
if column is not None:
|
| 70 |
+
args["column"] = column
|
| 71 |
step_result = client.step(
|
| 72 |
CallToolAction(
|
| 73 |
tool_name="apply_cleaning_operation",
|
| 74 |
+
arguments=args,
|
| 75 |
)
|
| 76 |
)
|
| 77 |
+
label = f"APPLY {operation}" + (f" ({column})" if column else "")
|
| 78 |
+
print_step_result(label, step_result)
|
| 79 |
if step_result.done:
|
| 80 |
return
|
| 81 |
|
results_heuristic.json
ADDED
|
@@ -0,0 +1,506 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"agent": "HeuristicAgent",
|
| 3 |
+
"base_url": "https://israaaML-fsds-cleaning-env.hf.space",
|
| 4 |
+
"n_episodes": 45,
|
| 5 |
+
"aggregate": {
|
| 6 |
+
"episodes": 45,
|
| 7 |
+
"success_rate": 0.0,
|
| 8 |
+
"avg_return": -0.08783333333333322,
|
| 9 |
+
"avg_steps": 12.2,
|
| 10 |
+
"avg_invalid_actions": 0.0
|
| 11 |
+
},
|
| 12 |
+
"episodes": [
|
| 13 |
+
{
|
| 14 |
+
"task_name": "ecommerce_mobile_baseline_seed0",
|
| 15 |
+
"task_id": "ecommerce_mobile",
|
| 16 |
+
"episode": 0,
|
| 17 |
+
"success": false,
|
| 18 |
+
"total_return": -0.07499999999999998,
|
| 19 |
+
"steps": 12,
|
| 20 |
+
"invalid_actions": 0,
|
| 21 |
+
"quality_gate_passed": false,
|
| 22 |
+
"retention_ratio": 0.9864
|
| 23 |
+
},
|
| 24 |
+
{
|
| 25 |
+
"task_name": "ecommerce_mobile_baseline_seed0",
|
| 26 |
+
"task_id": "ecommerce_mobile",
|
| 27 |
+
"episode": 1,
|
| 28 |
+
"success": false,
|
| 29 |
+
"total_return": -0.07499999999999998,
|
| 30 |
+
"steps": 12,
|
| 31 |
+
"invalid_actions": 0,
|
| 32 |
+
"quality_gate_passed": false,
|
| 33 |
+
"retention_ratio": 0.9864
|
| 34 |
+
},
|
| 35 |
+
{
|
| 36 |
+
"task_name": "ecommerce_mobile_baseline_seed0",
|
| 37 |
+
"task_id": "ecommerce_mobile",
|
| 38 |
+
"episode": 2,
|
| 39 |
+
"success": false,
|
| 40 |
+
"total_return": -0.07499999999999998,
|
| 41 |
+
"steps": 12,
|
| 42 |
+
"invalid_actions": 0,
|
| 43 |
+
"quality_gate_passed": false,
|
| 44 |
+
"retention_ratio": 0.9864
|
| 45 |
+
},
|
| 46 |
+
{
|
| 47 |
+
"task_name": "ecommerce_mobile_baseline_seed1",
|
| 48 |
+
"task_id": "ecommerce_mobile",
|
| 49 |
+
"episode": 0,
|
| 50 |
+
"success": false,
|
| 51 |
+
"total_return": -0.07499999999999998,
|
| 52 |
+
"steps": 12,
|
| 53 |
+
"invalid_actions": 0,
|
| 54 |
+
"quality_gate_passed": false,
|
| 55 |
+
"retention_ratio": 0.9786
|
| 56 |
+
},
|
| 57 |
+
{
|
| 58 |
+
"task_name": "ecommerce_mobile_baseline_seed1",
|
| 59 |
+
"task_id": "ecommerce_mobile",
|
| 60 |
+
"episode": 1,
|
| 61 |
+
"success": false,
|
| 62 |
+
"total_return": -0.07499999999999998,
|
| 63 |
+
"steps": 12,
|
| 64 |
+
"invalid_actions": 0,
|
| 65 |
+
"quality_gate_passed": false,
|
| 66 |
+
"retention_ratio": 0.9786
|
| 67 |
+
},
|
| 68 |
+
{
|
| 69 |
+
"task_name": "ecommerce_mobile_baseline_seed1",
|
| 70 |
+
"task_id": "ecommerce_mobile",
|
| 71 |
+
"episode": 2,
|
| 72 |
+
"success": false,
|
| 73 |
+
"total_return": -0.07499999999999998,
|
| 74 |
+
"steps": 12,
|
| 75 |
+
"invalid_actions": 0,
|
| 76 |
+
"quality_gate_passed": false,
|
| 77 |
+
"retention_ratio": 0.9786
|
| 78 |
+
},
|
| 79 |
+
{
|
| 80 |
+
"task_name": "ecommerce_mobile_baseline_seed2",
|
| 81 |
+
"task_id": "ecommerce_mobile",
|
| 82 |
+
"episode": 0,
|
| 83 |
+
"error": "Tool 'submit_solution' failed: Error calling tool 'submit_solution': Error serializing to JSON: TypeError: 'float' object cannot be interpreted as an integer (type: execution_error)",
|
| 84 |
+
"success": false,
|
| 85 |
+
"total_return": 0.0,
|
| 86 |
+
"steps": 0,
|
| 87 |
+
"invalid_actions": 0
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
"task_name": "ecommerce_mobile_baseline_seed2",
|
| 91 |
+
"task_id": "ecommerce_mobile",
|
| 92 |
+
"episode": 1,
|
| 93 |
+
"error": "Tool 'submit_solution' failed: Error calling tool 'submit_solution': Error serializing to JSON: TypeError: 'float' object cannot be interpreted as an integer (type: execution_error)",
|
| 94 |
+
"success": false,
|
| 95 |
+
"total_return": 0.0,
|
| 96 |
+
"steps": 0,
|
| 97 |
+
"invalid_actions": 0
|
| 98 |
+
},
|
| 99 |
+
{
|
| 100 |
+
"task_name": "ecommerce_mobile_baseline_seed2",
|
| 101 |
+
"task_id": "ecommerce_mobile",
|
| 102 |
+
"episode": 2,
|
| 103 |
+
"error": "Tool 'submit_solution' failed: Error calling tool 'submit_solution': Error serializing to JSON: TypeError: 'float' object cannot be interpreted as an integer (type: execution_error)",
|
| 104 |
+
"success": false,
|
| 105 |
+
"total_return": 0.0,
|
| 106 |
+
"steps": 0,
|
| 107 |
+
"invalid_actions": 0
|
| 108 |
+
},
|
| 109 |
+
{
|
| 110 |
+
"task_name": "ecommerce_mobile_baseline_seed3",
|
| 111 |
+
"task_id": "ecommerce_mobile",
|
| 112 |
+
"episode": 0,
|
| 113 |
+
"success": false,
|
| 114 |
+
"total_return": -0.07499999999999998,
|
| 115 |
+
"steps": 12,
|
| 116 |
+
"invalid_actions": 0,
|
| 117 |
+
"quality_gate_passed": false,
|
| 118 |
+
"retention_ratio": 0.9767
|
| 119 |
+
},
|
| 120 |
+
{
|
| 121 |
+
"task_name": "ecommerce_mobile_baseline_seed3",
|
| 122 |
+
"task_id": "ecommerce_mobile",
|
| 123 |
+
"episode": 1,
|
| 124 |
+
"success": false,
|
| 125 |
+
"total_return": -0.07499999999999998,
|
| 126 |
+
"steps": 12,
|
| 127 |
+
"invalid_actions": 0,
|
| 128 |
+
"quality_gate_passed": false,
|
| 129 |
+
"retention_ratio": 0.9767
|
| 130 |
+
},
|
| 131 |
+
{
|
| 132 |
+
"task_name": "ecommerce_mobile_baseline_seed3",
|
| 133 |
+
"task_id": "ecommerce_mobile",
|
| 134 |
+
"episode": 2,
|
| 135 |
+
"success": false,
|
| 136 |
+
"total_return": -0.07499999999999998,
|
| 137 |
+
"steps": 12,
|
| 138 |
+
"invalid_actions": 0,
|
| 139 |
+
"quality_gate_passed": false,
|
| 140 |
+
"retention_ratio": 0.9767
|
| 141 |
+
},
|
| 142 |
+
{
|
| 143 |
+
"task_name": "ecommerce_mobile_baseline_seed4",
|
| 144 |
+
"task_id": "ecommerce_mobile",
|
| 145 |
+
"episode": 0,
|
| 146 |
+
"success": false,
|
| 147 |
+
"total_return": -0.07499999999999998,
|
| 148 |
+
"steps": 12,
|
| 149 |
+
"invalid_actions": 0,
|
| 150 |
+
"quality_gate_passed": false,
|
| 151 |
+
"retention_ratio": 0.9806
|
| 152 |
+
},
|
| 153 |
+
{
|
| 154 |
+
"task_name": "ecommerce_mobile_baseline_seed4",
|
| 155 |
+
"task_id": "ecommerce_mobile",
|
| 156 |
+
"episode": 1,
|
| 157 |
+
"success": false,
|
| 158 |
+
"total_return": -0.07499999999999998,
|
| 159 |
+
"steps": 12,
|
| 160 |
+
"invalid_actions": 0,
|
| 161 |
+
"quality_gate_passed": false,
|
| 162 |
+
"retention_ratio": 0.9806
|
| 163 |
+
},
|
| 164 |
+
{
|
| 165 |
+
"task_name": "ecommerce_mobile_baseline_seed4",
|
| 166 |
+
"task_id": "ecommerce_mobile",
|
| 167 |
+
"episode": 2,
|
| 168 |
+
"success": false,
|
| 169 |
+
"total_return": -0.07499999999999998,
|
| 170 |
+
"steps": 12,
|
| 171 |
+
"invalid_actions": 0,
|
| 172 |
+
"quality_gate_passed": false,
|
| 173 |
+
"retention_ratio": 0.9806
|
| 174 |
+
},
|
| 175 |
+
{
|
| 176 |
+
"task_name": "subscription_churn_baseline_seed0",
|
| 177 |
+
"task_id": "subscription_churn",
|
| 178 |
+
"episode": 0,
|
| 179 |
+
"success": false,
|
| 180 |
+
"total_return": -0.11079999999999998,
|
| 181 |
+
"steps": 14,
|
| 182 |
+
"invalid_actions": 0,
|
| 183 |
+
"quality_gate_passed": false,
|
| 184 |
+
"retention_ratio": 0.9864
|
| 185 |
+
},
|
| 186 |
+
{
|
| 187 |
+
"task_name": "subscription_churn_baseline_seed0",
|
| 188 |
+
"task_id": "subscription_churn",
|
| 189 |
+
"episode": 1,
|
| 190 |
+
"success": false,
|
| 191 |
+
"total_return": -0.11079999999999998,
|
| 192 |
+
"steps": 14,
|
| 193 |
+
"invalid_actions": 0,
|
| 194 |
+
"quality_gate_passed": false,
|
| 195 |
+
"retention_ratio": 0.9864
|
| 196 |
+
},
|
| 197 |
+
{
|
| 198 |
+
"task_name": "subscription_churn_baseline_seed0",
|
| 199 |
+
"task_id": "subscription_churn",
|
| 200 |
+
"episode": 2,
|
| 201 |
+
"success": false,
|
| 202 |
+
"total_return": -0.11079999999999998,
|
| 203 |
+
"steps": 14,
|
| 204 |
+
"invalid_actions": 0,
|
| 205 |
+
"quality_gate_passed": false,
|
| 206 |
+
"retention_ratio": 0.9864
|
| 207 |
+
},
|
| 208 |
+
{
|
| 209 |
+
"task_name": "subscription_churn_baseline_seed1",
|
| 210 |
+
"task_id": "subscription_churn",
|
| 211 |
+
"episode": 0,
|
| 212 |
+
"success": false,
|
| 213 |
+
"total_return": -0.11079999999999998,
|
| 214 |
+
"steps": 14,
|
| 215 |
+
"invalid_actions": 0,
|
| 216 |
+
"quality_gate_passed": false,
|
| 217 |
+
"retention_ratio": 0.9767
|
| 218 |
+
},
|
| 219 |
+
{
|
| 220 |
+
"task_name": "subscription_churn_baseline_seed1",
|
| 221 |
+
"task_id": "subscription_churn",
|
| 222 |
+
"episode": 1,
|
| 223 |
+
"success": false,
|
| 224 |
+
"total_return": -0.11079999999999998,
|
| 225 |
+
"steps": 14,
|
| 226 |
+
"invalid_actions": 0,
|
| 227 |
+
"quality_gate_passed": false,
|
| 228 |
+
"retention_ratio": 0.9767
|
| 229 |
+
},
|
| 230 |
+
{
|
| 231 |
+
"task_name": "subscription_churn_baseline_seed1",
|
| 232 |
+
"task_id": "subscription_churn",
|
| 233 |
+
"episode": 2,
|
| 234 |
+
"success": false,
|
| 235 |
+
"total_return": -0.11079999999999998,
|
| 236 |
+
"steps": 14,
|
| 237 |
+
"invalid_actions": 0,
|
| 238 |
+
"quality_gate_passed": false,
|
| 239 |
+
"retention_ratio": 0.9767
|
| 240 |
+
},
|
| 241 |
+
{
|
| 242 |
+
"task_name": "subscription_churn_baseline_seed2",
|
| 243 |
+
"task_id": "subscription_churn",
|
| 244 |
+
"episode": 0,
|
| 245 |
+
"success": false,
|
| 246 |
+
"total_return": -0.11079999999999998,
|
| 247 |
+
"steps": 14,
|
| 248 |
+
"invalid_actions": 0,
|
| 249 |
+
"quality_gate_passed": false,
|
| 250 |
+
"retention_ratio": 0.9845
|
| 251 |
+
},
|
| 252 |
+
{
|
| 253 |
+
"task_name": "subscription_churn_baseline_seed2",
|
| 254 |
+
"task_id": "subscription_churn",
|
| 255 |
+
"episode": 1,
|
| 256 |
+
"success": false,
|
| 257 |
+
"total_return": -0.11079999999999998,
|
| 258 |
+
"steps": 14,
|
| 259 |
+
"invalid_actions": 0,
|
| 260 |
+
"quality_gate_passed": false,
|
| 261 |
+
"retention_ratio": 0.9845
|
| 262 |
+
},
|
| 263 |
+
{
|
| 264 |
+
"task_name": "subscription_churn_baseline_seed2",
|
| 265 |
+
"task_id": "subscription_churn",
|
| 266 |
+
"episode": 2,
|
| 267 |
+
"success": false,
|
| 268 |
+
"total_return": -0.11079999999999998,
|
| 269 |
+
"steps": 14,
|
| 270 |
+
"invalid_actions": 0,
|
| 271 |
+
"quality_gate_passed": false,
|
| 272 |
+
"retention_ratio": 0.9845
|
| 273 |
+
},
|
| 274 |
+
{
|
| 275 |
+
"task_name": "subscription_churn_baseline_seed3",
|
| 276 |
+
"task_id": "subscription_churn",
|
| 277 |
+
"episode": 0,
|
| 278 |
+
"success": false,
|
| 279 |
+
"total_return": -0.11079999999999998,
|
| 280 |
+
"steps": 14,
|
| 281 |
+
"invalid_actions": 0,
|
| 282 |
+
"quality_gate_passed": false,
|
| 283 |
+
"retention_ratio": 0.9786
|
| 284 |
+
},
|
| 285 |
+
{
|
| 286 |
+
"task_name": "subscription_churn_baseline_seed3",
|
| 287 |
+
"task_id": "subscription_churn",
|
| 288 |
+
"episode": 1,
|
| 289 |
+
"success": false,
|
| 290 |
+
"total_return": -0.11079999999999998,
|
| 291 |
+
"steps": 14,
|
| 292 |
+
"invalid_actions": 0,
|
| 293 |
+
"quality_gate_passed": false,
|
| 294 |
+
"retention_ratio": 0.9786
|
| 295 |
+
},
|
| 296 |
+
{
|
| 297 |
+
"task_name": "subscription_churn_baseline_seed3",
|
| 298 |
+
"task_id": "subscription_churn",
|
| 299 |
+
"episode": 2,
|
| 300 |
+
"success": false,
|
| 301 |
+
"total_return": -0.11079999999999998,
|
| 302 |
+
"steps": 14,
|
| 303 |
+
"invalid_actions": 0,
|
| 304 |
+
"quality_gate_passed": false,
|
| 305 |
+
"retention_ratio": 0.9786
|
| 306 |
+
},
|
| 307 |
+
{
|
| 308 |
+
"task_name": "subscription_churn_baseline_seed4",
|
| 309 |
+
"task_id": "subscription_churn",
|
| 310 |
+
"episode": 0,
|
| 311 |
+
"success": false,
|
| 312 |
+
"total_return": -0.11079999999999998,
|
| 313 |
+
"steps": 14,
|
| 314 |
+
"invalid_actions": 0,
|
| 315 |
+
"quality_gate_passed": false,
|
| 316 |
+
"retention_ratio": 0.9825
|
| 317 |
+
},
|
| 318 |
+
{
|
| 319 |
+
"task_name": "subscription_churn_baseline_seed4",
|
| 320 |
+
"task_id": "subscription_churn",
|
| 321 |
+
"episode": 1,
|
| 322 |
+
"success": false,
|
| 323 |
+
"total_return": -0.11079999999999998,
|
| 324 |
+
"steps": 14,
|
| 325 |
+
"invalid_actions": 0,
|
| 326 |
+
"quality_gate_passed": false,
|
| 327 |
+
"retention_ratio": 0.9825
|
| 328 |
+
},
|
| 329 |
+
{
|
| 330 |
+
"task_name": "subscription_churn_baseline_seed4",
|
| 331 |
+
"task_id": "subscription_churn",
|
| 332 |
+
"episode": 2,
|
| 333 |
+
"success": false,
|
| 334 |
+
"total_return": -0.11079999999999998,
|
| 335 |
+
"steps": 14,
|
| 336 |
+
"invalid_actions": 0,
|
| 337 |
+
"quality_gate_passed": false,
|
| 338 |
+
"retention_ratio": 0.9825
|
| 339 |
+
},
|
| 340 |
+
{
|
| 341 |
+
"task_name": "delivery_eta_baseline_seed0",
|
| 342 |
+
"task_id": "delivery_eta",
|
| 343 |
+
"episode": 0,
|
| 344 |
+
"success": false,
|
| 345 |
+
"total_return": -0.09269999999999995,
|
| 346 |
+
"steps": 13,
|
| 347 |
+
"invalid_actions": 0,
|
| 348 |
+
"quality_gate_passed": false,
|
| 349 |
+
"retention_ratio": 0.9845
|
| 350 |
+
},
|
| 351 |
+
{
|
| 352 |
+
"task_name": "delivery_eta_baseline_seed0",
|
| 353 |
+
"task_id": "delivery_eta",
|
| 354 |
+
"episode": 1,
|
| 355 |
+
"success": false,
|
| 356 |
+
"total_return": -0.09269999999999995,
|
| 357 |
+
"steps": 13,
|
| 358 |
+
"invalid_actions": 0,
|
| 359 |
+
"quality_gate_passed": false,
|
| 360 |
+
"retention_ratio": 0.9845
|
| 361 |
+
},
|
| 362 |
+
{
|
| 363 |
+
"task_name": "delivery_eta_baseline_seed0",
|
| 364 |
+
"task_id": "delivery_eta",
|
| 365 |
+
"episode": 2,
|
| 366 |
+
"success": false,
|
| 367 |
+
"total_return": -0.09269999999999995,
|
| 368 |
+
"steps": 13,
|
| 369 |
+
"invalid_actions": 0,
|
| 370 |
+
"quality_gate_passed": false,
|
| 371 |
+
"retention_ratio": 0.9845
|
| 372 |
+
},
|
| 373 |
+
{
|
| 374 |
+
"task_name": "delivery_eta_baseline_seed1",
|
| 375 |
+
"task_id": "delivery_eta",
|
| 376 |
+
"episode": 0,
|
| 377 |
+
"success": false,
|
| 378 |
+
"total_return": -0.09269999999999995,
|
| 379 |
+
"steps": 13,
|
| 380 |
+
"invalid_actions": 0,
|
| 381 |
+
"quality_gate_passed": false,
|
| 382 |
+
"retention_ratio": 0.9883
|
| 383 |
+
},
|
| 384 |
+
{
|
| 385 |
+
"task_name": "delivery_eta_baseline_seed1",
|
| 386 |
+
"task_id": "delivery_eta",
|
| 387 |
+
"episode": 1,
|
| 388 |
+
"success": false,
|
| 389 |
+
"total_return": -0.09269999999999995,
|
| 390 |
+
"steps": 13,
|
| 391 |
+
"invalid_actions": 0,
|
| 392 |
+
"quality_gate_passed": false,
|
| 393 |
+
"retention_ratio": 0.9883
|
| 394 |
+
},
|
| 395 |
+
{
|
| 396 |
+
"task_name": "delivery_eta_baseline_seed1",
|
| 397 |
+
"task_id": "delivery_eta",
|
| 398 |
+
"episode": 2,
|
| 399 |
+
"success": false,
|
| 400 |
+
"total_return": -0.09269999999999995,
|
| 401 |
+
"steps": 13,
|
| 402 |
+
"invalid_actions": 0,
|
| 403 |
+
"quality_gate_passed": false,
|
| 404 |
+
"retention_ratio": 0.9883
|
| 405 |
+
},
|
| 406 |
+
{
|
| 407 |
+
"task_name": "delivery_eta_baseline_seed2",
|
| 408 |
+
"task_id": "delivery_eta",
|
| 409 |
+
"episode": 0,
|
| 410 |
+
"success": false,
|
| 411 |
+
"total_return": -0.09269999999999995,
|
| 412 |
+
"steps": 13,
|
| 413 |
+
"invalid_actions": 0,
|
| 414 |
+
"quality_gate_passed": false,
|
| 415 |
+
"retention_ratio": 0.9748
|
| 416 |
+
},
|
| 417 |
+
{
|
| 418 |
+
"task_name": "delivery_eta_baseline_seed2",
|
| 419 |
+
"task_id": "delivery_eta",
|
| 420 |
+
"episode": 1,
|
| 421 |
+
"success": false,
|
| 422 |
+
"total_return": -0.09269999999999995,
|
| 423 |
+
"steps": 13,
|
| 424 |
+
"invalid_actions": 0,
|
| 425 |
+
"quality_gate_passed": false,
|
| 426 |
+
"retention_ratio": 0.9748
|
| 427 |
+
},
|
| 428 |
+
{
|
| 429 |
+
"task_name": "delivery_eta_baseline_seed2",
|
| 430 |
+
"task_id": "delivery_eta",
|
| 431 |
+
"episode": 2,
|
| 432 |
+
"success": false,
|
| 433 |
+
"total_return": -0.09269999999999995,
|
| 434 |
+
"steps": 13,
|
| 435 |
+
"invalid_actions": 0,
|
| 436 |
+
"quality_gate_passed": false,
|
| 437 |
+
"retention_ratio": 0.9748
|
| 438 |
+
},
|
| 439 |
+
{
|
| 440 |
+
"task_name": "delivery_eta_baseline_seed3",
|
| 441 |
+
"task_id": "delivery_eta",
|
| 442 |
+
"episode": 0,
|
| 443 |
+
"success": false,
|
| 444 |
+
"total_return": -0.09269999999999995,
|
| 445 |
+
"steps": 13,
|
| 446 |
+
"invalid_actions": 0,
|
| 447 |
+
"quality_gate_passed": false,
|
| 448 |
+
"retention_ratio": 0.9767
|
| 449 |
+
},
|
| 450 |
+
{
|
| 451 |
+
"task_name": "delivery_eta_baseline_seed3",
|
| 452 |
+
"task_id": "delivery_eta",
|
| 453 |
+
"episode": 1,
|
| 454 |
+
"success": false,
|
| 455 |
+
"total_return": -0.09269999999999995,
|
| 456 |
+
"steps": 13,
|
| 457 |
+
"invalid_actions": 0,
|
| 458 |
+
"quality_gate_passed": false,
|
| 459 |
+
"retention_ratio": 0.9767
|
| 460 |
+
},
|
| 461 |
+
{
|
| 462 |
+
"task_name": "delivery_eta_baseline_seed3",
|
| 463 |
+
"task_id": "delivery_eta",
|
| 464 |
+
"episode": 2,
|
| 465 |
+
"success": false,
|
| 466 |
+
"total_return": -0.09269999999999995,
|
| 467 |
+
"steps": 13,
|
| 468 |
+
"invalid_actions": 0,
|
| 469 |
+
"quality_gate_passed": false,
|
| 470 |
+
"retention_ratio": 0.9767
|
| 471 |
+
},
|
| 472 |
+
{
|
| 473 |
+
"task_name": "delivery_eta_baseline_seed4",
|
| 474 |
+
"task_id": "delivery_eta",
|
| 475 |
+
"episode": 0,
|
| 476 |
+
"success": false,
|
| 477 |
+
"total_return": -0.09269999999999995,
|
| 478 |
+
"steps": 13,
|
| 479 |
+
"invalid_actions": 0,
|
| 480 |
+
"quality_gate_passed": false,
|
| 481 |
+
"retention_ratio": 0.9748
|
| 482 |
+
},
|
| 483 |
+
{
|
| 484 |
+
"task_name": "delivery_eta_baseline_seed4",
|
| 485 |
+
"task_id": "delivery_eta",
|
| 486 |
+
"episode": 1,
|
| 487 |
+
"success": false,
|
| 488 |
+
"total_return": -0.09269999999999995,
|
| 489 |
+
"steps": 13,
|
| 490 |
+
"invalid_actions": 0,
|
| 491 |
+
"quality_gate_passed": false,
|
| 492 |
+
"retention_ratio": 0.9748
|
| 493 |
+
},
|
| 494 |
+
{
|
| 495 |
+
"task_name": "delivery_eta_baseline_seed4",
|
| 496 |
+
"task_id": "delivery_eta",
|
| 497 |
+
"episode": 2,
|
| 498 |
+
"success": false,
|
| 499 |
+
"total_return": -0.09269999999999995,
|
| 500 |
+
"steps": 13,
|
| 501 |
+
"invalid_actions": 0,
|
| 502 |
+
"quality_gate_passed": false,
|
| 503 |
+
"retention_ratio": 0.9748
|
| 504 |
+
}
|
| 505 |
+
]
|
| 506 |
+
}
|
results_llm.json
ADDED
|
@@ -0,0 +1,464 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"agent": "LLMAgent",
|
| 3 |
+
"base_url": "https://israaaML-fsds-cleaning-env.hf.space",
|
| 4 |
+
"n_episodes": 45,
|
| 5 |
+
"aggregate": {
|
| 6 |
+
"episodes": 45,
|
| 7 |
+
"success_rate": 0.0,
|
| 8 |
+
"avg_return": 0.0,
|
| 9 |
+
"avg_steps": 0.0,
|
| 10 |
+
"avg_invalid_actions": 0.0
|
| 11 |
+
},
|
| 12 |
+
"episodes": [
|
| 13 |
+
{
|
| 14 |
+
"task_name": "ecommerce_mobile_baseline_seed0",
|
| 15 |
+
"task_id": "ecommerce_mobile",
|
| 16 |
+
"episode": 0,
|
| 17 |
+
"error": "No module named 'unsloth'",
|
| 18 |
+
"success": false,
|
| 19 |
+
"total_return": 0.0,
|
| 20 |
+
"steps": 0,
|
| 21 |
+
"invalid_actions": 0
|
| 22 |
+
},
|
| 23 |
+
{
|
| 24 |
+
"task_name": "ecommerce_mobile_baseline_seed0",
|
| 25 |
+
"task_id": "ecommerce_mobile",
|
| 26 |
+
"episode": 1,
|
| 27 |
+
"error": "No module named 'unsloth'",
|
| 28 |
+
"success": false,
|
| 29 |
+
"total_return": 0.0,
|
| 30 |
+
"steps": 0,
|
| 31 |
+
"invalid_actions": 0
|
| 32 |
+
},
|
| 33 |
+
{
|
| 34 |
+
"task_name": "ecommerce_mobile_baseline_seed0",
|
| 35 |
+
"task_id": "ecommerce_mobile",
|
| 36 |
+
"episode": 2,
|
| 37 |
+
"error": "No module named 'unsloth'",
|
| 38 |
+
"success": false,
|
| 39 |
+
"total_return": 0.0,
|
| 40 |
+
"steps": 0,
|
| 41 |
+
"invalid_actions": 0
|
| 42 |
+
},
|
| 43 |
+
{
|
| 44 |
+
"task_name": "ecommerce_mobile_baseline_seed1",
|
| 45 |
+
"task_id": "ecommerce_mobile",
|
| 46 |
+
"episode": 0,
|
| 47 |
+
"error": "No module named 'unsloth'",
|
| 48 |
+
"success": false,
|
| 49 |
+
"total_return": 0.0,
|
| 50 |
+
"steps": 0,
|
| 51 |
+
"invalid_actions": 0
|
| 52 |
+
},
|
| 53 |
+
{
|
| 54 |
+
"task_name": "ecommerce_mobile_baseline_seed1",
|
| 55 |
+
"task_id": "ecommerce_mobile",
|
| 56 |
+
"episode": 1,
|
| 57 |
+
"error": "No module named 'unsloth'",
|
| 58 |
+
"success": false,
|
| 59 |
+
"total_return": 0.0,
|
| 60 |
+
"steps": 0,
|
| 61 |
+
"invalid_actions": 0
|
| 62 |
+
},
|
| 63 |
+
{
|
| 64 |
+
"task_name": "ecommerce_mobile_baseline_seed1",
|
| 65 |
+
"task_id": "ecommerce_mobile",
|
| 66 |
+
"episode": 2,
|
| 67 |
+
"error": "No module named 'unsloth'",
|
| 68 |
+
"success": false,
|
| 69 |
+
"total_return": 0.0,
|
| 70 |
+
"steps": 0,
|
| 71 |
+
"invalid_actions": 0
|
| 72 |
+
},
|
| 73 |
+
{
|
| 74 |
+
"task_name": "ecommerce_mobile_baseline_seed2",
|
| 75 |
+
"task_id": "ecommerce_mobile",
|
| 76 |
+
"episode": 0,
|
| 77 |
+
"error": "No module named 'unsloth'",
|
| 78 |
+
"success": false,
|
| 79 |
+
"total_return": 0.0,
|
| 80 |
+
"steps": 0,
|
| 81 |
+
"invalid_actions": 0
|
| 82 |
+
},
|
| 83 |
+
{
|
| 84 |
+
"task_name": "ecommerce_mobile_baseline_seed2",
|
| 85 |
+
"task_id": "ecommerce_mobile",
|
| 86 |
+
"episode": 1,
|
| 87 |
+
"error": "No module named 'unsloth'",
|
| 88 |
+
"success": false,
|
| 89 |
+
"total_return": 0.0,
|
| 90 |
+
"steps": 0,
|
| 91 |
+
"invalid_actions": 0
|
| 92 |
+
},
|
| 93 |
+
{
|
| 94 |
+
"task_name": "ecommerce_mobile_baseline_seed2",
|
| 95 |
+
"task_id": "ecommerce_mobile",
|
| 96 |
+
"episode": 2,
|
| 97 |
+
"error": "No module named 'unsloth'",
|
| 98 |
+
"success": false,
|
| 99 |
+
"total_return": 0.0,
|
| 100 |
+
"steps": 0,
|
| 101 |
+
"invalid_actions": 0
|
| 102 |
+
},
|
| 103 |
+
{
|
| 104 |
+
"task_name": "ecommerce_mobile_baseline_seed3",
|
| 105 |
+
"task_id": "ecommerce_mobile",
|
| 106 |
+
"episode": 0,
|
| 107 |
+
"error": "No module named 'unsloth'",
|
| 108 |
+
"success": false,
|
| 109 |
+
"total_return": 0.0,
|
| 110 |
+
"steps": 0,
|
| 111 |
+
"invalid_actions": 0
|
| 112 |
+
},
|
| 113 |
+
{
|
| 114 |
+
"task_name": "ecommerce_mobile_baseline_seed3",
|
| 115 |
+
"task_id": "ecommerce_mobile",
|
| 116 |
+
"episode": 1,
|
| 117 |
+
"error": "No module named 'unsloth'",
|
| 118 |
+
"success": false,
|
| 119 |
+
"total_return": 0.0,
|
| 120 |
+
"steps": 0,
|
| 121 |
+
"invalid_actions": 0
|
| 122 |
+
},
|
| 123 |
+
{
|
| 124 |
+
"task_name": "ecommerce_mobile_baseline_seed3",
|
| 125 |
+
"task_id": "ecommerce_mobile",
|
| 126 |
+
"episode": 2,
|
| 127 |
+
"error": "No module named 'unsloth'",
|
| 128 |
+
"success": false,
|
| 129 |
+
"total_return": 0.0,
|
| 130 |
+
"steps": 0,
|
| 131 |
+
"invalid_actions": 0
|
| 132 |
+
},
|
| 133 |
+
{
|
| 134 |
+
"task_name": "ecommerce_mobile_baseline_seed4",
|
| 135 |
+
"task_id": "ecommerce_mobile",
|
| 136 |
+
"episode": 0,
|
| 137 |
+
"error": "No module named 'unsloth'",
|
| 138 |
+
"success": false,
|
| 139 |
+
"total_return": 0.0,
|
| 140 |
+
"steps": 0,
|
| 141 |
+
"invalid_actions": 0
|
| 142 |
+
},
|
| 143 |
+
{
|
| 144 |
+
"task_name": "ecommerce_mobile_baseline_seed4",
|
| 145 |
+
"task_id": "ecommerce_mobile",
|
| 146 |
+
"episode": 1,
|
| 147 |
+
"error": "No module named 'unsloth'",
|
| 148 |
+
"success": false,
|
| 149 |
+
"total_return": 0.0,
|
| 150 |
+
"steps": 0,
|
| 151 |
+
"invalid_actions": 0
|
| 152 |
+
},
|
| 153 |
+
{
|
| 154 |
+
"task_name": "ecommerce_mobile_baseline_seed4",
|
| 155 |
+
"task_id": "ecommerce_mobile",
|
| 156 |
+
"episode": 2,
|
| 157 |
+
"error": "No module named 'unsloth'",
|
| 158 |
+
"success": false,
|
| 159 |
+
"total_return": 0.0,
|
| 160 |
+
"steps": 0,
|
| 161 |
+
"invalid_actions": 0
|
| 162 |
+
},
|
| 163 |
+
{
|
| 164 |
+
"task_name": "subscription_churn_baseline_seed0",
|
| 165 |
+
"task_id": "subscription_churn",
|
| 166 |
+
"episode": 0,
|
| 167 |
+
"error": "No module named 'unsloth'",
|
| 168 |
+
"success": false,
|
| 169 |
+
"total_return": 0.0,
|
| 170 |
+
"steps": 0,
|
| 171 |
+
"invalid_actions": 0
|
| 172 |
+
},
|
| 173 |
+
{
|
| 174 |
+
"task_name": "subscription_churn_baseline_seed0",
|
| 175 |
+
"task_id": "subscription_churn",
|
| 176 |
+
"episode": 1,
|
| 177 |
+
"error": "No module named 'unsloth'",
|
| 178 |
+
"success": false,
|
| 179 |
+
"total_return": 0.0,
|
| 180 |
+
"steps": 0,
|
| 181 |
+
"invalid_actions": 0
|
| 182 |
+
},
|
| 183 |
+
{
|
| 184 |
+
"task_name": "subscription_churn_baseline_seed0",
|
| 185 |
+
"task_id": "subscription_churn",
|
| 186 |
+
"episode": 2,
|
| 187 |
+
"error": "No module named 'unsloth'",
|
| 188 |
+
"success": false,
|
| 189 |
+
"total_return": 0.0,
|
| 190 |
+
"steps": 0,
|
| 191 |
+
"invalid_actions": 0
|
| 192 |
+
},
|
| 193 |
+
{
|
| 194 |
+
"task_name": "subscription_churn_baseline_seed1",
|
| 195 |
+
"task_id": "subscription_churn",
|
| 196 |
+
"episode": 0,
|
| 197 |
+
"error": "No module named 'unsloth'",
|
| 198 |
+
"success": false,
|
| 199 |
+
"total_return": 0.0,
|
| 200 |
+
"steps": 0,
|
| 201 |
+
"invalid_actions": 0
|
| 202 |
+
},
|
| 203 |
+
{
|
| 204 |
+
"task_name": "subscription_churn_baseline_seed1",
|
| 205 |
+
"task_id": "subscription_churn",
|
| 206 |
+
"episode": 1,
|
| 207 |
+
"error": "No module named 'unsloth'",
|
| 208 |
+
"success": false,
|
| 209 |
+
"total_return": 0.0,
|
| 210 |
+
"steps": 0,
|
| 211 |
+
"invalid_actions": 0
|
| 212 |
+
},
|
| 213 |
+
{
|
| 214 |
+
"task_name": "subscription_churn_baseline_seed1",
|
| 215 |
+
"task_id": "subscription_churn",
|
| 216 |
+
"episode": 2,
|
| 217 |
+
"error": "No module named 'unsloth'",
|
| 218 |
+
"success": false,
|
| 219 |
+
"total_return": 0.0,
|
| 220 |
+
"steps": 0,
|
| 221 |
+
"invalid_actions": 0
|
| 222 |
+
},
|
| 223 |
+
{
|
| 224 |
+
"task_name": "subscription_churn_baseline_seed2",
|
| 225 |
+
"task_id": "subscription_churn",
|
| 226 |
+
"episode": 0,
|
| 227 |
+
"error": "No module named 'unsloth'",
|
| 228 |
+
"success": false,
|
| 229 |
+
"total_return": 0.0,
|
| 230 |
+
"steps": 0,
|
| 231 |
+
"invalid_actions": 0
|
| 232 |
+
},
|
| 233 |
+
{
|
| 234 |
+
"task_name": "subscription_churn_baseline_seed2",
|
| 235 |
+
"task_id": "subscription_churn",
|
| 236 |
+
"episode": 1,
|
| 237 |
+
"error": "No module named 'unsloth'",
|
| 238 |
+
"success": false,
|
| 239 |
+
"total_return": 0.0,
|
| 240 |
+
"steps": 0,
|
| 241 |
+
"invalid_actions": 0
|
| 242 |
+
},
|
| 243 |
+
{
|
| 244 |
+
"task_name": "subscription_churn_baseline_seed2",
|
| 245 |
+
"task_id": "subscription_churn",
|
| 246 |
+
"episode": 2,
|
| 247 |
+
"error": "No module named 'unsloth'",
|
| 248 |
+
"success": false,
|
| 249 |
+
"total_return": 0.0,
|
| 250 |
+
"steps": 0,
|
| 251 |
+
"invalid_actions": 0
|
| 252 |
+
},
|
| 253 |
+
{
|
| 254 |
+
"task_name": "subscription_churn_baseline_seed3",
|
| 255 |
+
"task_id": "subscription_churn",
|
| 256 |
+
"episode": 0,
|
| 257 |
+
"error": "No module named 'unsloth'",
|
| 258 |
+
"success": false,
|
| 259 |
+
"total_return": 0.0,
|
| 260 |
+
"steps": 0,
|
| 261 |
+
"invalid_actions": 0
|
| 262 |
+
},
|
| 263 |
+
{
|
| 264 |
+
"task_name": "subscription_churn_baseline_seed3",
|
| 265 |
+
"task_id": "subscription_churn",
|
| 266 |
+
"episode": 1,
|
| 267 |
+
"error": "No module named 'unsloth'",
|
| 268 |
+
"success": false,
|
| 269 |
+
"total_return": 0.0,
|
| 270 |
+
"steps": 0,
|
| 271 |
+
"invalid_actions": 0
|
| 272 |
+
},
|
| 273 |
+
{
|
| 274 |
+
"task_name": "subscription_churn_baseline_seed3",
|
| 275 |
+
"task_id": "subscription_churn",
|
| 276 |
+
"episode": 2,
|
| 277 |
+
"error": "No module named 'unsloth'",
|
| 278 |
+
"success": false,
|
| 279 |
+
"total_return": 0.0,
|
| 280 |
+
"steps": 0,
|
| 281 |
+
"invalid_actions": 0
|
| 282 |
+
},
|
| 283 |
+
{
|
| 284 |
+
"task_name": "subscription_churn_baseline_seed4",
|
| 285 |
+
"task_id": "subscription_churn",
|
| 286 |
+
"episode": 0,
|
| 287 |
+
"error": "No module named 'unsloth'",
|
| 288 |
+
"success": false,
|
| 289 |
+
"total_return": 0.0,
|
| 290 |
+
"steps": 0,
|
| 291 |
+
"invalid_actions": 0
|
| 292 |
+
},
|
| 293 |
+
{
|
| 294 |
+
"task_name": "subscription_churn_baseline_seed4",
|
| 295 |
+
"task_id": "subscription_churn",
|
| 296 |
+
"episode": 1,
|
| 297 |
+
"error": "No module named 'unsloth'",
|
| 298 |
+
"success": false,
|
| 299 |
+
"total_return": 0.0,
|
| 300 |
+
"steps": 0,
|
| 301 |
+
"invalid_actions": 0
|
| 302 |
+
},
|
| 303 |
+
{
|
| 304 |
+
"task_name": "subscription_churn_baseline_seed4",
|
| 305 |
+
"task_id": "subscription_churn",
|
| 306 |
+
"episode": 2,
|
| 307 |
+
"error": "No module named 'unsloth'",
|
| 308 |
+
"success": false,
|
| 309 |
+
"total_return": 0.0,
|
| 310 |
+
"steps": 0,
|
| 311 |
+
"invalid_actions": 0
|
| 312 |
+
},
|
| 313 |
+
{
|
| 314 |
+
"task_name": "delivery_eta_baseline_seed0",
|
| 315 |
+
"task_id": "delivery_eta",
|
| 316 |
+
"episode": 0,
|
| 317 |
+
"error": "No module named 'unsloth'",
|
| 318 |
+
"success": false,
|
| 319 |
+
"total_return": 0.0,
|
| 320 |
+
"steps": 0,
|
| 321 |
+
"invalid_actions": 0
|
| 322 |
+
},
|
| 323 |
+
{
|
| 324 |
+
"task_name": "delivery_eta_baseline_seed0",
|
| 325 |
+
"task_id": "delivery_eta",
|
| 326 |
+
"episode": 1,
|
| 327 |
+
"error": "No module named 'unsloth'",
|
| 328 |
+
"success": false,
|
| 329 |
+
"total_return": 0.0,
|
| 330 |
+
"steps": 0,
|
| 331 |
+
"invalid_actions": 0
|
| 332 |
+
},
|
| 333 |
+
{
|
| 334 |
+
"task_name": "delivery_eta_baseline_seed0",
|
| 335 |
+
"task_id": "delivery_eta",
|
| 336 |
+
"episode": 2,
|
| 337 |
+
"error": "No module named 'unsloth'",
|
| 338 |
+
"success": false,
|
| 339 |
+
"total_return": 0.0,
|
| 340 |
+
"steps": 0,
|
| 341 |
+
"invalid_actions": 0
|
| 342 |
+
},
|
| 343 |
+
{
|
| 344 |
+
"task_name": "delivery_eta_baseline_seed1",
|
| 345 |
+
"task_id": "delivery_eta",
|
| 346 |
+
"episode": 0,
|
| 347 |
+
"error": "No module named 'unsloth'",
|
| 348 |
+
"success": false,
|
| 349 |
+
"total_return": 0.0,
|
| 350 |
+
"steps": 0,
|
| 351 |
+
"invalid_actions": 0
|
| 352 |
+
},
|
| 353 |
+
{
|
| 354 |
+
"task_name": "delivery_eta_baseline_seed1",
|
| 355 |
+
"task_id": "delivery_eta",
|
| 356 |
+
"episode": 1,
|
| 357 |
+
"error": "No module named 'unsloth'",
|
| 358 |
+
"success": false,
|
| 359 |
+
"total_return": 0.0,
|
| 360 |
+
"steps": 0,
|
| 361 |
+
"invalid_actions": 0
|
| 362 |
+
},
|
| 363 |
+
{
|
| 364 |
+
"task_name": "delivery_eta_baseline_seed1",
|
| 365 |
+
"task_id": "delivery_eta",
|
| 366 |
+
"episode": 2,
|
| 367 |
+
"error": "No module named 'unsloth'",
|
| 368 |
+
"success": false,
|
| 369 |
+
"total_return": 0.0,
|
| 370 |
+
"steps": 0,
|
| 371 |
+
"invalid_actions": 0
|
| 372 |
+
},
|
| 373 |
+
{
|
| 374 |
+
"task_name": "delivery_eta_baseline_seed2",
|
| 375 |
+
"task_id": "delivery_eta",
|
| 376 |
+
"episode": 0,
|
| 377 |
+
"error": "No module named 'unsloth'",
|
| 378 |
+
"success": false,
|
| 379 |
+
"total_return": 0.0,
|
| 380 |
+
"steps": 0,
|
| 381 |
+
"invalid_actions": 0
|
| 382 |
+
},
|
| 383 |
+
{
|
| 384 |
+
"task_name": "delivery_eta_baseline_seed2",
|
| 385 |
+
"task_id": "delivery_eta",
|
| 386 |
+
"episode": 1,
|
| 387 |
+
"error": "No module named 'unsloth'",
|
| 388 |
+
"success": false,
|
| 389 |
+
"total_return": 0.0,
|
| 390 |
+
"steps": 0,
|
| 391 |
+
"invalid_actions": 0
|
| 392 |
+
},
|
| 393 |
+
{
|
| 394 |
+
"task_name": "delivery_eta_baseline_seed2",
|
| 395 |
+
"task_id": "delivery_eta",
|
| 396 |
+
"episode": 2,
|
| 397 |
+
"error": "No module named 'unsloth'",
|
| 398 |
+
"success": false,
|
| 399 |
+
"total_return": 0.0,
|
| 400 |
+
"steps": 0,
|
| 401 |
+
"invalid_actions": 0
|
| 402 |
+
},
|
| 403 |
+
{
|
| 404 |
+
"task_name": "delivery_eta_baseline_seed3",
|
| 405 |
+
"task_id": "delivery_eta",
|
| 406 |
+
"episode": 0,
|
| 407 |
+
"error": "No module named 'unsloth'",
|
| 408 |
+
"success": false,
|
| 409 |
+
"total_return": 0.0,
|
| 410 |
+
"steps": 0,
|
| 411 |
+
"invalid_actions": 0
|
| 412 |
+
},
|
| 413 |
+
{
|
| 414 |
+
"task_name": "delivery_eta_baseline_seed3",
|
| 415 |
+
"task_id": "delivery_eta",
|
| 416 |
+
"episode": 1,
|
| 417 |
+
"error": "No module named 'unsloth'",
|
| 418 |
+
"success": false,
|
| 419 |
+
"total_return": 0.0,
|
| 420 |
+
"steps": 0,
|
| 421 |
+
"invalid_actions": 0
|
| 422 |
+
},
|
| 423 |
+
{
|
| 424 |
+
"task_name": "delivery_eta_baseline_seed3",
|
| 425 |
+
"task_id": "delivery_eta",
|
| 426 |
+
"episode": 2,
|
| 427 |
+
"error": "No module named 'unsloth'",
|
| 428 |
+
"success": false,
|
| 429 |
+
"total_return": 0.0,
|
| 430 |
+
"steps": 0,
|
| 431 |
+
"invalid_actions": 0
|
| 432 |
+
},
|
| 433 |
+
{
|
| 434 |
+
"task_name": "delivery_eta_baseline_seed4",
|
| 435 |
+
"task_id": "delivery_eta",
|
| 436 |
+
"episode": 0,
|
| 437 |
+
"error": "No module named 'unsloth'",
|
| 438 |
+
"success": false,
|
| 439 |
+
"total_return": 0.0,
|
| 440 |
+
"steps": 0,
|
| 441 |
+
"invalid_actions": 0
|
| 442 |
+
},
|
| 443 |
+
{
|
| 444 |
+
"task_name": "delivery_eta_baseline_seed4",
|
| 445 |
+
"task_id": "delivery_eta",
|
| 446 |
+
"episode": 1,
|
| 447 |
+
"error": "No module named 'unsloth'",
|
| 448 |
+
"success": false,
|
| 449 |
+
"total_return": 0.0,
|
| 450 |
+
"steps": 0,
|
| 451 |
+
"invalid_actions": 0
|
| 452 |
+
},
|
| 453 |
+
{
|
| 454 |
+
"task_name": "delivery_eta_baseline_seed4",
|
| 455 |
+
"task_id": "delivery_eta",
|
| 456 |
+
"episode": 2,
|
| 457 |
+
"error": "No module named 'unsloth'",
|
| 458 |
+
"success": false,
|
| 459 |
+
"total_return": 0.0,
|
| 460 |
+
"steps": 0,
|
| 461 |
+
"invalid_actions": 0
|
| 462 |
+
}
|
| 463 |
+
]
|
| 464 |
+
}
|
results_random.json
ADDED
|
@@ -0,0 +1,505 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"agent": "RandomAgent",
|
| 3 |
+
"base_url": "https://israaaML-fsds-cleaning-env.hf.space",
|
| 4 |
+
"n_episodes": 45,
|
| 5 |
+
"aggregate": {
|
| 6 |
+
"episodes": 45,
|
| 7 |
+
"success_rate": 0.0,
|
| 8 |
+
"avg_return": -0.09121333333333337,
|
| 9 |
+
"avg_steps": 3.1777777777777776,
|
| 10 |
+
"avg_invalid_actions": 0.0
|
| 11 |
+
},
|
| 12 |
+
"episodes": [
|
| 13 |
+
{
|
| 14 |
+
"task_name": "ecommerce_mobile_baseline_seed0",
|
| 15 |
+
"task_id": "ecommerce_mobile",
|
| 16 |
+
"episode": 0,
|
| 17 |
+
"success": false,
|
| 18 |
+
"total_return": -0.1,
|
| 19 |
+
"steps": 2,
|
| 20 |
+
"invalid_actions": 0,
|
| 21 |
+
"quality_gate_passed": false,
|
| 22 |
+
"retention_ratio": 1.0
|
| 23 |
+
},
|
| 24 |
+
{
|
| 25 |
+
"task_name": "ecommerce_mobile_baseline_seed0",
|
| 26 |
+
"task_id": "ecommerce_mobile",
|
| 27 |
+
"episode": 1,
|
| 28 |
+
"success": false,
|
| 29 |
+
"total_return": -0.42000000000000004,
|
| 30 |
+
"steps": 6,
|
| 31 |
+
"invalid_actions": 0,
|
| 32 |
+
"quality_gate_passed": false,
|
| 33 |
+
"retention_ratio": 1.0
|
| 34 |
+
},
|
| 35 |
+
{
|
| 36 |
+
"task_name": "ecommerce_mobile_baseline_seed0",
|
| 37 |
+
"task_id": "ecommerce_mobile",
|
| 38 |
+
"episode": 2,
|
| 39 |
+
"success": false,
|
| 40 |
+
"total_return": -0.1,
|
| 41 |
+
"steps": 3,
|
| 42 |
+
"invalid_actions": 0,
|
| 43 |
+
"quality_gate_passed": false,
|
| 44 |
+
"retention_ratio": 1.0
|
| 45 |
+
},
|
| 46 |
+
{
|
| 47 |
+
"task_name": "ecommerce_mobile_baseline_seed1",
|
| 48 |
+
"task_id": "ecommerce_mobile",
|
| 49 |
+
"episode": 0,
|
| 50 |
+
"success": false,
|
| 51 |
+
"total_return": -0.1,
|
| 52 |
+
"steps": 6,
|
| 53 |
+
"invalid_actions": 0,
|
| 54 |
+
"quality_gate_passed": false,
|
| 55 |
+
"retention_ratio": 1.0
|
| 56 |
+
},
|
| 57 |
+
{
|
| 58 |
+
"task_name": "ecommerce_mobile_baseline_seed1",
|
| 59 |
+
"task_id": "ecommerce_mobile",
|
| 60 |
+
"episode": 1,
|
| 61 |
+
"success": false,
|
| 62 |
+
"total_return": -0.04,
|
| 63 |
+
"steps": 4,
|
| 64 |
+
"invalid_actions": 0,
|
| 65 |
+
"quality_gate_passed": false,
|
| 66 |
+
"retention_ratio": 1.0
|
| 67 |
+
},
|
| 68 |
+
{
|
| 69 |
+
"task_name": "ecommerce_mobile_baseline_seed1",
|
| 70 |
+
"task_id": "ecommerce_mobile",
|
| 71 |
+
"episode": 2,
|
| 72 |
+
"success": false,
|
| 73 |
+
"total_return": -0.28,
|
| 74 |
+
"steps": 10,
|
| 75 |
+
"invalid_actions": 0,
|
| 76 |
+
"quality_gate_passed": false,
|
| 77 |
+
"retention_ratio": 1.0
|
| 78 |
+
},
|
| 79 |
+
{
|
| 80 |
+
"task_name": "ecommerce_mobile_baseline_seed2",
|
| 81 |
+
"task_id": "ecommerce_mobile",
|
| 82 |
+
"episode": 0,
|
| 83 |
+
"error": "Tool 'submit_solution' failed: Error calling tool 'submit_solution': Error serializing to JSON: TypeError: 'float' object cannot be interpreted as an integer (type: execution_error)",
|
| 84 |
+
"success": false,
|
| 85 |
+
"total_return": 0.0,
|
| 86 |
+
"steps": 0,
|
| 87 |
+
"invalid_actions": 0
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
"task_name": "ecommerce_mobile_baseline_seed2",
|
| 91 |
+
"task_id": "ecommerce_mobile",
|
| 92 |
+
"episode": 1,
|
| 93 |
+
"success": false,
|
| 94 |
+
"total_return": -0.02,
|
| 95 |
+
"steps": 2,
|
| 96 |
+
"invalid_actions": 0,
|
| 97 |
+
"quality_gate_passed": false,
|
| 98 |
+
"retention_ratio": 1.0
|
| 99 |
+
},
|
| 100 |
+
{
|
| 101 |
+
"task_name": "ecommerce_mobile_baseline_seed2",
|
| 102 |
+
"task_id": "ecommerce_mobile",
|
| 103 |
+
"episode": 2,
|
| 104 |
+
"success": false,
|
| 105 |
+
"total_return": -0.12000000000000001,
|
| 106 |
+
"steps": 4,
|
| 107 |
+
"invalid_actions": 0,
|
| 108 |
+
"quality_gate_passed": false,
|
| 109 |
+
"retention_ratio": 0.9728
|
| 110 |
+
},
|
| 111 |
+
{
|
| 112 |
+
"task_name": "ecommerce_mobile_baseline_seed3",
|
| 113 |
+
"task_id": "ecommerce_mobile",
|
| 114 |
+
"episode": 0,
|
| 115 |
+
"success": false,
|
| 116 |
+
"total_return": 0.0,
|
| 117 |
+
"steps": 1,
|
| 118 |
+
"invalid_actions": 0,
|
| 119 |
+
"quality_gate_passed": false,
|
| 120 |
+
"retention_ratio": 1.0
|
| 121 |
+
},
|
| 122 |
+
{
|
| 123 |
+
"task_name": "ecommerce_mobile_baseline_seed3",
|
| 124 |
+
"task_id": "ecommerce_mobile",
|
| 125 |
+
"episode": 1,
|
| 126 |
+
"success": false,
|
| 127 |
+
"total_return": 0.0,
|
| 128 |
+
"steps": 1,
|
| 129 |
+
"invalid_actions": 0,
|
| 130 |
+
"quality_gate_passed": false,
|
| 131 |
+
"retention_ratio": 1.0
|
| 132 |
+
},
|
| 133 |
+
{
|
| 134 |
+
"task_name": "ecommerce_mobile_baseline_seed3",
|
| 135 |
+
"task_id": "ecommerce_mobile",
|
| 136 |
+
"episode": 2,
|
| 137 |
+
"error": "Tool 'preview_data' failed: Error calling tool 'preview_data': Error serializing to JSON: TypeError: 'float' object cannot be interpreted as an integer (type: execution_error)",
|
| 138 |
+
"success": false,
|
| 139 |
+
"total_return": 0.0,
|
| 140 |
+
"steps": 0,
|
| 141 |
+
"invalid_actions": 0
|
| 142 |
+
},
|
| 143 |
+
{
|
| 144 |
+
"task_name": "ecommerce_mobile_baseline_seed4",
|
| 145 |
+
"task_id": "ecommerce_mobile",
|
| 146 |
+
"episode": 0,
|
| 147 |
+
"success": false,
|
| 148 |
+
"total_return": 0.005000000000000001,
|
| 149 |
+
"steps": 4,
|
| 150 |
+
"invalid_actions": 0,
|
| 151 |
+
"quality_gate_passed": false,
|
| 152 |
+
"retention_ratio": 1.0
|
| 153 |
+
},
|
| 154 |
+
{
|
| 155 |
+
"task_name": "ecommerce_mobile_baseline_seed4",
|
| 156 |
+
"task_id": "ecommerce_mobile",
|
| 157 |
+
"episode": 1,
|
| 158 |
+
"success": false,
|
| 159 |
+
"total_return": -0.1,
|
| 160 |
+
"steps": 4,
|
| 161 |
+
"invalid_actions": 0,
|
| 162 |
+
"quality_gate_passed": false,
|
| 163 |
+
"retention_ratio": 1.0
|
| 164 |
+
},
|
| 165 |
+
{
|
| 166 |
+
"task_name": "ecommerce_mobile_baseline_seed4",
|
| 167 |
+
"task_id": "ecommerce_mobile",
|
| 168 |
+
"episode": 2,
|
| 169 |
+
"success": false,
|
| 170 |
+
"total_return": -0.135,
|
| 171 |
+
"steps": 7,
|
| 172 |
+
"invalid_actions": 0,
|
| 173 |
+
"quality_gate_passed": false,
|
| 174 |
+
"retention_ratio": 0.9806
|
| 175 |
+
},
|
| 176 |
+
{
|
| 177 |
+
"task_name": "subscription_churn_baseline_seed0",
|
| 178 |
+
"task_id": "subscription_churn",
|
| 179 |
+
"episode": 0,
|
| 180 |
+
"success": false,
|
| 181 |
+
"total_return": 0.0,
|
| 182 |
+
"steps": 1,
|
| 183 |
+
"invalid_actions": 0,
|
| 184 |
+
"quality_gate_passed": false,
|
| 185 |
+
"retention_ratio": 1.0
|
| 186 |
+
},
|
| 187 |
+
{
|
| 188 |
+
"task_name": "subscription_churn_baseline_seed0",
|
| 189 |
+
"task_id": "subscription_churn",
|
| 190 |
+
"episode": 1,
|
| 191 |
+
"success": false,
|
| 192 |
+
"total_return": 0.0,
|
| 193 |
+
"steps": 1,
|
| 194 |
+
"invalid_actions": 0,
|
| 195 |
+
"quality_gate_passed": false,
|
| 196 |
+
"retention_ratio": 1.0
|
| 197 |
+
},
|
| 198 |
+
{
|
| 199 |
+
"task_name": "subscription_churn_baseline_seed0",
|
| 200 |
+
"task_id": "subscription_churn",
|
| 201 |
+
"episode": 2,
|
| 202 |
+
"success": false,
|
| 203 |
+
"total_return": 0.0,
|
| 204 |
+
"steps": 1,
|
| 205 |
+
"invalid_actions": 0,
|
| 206 |
+
"quality_gate_passed": false,
|
| 207 |
+
"retention_ratio": 1.0
|
| 208 |
+
},
|
| 209 |
+
{
|
| 210 |
+
"task_name": "subscription_churn_baseline_seed1",
|
| 211 |
+
"task_id": "subscription_churn",
|
| 212 |
+
"episode": 0,
|
| 213 |
+
"success": false,
|
| 214 |
+
"total_return": -0.34,
|
| 215 |
+
"steps": 7,
|
| 216 |
+
"invalid_actions": 0,
|
| 217 |
+
"quality_gate_passed": false,
|
| 218 |
+
"retention_ratio": 0.9767
|
| 219 |
+
},
|
| 220 |
+
{
|
| 221 |
+
"task_name": "subscription_churn_baseline_seed1",
|
| 222 |
+
"task_id": "subscription_churn",
|
| 223 |
+
"episode": 1,
|
| 224 |
+
"success": false,
|
| 225 |
+
"total_return": -0.12000000000000001,
|
| 226 |
+
"steps": 3,
|
| 227 |
+
"invalid_actions": 0,
|
| 228 |
+
"quality_gate_passed": false,
|
| 229 |
+
"retention_ratio": 1.0
|
| 230 |
+
},
|
| 231 |
+
{
|
| 232 |
+
"task_name": "subscription_churn_baseline_seed1",
|
| 233 |
+
"task_id": "subscription_churn",
|
| 234 |
+
"episode": 2,
|
| 235 |
+
"success": false,
|
| 236 |
+
"total_return": 0.0,
|
| 237 |
+
"steps": 2,
|
| 238 |
+
"invalid_actions": 0,
|
| 239 |
+
"quality_gate_passed": false,
|
| 240 |
+
"retention_ratio": 1.0
|
| 241 |
+
},
|
| 242 |
+
{
|
| 243 |
+
"task_name": "subscription_churn_baseline_seed2",
|
| 244 |
+
"task_id": "subscription_churn",
|
| 245 |
+
"episode": 0,
|
| 246 |
+
"success": false,
|
| 247 |
+
"total_return": -0.1,
|
| 248 |
+
"steps": 2,
|
| 249 |
+
"invalid_actions": 0,
|
| 250 |
+
"quality_gate_passed": false,
|
| 251 |
+
"retention_ratio": 1.0
|
| 252 |
+
},
|
| 253 |
+
{
|
| 254 |
+
"task_name": "subscription_churn_baseline_seed2",
|
| 255 |
+
"task_id": "subscription_churn",
|
| 256 |
+
"episode": 1,
|
| 257 |
+
"success": false,
|
| 258 |
+
"total_return": -0.12000000000000001,
|
| 259 |
+
"steps": 3,
|
| 260 |
+
"invalid_actions": 0,
|
| 261 |
+
"quality_gate_passed": false,
|
| 262 |
+
"retention_ratio": 1.0
|
| 263 |
+
},
|
| 264 |
+
{
|
| 265 |
+
"task_name": "subscription_churn_baseline_seed2",
|
| 266 |
+
"task_id": "subscription_churn",
|
| 267 |
+
"episode": 2,
|
| 268 |
+
"success": false,
|
| 269 |
+
"total_return": -0.1,
|
| 270 |
+
"steps": 2,
|
| 271 |
+
"invalid_actions": 0,
|
| 272 |
+
"quality_gate_passed": false,
|
| 273 |
+
"retention_ratio": 1.0
|
| 274 |
+
},
|
| 275 |
+
{
|
| 276 |
+
"task_name": "subscription_churn_baseline_seed3",
|
| 277 |
+
"task_id": "subscription_churn",
|
| 278 |
+
"episode": 0,
|
| 279 |
+
"success": false,
|
| 280 |
+
"total_return": -0.34,
|
| 281 |
+
"steps": 6,
|
| 282 |
+
"invalid_actions": 0,
|
| 283 |
+
"quality_gate_passed": false,
|
| 284 |
+
"retention_ratio": 1.0
|
| 285 |
+
},
|
| 286 |
+
{
|
| 287 |
+
"task_name": "subscription_churn_baseline_seed3",
|
| 288 |
+
"task_id": "subscription_churn",
|
| 289 |
+
"episode": 1,
|
| 290 |
+
"success": false,
|
| 291 |
+
"total_return": -0.06,
|
| 292 |
+
"steps": 5,
|
| 293 |
+
"invalid_actions": 0,
|
| 294 |
+
"quality_gate_passed": false,
|
| 295 |
+
"retention_ratio": 0.9786
|
| 296 |
+
},
|
| 297 |
+
{
|
| 298 |
+
"task_name": "subscription_churn_baseline_seed3",
|
| 299 |
+
"task_id": "subscription_churn",
|
| 300 |
+
"episode": 2,
|
| 301 |
+
"success": false,
|
| 302 |
+
"total_return": -0.02,
|
| 303 |
+
"steps": 2,
|
| 304 |
+
"invalid_actions": 0,
|
| 305 |
+
"quality_gate_passed": false,
|
| 306 |
+
"retention_ratio": 1.0
|
| 307 |
+
},
|
| 308 |
+
{
|
| 309 |
+
"task_name": "subscription_churn_baseline_seed4",
|
| 310 |
+
"task_id": "subscription_churn",
|
| 311 |
+
"episode": 0,
|
| 312 |
+
"success": false,
|
| 313 |
+
"total_return": -0.12000000000000001,
|
| 314 |
+
"steps": 5,
|
| 315 |
+
"invalid_actions": 0,
|
| 316 |
+
"quality_gate_passed": false,
|
| 317 |
+
"retention_ratio": 1.0
|
| 318 |
+
},
|
| 319 |
+
{
|
| 320 |
+
"task_name": "subscription_churn_baseline_seed4",
|
| 321 |
+
"task_id": "subscription_churn",
|
| 322 |
+
"episode": 1,
|
| 323 |
+
"success": false,
|
| 324 |
+
"total_return": -0.02,
|
| 325 |
+
"steps": 3,
|
| 326 |
+
"invalid_actions": 0,
|
| 327 |
+
"quality_gate_passed": false,
|
| 328 |
+
"retention_ratio": 1.0
|
| 329 |
+
},
|
| 330 |
+
{
|
| 331 |
+
"task_name": "subscription_churn_baseline_seed4",
|
| 332 |
+
"task_id": "subscription_churn",
|
| 333 |
+
"episode": 2,
|
| 334 |
+
"success": false,
|
| 335 |
+
"total_return": -0.2,
|
| 336 |
+
"steps": 3,
|
| 337 |
+
"invalid_actions": 0,
|
| 338 |
+
"quality_gate_passed": false,
|
| 339 |
+
"retention_ratio": 1.0
|
| 340 |
+
},
|
| 341 |
+
{
|
| 342 |
+
"task_name": "delivery_eta_baseline_seed0",
|
| 343 |
+
"task_id": "delivery_eta",
|
| 344 |
+
"episode": 0,
|
| 345 |
+
"success": false,
|
| 346 |
+
"total_return": -0.4800000000000001,
|
| 347 |
+
"steps": 11,
|
| 348 |
+
"invalid_actions": 0,
|
| 349 |
+
"quality_gate_passed": false,
|
| 350 |
+
"retention_ratio": 0.9825
|
| 351 |
+
},
|
| 352 |
+
{
|
| 353 |
+
"task_name": "delivery_eta_baseline_seed0",
|
| 354 |
+
"task_id": "delivery_eta",
|
| 355 |
+
"episode": 1,
|
| 356 |
+
"success": false,
|
| 357 |
+
"total_return": -0.21730000000000002,
|
| 358 |
+
"steps": 6,
|
| 359 |
+
"invalid_actions": 0,
|
| 360 |
+
"quality_gate_passed": false,
|
| 361 |
+
"retention_ratio": 1.0
|
| 362 |
+
},
|
| 363 |
+
{
|
| 364 |
+
"task_name": "delivery_eta_baseline_seed0",
|
| 365 |
+
"task_id": "delivery_eta",
|
| 366 |
+
"episode": 2,
|
| 367 |
+
"success": false,
|
| 368 |
+
"total_return": -0.02,
|
| 369 |
+
"steps": 4,
|
| 370 |
+
"invalid_actions": 0,
|
| 371 |
+
"quality_gate_passed": false,
|
| 372 |
+
"retention_ratio": 1.0
|
| 373 |
+
},
|
| 374 |
+
{
|
| 375 |
+
"task_name": "delivery_eta_baseline_seed1",
|
| 376 |
+
"task_id": "delivery_eta",
|
| 377 |
+
"episode": 0,
|
| 378 |
+
"success": false,
|
| 379 |
+
"total_return": 0.0,
|
| 380 |
+
"steps": 1,
|
| 381 |
+
"invalid_actions": 0,
|
| 382 |
+
"quality_gate_passed": false,
|
| 383 |
+
"retention_ratio": 1.0
|
| 384 |
+
},
|
| 385 |
+
{
|
| 386 |
+
"task_name": "delivery_eta_baseline_seed1",
|
| 387 |
+
"task_id": "delivery_eta",
|
| 388 |
+
"episode": 1,
|
| 389 |
+
"success": false,
|
| 390 |
+
"total_return": 0.0,
|
| 391 |
+
"steps": 2,
|
| 392 |
+
"invalid_actions": 0,
|
| 393 |
+
"quality_gate_passed": false,
|
| 394 |
+
"retention_ratio": 1.0
|
| 395 |
+
},
|
| 396 |
+
{
|
| 397 |
+
"task_name": "delivery_eta_baseline_seed1",
|
| 398 |
+
"task_id": "delivery_eta",
|
| 399 |
+
"episode": 2,
|
| 400 |
+
"error": "Tool 'submit_solution' failed: Error calling tool 'submit_solution': Error serializing to JSON: TypeError: 'float' object cannot be interpreted as an integer (type: execution_error)",
|
| 401 |
+
"success": false,
|
| 402 |
+
"total_return": 0.0,
|
| 403 |
+
"steps": 0,
|
| 404 |
+
"invalid_actions": 0
|
| 405 |
+
},
|
| 406 |
+
{
|
| 407 |
+
"task_name": "delivery_eta_baseline_seed2",
|
| 408 |
+
"task_id": "delivery_eta",
|
| 409 |
+
"episode": 0,
|
| 410 |
+
"success": false,
|
| 411 |
+
"total_return": 0.0,
|
| 412 |
+
"steps": 3,
|
| 413 |
+
"invalid_actions": 0,
|
| 414 |
+
"quality_gate_passed": false,
|
| 415 |
+
"retention_ratio": 1.0
|
| 416 |
+
},
|
| 417 |
+
{
|
| 418 |
+
"task_name": "delivery_eta_baseline_seed2",
|
| 419 |
+
"task_id": "delivery_eta",
|
| 420 |
+
"episode": 1,
|
| 421 |
+
"success": false,
|
| 422 |
+
"total_return": 0.0,
|
| 423 |
+
"steps": 1,
|
| 424 |
+
"invalid_actions": 0,
|
| 425 |
+
"quality_gate_passed": false,
|
| 426 |
+
"retention_ratio": 1.0
|
| 427 |
+
},
|
| 428 |
+
{
|
| 429 |
+
"task_name": "delivery_eta_baseline_seed2",
|
| 430 |
+
"task_id": "delivery_eta",
|
| 431 |
+
"episode": 2,
|
| 432 |
+
"success": false,
|
| 433 |
+
"total_return": 0.0,
|
| 434 |
+
"steps": 1,
|
| 435 |
+
"invalid_actions": 0,
|
| 436 |
+
"quality_gate_passed": false,
|
| 437 |
+
"retention_ratio": 1.0
|
| 438 |
+
},
|
| 439 |
+
{
|
| 440 |
+
"task_name": "delivery_eta_baseline_seed3",
|
| 441 |
+
"task_id": "delivery_eta",
|
| 442 |
+
"episode": 0,
|
| 443 |
+
"success": false,
|
| 444 |
+
"total_return": 0.002700000000000001,
|
| 445 |
+
"steps": 2,
|
| 446 |
+
"invalid_actions": 0,
|
| 447 |
+
"quality_gate_passed": false,
|
| 448 |
+
"retention_ratio": 1.0
|
| 449 |
+
},
|
| 450 |
+
{
|
| 451 |
+
"task_name": "delivery_eta_baseline_seed3",
|
| 452 |
+
"task_id": "delivery_eta",
|
| 453 |
+
"episode": 1,
|
| 454 |
+
"error": "Tool 'preview_data' failed: Error calling tool 'preview_data': Error serializing to JSON: TypeError: 'float' object cannot be interpreted as an integer (type: execution_error)",
|
| 455 |
+
"success": false,
|
| 456 |
+
"total_return": 0.0,
|
| 457 |
+
"steps": 0,
|
| 458 |
+
"invalid_actions": 0
|
| 459 |
+
},
|
| 460 |
+
{
|
| 461 |
+
"task_name": "delivery_eta_baseline_seed3",
|
| 462 |
+
"task_id": "delivery_eta",
|
| 463 |
+
"episode": 2,
|
| 464 |
+
"success": false,
|
| 465 |
+
"total_return": -0.1,
|
| 466 |
+
"steps": 2,
|
| 467 |
+
"invalid_actions": 0,
|
| 468 |
+
"quality_gate_passed": false,
|
| 469 |
+
"retention_ratio": 1.0
|
| 470 |
+
},
|
| 471 |
+
{
|
| 472 |
+
"task_name": "delivery_eta_baseline_seed4",
|
| 473 |
+
"task_id": "delivery_eta",
|
| 474 |
+
"episode": 0,
|
| 475 |
+
"success": false,
|
| 476 |
+
"total_return": 0.0,
|
| 477 |
+
"steps": 1,
|
| 478 |
+
"invalid_actions": 0,
|
| 479 |
+
"quality_gate_passed": false,
|
| 480 |
+
"retention_ratio": 1.0
|
| 481 |
+
},
|
| 482 |
+
{
|
| 483 |
+
"task_name": "delivery_eta_baseline_seed4",
|
| 484 |
+
"task_id": "delivery_eta",
|
| 485 |
+
"episode": 1,
|
| 486 |
+
"success": false,
|
| 487 |
+
"total_return": -0.12000000000000001,
|
| 488 |
+
"steps": 3,
|
| 489 |
+
"invalid_actions": 0,
|
| 490 |
+
"quality_gate_passed": false,
|
| 491 |
+
"retention_ratio": 1.0
|
| 492 |
+
},
|
| 493 |
+
{
|
| 494 |
+
"task_name": "delivery_eta_baseline_seed4",
|
| 495 |
+
"task_id": "delivery_eta",
|
| 496 |
+
"episode": 2,
|
| 497 |
+
"success": false,
|
| 498 |
+
"total_return": -0.22000000000000003,
|
| 499 |
+
"steps": 6,
|
| 500 |
+
"invalid_actions": 0,
|
| 501 |
+
"quality_gate_passed": false,
|
| 502 |
+
"retention_ratio": 1.0
|
| 503 |
+
}
|
| 504 |
+
]
|
| 505 |
+
}
|
server/cleaning_environment.py
CHANGED
|
@@ -106,25 +106,30 @@ TASKS: dict[str, TaskSpec] = {
|
|
| 106 |
dataset_factory=make_dataset_factory("ecommerce_mobile", n_rows=SIZE_MEDIUM),
|
| 107 |
expected_types={
|
| 108 |
"session_id": "int64",
|
| 109 |
-
"device_os": "
|
| 110 |
-
"customer_id": "
|
| 111 |
-
"country": "
|
| 112 |
"items_in_cart": "float64",
|
| 113 |
"order_value": "float64",
|
| 114 |
-
"event_date": "datetime64[
|
| 115 |
"converted": "int64",
|
| 116 |
},
|
| 117 |
required_ops=[
|
| 118 |
{"operation": "replace_invalid_with_null", "column": "country"},
|
| 119 |
{"operation": "replace_invalid_with_null", "column": "items_in_cart"},
|
| 120 |
-
{"operation": "
|
| 121 |
{"operation": "cast_numeric", "column": "items_in_cart"},
|
|
|
|
| 122 |
{"operation": "impute_numeric", "column": "items_in_cart"},
|
|
|
|
| 123 |
{"operation": "clip_outliers_iqr", "column": "items_in_cart"},
|
| 124 |
{"operation": "clip_outliers_iqr", "column": "order_value"},
|
| 125 |
{"operation": "normalize_categories", "column": "device_os"},
|
| 126 |
{"operation": "normalize_categories", "column": "country"},
|
|
|
|
|
|
|
| 127 |
{"operation": "cast_datetime", "column": "event_date"},
|
|
|
|
| 128 |
],
|
| 129 |
notes=[
|
| 130 |
"Preserve the target column.",
|
|
@@ -143,19 +148,19 @@ TASKS: dict[str, TaskSpec] = {
|
|
| 143 |
task_type="classification",
|
| 144 |
dataset_factory=make_dataset_factory("subscription_churn", n_rows=SIZE_MEDIUM),
|
| 145 |
expected_types={
|
| 146 |
-
"customer_key": "
|
| 147 |
"age": "float64",
|
| 148 |
"monthly_spend": "float64",
|
| 149 |
-
"plan_type": "
|
| 150 |
"tenure_months": "float64",
|
| 151 |
-
"payment_method": "
|
| 152 |
"churned": "int64",
|
| 153 |
},
|
| 154 |
required_ops=[
|
| 155 |
-
{"operation": "drop_duplicates"},
|
| 156 |
{"operation": "replace_invalid_with_null", "column": "monthly_spend"},
|
| 157 |
{"operation": "replace_invalid_with_null", "column": "age"},
|
| 158 |
{"operation": "replace_invalid_with_null", "column": "tenure_months"},
|
|
|
|
| 159 |
{"operation": "cast_numeric", "column": "age"},
|
| 160 |
{"operation": "cast_numeric", "column": "monthly_spend"},
|
| 161 |
{"operation": "cast_numeric", "column": "tenure_months"},
|
|
@@ -164,6 +169,10 @@ TASKS: dict[str, TaskSpec] = {
|
|
| 164 |
{"operation": "impute_numeric", "column": "tenure_months"},
|
| 165 |
{"operation": "clip_outliers_iqr", "column": "monthly_spend"},
|
| 166 |
{"operation": "normalize_categories", "column": "plan_type"},
|
|
|
|
|
|
|
|
|
|
|
|
|
| 167 |
],
|
| 168 |
notes=["Monthly spend contains a severe outlier that should be handled, not ignored."],
|
| 169 |
),
|
|
@@ -179,26 +188,31 @@ TASKS: dict[str, TaskSpec] = {
|
|
| 179 |
task_type="regression",
|
| 180 |
dataset_factory=make_dataset_factory("delivery_eta", n_rows=SIZE_MEDIUM),
|
| 181 |
expected_types={
|
| 182 |
-
"route_id": "
|
| 183 |
-
"city": "
|
| 184 |
"driver_rating": "float64",
|
| 185 |
"delivery_distance_km": "float64",
|
| 186 |
"late_deliveries_last_30d": "float64",
|
| 187 |
-
"vehicle_type": "
|
| 188 |
"delivery_time_minutes": "float64",
|
| 189 |
},
|
| 190 |
required_ops=[
|
| 191 |
-
{"operation": "drop_duplicates"},
|
| 192 |
{"operation": "replace_invalid_with_null", "column": "driver_rating"},
|
| 193 |
{"operation": "replace_invalid_with_null", "column": "late_deliveries_last_30d"},
|
|
|
|
|
|
|
| 194 |
{"operation": "cast_numeric", "column": "driver_rating"},
|
| 195 |
{"operation": "cast_numeric", "column": "delivery_distance_km"},
|
| 196 |
{"operation": "cast_numeric", "column": "late_deliveries_last_30d"},
|
| 197 |
{"operation": "impute_numeric", "column": "driver_rating"},
|
| 198 |
{"operation": "impute_numeric", "column": "late_deliveries_last_30d"},
|
|
|
|
| 199 |
{"operation": "clip_outliers_iqr", "column": "delivery_distance_km"},
|
| 200 |
{"operation": "normalize_categories", "column": "city"},
|
| 201 |
{"operation": "normalize_categories", "column": "vehicle_type"},
|
|
|
|
|
|
|
|
|
|
| 202 |
],
|
| 203 |
notes=["City aliases should be standardized before downstream feature engineering."],
|
| 204 |
),
|
|
@@ -488,14 +502,15 @@ class FSDSCleaningEnvironment(MCPEnvironment):
|
|
| 488 |
return f"Imputed '{column}' with mode='{fill_value}'."
|
| 489 |
|
| 490 |
if operation == "normalize_categories":
|
| 491 |
-
episode.working_df[column]
|
|
|
|
| 492 |
episode.working_df[column]
|
| 493 |
.astype(str)
|
| 494 |
.str.strip()
|
| 495 |
.str.lower()
|
| 496 |
.replace({"ios": "ios", "android": "android", "android ": "android", "mty": "monterrey", "car": "car", "CAR": "car"})
|
| 497 |
)
|
| 498 |
-
|
| 499 |
"ca": "ca",
|
| 500 |
"mx": "mx",
|
| 501 |
"us": "us",
|
|
@@ -505,6 +520,8 @@ class FSDSCleaningEnvironment(MCPEnvironment):
|
|
| 505 |
"motorbike": "motorbike",
|
| 506 |
"bike": "bike",
|
| 507 |
})
|
|
|
|
|
|
|
| 508 |
return f"Normalized categories in '{column}'."
|
| 509 |
|
| 510 |
if operation == "clip_outliers_iqr":
|
|
@@ -572,7 +589,7 @@ class FSDSCleaningEnvironment(MCPEnvironment):
|
|
| 572 |
|
| 573 |
def _required_operations_score(self, episode: EpisodeData) -> float:
|
| 574 |
executed = [
|
| 575 |
-
{k: v for k, v in op.items() if k in {"operation", "column"}}
|
| 576 |
for op in episode.operation_log
|
| 577 |
]
|
| 578 |
matched = 0
|
|
|
|
| 106 |
dataset_factory=make_dataset_factory("ecommerce_mobile", n_rows=SIZE_MEDIUM),
|
| 107 |
expected_types={
|
| 108 |
"session_id": "int64",
|
| 109 |
+
"device_os": "str",
|
| 110 |
+
"customer_id": "str",
|
| 111 |
+
"country": "str",
|
| 112 |
"items_in_cart": "float64",
|
| 113 |
"order_value": "float64",
|
| 114 |
+
"event_date": "datetime64[us]",
|
| 115 |
"converted": "int64",
|
| 116 |
},
|
| 117 |
required_ops=[
|
| 118 |
{"operation": "replace_invalid_with_null", "column": "country"},
|
| 119 |
{"operation": "replace_invalid_with_null", "column": "items_in_cart"},
|
| 120 |
+
{"operation": "replace_invalid_with_null", "column": "device_os"},
|
| 121 |
{"operation": "cast_numeric", "column": "items_in_cart"},
|
| 122 |
+
{"operation": "cast_numeric", "column": "order_value"},
|
| 123 |
{"operation": "impute_numeric", "column": "items_in_cart"},
|
| 124 |
+
{"operation": "impute_numeric", "column": "order_value"},
|
| 125 |
{"operation": "clip_outliers_iqr", "column": "items_in_cart"},
|
| 126 |
{"operation": "clip_outliers_iqr", "column": "order_value"},
|
| 127 |
{"operation": "normalize_categories", "column": "device_os"},
|
| 128 |
{"operation": "normalize_categories", "column": "country"},
|
| 129 |
+
{"operation": "impute_categorical", "column": "device_os"},
|
| 130 |
+
{"operation": "impute_categorical", "column": "country"},
|
| 131 |
{"operation": "cast_datetime", "column": "event_date"},
|
| 132 |
+
{"operation": "drop_duplicates"},
|
| 133 |
],
|
| 134 |
notes=[
|
| 135 |
"Preserve the target column.",
|
|
|
|
| 148 |
task_type="classification",
|
| 149 |
dataset_factory=make_dataset_factory("subscription_churn", n_rows=SIZE_MEDIUM),
|
| 150 |
expected_types={
|
| 151 |
+
"customer_key": "str",
|
| 152 |
"age": "float64",
|
| 153 |
"monthly_spend": "float64",
|
| 154 |
+
"plan_type": "str",
|
| 155 |
"tenure_months": "float64",
|
| 156 |
+
"payment_method": "str",
|
| 157 |
"churned": "int64",
|
| 158 |
},
|
| 159 |
required_ops=[
|
|
|
|
| 160 |
{"operation": "replace_invalid_with_null", "column": "monthly_spend"},
|
| 161 |
{"operation": "replace_invalid_with_null", "column": "age"},
|
| 162 |
{"operation": "replace_invalid_with_null", "column": "tenure_months"},
|
| 163 |
+
{"operation": "replace_invalid_with_null", "column": "payment_method"},
|
| 164 |
{"operation": "cast_numeric", "column": "age"},
|
| 165 |
{"operation": "cast_numeric", "column": "monthly_spend"},
|
| 166 |
{"operation": "cast_numeric", "column": "tenure_months"},
|
|
|
|
| 169 |
{"operation": "impute_numeric", "column": "tenure_months"},
|
| 170 |
{"operation": "clip_outliers_iqr", "column": "monthly_spend"},
|
| 171 |
{"operation": "normalize_categories", "column": "plan_type"},
|
| 172 |
+
{"operation": "normalize_categories", "column": "payment_method"},
|
| 173 |
+
{"operation": "impute_categorical", "column": "plan_type"},
|
| 174 |
+
{"operation": "impute_categorical", "column": "payment_method"},
|
| 175 |
+
{"operation": "drop_duplicates"},
|
| 176 |
],
|
| 177 |
notes=["Monthly spend contains a severe outlier that should be handled, not ignored."],
|
| 178 |
),
|
|
|
|
| 188 |
task_type="regression",
|
| 189 |
dataset_factory=make_dataset_factory("delivery_eta", n_rows=SIZE_MEDIUM),
|
| 190 |
expected_types={
|
| 191 |
+
"route_id": "str",
|
| 192 |
+
"city": "str",
|
| 193 |
"driver_rating": "float64",
|
| 194 |
"delivery_distance_km": "float64",
|
| 195 |
"late_deliveries_last_30d": "float64",
|
| 196 |
+
"vehicle_type": "str",
|
| 197 |
"delivery_time_minutes": "float64",
|
| 198 |
},
|
| 199 |
required_ops=[
|
|
|
|
| 200 |
{"operation": "replace_invalid_with_null", "column": "driver_rating"},
|
| 201 |
{"operation": "replace_invalid_with_null", "column": "late_deliveries_last_30d"},
|
| 202 |
+
{"operation": "replace_invalid_with_null", "column": "city"},
|
| 203 |
+
{"operation": "replace_invalid_with_null", "column": "vehicle_type"},
|
| 204 |
{"operation": "cast_numeric", "column": "driver_rating"},
|
| 205 |
{"operation": "cast_numeric", "column": "delivery_distance_km"},
|
| 206 |
{"operation": "cast_numeric", "column": "late_deliveries_last_30d"},
|
| 207 |
{"operation": "impute_numeric", "column": "driver_rating"},
|
| 208 |
{"operation": "impute_numeric", "column": "late_deliveries_last_30d"},
|
| 209 |
+
{"operation": "impute_numeric", "column": "delivery_distance_km"},
|
| 210 |
{"operation": "clip_outliers_iqr", "column": "delivery_distance_km"},
|
| 211 |
{"operation": "normalize_categories", "column": "city"},
|
| 212 |
{"operation": "normalize_categories", "column": "vehicle_type"},
|
| 213 |
+
{"operation": "impute_categorical", "column": "city"},
|
| 214 |
+
{"operation": "impute_categorical", "column": "vehicle_type"},
|
| 215 |
+
{"operation": "drop_duplicates"},
|
| 216 |
],
|
| 217 |
notes=["City aliases should be standardized before downstream feature engineering."],
|
| 218 |
),
|
|
|
|
| 502 |
return f"Imputed '{column}' with mode='{fill_value}'."
|
| 503 |
|
| 504 |
if operation == "normalize_categories":
|
| 505 |
+
null_mask = episode.working_df[column].isna()
|
| 506 |
+
normalized = (
|
| 507 |
episode.working_df[column]
|
| 508 |
.astype(str)
|
| 509 |
.str.strip()
|
| 510 |
.str.lower()
|
| 511 |
.replace({"ios": "ios", "android": "android", "android ": "android", "mty": "monterrey", "car": "car", "CAR": "car"})
|
| 512 |
)
|
| 513 |
+
normalized = normalized.replace({
|
| 514 |
"ca": "ca",
|
| 515 |
"mx": "mx",
|
| 516 |
"us": "us",
|
|
|
|
| 520 |
"motorbike": "motorbike",
|
| 521 |
"bike": "bike",
|
| 522 |
})
|
| 523 |
+
normalized[null_mask] = np.nan
|
| 524 |
+
episode.working_df[column] = normalized
|
| 525 |
return f"Normalized categories in '{column}'."
|
| 526 |
|
| 527 |
if operation == "clip_outliers_iqr":
|
|
|
|
| 589 |
|
| 590 |
def _required_operations_score(self, episode: EpisodeData) -> float:
|
| 591 |
executed = [
|
| 592 |
+
{k: v for k, v in op.items() if k in {"operation", "column"} and v is not None}
|
| 593 |
for op in episode.operation_log
|
| 594 |
]
|
| 595 |
matched = 0
|
training_colab.py
CHANGED
|
@@ -71,6 +71,8 @@ model = FastLanguageModel.get_peft_model(
|
|
| 71 |
)
|
| 72 |
if tokenizer.pad_token is None:
|
| 73 |
tokenizer.pad_token = tokenizer.eos_token
|
|
|
|
|
|
|
| 74 |
|
| 75 |
|
| 76 |
# ββ Cell 5 βΈ System prompt & dataset βββββββββββββββββββββββββββββββββ
|
|
|
|
| 71 |
)
|
| 72 |
if tokenizer.pad_token is None:
|
| 73 |
tokenizer.pad_token = tokenizer.eos_token
|
| 74 |
+
if not hasattr(model, "warnings_issued"):
|
| 75 |
+
model.warnings_issued = {}
|
| 76 |
|
| 77 |
|
| 78 |
# ββ Cell 5 βΈ System prompt & dataset βββββββββββββββββββββββββββββββββ
|