Spaces:

Pulastya0
/

Data-Science-Agent

Running

App Files Files Community

Pulastya B commited on Jan 29

Commit

6f57124

1 Parent(s): b43b5e5

Fixed bugs 2

Browse files

Files changed (3) hide show

TEST_SCENARIOS.md +0 -300
src/tools/advanced_training.py +19 -0
src/tools/model_training.py +23 -0

TEST_SCENARIOS.md CHANGED Viewed

@@ -1,300 +0,0 @@
-# Test Scenarios for Parameter Remapping Fixes
-## Test Case 1: train_baseline_models with invalid 'models' parameter
-### Input (from LLM):
-```json
-{
-  "tool": "train_baseline_models",
-  "arguments": {
-    "file_path": "/tmp/data.csv",
-    "target_column": "price",
-    "models": ["linear_regression", "random_forest", "xgboost"],
-    "test_size": 0.2,
-    "random_state": 42
-  }
-}
-```
-### Expected Output (after remapping):
-```
-   ✓ Parameter remapped: target_column → target_col
-   ✓ Stripped invalid parameter 'models': ['linear_regression', 'random_forest', 'xgboost']
-   ℹ️ train_baseline_models trains all baseline models automatically
-   📋 Final parameters: ['file_path', 'target_col', 'test_size', 'random_state']
-🔧 Executing tool: train_baseline_models
-✅ Tool executed successfully
-```
-### What Gets Called:
-```python
-train_baseline_models(
-    file_path="/tmp/data.csv",
-    target_col="price",  # Remapped from target_column
-    test_size=0.2,
-    random_state=42
-    # models parameter stripped
-)
-```
----
-## Test Case 2: generate_model_report with wrong parameter name
-### Input (from LLM):
-```json
-{
-  "tool": "generate_model_report",
-  "arguments": {
-    "model_path": "/tmp/model.pkl",
-    "file_path": "/tmp/test.csv",
-    "target_column": "price",
-    "output_path": "/tmp/report.json"
-  }
-}
-```
-### Expected Output (after remapping):
-```
-   ✓ Parameter remapped: target_column → target_col
-   ✓ Parameter remapped: file_path → test_data_path
-   📋 Final parameters: ['model_path', 'test_data_path', 'target_col', 'output_path']
-🔧 Executing tool: generate_model_report
-✅ Tool executed successfully
-```
-### What Gets Called:
-```python
-generate_model_report(
-    model_path="/tmp/model.pkl",
-    test_data_path="/tmp/test.csv",  # Remapped from file_path
-    target_col="price",  # Remapped from target_column
-    output_path="/tmp/report.json"
-)
-```
----
-## Test Case 3: detect_model_issues with invalid split parameters
-### Input (from LLM):
-```json
-{
-  "tool": "detect_model_issues",
-  "arguments": {
-    "model_path": "/tmp/model.pkl",
-    "train_data_path": "/tmp/train.csv",
-    "test_data_path": "/tmp/test.csv",
-    "target_column": "price",
-    "train_target_path": "/tmp/y_train.csv",
-    "test_target_path": "/tmp/y_test.csv"
-  }
-}
-```
-### Expected Output (after remapping):
-```
-   ✓ Parameter remapped: target_column → target_col
-   ✓ Stripped invalid parameter 'train_target_path': /tmp/y_train.csv
-   ✓ Stripped invalid parameter 'test_target_path': /tmp/y_test.csv
-   📋 Final parameters: ['model_path', 'train_data_path', 'test_data_path', 'target_col']
-🔧 Executing tool: detect_model_issues
-✅ Tool executed successfully
-```
-### What Gets Called:
-```python
-detect_model_issues(
-    model_path="/tmp/model.pkl",
-    train_data_path="/tmp/train.csv",
-    test_data_path="/tmp/test.csv",
-    target_col="price"  # Remapped from target_column
-    # train_target_path and test_target_path stripped
-)
-```
----
-## Test Case 4: detect_model_issues missing required parameter
-### Input (from LLM):
-```json
-{
-  "tool": "detect_model_issues",
-  "arguments": {
-    "model_path": "/tmp/model.pkl",
-    "test_data_path": "/tmp/test.csv",
-    "target_column": "price"
-  }
-}
-```
-### Expected Output (after remapping):
-```
-   ✓ Parameter remapped: target_column → target_col
-   ⚠️ WARNING: detect_model_issues requires 'train_data_path' parameter
-   📋 Final parameters: ['model_path', 'test_data_path', 'target_col']
-🔧 Executing tool: detect_model_issues
-❌ Error: detect_model_issues() missing 1 required positional argument: 'train_data_path'
-```
-### Result:
-Tool will still fail (as expected) but with clear warning that train_data_path is required. LLM can retry with correct parameters.
----
-## Test Case 5: Combined parameter issues
-### Input (from LLM):
-```json
-{
-  "tool": "train_baseline_models",
-  "arguments": {
-    "file_path": "/tmp/data.csv",
-    "target_column": "price",
-    "models": ["xgboost"],
-    "test_size": "0.3",
-    "random_state": "None"
-  }
-}
-```
-### Expected Output (after remapping):
-```
-   ✓ Parameter remapped: target_column → target_col
-   ✓ Stripped invalid parameter 'models': ['xgboost']
-   ℹ️ train_baseline_models trains all baseline models automatically
-   📋 Final parameters: ['file_path', 'target_col', 'test_size', 'random_state']
-🔧 Executing tool: train_baseline_models
-✅ Tool executed successfully
-```
-### What Gets Called:
-```python
-train_baseline_models(
-    file_path="/tmp/data.csv",
-    target_col="price",  # Remapped
-    test_size="0.3",  # String (may cause type error - should be float)
-    random_state=None  # "None" string converted to None
-)
-```
-**Note**: Type conversion from string "None" to None works. String "0.3" to float conversion needs testing.
----
-## Test Case 6: No remapping needed (correct parameters)
-### Input (from LLM):
-```json
-{
-  "tool": "train_baseline_models",
-  "arguments": {
-    "file_path": "/tmp/data.csv",
-    "target_col": "price",
-    "test_size": 0.2,
-    "random_state": 42
-  }
-}
-```
-### Expected Output:
-```
-   📋 Final parameters: ['file_path', 'target_col', 'test_size', 'random_state']
-🔧 Executing tool: train_baseline_models
-✅ Tool executed successfully
-```
-**No remapping messages** - parameters already correct!
----
-## Validation Commands
-### Check logs for parameter remapping:
-```bash
-grep "✓ Parameter remapped" logs.txt
-grep "✓ Stripped invalid parameter" logs.txt
-```
-### Check for remaining errors:
-```bash
-grep "unexpected keyword argument" logs.txt
-grep "missing.*required.*argument" logs.txt
-```
-### Count successful modeling tool executions:
-```bash
-grep -A5 "train_baseline_models" logs.txt | grep "✅ Tool executed successfully" | wc -l
-grep -A5 "generate_model_report" logs.txt | grep "✅ Tool executed successfully" | wc -l
-grep -A5 "detect_model_issues" logs.txt | grep "✅ Tool executed successfully" | wc -l
-```
----
-## Integration Test Flow
-**Complete ML Pipeline Test**:
-1. Load earthquake dataset
-2. Profile data (`profile_dataset`)
-3. Create time features (`create_time_features`)
-4. Create interaction features (`create_interaction_features`)
-5. Encode categorical (`encode_categorical`)
-6. **Train baseline models** (`train_baseline_models` - WITH REMAPPING)
-7. Hyperparameter tuning (`hyperparameter_tuning`)
-8. Cross-validation (`perform_cross_validation`)
-9. **Generate report** (`generate_model_report` - WITH REMAPPING)
-10. **Detect issues** (`detect_model_issues` - WITH REMAPPING)
-**Expected**: All steps succeed without parameter errors.
----
-## Edge Cases to Consider
-### 1. Both old and new parameter provided:
-```json
-{
-  "target_column": "price",
-  "target_col": "sales"
-}
-```
-**Behavior**: Keep `target_col`, ignore `target_column` (remapping checks `target_col not in arguments`)
-### 2. Parameter is None:
-```json
-{
-  "models": null
-}
-```
-**Behavior**: Still stripped (check is `if "models" in arguments`)
-### 3. Empty list parameter:
-```json
-{
-  "models": []
-}
-```
-**Behavior**: Stripped with log showing empty list
-### 4. Multiple invalid parameters:
-```json
-{
-  "train_target_path": "/tmp/y_train.csv",
-  "test_target_path": "/tmp/y_test.csv",
-  "validation_target_path": "/tmp/y_val.csv"
-}
-```
-**Behavior**: Only `train_target_path` and `test_target_path` stripped (not in remapping list)
----
-## Success Metrics
-After deployment, measure:
-- ✅ Number of parameter remapping logs (should increase)
-- ✅ Successful modeling tool executions (should increase)
-- ✅ Parameter error count (should decrease to near zero)
-- ✅ execute_python_code fallbacks for modeling (should decrease)
-- ✅ Complete workflow success rate (should increase)

src/tools/advanced_training.py CHANGED Viewed

@@ -110,6 +110,25 @@ def hyperparameter_tuning(
         n_trials = 30
         print(f"   ⚠️ Medium dataset ({n_rows:,} rows) - reducing trials from {original_trials} to {n_trials}")
     # ⚠️ SKIP DATETIME CONVERSION: Already handled by create_time_features() in workflow step 7
     # The encoded.csv file should already have time features extracted
     # If datetime columns still exist, they will be handled as regular features

         n_trials = 30
         print(f"   ⚠️ Medium dataset ({n_rows:,} rows) - reducing trials from {original_trials} to {n_trials}")
+    # ⚠️ PERFORMANCE FIX: Sample large datasets for hyperparameter tuning
+    # Hyperparameters found on sample will be used to train final model on full dataset
+    MAX_TUNING_ROWS = 50000
+    sampled = False
+    if n_rows > MAX_TUNING_ROWS:
+        original_rows = n_rows
+        sample_frac = MAX_TUNING_ROWS / n_rows
+        df = df.sample(n=MAX_TUNING_ROWS, random_state=random_state)
+        sampled = True
+        print(f"   📊 Sampled {MAX_TUNING_ROWS:,} rows ({sample_frac:.1%}) from {original_rows:,} for faster tuning")
+        print(f"   💡 Hyperparameters found on sample will generalize well to full dataset")
+        print(f"   ⏱️ Expected speedup: 3-5x faster tuning")
+    # ⚠️ Auto-reduce CV folds for very large datasets
+    original_cv_folds = cv_folds
+    if n_rows > 100000 and cv_folds > 3:
+        cv_folds = 3
+        print(f"   ⏱️ Using {cv_folds}-fold CV (instead of {original_cv_folds}) for faster tuning on large dataset")
     # ⚠️ SKIP DATETIME CONVERSION: Already handled by create_time_features() in workflow step 7
     # The encoded.csv file should already have time features extracted
     # If datetime columns still exist, they will be handled as regular features

src/tools/model_training.py CHANGED Viewed

@@ -286,6 +286,29 @@ def train_baseline_models(file_path: str, target_col: str,
         "model_path": results["models"][best_model_name]["model_path"] if best_model_name else None
     }
     # Generate visualizations for best model
     if VISUALIZATION_AVAILABLE and best_model_name:
         try:

         "model_path": results["models"][best_model_name]["model_path"] if best_model_name else None
     }
+    # ⚠️ Add guidance for hyperparameter tuning on large datasets
+    if results["n_samples"] > 100000:
+        # Recommend faster models for large datasets
+        fast_models = ["xgboost", "lightgbm"]
+        if best_model_name in fast_models:
+            results["tuning_recommendation"] = {
+                "suggested_model": best_model_name,
+                "reason": f"{best_model_name} is optimal for large datasets - fast training and good performance"
+            }
+        elif best_model_name == "random_forest":
+            # Find next best fast model
+            fast_model_scores = {name: results["models"][name]["test_metrics"].get("r2" if task_type == "regression" else "f1", 0)
+                               for name in fast_models if name in results["models"]}
+            if fast_model_scores:
+                alt_model = max(fast_model_scores, key=fast_model_scores.get)
+                alt_score = fast_model_scores[alt_model]
+                score_diff = abs(best_score - alt_score)
+                if score_diff < 0.05:  # Less than 5% difference
+                    results["tuning_recommendation"] = {
+                        "suggested_model": alt_model,
+                        "reason": f"For large datasets, {alt_model} is 5-10x faster than {best_model_name} with similar performance (score difference: {score_diff:.4f})"
+                    }
     # Generate visualizations for best model
     if VISUALIZATION_AVAILABLE and best_model_name:
         try: