Spaces:

Pulastya0
/

Data-Science-Agent

Running

Pulastya B commited on Dec 28, 2025

Commit

03b24f8

1 Parent(s): d6ba760

fix: Handle output_dir parameter mismatch in ydata_profiling

- LLM hallucinates 'output_dir' instead of correct 'output_path'
- Auto-convert output_dir to output_path with filename
- Fixes: TypeError unexpected keyword argument 'output_dir'
- Add STORAGE.md documenting ephemeral vs persistent storage options
- Current: Ephemeral /tmp (fine for hackathon)
- Future: Cloudflare R2 or Render Persistent Disks

Files changed (2) hide show

STORAGE.md +108 -0
src/orchestrator.py +9 -0

STORAGE.md ADDED Viewed

	@@ -0,0 +1,108 @@

+# Storage Strategy for Render Deployment
+## Current Status (Ephemeral Storage)
+**Render uses ephemeral `/tmp` storage** - files are deleted on:
+- Container restart
+- New deployment
+- Service scaling
+**Current behavior:**
+- Reports generated during analysis are accessible during the session
+- Files disappear after 10-30 minutes or on redeploy
+- Fine for **hackathon demos** where users view reports immediately
+## For Production (If Needed)
+### Option 1: Cloudflare R2 (Recommended)
+**Best for:** Production deployment with persistent storage
+```bash
+# Install R2 SDK
+pip install boto3
+# Configuration
+R2_ENDPOINT = "https://<account-id>.r2.cloudflarestorage.com"
+R2_ACCESS_KEY = "<access-key>"
+R2_SECRET_KEY = "<secret-key>"
+R2_BUCKET = "ds-agent-reports"
+```
+**Code changes needed:**
+```python
+# In src/storage/artifact_store.py
+import boto3
+def upload_to_r2(local_path: str, r2_key: str):
+    s3 = boto3.client(
+        's3',
+        endpoint_url=os.getenv('R2_ENDPOINT'),
+        aws_access_key_id=os.getenv('R2_ACCESS_KEY'),
+        aws_secret_access_key=os.getenv('R2_SECRET_KEY')
+    )
+    s3.upload_file(local_path, os.getenv('R2_BUCKET'), r2_key)
+    # Return public URL
+    return f"https://reports.yourdomain.com/{r2_key}"
+```
+**Cost:** ~$0.015/GB storage + $0.36/million Class B operations (very cheap)
+### Option 2: Render Persistent Disks
+**Best for:** Simple persistent storage without external dependencies
+- Add persistent disk in Render dashboard
+- Mount at `/data`
+- Change `OUTPUT_DIR` to `/data/outputs`
+- **Cost:** $0.25/GB/month (more expensive than R2)
+- **Limitation:** Disk size is fixed, can't easily scale
+### Option 3: Browser-Side Download (Current + Enhancement)
+**Best for:** Hackathon/Demo where users download immediately
+```typescript
+// Auto-download reports after generation
+const downloadReport = async (reportPath: string) => {
+  const response = await fetch(reportPath);
+  const blob = await response.blob();
+  const url = window.URL.createObjectURL(blob);
+  const a = document.createElement('a');
+  a.href = url;
+  a.download = reportPath.split('/').pop() || 'report.html';
+  a.click();
+};
+```
+**Pros:**
+- No storage costs
+- Works with ephemeral Render storage
+- User has permanent copy
+**Cons:**
+- Large files (reports can be 5-50MB)
+- Can't re-access after browser close
+## Recommendation for DevSprint Hackathon
+**Keep current ephemeral storage** because:
+1. ✅ No cost or setup complexity
+2. ✅ Reports accessible during demo session
+3. ✅ Judges can view reports immediately after generation
+4. ✅ If needed, add "Download Report" button for permanent copy
+**After hackathon** (if going to production):
+- Use **Cloudflare R2** for cost-effective persistent storage
+- Keep reports for 30 days with auto-cleanup
+- Estimated cost: ~$1-5/month for typical usage
+## Current File Serving
+Reports are served via FastAPI endpoint:
+```python
+# src/api/app.py
+@app.get("/outputs/{file_path:path}")
+async def serve_output_file(file_path: str):
+    file_full_path = Path(f"./outputs/{file_path}")
+    return FileResponse(file_full_path, media_type="text/html")
+```
+Works perfectly for ephemeral storage during active sessions.

src/orchestrator.py CHANGED Viewed

@@ -829,6 +829,15 @@ You are a DOER. Complete workflows based on user intent."""
         try:
             tool_func = self.tool_functions[tool_name]
             result = tool_func(**arguments)
             # Check if tool itself returned an error (some tools return dict with 'status': 'error')

         try:
             tool_func = self.tool_functions[tool_name]
+            # Fix common parameter mismatches from LLM hallucinations
+            if tool_name == "generate_ydata_profiling_report":
+                # LLM often calls with 'output_dir' instead of 'output_path'
+                if "output_dir" in arguments and "output_path" not in arguments:
+                    output_dir = arguments.pop("output_dir")
+                    # Convert directory to full file path
+                    arguments["output_path"] = f"{output_dir}/ydata_profile.html"
             result = tool_func(**arguments)
             # Check if tool itself returned an error (some tools return dict with 'status': 'error')