Pulastya B commited on
Commit
03b24f8
·
1 Parent(s): d6ba760

fix: Handle output_dir parameter mismatch in ydata_profiling

Browse files

- LLM hallucinates 'output_dir' instead of correct 'output_path'
- Auto-convert output_dir to output_path with filename
- Fixes: TypeError unexpected keyword argument 'output_dir'
- Add STORAGE.md documenting ephemeral vs persistent storage options
- Current: Ephemeral /tmp (fine for hackathon)
- Future: Cloudflare R2 or Render Persistent Disks

Files changed (2) hide show
  1. STORAGE.md +108 -0
  2. src/orchestrator.py +9 -0
STORAGE.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Storage Strategy for Render Deployment
2
+
3
+ ## Current Status (Ephemeral Storage)
4
+
5
+ **Render uses ephemeral `/tmp` storage** - files are deleted on:
6
+ - Container restart
7
+ - New deployment
8
+ - Service scaling
9
+
10
+ **Current behavior:**
11
+ - Reports generated during analysis are accessible during the session
12
+ - Files disappear after 10-30 minutes or on redeploy
13
+ - Fine for **hackathon demos** where users view reports immediately
14
+
15
+ ## For Production (If Needed)
16
+
17
+ ### Option 1: Cloudflare R2 (Recommended)
18
+ **Best for:** Production deployment with persistent storage
19
+
20
+ ```bash
21
+ # Install R2 SDK
22
+ pip install boto3
23
+
24
+ # Configuration
25
+ R2_ENDPOINT = "https://<account-id>.r2.cloudflarestorage.com"
26
+ R2_ACCESS_KEY = "<access-key>"
27
+ R2_SECRET_KEY = "<secret-key>"
28
+ R2_BUCKET = "ds-agent-reports"
29
+ ```
30
+
31
+ **Code changes needed:**
32
+ ```python
33
+ # In src/storage/artifact_store.py
34
+ import boto3
35
+
36
+ def upload_to_r2(local_path: str, r2_key: str):
37
+ s3 = boto3.client(
38
+ 's3',
39
+ endpoint_url=os.getenv('R2_ENDPOINT'),
40
+ aws_access_key_id=os.getenv('R2_ACCESS_KEY'),
41
+ aws_secret_access_key=os.getenv('R2_SECRET_KEY')
42
+ )
43
+ s3.upload_file(local_path, os.getenv('R2_BUCKET'), r2_key)
44
+ # Return public URL
45
+ return f"https://reports.yourdomain.com/{r2_key}"
46
+ ```
47
+
48
+ **Cost:** ~$0.015/GB storage + $0.36/million Class B operations (very cheap)
49
+
50
+ ### Option 2: Render Persistent Disks
51
+ **Best for:** Simple persistent storage without external dependencies
52
+
53
+ - Add persistent disk in Render dashboard
54
+ - Mount at `/data`
55
+ - Change `OUTPUT_DIR` to `/data/outputs`
56
+ - **Cost:** $0.25/GB/month (more expensive than R2)
57
+ - **Limitation:** Disk size is fixed, can't easily scale
58
+
59
+ ### Option 3: Browser-Side Download (Current + Enhancement)
60
+ **Best for:** Hackathon/Demo where users download immediately
61
+
62
+ ```typescript
63
+ // Auto-download reports after generation
64
+ const downloadReport = async (reportPath: string) => {
65
+ const response = await fetch(reportPath);
66
+ const blob = await response.blob();
67
+ const url = window.URL.createObjectURL(blob);
68
+ const a = document.createElement('a');
69
+ a.href = url;
70
+ a.download = reportPath.split('/').pop() || 'report.html';
71
+ a.click();
72
+ };
73
+ ```
74
+
75
+ **Pros:**
76
+ - No storage costs
77
+ - Works with ephemeral Render storage
78
+ - User has permanent copy
79
+
80
+ **Cons:**
81
+ - Large files (reports can be 5-50MB)
82
+ - Can't re-access after browser close
83
+
84
+ ## Recommendation for DevSprint Hackathon
85
+
86
+ **Keep current ephemeral storage** because:
87
+ 1. ✅ No cost or setup complexity
88
+ 2. ✅ Reports accessible during demo session
89
+ 3. ✅ Judges can view reports immediately after generation
90
+ 4. ✅ If needed, add "Download Report" button for permanent copy
91
+
92
+ **After hackathon** (if going to production):
93
+ - Use **Cloudflare R2** for cost-effective persistent storage
94
+ - Keep reports for 30 days with auto-cleanup
95
+ - Estimated cost: ~$1-5/month for typical usage
96
+
97
+ ## Current File Serving
98
+
99
+ Reports are served via FastAPI endpoint:
100
+ ```python
101
+ # src/api/app.py
102
+ @app.get("/outputs/{file_path:path}")
103
+ async def serve_output_file(file_path: str):
104
+ file_full_path = Path(f"./outputs/{file_path}")
105
+ return FileResponse(file_full_path, media_type="text/html")
106
+ ```
107
+
108
+ Works perfectly for ephemeral storage during active sessions.
src/orchestrator.py CHANGED
@@ -829,6 +829,15 @@ You are a DOER. Complete workflows based on user intent."""
829
 
830
  try:
831
  tool_func = self.tool_functions[tool_name]
 
 
 
 
 
 
 
 
 
832
  result = tool_func(**arguments)
833
 
834
  # Check if tool itself returned an error (some tools return dict with 'status': 'error')
 
829
 
830
  try:
831
  tool_func = self.tool_functions[tool_name]
832
+
833
+ # Fix common parameter mismatches from LLM hallucinations
834
+ if tool_name == "generate_ydata_profiling_report":
835
+ # LLM often calls with 'output_dir' instead of 'output_path'
836
+ if "output_dir" in arguments and "output_path" not in arguments:
837
+ output_dir = arguments.pop("output_dir")
838
+ # Convert directory to full file path
839
+ arguments["output_path"] = f"{output_dir}/ydata_profile.html"
840
+
841
  result = tool_func(**arguments)
842
 
843
  # Check if tool itself returned an error (some tools return dict with 'status': 'error')