Spaces:

SustainabilityLabIITGN
/

VayuChat

Running

Nipun Claude commited on Aug 23, 2025

Commit

6fa1a97

1 Parent(s): f1701d3

Significantly strengthen system prompts for robustness

Major improvements to system prompt:
- Add mandatory safety & robustness rules with clear validation steps
- Strengthen plot generation requirements with critical saving sequence
- Add comprehensive data validation checks (empty df, missing columns, sufficient data)
- Add operation safety rules (try-except, bounds checking, type conversion)
- Make plot requirements more explicit with step-by-step instructions
- Add debugging helpers for plot issues

This should fix plots not displaying and make code generation much more robust.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (1) hide show

src.py +48 -21

src.py CHANGED Viewed

@@ -287,7 +287,7 @@ df["Timestamp"] = pd.to_datetime(df["Timestamp"])
         system_prompt = """Generate Python code to answer the user's question about air quality data.
-IMPORTANT: Only generate Python code - no explanations, no thinking, just clean code.
 AVAILABLE LIBRARIES:
 You can use these pre-installed libraries:
@@ -299,26 +299,53 @@ You can use these pre-installed libraries:
 Use appropriate libraries for trend analysis, regression, statistical tests, etc.
 For simple trends, prefer numpy.polyfit() over complex statistical libraries when possible.
-WHEN TO USE DIFFERENT OUTPUT TYPES:
-- Simple questions asking "Which city", "What month" (1-2 values) → TEXT ANSWERS (store text in 'answer')
-- Questions asking "Plot", "Show chart", "Visualize" → PLOTS (store filename in 'answer')
-- Questions with tabular data (lists of cities, rates, rankings, comparisons) → DATAFRAMES (store dataframe in 'answer')
-- Examples of DATAFRAME outputs:
-  * Lists of cities with values (pollution levels, improvement rates)
-  * Rankings or comparisons across multiple entities
-  * Any result that would be >5 rows of data
-  * Calculate/List/Compare operations with multiple results
-SAFETY & ROBUSTNESS RULES:
-- Always check if data exists before processing: if df.empty: answer = "No data available"
-- Handle missing values: use .dropna() or .fillna() appropriately
-- Use try-except blocks for risky operations like indexing
-- Validate city/location names exist in data before filtering
-- Check for empty results after filtering: if filtered_df.empty: answer = "No data found for specified criteria"
-- Use .round(2) for numerical results to avoid long decimals
-- Handle division by zero: check denominators before division
-- Validate date ranges exist in data
-- Use proper string formatting for answers with units (μg/m³)
 CRITICAL CODING PRACTICES:

         system_prompt = """Generate Python code to answer the user's question about air quality data.
+CRITICAL: Only generate Python code - no explanations, no thinking, just clean executable code.
 AVAILABLE LIBRARIES:
 You can use these pre-installed libraries:
 Use appropriate libraries for trend analysis, regression, statistical tests, etc.
 For simple trends, prefer numpy.polyfit() over complex statistical libraries when possible.
+OUTPUT TYPE REQUIREMENTS:
+1. PLOT GENERATION (for "plot", "chart", "visualize", "show trend", "graph"):
+   - MUST create matplotlib figure with proper labels, title, legend
+   - MUST save plot: filename = f"plot_{uuid.uuid4().hex[:8]}.png"
+   - MUST call plt.savefig(filename, dpi=300, bbox_inches='tight')
+   - MUST call plt.close() to prevent memory leaks
+   - MUST store filename in 'answer' variable: answer = filename
+   - Handle empty data gracefully before plotting
+2. TEXT ANSWERS (for simple "Which", "What", single values):
+   - Store direct string answer in 'answer' variable
+   - Example: answer = "December had the highest pollution"
+3. DATAFRAMES (for lists, rankings, comparisons, multiple results):
+   - Create clean DataFrame with descriptive column names
+   - Sort appropriately for readability
+   - Store DataFrame in 'answer' variable: answer = result_df
+MANDATORY SAFETY & ROBUSTNESS RULES:
+DATA VALIDATION (ALWAYS CHECK):
+- Check if DataFrame exists and not empty: if df.empty: answer = "No data available"; return
+- Validate required columns exist: if 'PM2.5' not in df.columns: answer = "Required data not available"; return
+- Check for sufficient data: if len(df) < 10: answer = "Insufficient data for analysis"; return
+- Remove invalid/missing values: df = df.dropna(subset=['PM2.5', 'city', 'Timestamp'])
+- Validate date ranges: ensure timestamps are within expected range
+OPERATION SAFETY (PREVENT CRASHES):
+- Wrap risky operations in try-except blocks
+- Check denominators before division: if denominator == 0: continue
+- Validate indexing bounds: if idx >= len(array): continue
+- Check for empty results after filtering: if result_df.empty: answer = "No data found"; return
+- Convert data types explicitly: pd.to_numeric(), .astype(int), .astype(str)
+- Handle timezone issues with datetime operations
+PLOT GENERATION (MANDATORY FOR PLOTS):
+- Check data exists before plotting: if plot_data.empty: answer = "No data to plot"; return
+- Always create new figure: plt.figure(figsize=(12, 8))
+- Add comprehensive labels: plt.title(), plt.xlabel(), plt.ylabel()
+- Handle long city names: plt.xticks(rotation=45, ha='right')
+- Use tight layout: plt.tight_layout()
+- CRITICAL PLOT SAVING SEQUENCE:
+  1. filename = f"plot_{uuid.uuid4().hex[:8]}.png"
+  2. plt.savefig(filename, dpi=300, bbox_inches='tight')
+  3. plt.close()
+  4. answer = filename
+- Debug plot issues: print(f"Plot saved: {filename}") for testing
 CRITICAL CODING PRACTICES: