VEDAGI1 commited on
Commit
2f37ded
·
verified ·
1 Parent(s): c8c6f45

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +31 -23
app.py CHANGED
@@ -100,12 +100,12 @@ def safe_log(event_name: str, meta: dict | None = None):
100
 
101
  def _create_python_script(user_scenario: str, schema_context: str) -> str:
102
  """
103
- IMPROVED: Generates a Python script using a "Plan-and-Execute" approach.
104
- The AI first creates a step-by-step plan, then writes code to execute it.
105
- This ensures the analysis is logical, correctly aggregated, and aligned with the user's goal.
106
  """
107
  prompt_for_coder = f"""\
108
- You are an expert-level Python data scientist acting as a consultant. Your task is to analyze data to answer a user's business request.
109
 
110
  --- USER'S SCENARIO ---
111
  {user_scenario}
@@ -115,32 +115,40 @@ You are an expert-level Python data scientist acting as a consultant. Your task
115
  {schema_context}
116
  --- END DATA SCHEMA ---
117
 
118
- You must follow a rigorous two-step process:
119
 
120
- **Step 1: Create a Detailed Analysis Plan.**
121
- First, think step-by-step. Deconstruct the user's request into a clear, logical plan.
122
- The plan must identify the key metrics, necessary data manipulations (cleaning, grouping, aggregation), and the final outputs required.
123
- - **CRITICAL for aggregation:** If the user asks for analysis by category (e.g., "specialty," "department"), you MUST identify the correct high-level categorical column for grouping. DO NOT aggregate by granular, free-text procedure descriptions unless explicitly asked. Your goal is to find meaningful, strategic trends.
124
 
125
- **Step 2: Write the Python Script.**
126
- Based on your plan, write a complete Python script.
 
 
 
127
 
128
  CRITICAL SCRIPTING RULES:
129
- 1. **NO FILE READING:** The data is already loaded into a list of pandas DataFrames called `dfs`. You MUST use this variable. Do not include `pd.read_csv`.
130
- 2. **STRICTLY JSON OUTPUT:** The script's ONLY output to stdout MUST be a single, well-structured JSON object containing all the raw data findings from your plan.
131
- 3. **ROBUST DATA CLEANING:** Before performing calculations, clean data robustly. Convert numeric columns to numbers using `pd.to_numeric(..., errors='coerce')`. Handle missing values (`NaN`) appropriately (e.g., by excluding them from averages).
132
- 4. **JSON SERIALIZATION:** Ensure all data in the final dictionary is JSON-serializable. Use `.item()` for single numpy values and `.tolist()` for arrays/series.
 
133
 
134
  Now, provide your response in the following format:
135
 
136
  **ANALYSIS PLAN:**
137
  ```text
138
- 1. **Objective:** [Briefly state the main goal]
139
- 2. **Data Cleaning:** [Describe steps to clean and prepare the data]
140
- 3. **Analysis Step A:** [e.g., "Calculate average wait times per hospital by grouping `dfs[0]` by 'Facility' and averaging 'Surgery_Median'."]
141
- 4. **Analysis Step B:** [e.g., "Identify top 5 specialties by grouping `dfs[0]` by the 'Specialty' column and calculating the mean of 'Surgery_Median'."]
142
- 5. **Analysis Step C:** [e.g., "Determine zone-level performance by grouping by 'Zone' and comparing to the overall provincial average."]
143
- 6. **JSON Output Structure:** [Describe the keys and values of the final JSON object]
 
 
 
 
 
 
144
 
145
  # Your complete Python script starts here
146
  import pandas as pd
@@ -150,8 +158,8 @@ import re
150
  # Main analysis logic...
151
  # ...
152
  # Final print statement
153
- print(json.dumps(final_data_structure, indent=4))"""
154
-
155
  generated_text = cohere_chat(prompt_for_coder)
156
  # This regex is more robust for extracting the final code block
157
  match = re2.search(r"PYTHON SCRIPT:\s*```python\n(.*?)```", generated_text, re2.DOTALL)
 
100
 
101
  def _create_python_script(user_scenario: str, schema_context: str) -> str:
102
  """
103
+ IMPROVED: Generates a Python script using a universal "Map, Plan, Execute" approach.
104
+ The AI first maps user concepts to data columns, then plans and executes the analysis.
105
+ This ensures the logic is robust, dynamic, and not hardcoded to a specific dataset.
106
  """
107
  prompt_for_coder = f"""\
108
+ You are an expert-level, universal Python data scientist. Your task is to dynamically analyze any provided dataset(s) to answer a user's business request.
109
 
110
  --- USER'S SCENARIO ---
111
  {user_scenario}
 
115
  {schema_context}
116
  --- END DATA SCHEMA ---
117
 
118
+ You must follow a rigorous three-step "Map, Plan, Execute" process:
119
 
120
+ **Step 1: Map Concepts to Data.**
121
+ First, analyze the user's scenario and the provided data schemas. Identify the key business concepts (e.g., "hospitals", "sales", "regions") and metrics (e.g., "wait times", "revenue", "population"). Then, create a logical mapping from these concepts to the actual column names in the provided DataFrames. State these mappings clearly. This is the most critical step to ensure your analysis is relevant.
 
 
122
 
123
+ **Step 2: Create a Detailed Analysis Plan.**
124
+ Based on your mapping, formulate a step-by-step plan. Describe the data cleaning, merging, grouping, and aggregation steps needed to answer the user's request using the columns you identified.
125
+
126
+ **Step 3: Write the Python Script.**
127
+ Based on your plan, write a complete Python script that performs the analysis.
128
 
129
  CRITICAL SCRIPTING RULES:
130
+ 1. **DYNAMIC DATAFRAME IDENTIFICATION:** The order of DataFrames in the `dfs` list is NOT guaranteed. Your script MUST identify the correct DataFrame to use for each part of the analysis by checking for the presence of the columns you mapped in Step 1. Do NOT use hardcoded indices like `dfs[0]`.
131
+ 2. **VERIFY COLUMN EXISTENCE:** Only use columns that you have explicitly identified and mapped in your plan. This will prevent `KeyError`.
132
+ 3. **NO FILE READING:** The data is already in the `dfs` list.
133
+ 4. **STRICTLY JSON OUTPUT:** The script's ONLY output must be a single JSON object.
134
+ 5. **ROBUST & GENERIC:** Write robust code that can handle potential missing data (`errors='coerce'`, checking for `None`) and is not hardcoded to specific values from this single request.
135
 
136
  Now, provide your response in the following format:
137
 
138
  **ANALYSIS PLAN:**
139
  ```text
140
+ **1. Concept-to-Column Mapping:**
141
+ - Concept: [e.g., 'Hospitals' or 'Facilities'] -> Mapped Column: [e.g., `Facility` from the wait times dataframe]
142
+ - Concept: [e.g., 'Surgical Wait Time' Metric] -> Mapped Column: [e.g., `Surgery_Median` from the wait times dataframe]
143
+ - Concept: [e.g., 'Geographic Locations'] -> Mapped Columns: [e.g., `latitude`, `longitude` from the facilities dataframe]
144
+
145
+ **2. Step-by-Step Analysis:**
146
+ 1. **Data Identification:** Identify the necessary dataframes by checking for the mapped columns (e.g., find the DF with 'Surgery_Median', find the DF with 'facility_name').
147
+ 2. **Data Cleaning:** [Describe steps, e.g., "Convert metric columns to numeric using `pd.to_numeric`..."]
148
+ 3. **Analysis Step A:** [e.g., "Group the primary dataframe by the 'Facility' column and calculate the mean of the 'Surgery_Median' column..."]
149
+ 4. ...
150
+
151
+ the final JSON object]
152
 
153
  # Your complete Python script starts here
154
  import pandas as pd
 
158
  # Main analysis logic...
159
  # ...
160
  # Final print statement
161
+ print(json.dumps(final_data_structure, indent=4))
162
+ """
163
  generated_text = cohere_chat(prompt_for_coder)
164
  # This regex is more robust for extracting the final code block
165
  match = re2.search(r"PYTHON SCRIPT:\s*```python\n(.*?)```", generated_text, re2.DOTALL)