Spaces:
Sleeping
Sleeping
Update app.py
Browse files
app.py
CHANGED
|
@@ -100,12 +100,12 @@ def safe_log(event_name: str, meta: dict | None = None):
|
|
| 100 |
|
| 101 |
def _create_python_script(user_scenario: str, schema_context: str) -> str:
|
| 102 |
"""
|
| 103 |
-
IMPROVED: Generates a Python script using a "Plan
|
| 104 |
-
The AI first
|
| 105 |
-
This ensures the
|
| 106 |
"""
|
| 107 |
prompt_for_coder = f"""\
|
| 108 |
-
You are an expert-level Python data scientist
|
| 109 |
|
| 110 |
--- USER'S SCENARIO ---
|
| 111 |
{user_scenario}
|
|
@@ -115,32 +115,40 @@ You are an expert-level Python data scientist acting as a consultant. Your task
|
|
| 115 |
{schema_context}
|
| 116 |
--- END DATA SCHEMA ---
|
| 117 |
|
| 118 |
-
You must follow a rigorous
|
| 119 |
|
| 120 |
-
**Step 1:
|
| 121 |
-
First,
|
| 122 |
-
The plan must identify the key metrics, necessary data manipulations (cleaning, grouping, aggregation), and the final outputs required.
|
| 123 |
-
- **CRITICAL for aggregation:** If the user asks for analysis by category (e.g., "specialty," "department"), you MUST identify the correct high-level categorical column for grouping. DO NOT aggregate by granular, free-text procedure descriptions unless explicitly asked. Your goal is to find meaningful, strategic trends.
|
| 124 |
|
| 125 |
-
**Step 2:
|
| 126 |
-
Based on your
|
|
|
|
|
|
|
|
|
|
| 127 |
|
| 128 |
CRITICAL SCRIPTING RULES:
|
| 129 |
-
1. **
|
| 130 |
-
2. **
|
| 131 |
-
3. **
|
| 132 |
-
4. **JSON
|
|
|
|
| 133 |
|
| 134 |
Now, provide your response in the following format:
|
| 135 |
|
| 136 |
**ANALYSIS PLAN:**
|
| 137 |
```text
|
| 138 |
-
1.
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
|
| 145 |
# Your complete Python script starts here
|
| 146 |
import pandas as pd
|
|
@@ -150,8 +158,8 @@ import re
|
|
| 150 |
# Main analysis logic...
|
| 151 |
# ...
|
| 152 |
# Final print statement
|
| 153 |
-
print(json.dumps(final_data_structure, indent=4))
|
| 154 |
-
|
| 155 |
generated_text = cohere_chat(prompt_for_coder)
|
| 156 |
# This regex is more robust for extracting the final code block
|
| 157 |
match = re2.search(r"PYTHON SCRIPT:\s*```python\n(.*?)```", generated_text, re2.DOTALL)
|
|
|
|
| 100 |
|
| 101 |
def _create_python_script(user_scenario: str, schema_context: str) -> str:
|
| 102 |
"""
|
| 103 |
+
IMPROVED: Generates a Python script using a universal "Map, Plan, Execute" approach.
|
| 104 |
+
The AI first maps user concepts to data columns, then plans and executes the analysis.
|
| 105 |
+
This ensures the logic is robust, dynamic, and not hardcoded to a specific dataset.
|
| 106 |
"""
|
| 107 |
prompt_for_coder = f"""\
|
| 108 |
+
You are an expert-level, universal Python data scientist. Your task is to dynamically analyze any provided dataset(s) to answer a user's business request.
|
| 109 |
|
| 110 |
--- USER'S SCENARIO ---
|
| 111 |
{user_scenario}
|
|
|
|
| 115 |
{schema_context}
|
| 116 |
--- END DATA SCHEMA ---
|
| 117 |
|
| 118 |
+
You must follow a rigorous three-step "Map, Plan, Execute" process:
|
| 119 |
|
| 120 |
+
**Step 1: Map Concepts to Data.**
|
| 121 |
+
First, analyze the user's scenario and the provided data schemas. Identify the key business concepts (e.g., "hospitals", "sales", "regions") and metrics (e.g., "wait times", "revenue", "population"). Then, create a logical mapping from these concepts to the actual column names in the provided DataFrames. State these mappings clearly. This is the most critical step to ensure your analysis is relevant.
|
|
|
|
|
|
|
| 122 |
|
| 123 |
+
**Step 2: Create a Detailed Analysis Plan.**
|
| 124 |
+
Based on your mapping, formulate a step-by-step plan. Describe the data cleaning, merging, grouping, and aggregation steps needed to answer the user's request using the columns you identified.
|
| 125 |
+
|
| 126 |
+
**Step 3: Write the Python Script.**
|
| 127 |
+
Based on your plan, write a complete Python script that performs the analysis.
|
| 128 |
|
| 129 |
CRITICAL SCRIPTING RULES:
|
| 130 |
+
1. **DYNAMIC DATAFRAME IDENTIFICATION:** The order of DataFrames in the `dfs` list is NOT guaranteed. Your script MUST identify the correct DataFrame to use for each part of the analysis by checking for the presence of the columns you mapped in Step 1. Do NOT use hardcoded indices like `dfs[0]`.
|
| 131 |
+
2. **VERIFY COLUMN EXISTENCE:** Only use columns that you have explicitly identified and mapped in your plan. This will prevent `KeyError`.
|
| 132 |
+
3. **NO FILE READING:** The data is already in the `dfs` list.
|
| 133 |
+
4. **STRICTLY JSON OUTPUT:** The script's ONLY output must be a single JSON object.
|
| 134 |
+
5. **ROBUST & GENERIC:** Write robust code that can handle potential missing data (`errors='coerce'`, checking for `None`) and is not hardcoded to specific values from this single request.
|
| 135 |
|
| 136 |
Now, provide your response in the following format:
|
| 137 |
|
| 138 |
**ANALYSIS PLAN:**
|
| 139 |
```text
|
| 140 |
+
**1. Concept-to-Column Mapping:**
|
| 141 |
+
- Concept: [e.g., 'Hospitals' or 'Facilities'] -> Mapped Column: [e.g., `Facility` from the wait times dataframe]
|
| 142 |
+
- Concept: [e.g., 'Surgical Wait Time' Metric] -> Mapped Column: [e.g., `Surgery_Median` from the wait times dataframe]
|
| 143 |
+
- Concept: [e.g., 'Geographic Locations'] -> Mapped Columns: [e.g., `latitude`, `longitude` from the facilities dataframe]
|
| 144 |
+
|
| 145 |
+
**2. Step-by-Step Analysis:**
|
| 146 |
+
1. **Data Identification:** Identify the necessary dataframes by checking for the mapped columns (e.g., find the DF with 'Surgery_Median', find the DF with 'facility_name').
|
| 147 |
+
2. **Data Cleaning:** [Describe steps, e.g., "Convert metric columns to numeric using `pd.to_numeric`..."]
|
| 148 |
+
3. **Analysis Step A:** [e.g., "Group the primary dataframe by the 'Facility' column and calculate the mean of the 'Surgery_Median' column..."]
|
| 149 |
+
4. ...
|
| 150 |
+
|
| 151 |
+
the final JSON object]
|
| 152 |
|
| 153 |
# Your complete Python script starts here
|
| 154 |
import pandas as pd
|
|
|
|
| 158 |
# Main analysis logic...
|
| 159 |
# ...
|
| 160 |
# Final print statement
|
| 161 |
+
print(json.dumps(final_data_structure, indent=4))
|
| 162 |
+
"""
|
| 163 |
generated_text = cohere_chat(prompt_for_coder)
|
| 164 |
# This regex is more robust for extracting the final code block
|
| 165 |
match = re2.search(r"PYTHON SCRIPT:\s*```python\n(.*?)```", generated_text, re2.DOTALL)
|