Spaces:
Sleeping
Sleeping
Commit Β·
2712881
1
Parent(s): 993cfb9
Nebius Integration
Browse files
README.md
CHANGED
|
@@ -17,7 +17,7 @@ tags:
|
|
| 17 |
- gradio
|
| 18 |
- mcp-server-track
|
| 19 |
- agent-demo-track
|
| 20 |
-
short_description: Automated ML model comparison with LazyPredict and MCP
|
| 21 |
---
|
| 22 |
|
| 23 |
# π€ AutoML - MCP Hackathon Submission
|
|
@@ -27,48 +27,10 @@ short_description: Automated ML model comparison with LazyPredict and MCP integ
|
|
| 27 |
## π Hackathon Track
|
| 28 |
**Agents & MCP Hackathon - Track 1: MCP Tool / Server**
|
| 29 |
|
| 30 |
-
## π Key Features
|
| 31 |
-
|
| 32 |
-
### Core ML Capabilities
|
| 33 |
-
- **π€ Dual Data Input**: Support for both local CSV file uploads and public URL data sources
|
| 34 |
-
- **π― Auto Problem Detection**: Automatically determines regression vs classification tasks
|
| 35 |
-
- **π€ Multi-Algorithm Comparison**: LazyPredict-powered comparison of 20+ ML algorithms
|
| 36 |
-
- **π Automated EDA**: Comprehensive dataset profiling with ydata-profiling
|
| 37 |
-
- **πΎ Best Model Export**: Download top-performing model as pickle file
|
| 38 |
-
- **π Performance Visualization**: Interactive charts showing model comparison results
|
| 39 |
-
|
| 40 |
-
### π Advanced Features
|
| 41 |
-
- **π URL Data Loading**: Direct data loading from public CSV URLs with robust error handling
|
| 42 |
-
- **π Agent-Friendly Interface**: Designed for both human users and AI agent interactions
|
| 43 |
-
- **π Interactive Dashboards**: Real-time model performance metrics and visualizations
|
| 44 |
-
- **π Smart Error Handling**: Comprehensive validation and user feedback system
|
| 45 |
-
- **π» MCP Server Integration**: Full Model Context Protocol server implementation
|
| 46 |
-
|
| 47 |
## π οΈ How It Works
|
| 48 |
|
| 49 |
The AutoML provides a streamlined pipeline for automated machine learning:
|
| 50 |
|
| 51 |
-
### Core Functions
|
| 52 |
-
|
| 53 |
-
1. **`load_data(file_input)`** - Universal data loader that handles:
|
| 54 |
-
- Local CSV file uploads through Gradio's file component
|
| 55 |
-
- Public CSV URLs with HTTP/HTTPS support
|
| 56 |
-
- Robust error handling and validation
|
| 57 |
-
- Automatic format detection and parsing
|
| 58 |
-
|
| 59 |
-
2. **`analyze_and_model(df, target_column)`** - Core ML pipeline that:
|
| 60 |
-
- Generates comprehensive EDA reports using ydata-profiling
|
| 61 |
-
- Automatically detects task type (classification vs regression) based on target variable uniqueness
|
| 62 |
-
- Trains and evaluates multiple models using LazyPredict
|
| 63 |
-
- Selects the best performing model based on appropriate metrics
|
| 64 |
-
- Creates publication-ready visualizations comparing model performance
|
| 65 |
-
- Exports the best model as a serialized pickle file
|
| 66 |
-
|
| 67 |
-
3. **`run_pipeline(data_source, target_column)`** - Main orchestration function:
|
| 68 |
-
- Validates all inputs and provides clear error messages
|
| 69 |
-
- Coordinates the entire ML workflow from data loading to model export
|
| 70 |
-
- Generates AI-powered explanations of results
|
| 71 |
-
- Returns all outputs in a format optimized for both UI and API consumption
|
| 72 |
|
| 73 |
### Agent-Friendly Design
|
| 74 |
- **Single Entry Point**: The `run_pipeline()` function serves as the primary interface for AI agents
|
|
@@ -78,34 +40,7 @@ The AutoML provides a streamlined pipeline for automated machine learning:
|
|
| 78 |
|
| 79 |
## π Quick Start
|
| 80 |
|
| 81 |
-
### Running the Application
|
| 82 |
-
|
| 83 |
-
The project includes two main application files:
|
| 84 |
-
|
| 85 |
-
#### Primary Application: `app.py` (Recommended)
|
| 86 |
-
```bash
|
| 87 |
-
# Install dependencies
|
| 88 |
-
pip install -r requirements.txt
|
| 89 |
-
|
| 90 |
-
# Run the main application
|
| 91 |
-
python app.py
|
| 92 |
-
```
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
### Web Interface
|
| 96 |
-
1. **Choose Data Source**:
|
| 97 |
-
- **Local Upload**: Use the file upload component to select a CSV file from your computer
|
| 98 |
-
- **URL Input**: Enter a public CSV URL (e.g., from GitHub, data repositories, or cloud storage)
|
| 99 |
-
2. **Specify Target**: Enter the exact name of your target column (case-sensitive)
|
| 100 |
-
3. **Run Analysis**: Click "Run Analysis & AutoML" to start the AutoML pipeline
|
| 101 |
-
4. **Review Results**:
|
| 102 |
-
- View detected task type (classification/regression)
|
| 103 |
-
- Examine model performance metrics in the interactive table
|
| 104 |
-
- Download comprehensive EDA report (HTML format)
|
| 105 |
-
- Download the best performing model (pickle format)
|
| 106 |
-
- View model comparison visualization
|
| 107 |
-
|
| 108 |
-
### Installation & Setup
|
| 109 |
```bash
|
| 110 |
# Clone the repository
|
| 111 |
git clone [repository-url]
|
|
@@ -113,67 +48,20 @@ cd AutoML
|
|
| 113 |
|
| 114 |
# Install dependencies
|
| 115 |
pip install -r requirements.txt
|
| 116 |
-
```
|
| 117 |
-
|
| 118 |
-
### Server Configuration
|
| 119 |
-
The application launches with the following settings:
|
| 120 |
-
- **Host**: `0.0.0.0` (accessible from any network interface)
|
| 121 |
-
- **Port**: `7860` (default Gradio port)
|
| 122 |
-
- **MCP Server**: Enabled for AI agent integration
|
| 123 |
-
- **API Documentation**: Available at `/docs` endpoint
|
| 124 |
-
- **Browser Launch**: Automatic browser opening enabled
|
| 125 |
-
|
| 126 |
-
## π― Current Implementation
|
| 127 |
-
|
| 128 |
-
### 1. LazyPredict Integration
|
| 129 |
-
- **Automated Model Training**: Trains 20+ algorithms automatically
|
| 130 |
-
- **Performance Comparison**: Side-by-side evaluation of all models
|
| 131 |
-
- **Best Model Selection**: Automatically selects top performer based on accuracy/RΒ² score
|
| 132 |
-
|
| 133 |
-
### 2. Comprehensive EDA
|
| 134 |
-
- **ydata-profiling**: Generates detailed dataset analysis reports
|
| 135 |
-
- **Automatic Insights**: Data quality, distributions, correlations, and missing values
|
| 136 |
-
- **Interactive Reports**: Downloadable HTML reports with comprehensive statistics
|
| 137 |
|
| 138 |
-
#
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
- **Adaptive Metrics**: Uses appropriate evaluation metrics for each task type
|
| 142 |
-
|
| 143 |
-
### 4. Model Persistence
|
| 144 |
-
- **Pickle Export**: Save trained models for future use
|
| 145 |
-
- **Model Reuse**: Load and apply models to new datasets
|
| 146 |
-
- **Production Ready**: Serialized models ready for deployment
|
| 147 |
|
| 148 |
|
| 149 |
## π Demo Scenarios
|
| 150 |
|
| 151 |
-
|
| 152 |
### College Placement Analysis
|
| 153 |
- Upload `collegePlace.csv` included in the project with url: (https://raw.githubusercontent.com/daniel-was-taken/Placement-Prediction/refs/heads/master/collegePlace.csv)
|
| 154 |
- Analyze student placement outcomes
|
| 155 |
- Automatic feature analysis and model comparison
|
| 156 |
- Export trained model for future predictions
|
| 157 |
|
| 158 |
-
### URL-Based Data Analysis
|
| 159 |
-
- Use public dataset URLs for instant analysis
|
| 160 |
-
- Example: Government open data, research datasets, cloud-hosted files
|
| 161 |
-
- No file size limitations with URL-based loading
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
## π Technologies Used
|
| 165 |
-
|
| 166 |
-
- **Frontend**: Gradio 4.0+ with soft theme and MCP server integration
|
| 167 |
-
- **AutoML Engine**: LazyPredict for automated model comparison and evaluation
|
| 168 |
-
- **EDA Framework**: ydata-profiling for comprehensive dataset analysis and reporting
|
| 169 |
-
- **ML Libraries**: scikit-learn, XGBoost, LightGBM (via LazyPredict ecosystem)
|
| 170 |
-
- **Visualization**: Matplotlib and Seaborn for model comparison charts and statistical plots
|
| 171 |
-
- **Data Processing**: pandas and numpy for efficient data manipulation and preprocessing
|
| 172 |
-
- **Model Persistence**: pickle for secure model serialization and export
|
| 173 |
-
- **Web Requests**: requests library for robust URL-based data loading
|
| 174 |
-
- **MCP Integration**: Model Context Protocol server for AI agent compatibility
|
| 175 |
-
- **File Handling**: tempfile for secure temporary file management
|
| 176 |
-
|
| 177 |
## π Current Features
|
| 178 |
|
| 179 |
- **π Dual Input Support**: Upload local CSV files or provide public URLs for data loading
|
|
@@ -187,17 +75,5 @@ The application launches with the following settings:
|
|
| 187 |
- **π‘οΈ Robust Error Handling**: Comprehensive validation with informative user feedback
|
| 188 |
- **π¨ Modern UI**: Clean, responsive interface optimized for both human and agent interactions
|
| 189 |
|
| 190 |
-
## π― Hackathon Submission Highlights
|
| 191 |
-
|
| 192 |
-
1. **π€ LazyPredict Integration**: Automated comparison of 20+ ML algorithms with minimal configuration
|
| 193 |
-
2. **π§ Smart Automation**: Intelligent task detection, data validation, and model selection
|
| 194 |
-
3. **π Comprehensive Analysis**: ydata-profiling powered EDA reports with statistical insights
|
| 195 |
-
4. **π₯ Dual Interface Design**: Optimized for both human users and AI agent interactions
|
| 196 |
-
5. **π MCP Server Implementation**: Full Model Context Protocol integration for seamless agent workflows
|
| 197 |
-
6. **π Flexible Data Loading**: Support for both local uploads and URL-based data sources
|
| 198 |
-
7. **π Production Ready**: Exportable models, comprehensive documentation, and robust error handling
|
| 199 |
-
8. **π¨ Modern UI/UX**: Clean Gradio interface with intuitive workflow and clear feedback systems
|
| 200 |
-
|
| 201 |
-
|
| 202 |
|
| 203 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
| 17 |
- gradio
|
| 18 |
- mcp-server-track
|
| 19 |
- agent-demo-track
|
| 20 |
+
short_description: Automated ML model comparison with LazyPredict and MCP integration
|
| 21 |
---
|
| 22 |
|
| 23 |
# π€ AutoML - MCP Hackathon Submission
|
|
|
|
| 27 |
## π Hackathon Track
|
| 28 |
**Agents & MCP Hackathon - Track 1: MCP Tool / Server**
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
## π οΈ How It Works
|
| 31 |
|
| 32 |
The AutoML provides a streamlined pipeline for automated machine learning:
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
### Agent-Friendly Design
|
| 36 |
- **Single Entry Point**: The `run_pipeline()` function serves as the primary interface for AI agents
|
|
|
|
| 40 |
|
| 41 |
## π Quick Start
|
| 42 |
|
| 43 |
+
### Installation & Running the Application
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
```bash
|
| 45 |
# Clone the repository
|
| 46 |
git clone [repository-url]
|
|
|
|
| 48 |
|
| 49 |
# Install dependencies
|
| 50 |
pip install -r requirements.txt
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
+
# Run the main application
|
| 53 |
+
python app.py
|
| 54 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
|
| 57 |
## π Demo Scenarios
|
| 58 |
|
|
|
|
| 59 |
### College Placement Analysis
|
| 60 |
- Upload `collegePlace.csv` included in the project with url: (https://raw.githubusercontent.com/daniel-was-taken/Placement-Prediction/refs/heads/master/collegePlace.csv)
|
| 61 |
- Analyze student placement outcomes
|
| 62 |
- Automatic feature analysis and model comparison
|
| 63 |
- Export trained model for future predictions
|
| 64 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
## π Current Features
|
| 66 |
|
| 67 |
- **π Dual Input Support**: Upload local CSV files or provide public URLs for data loading
|
|
|
|
| 75 |
- **π‘οΈ Robust Error Handling**: Comprehensive validation with informative user feedback
|
| 76 |
- **π¨ Modern UI**: Clean, responsive interface optimized for both human and agent interactions
|
| 77 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
app.py
CHANGED
|
@@ -9,33 +9,53 @@ from sklearn.model_selection import train_test_split
|
|
| 9 |
from ydata_profiling import ProfileReport
|
| 10 |
import tempfile
|
| 11 |
import requests
|
|
|
|
|
|
|
| 12 |
|
| 13 |
def load_data(file_input):
|
| 14 |
"""Loads CSV data from either a local file upload or a public URL."""
|
| 15 |
if file_input is None:
|
| 16 |
-
return None
|
|
|
|
| 17 |
try:
|
| 18 |
-
# For local file uploads, file_input is a temporary file object
|
| 19 |
if hasattr(file_input, 'name'):
|
| 20 |
file_path = file_input.name
|
| 21 |
with open(file_path, 'rb') as f:
|
| 22 |
file_bytes = f.read()
|
| 23 |
df = pd.read_csv(io.BytesIO(file_bytes))
|
| 24 |
-
# For URL text input
|
| 25 |
elif isinstance(file_input, str) and file_input.startswith('http'):
|
| 26 |
response = requests.get(file_input)
|
| 27 |
response.raise_for_status()
|
| 28 |
df = pd.read_csv(io.StringIO(response.text))
|
| 29 |
else:
|
| 30 |
-
return None
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
| 32 |
except Exception as e:
|
| 33 |
gr.Warning(f"Failed to load or parse data: {e}")
|
| 34 |
-
return None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
def analyze_and_model(df, target_column):
|
| 37 |
"""Internal function to perform EDA, model training, and visualization."""
|
| 38 |
-
# ... (This function's content is unchanged)
|
| 39 |
profile = ProfileReport(df, title="EDA Report", minimal=True)
|
| 40 |
with tempfile.NamedTemporaryFile(delete=False, suffix=".html") as temp_html:
|
| 41 |
profile.to_file(temp_html.name)
|
|
@@ -50,7 +70,7 @@ def analyze_and_model(df, target_column):
|
|
| 50 |
models, _ = model.fit(X_train, X_test, y_train, y_test)
|
| 51 |
|
| 52 |
sort_metric = "Accuracy" if task == "classification" else "R-Squared"
|
| 53 |
-
best_model_name = models.sort_values(by=sort_metric, ascending=False).index[0]
|
| 54 |
best_model = model.models[best_model_name]
|
| 55 |
|
| 56 |
with tempfile.NamedTemporaryFile(delete=False, suffix=".pkl") as temp_pkl:
|
|
@@ -68,60 +88,96 @@ def analyze_and_model(df, target_column):
|
|
| 68 |
plt.close()
|
| 69 |
|
| 70 |
models_reset = models.reset_index().rename(columns={'index': 'Model'})
|
| 71 |
-
return profile_path, task, models_reset, plot_path, pickle_path
|
| 72 |
|
| 73 |
-
def run_pipeline(data_source, target_column):
|
| 74 |
"""
|
| 75 |
This single function drives the entire application.
|
| 76 |
It's exposed as the primary tool for the MCP server.
|
| 77 |
-
|
| 78 |
:param data_source: A local file path (from gr.File) or a URL (from gr.Textbox).
|
| 79 |
:param target_column: The name of the target column for prediction.
|
|
|
|
| 80 |
"""
|
| 81 |
# --- 1. Input Validation ---
|
| 82 |
if not data_source or not target_column:
|
| 83 |
error_msg = "Error: Data source and target column must be provided."
|
| 84 |
gr.Warning(error_msg)
|
| 85 |
-
return None, error_msg, None, None, None, "Please provide all inputs."
|
| 86 |
|
| 87 |
gr.Info("Starting analysis...")
|
| 88 |
-
|
| 89 |
# --- 2. Data Loading ---
|
| 90 |
-
df = load_data(data_source)
|
| 91 |
if df is None:
|
| 92 |
-
return None, "Error: Could not load data.", None, None, None, None
|
| 93 |
|
| 94 |
if target_column not in df.columns:
|
| 95 |
-
error_msg = f"Error: Target column '{target_column}' not found in the dataset. Available: {
|
| 96 |
gr.Warning(error_msg)
|
| 97 |
-
return None, error_msg, None, None, None, None
|
| 98 |
|
| 99 |
# --- 3. Analysis and Modeling ---
|
| 100 |
-
profile_path, task, models_df, plot_path, pickle_path = analyze_and_model(df, target_column)
|
| 101 |
-
|
| 102 |
-
# --- 4. Explanation ---
|
| 103 |
-
best_model_name = models_df.iloc[0]['Model']
|
| 104 |
-
|
| 105 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
gr.Info("Analysis complete!")
|
| 107 |
-
|
|
|
|
| 108 |
|
| 109 |
# --- Gradio UI ---
|
| 110 |
with gr.Blocks(title="AutoML Trainer", theme=gr.themes.Soft()) as demo:
|
| 111 |
gr.Markdown("## π€ AutoML Trainer")
|
| 112 |
-
gr.Markdown("Enter a CSV data source (local file or public URL) and a target column to run the analysis. This interface is now friendly for both humans and AI agents.")
|
| 113 |
|
| 114 |
with gr.Row():
|
| 115 |
with gr.Column(scale=1):
|
| 116 |
-
# Using gr.File allows for both upload and is compatible with agents
|
| 117 |
file_input = gr.File(label="Upload Local CSV File")
|
| 118 |
url_input = gr.Textbox(label="Or Enter Public CSV URL", placeholder="e.g., https://.../data.csv")
|
| 119 |
target_column_input = gr.Textbox(label="Enter Target Column Name", placeholder="e.g., approved")
|
|
|
|
| 120 |
run_button = gr.Button("Run Analysis & AutoML", variant="primary")
|
| 121 |
-
|
| 122 |
with gr.Column(scale=2):
|
|
|
|
| 123 |
task_output = gr.Textbox(label="Detected Task", interactive=False)
|
| 124 |
-
llm_output = gr.
|
| 125 |
metrics_output = gr.Dataframe(label="Model Performance Metrics")
|
| 126 |
|
| 127 |
with gr.Row():
|
|
@@ -130,24 +186,33 @@ with gr.Blocks(title="AutoML Trainer", theme=gr.themes.Soft()) as demo:
|
|
| 130 |
eda_output = gr.File(label="Download Full EDA Report")
|
| 131 |
model_output = gr.File(label="Download Best Model (.pkl)")
|
| 132 |
|
| 133 |
-
|
| 134 |
-
# A helper function decides whether to use the file or URL input
|
| 135 |
-
def process_inputs(file_data, url_data, target):
|
| 136 |
data_source = file_data if file_data is not None else url_data
|
| 137 |
-
return run_pipeline(data_source, target)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
|
| 139 |
run_button.click(
|
| 140 |
fn=process_inputs,
|
| 141 |
-
inputs=[file_input, url_input, target_column_input],
|
| 142 |
-
outputs=[eda_output, task_output, metrics_output, vis_output, model_output, llm_output]
|
| 143 |
)
|
| 144 |
|
| 145 |
demo.launch(
|
| 146 |
server_name="0.0.0.0",
|
| 147 |
server_port=7860,
|
| 148 |
-
share=
|
| 149 |
show_api=True,
|
| 150 |
inbrowser=True,
|
| 151 |
mcp_server=True
|
| 152 |
-
|
| 153 |
)
|
|
|
|
| 9 |
from ydata_profiling import ProfileReport
|
| 10 |
import tempfile
|
| 11 |
import requests
|
| 12 |
+
import json
|
| 13 |
+
from openai import OpenAI # Added for Nebius AI Studio LLM integration
|
| 14 |
|
| 15 |
def load_data(file_input):
|
| 16 |
"""Loads CSV data from either a local file upload or a public URL."""
|
| 17 |
if file_input is None:
|
| 18 |
+
return None, None
|
| 19 |
+
|
| 20 |
try:
|
|
|
|
| 21 |
if hasattr(file_input, 'name'):
|
| 22 |
file_path = file_input.name
|
| 23 |
with open(file_path, 'rb') as f:
|
| 24 |
file_bytes = f.read()
|
| 25 |
df = pd.read_csv(io.BytesIO(file_bytes))
|
|
|
|
| 26 |
elif isinstance(file_input, str) and file_input.startswith('http'):
|
| 27 |
response = requests.get(file_input)
|
| 28 |
response.raise_for_status()
|
| 29 |
df = pd.read_csv(io.StringIO(response.text))
|
| 30 |
else:
|
| 31 |
+
return None, None
|
| 32 |
+
|
| 33 |
+
# Extract column names here
|
| 34 |
+
column_names = ", ".join(df.columns.tolist())
|
| 35 |
+
return df, column_names
|
| 36 |
except Exception as e:
|
| 37 |
gr.Warning(f"Failed to load or parse data: {e}")
|
| 38 |
+
return None, None
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def update_detected_columns_display(file_data, url_data):
|
| 42 |
+
"""
|
| 43 |
+
Detects and displays column names from the uploaded file or URL
|
| 44 |
+
as soon as the input changes, before the main analysis button is pressed.
|
| 45 |
+
"""
|
| 46 |
+
data_source = file_data if file_data is not None else url_data
|
| 47 |
+
if data_source is None:
|
| 48 |
+
return ""
|
| 49 |
+
|
| 50 |
+
df, column_names = load_data(data_source)
|
| 51 |
+
if column_names:
|
| 52 |
+
return column_names
|
| 53 |
+
else:
|
| 54 |
+
return "No columns detected or error loading file. Please check the file format."
|
| 55 |
+
|
| 56 |
|
| 57 |
def analyze_and_model(df, target_column):
|
| 58 |
"""Internal function to perform EDA, model training, and visualization."""
|
|
|
|
| 59 |
profile = ProfileReport(df, title="EDA Report", minimal=True)
|
| 60 |
with tempfile.NamedTemporaryFile(delete=False, suffix=".html") as temp_html:
|
| 61 |
profile.to_file(temp_html.name)
|
|
|
|
| 70 |
models, _ = model.fit(X_train, X_test, y_train, y_test)
|
| 71 |
|
| 72 |
sort_metric = "Accuracy" if task == "classification" else "R-Squared"
|
| 73 |
+
best_model_name = models.sort_values(by=sort_metric, ascending=False).index[0] # Corrected indexing
|
| 74 |
best_model = model.models[best_model_name]
|
| 75 |
|
| 76 |
with tempfile.NamedTemporaryFile(delete=False, suffix=".pkl") as temp_pkl:
|
|
|
|
| 88 |
plt.close()
|
| 89 |
|
| 90 |
models_reset = models.reset_index().rename(columns={'index': 'Model'})
|
| 91 |
+
return profile, profile_path, task, models_reset, plot_path, pickle_path
|
| 92 |
|
| 93 |
+
def run_pipeline(data_source, target_column, nebius_api_key):
|
| 94 |
"""
|
| 95 |
This single function drives the entire application.
|
| 96 |
It's exposed as the primary tool for the MCP server.
|
| 97 |
+
|
| 98 |
:param data_source: A local file path (from gr.File) or a URL (from gr.Textbox).
|
| 99 |
:param target_column: The name of the target column for prediction.
|
| 100 |
+
:param nebius_api_key: The API key for Nebius AI Studio.
|
| 101 |
"""
|
| 102 |
# --- 1. Input Validation ---
|
| 103 |
if not data_source or not target_column:
|
| 104 |
error_msg = "Error: Data source and target column must be provided."
|
| 105 |
gr.Warning(error_msg)
|
| 106 |
+
return None, error_msg, None, None, None, "Please provide all inputs.", "No columns loaded."
|
| 107 |
|
| 108 |
gr.Info("Starting analysis...")
|
| 109 |
+
|
| 110 |
# --- 2. Data Loading ---
|
| 111 |
+
df, column_names = load_data(data_source)
|
| 112 |
if df is None:
|
| 113 |
+
return None, "Error: Could not load data.", None, None, None, None, "No columns loaded."
|
| 114 |
|
| 115 |
if target_column not in df.columns:
|
| 116 |
+
error_msg = f"Error: Target column '{target_column}' not found in the dataset. Available: {column_names}"
|
| 117 |
gr.Warning(error_msg)
|
| 118 |
+
return None, error_msg, None, None, None, None, column_names
|
| 119 |
|
| 120 |
# --- 3. Analysis and Modeling ---
|
| 121 |
+
profile, profile_path, task, models_df, plot_path, pickle_path = analyze_and_model(df, target_column)
|
| 122 |
+
|
| 123 |
+
# --- 4. Explanation with Nebius AI Studio LLM ---
|
| 124 |
+
best_model_name = models_df.iloc[0]['Model'] # Corrected indexing
|
| 125 |
+
|
| 126 |
+
llm_explanation = "AI explanation is unavailable. Please provide a Nebius AI Studio API key to enable this feature." # Generic fallback [1]
|
| 127 |
+
|
| 128 |
+
if nebius_api_key:
|
| 129 |
+
try:
|
| 130 |
+
client = OpenAI(
|
| 131 |
+
base_url="https://api.studio.nebius.com/v1/",
|
| 132 |
+
|
| 133 |
+
api_key=nebius_api_key
|
| 134 |
+
)
|
| 135 |
+
|
| 136 |
+
# Craft a prompt for the LLM [2]
|
| 137 |
+
prompt_text = f"Explain and Summarize the significance of the top performing model, '{best_model_name}', for a {task} task in a data analysis context. Keep the explanation concise and professional. Analyse the report: {profile}."
|
| 138 |
+
|
| 139 |
+
# Make the LLM call [2, 3]
|
| 140 |
+
response = client.chat.completions.create(
|
| 141 |
+
model="meta-llama/Llama-3.3-70B-Instruct",
|
| 142 |
+
messages=[
|
| 143 |
+
{"role": "system", "content": "You are a helpful AI assistant that explains data science concepts. "},
|
| 144 |
+
{"role": "user", "content": prompt_text}
|
| 145 |
+
],
|
| 146 |
+
temperature=0.6,
|
| 147 |
+
max_tokens=512,
|
| 148 |
+
top_p=0.9,
|
| 149 |
+
extra_body={
|
| 150 |
+
"top_k": 50
|
| 151 |
+
}
|
| 152 |
+
)
|
| 153 |
+
message_content = response.to_json()
|
| 154 |
+
data = json.loads(message_content)
|
| 155 |
+
llm_explanation = data['choices'][0]['message']['content']
|
| 156 |
+
|
| 157 |
+
except Exception as e:
|
| 158 |
+
gr.Warning(f"Failed to get AI explanation: {e}. Please check your API key or try again later.")
|
| 159 |
+
llm_explanation = "An error occurred while fetching AI explanation. Please check your API key or try again later."
|
| 160 |
+
|
| 161 |
gr.Info("Analysis complete!")
|
| 162 |
+
gr.Info(f'Profile report saved to: {profile_path}')
|
| 163 |
+
return profile_path, task, models_df, plot_path, pickle_path, llm_explanation, column_names
|
| 164 |
|
| 165 |
# --- Gradio UI ---
|
| 166 |
with gr.Blocks(title="AutoML Trainer", theme=gr.themes.Soft()) as demo:
|
| 167 |
gr.Markdown("## π€ AutoML Trainer")
|
|
|
|
| 168 |
|
| 169 |
with gr.Row():
|
| 170 |
with gr.Column(scale=1):
|
|
|
|
| 171 |
file_input = gr.File(label="Upload Local CSV File")
|
| 172 |
url_input = gr.Textbox(label="Or Enter Public CSV URL", placeholder="e.g., https://.../data.csv")
|
| 173 |
target_column_input = gr.Textbox(label="Enter Target Column Name", placeholder="e.g., approved")
|
| 174 |
+
nebius_api_key_input = gr.Textbox(label="Nebius AI Studio API Key (Optional)", type="password", placeholder="Enter your API key for AI explanations")
|
| 175 |
run_button = gr.Button("Run Analysis & AutoML", variant="primary")
|
| 176 |
+
|
| 177 |
with gr.Column(scale=2):
|
| 178 |
+
column_names_output = gr.Textbox(label="Detected Columns", interactive=False, lines=2) # New Textbox for column names
|
| 179 |
task_output = gr.Textbox(label="Detected Task", interactive=False)
|
| 180 |
+
llm_output = gr.Markdown(label="AI Explanation")
|
| 181 |
metrics_output = gr.Dataframe(label="Model Performance Metrics")
|
| 182 |
|
| 183 |
with gr.Row():
|
|
|
|
| 186 |
eda_output = gr.File(label="Download Full EDA Report")
|
| 187 |
model_output = gr.File(label="Download Best Model (.pkl)")
|
| 188 |
|
| 189 |
+
def process_inputs(file_data, url_data, target, api_key):
|
|
|
|
|
|
|
| 190 |
data_source = file_data if file_data is not None else url_data
|
| 191 |
+
return run_pipeline(data_source, target, api_key)
|
| 192 |
+
|
| 193 |
+
|
| 194 |
+
file_input.change(
|
| 195 |
+
fn=update_detected_columns_display,
|
| 196 |
+
inputs=[file_input, url_input],
|
| 197 |
+
outputs=column_names_output
|
| 198 |
+
)
|
| 199 |
+
url_input.change(
|
| 200 |
+
fn=update_detected_columns_display,
|
| 201 |
+
inputs=[file_input, url_input],
|
| 202 |
+
outputs=column_names_output
|
| 203 |
+
)
|
| 204 |
|
| 205 |
run_button.click(
|
| 206 |
fn=process_inputs,
|
| 207 |
+
inputs=[file_input, url_input, target_column_input, nebius_api_key_input],
|
| 208 |
+
outputs=[eda_output, task_output, metrics_output, vis_output, model_output, llm_output, column_names_output]
|
| 209 |
)
|
| 210 |
|
| 211 |
demo.launch(
|
| 212 |
server_name="0.0.0.0",
|
| 213 |
server_port=7860,
|
| 214 |
+
share=False,
|
| 215 |
show_api=True,
|
| 216 |
inbrowser=True,
|
| 217 |
mcp_server=True
|
|
|
|
| 218 |
)
|