Spaces:

Agents-MCP-Hackathon
/

AutoML-MCP

Sleeping

App Files Files Community

daniel-was-taken commited on Jun 9, 2025

Commit

2712881

1 Parent(s): 993cfb9

Nebius Integration

Browse files

Files changed (2) hide show

README.md +5 -129
app.py +101 -36

README.md CHANGED Viewed

@@ -17,7 +17,7 @@ tags:
   - gradio
   - mcp-server-track
   - agent-demo-track
-short_description: Automated ML model comparison with LazyPredict and MCP integ
 ---
 # 🤖 AutoML - MCP Hackathon Submission
@@ -27,48 +27,10 @@ short_description: Automated ML model comparison with LazyPredict and MCP integ
 ## 🏆 Hackathon Track
 **Agents & MCP Hackathon - Track 1: MCP Tool / Server**
-## 🌟 Key Features
-### Core ML Capabilities
-- **📤 Dual Data Input**: Support for both local CSV file uploads and public URL data sources
-- **🎯 Auto Problem Detection**: Automatically determines regression vs classification tasks
-- **🤖 Multi-Algorithm Comparison**: LazyPredict-powered comparison of 20+ ML algorithms
-- **📊 Automated EDA**: Comprehensive dataset profiling with ydata-profiling
-- **💾 Best Model Export**: Download top-performing model as pickle file
-- **📈 Performance Visualization**: Interactive charts showing model comparison results
-### 🚀 Advanced Features
-- **🌐 URL Data Loading**: Direct data loading from public CSV URLs with robust error handling
-- **🔄 Agent-Friendly Interface**: Designed for both human users and AI agent interactions
-- **📊 Interactive Dashboards**: Real-time model performance metrics and visualizations
-- **🔍 Smart Error Handling**: Comprehensive validation and user feedback system
-- **💻 MCP Server Integration**: Full Model Context Protocol server implementation
 ## 🛠️ How It Works
 The AutoML provides a streamlined pipeline for automated machine learning:
-### Core Functions
-1. **`load_data(file_input)`** - Universal data loader that handles:
-   - Local CSV file uploads through Gradio's file component
-   - Public CSV URLs with HTTP/HTTPS support
-   - Robust error handling and validation
-   - Automatic format detection and parsing
-2. **`analyze_and_model(df, target_column)`** - Core ML pipeline that:
-   - Generates comprehensive EDA reports using ydata-profiling
-   - Automatically detects task type (classification vs regression) based on target variable uniqueness
-   - Trains and evaluates multiple models using LazyPredict
-   - Selects the best performing model based on appropriate metrics
-   - Creates publication-ready visualizations comparing model performance
-   - Exports the best model as a serialized pickle file
-3. **`run_pipeline(data_source, target_column)`** - Main orchestration function:
-   - Validates all inputs and provides clear error messages
-   - Coordinates the entire ML workflow from data loading to model export
-   - Generates AI-powered explanations of results
-   - Returns all outputs in a format optimized for both UI and API consumption
 ### Agent-Friendly Design
 - **Single Entry Point**: The `run_pipeline()` function serves as the primary interface for AI agents
@@ -78,34 +40,7 @@ The AutoML provides a streamlined pipeline for automated machine learning:
 ## 🚀 Quick Start
-### Running the Application
-The project includes two main application files:
-#### Primary Application: `app.py` (Recommended)
-```bash
-# Install dependencies
-pip install -r requirements.txt
-# Run the main application
-python app.py
-```
-### Web Interface
-1. **Choose Data Source**:
-   - **Local Upload**: Use the file upload component to select a CSV file from your computer
-   - **URL Input**: Enter a public CSV URL (e.g., from GitHub, data repositories, or cloud storage)
-2. **Specify Target**: Enter the exact name of your target column (case-sensitive)
-3. **Run Analysis**: Click "Run Analysis & AutoML" to start the AutoML pipeline
-4. **Review Results**:
-   - View detected task type (classification/regression)
-   - Examine model performance metrics in the interactive table
-   - Download comprehensive EDA report (HTML format)
-   - Download the best performing model (pickle format)
-   - View model comparison visualization
-### Installation & Setup
 ```bash
 # Clone the repository
 git clone [repository-url]
@@ -113,67 +48,20 @@ cd AutoML
 # Install dependencies
 pip install -r requirements.txt
-```
-### Server Configuration
-The application launches with the following settings:
-- **Host**: `0.0.0.0` (accessible from any network interface)
-- **Port**: `7860` (default Gradio port)
-- **MCP Server**: Enabled for AI agent integration
-- **API Documentation**: Available at `/docs` endpoint
-- **Browser Launch**: Automatic browser opening enabled
-## 🎯 Current Implementation
-### 1. LazyPredict Integration
-- **Automated Model Training**: Trains 20+ algorithms automatically
-- **Performance Comparison**: Side-by-side evaluation of all models
-- **Best Model Selection**: Automatically selects top performer based on accuracy/R² score
-### 2. Comprehensive EDA
-- **ydata-profiling**: Generates detailed dataset analysis reports
-- **Automatic Insights**: Data quality, distributions, correlations, and missing values
-- **Interactive Reports**: Downloadable HTML reports with comprehensive statistics
-### 3. Smart Task Detection
-- **Classification**: Automatically detected when target has ≤10 unique values
-- **Regression**: Automatically detected for continuous target variables
-- **Adaptive Metrics**: Uses appropriate evaluation metrics for each task type
-### 4. Model Persistence
-- **Pickle Export**: Save trained models for future use
-- **Model Reuse**: Load and apply models to new datasets
-- **Production Ready**: Serialized models ready for deployment
 ## 🏆 Demo Scenarios
 ### College Placement Analysis
 - Upload `collegePlace.csv` included in the project with url: (https://raw.githubusercontent.com/daniel-was-taken/Placement-Prediction/refs/heads/master/collegePlace.csv)
 - Analyze student placement outcomes
 - Automatic feature analysis and model comparison
 - Export trained model for future predictions
-### URL-Based Data Analysis
-- Use public dataset URLs for instant analysis
-- Example: Government open data, research datasets, cloud-hosted files
-- No file size limitations with URL-based loading
-## 🚀 Technologies Used
-- **Frontend**: Gradio 4.0+ with soft theme and MCP server integration
-- **AutoML Engine**: LazyPredict for automated model comparison and evaluation
-- **EDA Framework**: ydata-profiling for comprehensive dataset analysis and reporting
-- **ML Libraries**: scikit-learn, XGBoost, LightGBM (via LazyPredict ecosystem)
-- **Visualization**: Matplotlib and Seaborn for model comparison charts and statistical plots
-- **Data Processing**: pandas and numpy for efficient data manipulation and preprocessing
-- **Model Persistence**: pickle for secure model serialization and export
-- **Web Requests**: requests library for robust URL-based data loading
-- **MCP Integration**: Model Context Protocol server for AI agent compatibility
-- **File Handling**: tempfile for secure temporary file management
 ## 📈 Current Features
 - **🔄 Dual Input Support**: Upload local CSV files or provide public URLs for data loading
@@ -187,17 +75,5 @@ The application launches with the following settings:
 - **🛡️ Robust Error Handling**: Comprehensive validation with informative user feedback
 - **🎨 Modern UI**: Clean, responsive interface optimized for both human and agent interactions
-## 🎯 Hackathon Submission Highlights
-1. **🤖 LazyPredict Integration**: Automated comparison of 20+ ML algorithms with minimal configuration
-2. **🧠 Smart Automation**: Intelligent task detection, data validation, and model selection
-3. **📊 Comprehensive Analysis**: ydata-profiling powered EDA reports with statistical insights
-4. **👥 Dual Interface Design**: Optimized for both human users and AI agent interactions
-5. **🌐 MCP Server Implementation**: Full Model Context Protocol integration for seamless agent workflows
-6. **🔄 Flexible Data Loading**: Support for both local uploads and URL-based data sources
-7. **📈 Production Ready**: Exportable models, comprehensive documentation, and robust error handling
-8. **🎨 Modern UI/UX**: Clean Gradio interface with intuitive workflow and clear feedback systems
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

   - gradio
   - mcp-server-track
   - agent-demo-track
+short_description: Automated ML model comparison with LazyPredict and MCP integration
 ---
 # 🤖 AutoML - MCP Hackathon Submission
 ## 🏆 Hackathon Track
 **Agents & MCP Hackathon - Track 1: MCP Tool / Server**
 ## 🛠️ How It Works
 The AutoML provides a streamlined pipeline for automated machine learning:
 ### Agent-Friendly Design
 - **Single Entry Point**: The `run_pipeline()` function serves as the primary interface for AI agents
 ## 🚀 Quick Start
+### Installation & Running the Application
 ```bash
 # Clone the repository
 git clone [repository-url]
 # Install dependencies
 pip install -r requirements.txt
+# Run the main application
+python app.py
+```
 ## 🏆 Demo Scenarios
 ### College Placement Analysis
 - Upload `collegePlace.csv` included in the project with url: (https://raw.githubusercontent.com/daniel-was-taken/Placement-Prediction/refs/heads/master/collegePlace.csv)
 - Analyze student placement outcomes
 - Automatic feature analysis and model comparison
 - Export trained model for future predictions
 ## 📈 Current Features
 - **🔄 Dual Input Support**: Upload local CSV files or provide public URLs for data loading
 - **🛡️ Robust Error Handling**: Comprehensive validation with informative user feedback
 - **🎨 Modern UI**: Clean, responsive interface optimized for both human and agent interactions
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

app.py CHANGED Viewed

@@ -9,33 +9,53 @@ from sklearn.model_selection import train_test_split
 from ydata_profiling import ProfileReport
 import tempfile
 import requests
 def load_data(file_input):
     """Loads CSV data from either a local file upload or a public URL."""
     if file_input is None:
-        return None
     try:
-        # For local file uploads, file_input is a temporary file object
         if hasattr(file_input, 'name'):
             file_path = file_input.name
             with open(file_path, 'rb') as f:
                 file_bytes = f.read()
             df = pd.read_csv(io.BytesIO(file_bytes))
-        # For URL text input
         elif isinstance(file_input, str) and file_input.startswith('http'):
             response = requests.get(file_input)
             response.raise_for_status()
             df = pd.read_csv(io.StringIO(response.text))
         else:
-            return None
-        return df
     except Exception as e:
         gr.Warning(f"Failed to load or parse data: {e}")
-        return None
 def analyze_and_model(df, target_column):
     """Internal function to perform EDA, model training, and visualization."""
-    # ... (This function's content is unchanged)
     profile = ProfileReport(df, title="EDA Report", minimal=True)
     with tempfile.NamedTemporaryFile(delete=False, suffix=".html") as temp_html:
         profile.to_file(temp_html.name)
@@ -50,7 +70,7 @@ def analyze_and_model(df, target_column):
     models, _ = model.fit(X_train, X_test, y_train, y_test)
     sort_metric = "Accuracy" if task == "classification" else "R-Squared"
-    best_model_name = models.sort_values(by=sort_metric, ascending=False).index[0]
     best_model = model.models[best_model_name]
     with tempfile.NamedTemporaryFile(delete=False, suffix=".pkl") as temp_pkl:
@@ -68,60 +88,96 @@ def analyze_and_model(df, target_column):
     plt.close()
     models_reset = models.reset_index().rename(columns={'index': 'Model'})
-    return profile_path, task, models_reset, plot_path, pickle_path
-def run_pipeline(data_source, target_column):
     """
     This single function drives the entire application.
     It's exposed as the primary tool for the MCP server.
     :param data_source: A local file path (from gr.File) or a URL (from gr.Textbox).
     :param target_column: The name of the target column for prediction.
     """
     # --- 1. Input Validation ---
     if not data_source or not target_column:
         error_msg = "Error: Data source and target column must be provided."
         gr.Warning(error_msg)
-        return None, error_msg, None, None, None, "Please provide all inputs."
     gr.Info("Starting analysis...")
     # --- 2. Data Loading ---
-    df = load_data(data_source)
     if df is None:
-        return None, "Error: Could not load data.", None, None, None, None
     if target_column not in df.columns:
-        error_msg = f"Error: Target column '{target_column}' not found in the dataset. Available: {list(df.columns)}"
         gr.Warning(error_msg)
-        return None, error_msg, None, None, None, None
     # --- 3. Analysis and Modeling ---
-    profile_path, task, models_df, plot_path, pickle_path = analyze_and_model(df, target_column)
-    # --- 4. Explanation ---
-    best_model_name = models_df.iloc[0]['Model']
-    llm_explanation = f"AI explanation for the '{task}' task: The top performing model was **{best_model_name}**."
     gr.Info("Analysis complete!")
-    return profile_path, task, models_df, plot_path, pickle_path, llm_explanation
 # --- Gradio UI ---
 with gr.Blocks(title="AutoML Trainer", theme=gr.themes.Soft()) as demo:
     gr.Markdown("## 🤖 AutoML Trainer")
-    gr.Markdown("Enter a CSV data source (local file or public URL) and a target column to run the analysis. This interface is now friendly for both humans and AI agents.")
     with gr.Row():
         with gr.Column(scale=1):
-            # Using gr.File allows for both upload and is compatible with agents
             file_input = gr.File(label="Upload Local CSV File")
             url_input = gr.Textbox(label="Or Enter Public CSV URL", placeholder="e.g., https://.../data.csv")
             target_column_input = gr.Textbox(label="Enter Target Column Name", placeholder="e.g., approved")
             run_button = gr.Button("Run Analysis & AutoML", variant="primary")
         with gr.Column(scale=2):
             task_output = gr.Textbox(label="Detected Task", interactive=False)
-            llm_output = gr.Textbox(label="AI Explanation (WIP)", lines=3, interactive=False)
             metrics_output = gr.Dataframe(label="Model Performance Metrics")
     with gr.Row():
@@ -130,24 +186,33 @@ with gr.Blocks(title="AutoML Trainer", theme=gr.themes.Soft()) as demo:
             eda_output = gr.File(label="Download Full EDA Report")
             model_output = gr.File(label="Download Best Model (.pkl)")
-    # The single click event that powers the whole app
-    # A helper function decides whether to use the file or URL input
-    def process_inputs(file_data, url_data, target):
         data_source = file_data if file_data is not None else url_data
-        return run_pipeline(data_source, target)
     run_button.click(
         fn=process_inputs,
-        inputs=[file_input, url_input, target_column_input],
-        outputs=[eda_output, task_output, metrics_output, vis_output, model_output, llm_output]
     )
 demo.launch(
     server_name="0.0.0.0",
     server_port=7860,
-    share=True,
     show_api=True,
     inbrowser=True,
     mcp_server=True
 )

 from ydata_profiling import ProfileReport
 import tempfile
 import requests
+import json
+from openai import OpenAI # Added for Nebius AI Studio LLM integration
 def load_data(file_input):
     """Loads CSV data from either a local file upload or a public URL."""
     if file_input is None:
+        return None, None
     try:
         if hasattr(file_input, 'name'):
             file_path = file_input.name
             with open(file_path, 'rb') as f:
                 file_bytes = f.read()
             df = pd.read_csv(io.BytesIO(file_bytes))
         elif isinstance(file_input, str) and file_input.startswith('http'):
             response = requests.get(file_input)
             response.raise_for_status()
             df = pd.read_csv(io.StringIO(response.text))
         else:
+            return None, None
+        # Extract column names here
+        column_names = ", ".join(df.columns.tolist())
+        return df, column_names
     except Exception as e:
         gr.Warning(f"Failed to load or parse data: {e}")
+        return None, None
+def update_detected_columns_display(file_data, url_data):
+    """
+    Detects and displays column names from the uploaded file or URL
+    as soon as the input changes, before the main analysis button is pressed.
+    """
+    data_source = file_data if file_data is not None else url_data
+    if data_source is None:
+        return ""
+    df, column_names = load_data(data_source)
+    if column_names:
+        return column_names
+    else:
+        return "No columns detected or error loading file. Please check the file format."
 def analyze_and_model(df, target_column):
     """Internal function to perform EDA, model training, and visualization."""
     profile = ProfileReport(df, title="EDA Report", minimal=True)
     with tempfile.NamedTemporaryFile(delete=False, suffix=".html") as temp_html:
         profile.to_file(temp_html.name)
     models, _ = model.fit(X_train, X_test, y_train, y_test)
     sort_metric = "Accuracy" if task == "classification" else "R-Squared"
+    best_model_name = models.sort_values(by=sort_metric, ascending=False).index[0] # Corrected indexing
     best_model = model.models[best_model_name]
     with tempfile.NamedTemporaryFile(delete=False, suffix=".pkl") as temp_pkl:
     plt.close()
     models_reset = models.reset_index().rename(columns={'index': 'Model'})
+    return profile, profile_path, task, models_reset, plot_path, pickle_path
+def run_pipeline(data_source, target_column, nebius_api_key):
     """
     This single function drives the entire application.
     It's exposed as the primary tool for the MCP server.
     :param data_source: A local file path (from gr.File) or a URL (from gr.Textbox).
     :param target_column: The name of the target column for prediction.
+    :param nebius_api_key: The API key for Nebius AI Studio.
     """
     # --- 1. Input Validation ---
     if not data_source or not target_column:
         error_msg = "Error: Data source and target column must be provided."
         gr.Warning(error_msg)
+        return None, error_msg, None, None, None, "Please provide all inputs.", "No columns loaded."
     gr.Info("Starting analysis...")
     # --- 2. Data Loading ---
+    df, column_names = load_data(data_source)
     if df is None:
+        return None, "Error: Could not load data.", None, None, None, None, "No columns loaded."
     if target_column not in df.columns:
+        error_msg = f"Error: Target column '{target_column}' not found in the dataset. Available: {column_names}"
         gr.Warning(error_msg)
+        return None, error_msg, None, None, None, None, column_names
     # --- 3. Analysis and Modeling ---
+    profile, profile_path, task, models_df, plot_path, pickle_path = analyze_and_model(df, target_column)
+    # --- 4. Explanation with Nebius AI Studio LLM ---
+    best_model_name = models_df.iloc[0]['Model'] # Corrected indexing
+    llm_explanation = "AI explanation is unavailable. Please provide a Nebius AI Studio API key to enable this feature." # Generic fallback [1]
+    if nebius_api_key:
+        try:
+            client = OpenAI(
+                base_url="https://api.studio.nebius.com/v1/",
+                api_key=nebius_api_key
+            )
+            # Craft a prompt for the LLM [2]
+            prompt_text = f"Explain and Summarize the significance of the top performing model, '{best_model_name}', for a {task} task in a data analysis context. Keep the explanation concise and professional. Analyse the report: {profile}."
+            # Make the LLM call [2, 3]
+            response = client.chat.completions.create(
+                model="meta-llama/Llama-3.3-70B-Instruct",
+                messages=[
+                    {"role": "system", "content": "You are a helpful AI assistant that explains data science concepts. "},
+                    {"role": "user", "content": prompt_text}
+                ],
+                temperature=0.6,
+                max_tokens=512,
+                top_p=0.9,
+                extra_body={
+                    "top_k": 50
+                }
+            )
+            message_content = response.to_json()
+            data = json.loads(message_content)
+            llm_explanation = data['choices'][0]['message']['content']
+        except Exception as e:
+            gr.Warning(f"Failed to get AI explanation: {e}. Please check your API key or try again later.")
+            llm_explanation = "An error occurred while fetching AI explanation. Please check your API key or try again later."
     gr.Info("Analysis complete!")
+    gr.Info(f'Profile report saved to: {profile_path}')
+    return profile_path, task, models_df, plot_path, pickle_path, llm_explanation, column_names
 # --- Gradio UI ---
 with gr.Blocks(title="AutoML Trainer", theme=gr.themes.Soft()) as demo:
     gr.Markdown("## 🤖 AutoML Trainer")
     with gr.Row():
         with gr.Column(scale=1):
             file_input = gr.File(label="Upload Local CSV File")
             url_input = gr.Textbox(label="Or Enter Public CSV URL", placeholder="e.g., https://.../data.csv")
             target_column_input = gr.Textbox(label="Enter Target Column Name", placeholder="e.g., approved")
+            nebius_api_key_input = gr.Textbox(label="Nebius AI Studio API Key (Optional)", type="password", placeholder="Enter your API key for AI explanations")
             run_button = gr.Button("Run Analysis & AutoML", variant="primary")
         with gr.Column(scale=2):
+            column_names_output = gr.Textbox(label="Detected Columns", interactive=False, lines=2) # New Textbox for column names
             task_output = gr.Textbox(label="Detected Task", interactive=False)
+            llm_output = gr.Markdown(label="AI Explanation")
             metrics_output = gr.Dataframe(label="Model Performance Metrics")
     with gr.Row():
             eda_output = gr.File(label="Download Full EDA Report")
             model_output = gr.File(label="Download Best Model (.pkl)")
+    def process_inputs(file_data, url_data, target, api_key):
         data_source = file_data if file_data is not None else url_data
+        return run_pipeline(data_source, target, api_key)
+    file_input.change(
+        fn=update_detected_columns_display,
+        inputs=[file_input, url_input],
+        outputs=column_names_output
+    )
+    url_input.change(
+        fn=update_detected_columns_display,
+        inputs=[file_input, url_input],
+        outputs=column_names_output
+    )
     run_button.click(
         fn=process_inputs,
+        inputs=[file_input, url_input, target_column_input, nebius_api_key_input],
+        outputs=[eda_output, task_output, metrics_output, vis_output, model_output, llm_output, column_names_output]
     )
 demo.launch(
     server_name="0.0.0.0",
     server_port=7860,
+    share=False,
     show_api=True,
     inbrowser=True,
     mcp_server=True
 )