ml-project / README.md
yashpinjarkar10's picture
Upload 45 files
75b3cbb verified
metadata
title: Student Performance Predictor
emoji: πŸŽ“
colorFrom: blue
colorTo: purple
sdk: docker
sdk_version: latest
app_file: app.py
pinned: false
license: mit

πŸŽ“ Student Performance Prediction System

Python Flask scikit-learn License

A comprehensive machine learning solution that predicts student math performance based on demographic and academic factors using ensemble learning techniques.

πŸ“‹ Table of Contents

🎯 Overview

This project implements an end-to-end machine learning pipeline to predict student mathematics performance based on various socio-economic and educational factors. The system uses advanced ensemble learning algorithms and provides a user-friendly web interface for real-time predictions.

πŸ” Problem Statement

Understanding how student performance in mathematics is influenced by various factors such as:

  • Demographic factors: Gender, Race/Ethnicity
  • Socio-economic factors: Lunch type (indicator of economic status)
  • Educational background: Parental education level, Test preparation course completion
  • Academic performance: Reading and Writing scores

The goal is to build a robust prediction model that can help educators and institutions identify students who might need additional support.

πŸ“Š Dataset

Source: Kaggle - Students Performance in Exams

Dataset Characteristics:

  • Size: 1,000 student records
  • Features: 8 columns (5 categorical, 3 numerical)
  • Target Variable: math_score (0-100)

Feature Description:

Feature Type Description
gender Categorical Student's gender (male/female)
race_ethnicity Categorical Student's ethnic group (A, B, C, D, E)
parental_level_of_education Categorical Highest education level of parents
lunch Categorical Lunch type (standard/free or reduced)
test_preparation_course Categorical Test prep course completion status
reading_score Numerical Reading test score (0-100)
writing_score Numerical Writing test score (0-100)
math_score Numerical Target - Mathematics test score (0-100)

πŸ—οΈ Architecture

The project follows a modular, production-ready architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Data Source   │───▢│  Data Ingestion  │───▢│ Data Transform  β”‚
β”‚   (CSV File)    β”‚    β”‚   Component      β”‚    β”‚   Component     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  Web Interface  │◀───│  Flask App       β”‚            β”‚
β”‚   (HTML/CSS)    β”‚    β”‚  (Prediction)    β”‚            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
                                β–²                       β”‚
                                β”‚                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Artifacts     │◀───│  Model Trainer   │◀───│  Preprocessed   β”‚
β”‚ (model.pkl,     β”‚    β”‚   Component      β”‚    β”‚     Data        β”‚
β”‚ preprocessor.pkl)β”‚    β”‚                  β”‚    β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

✨ Features

πŸ€– Machine Learning Pipeline

  • Data Ingestion: Automated data loading and train-test splitting
  • Data Transformation:
    • Numerical features: Median imputation + Standard scaling
    • Categorical features: Mode imputation + One-hot encoding + Scaling
  • Model Training: Multi-algorithm comparison with hyperparameter tuning
  • Model Selection: Automated best model selection based on RΒ² score

🧠 Advanced Algorithms

  • Random Forest Regressor
  • Gradient Boosting Regressor
  • XGBoost Regressor
  • CatBoost Regressor
  • AdaBoost Regressor
  • Decision Tree Regressor
  • Linear Regression

🌐 Web Application

  • Modern UI/UX: Responsive design with gradient styling
  • Real-time Predictions: Instant math score predictions
  • Form Validation: Client-side and server-side validation
  • Error Handling: Comprehensive exception handling with custom logging

πŸ”§ Production Features

  • Custom Exception Handling: Detailed error tracking and logging
  • Logging System: Timestamped logs for debugging and monitoring
  • Modular Design: Reusable components for easy maintenance
  • Configuration Management: Centralized configuration using dataclasses

πŸš€ Installation

Prerequisites

  • Python 3.11+
  • pip package manager

Setup Instructions

  1. Clone the repository

    git clone https://github.com/yashpinjarkar10/mlproject.git
    cd mlproject
    
  2. Create virtual environment

    python -m venv venv
    
    # Windows
    venv\\Scripts\\activate
    
    # Linux/Mac
    source venv/bin/activate
    
  3. Install dependencies

    pip install -r requirements.txt
    
  4. Install the project in development mode

    pip install -e .
    

πŸ’» Usage

🎯 Training the Model

Run the complete ML pipeline (data ingestion β†’ transformation β†’ model training):

python src/components/data_ingestion.py

This will:

  • Load and split the dataset (80% train, 20% test)
  • Apply data transformations
  • Train multiple models with hyperparameter tuning
  • Save the best model and preprocessor

🌐 Running the Web Application

python app.py

Access the application at: http://localhost:5000

πŸ“ Making Predictions

  1. Navigate to the prediction page
  2. Fill in the student information:
    • Personal details (Gender, Ethnicity)
    • Educational background (Parent education, Test prep)
    • Academic scores (Reading & Writing)
  3. Click "Predict Math Score"
  4. View the predicted mathematics score

πŸ“ˆ Model Performance

The system automatically selects the best-performing model based on RΒ² score evaluation:

  • Minimum Acceptable Performance: RΒ² β‰₯ 0.6
  • Cross-validation: 3-fold CV during hyperparameter tuning
  • Evaluation Metrics: RΒ² Score on test set
  • Model Comparison: Comprehensive evaluation of 7 different algorithms

Hyperparameter Optimization

Each algorithm undergoes GridSearchCV with algorithm-specific parameter grids:

Algorithm Key Parameters Tuned
Random Forest n_estimators, max_features
Gradient Boosting learning_rate, n_estimators, subsample
XGBoost learning_rate, n_estimators
CatBoost depth, learning_rate, iterations

πŸ”Œ API Endpoints

Endpoint Method Description
/ GET Landing page with project overview
/predictdata GET Display prediction form
/predictdata POST Process prediction request and return result

Request Format (POST /predictdata)

{
  \"gender\": \"male\",
  \"ethnicity\": \"group B\",
  \"parental_level_of_education\": \"bachelor's degree\",
  \"lunch\": \"standard\",
  \"test_preparation_course\": \"completed\",
  \"reading_score\": 85,
  \"writing_score\": 78
}

πŸ“ Project Structure

mlproject/
β”œβ”€β”€ πŸ“± app.py                          # Flask web application
β”œβ”€β”€ πŸ“‹ requirements.txt                # Project dependencies  
β”œβ”€β”€ βš™οΈ setup.py                        # Package configuration
β”œβ”€β”€ πŸ“š README.md                       # Project documentation
β”‚
β”œβ”€β”€ πŸ“Š artifacts/                      # Generated model artifacts
β”‚   β”œβ”€β”€ πŸ“ˆ data.csv                   # Raw dataset
β”‚   β”œβ”€β”€ πŸ”§ preprocessor.pkl           # Data transformation pipeline
β”‚   β”œβ”€β”€ πŸ€– model.pkl                  # Trained best model
β”‚   β”œβ”€β”€ πŸ“ train.csv                  # Training dataset
β”‚   └── βœ… test.csv                   # Testing dataset
β”‚
β”œβ”€β”€ πŸ““ notebook/                       # Jupyter notebooks
β”‚   β”œβ”€β”€ πŸ” 1. EDA STUDENT PERFORMANCE.ipynb   # Exploratory Data Analysis
β”‚   β”œβ”€β”€ 🎯 2. MODEL TRAINING.ipynb            # Model development
β”‚   └── πŸ“ data/
β”‚       └── πŸ“Š stud.csv               # Original dataset
β”‚
β”œβ”€β”€ 🎨 templates/                      # HTML templates
β”‚   β”œβ”€β”€ 🏠 index.html                # Landing page
β”‚   └── πŸ“‹ home.html                 # Prediction form
β”‚
β”œβ”€β”€ πŸ“¦ src/                           # Source code package
β”‚   β”œβ”€β”€ πŸ”§ components/               # ML pipeline components
β”‚   β”‚   β”œβ”€β”€ πŸ“₯ data_ingestion.py     # Data loading and splitting
β”‚   β”‚   β”œβ”€β”€ πŸ”„ data_transformation.py # Feature engineering
β”‚   β”‚   └── 🎯 model_trainer.py      # Model training and selection
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ”€ pipeline/                 # Prediction pipelines
β”‚   β”‚   β”œβ”€β”€ πŸš€ predict_pipeline.py   # Inference pipeline
β”‚   β”‚   └── πŸŽ“ train_pipeline.py     # Training pipeline
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ› οΈ utils.py                  # Utility functions
β”‚   β”œβ”€β”€ ⚠️ exception.py              # Custom exception handling
β”‚   └── πŸ“ logger.py                 # Logging configuration
β”‚
└── πŸ“‹ logs/                         # Application logs
    └── πŸ“… [timestamp].log          # Timestamped log files

πŸ› οΈ Technologies Used

Core Framework

  • Python 3.11+: Main programming language
  • Flask: Web framework for the user interface
  • scikit-learn 1.2.1: Machine learning algorithms and preprocessing

Data Science Stack

  • pandas: Data manipulation and analysis
  • numpy: Numerical computing
  • matplotlib & seaborn: Data visualization

Machine Learning Libraries

  • XGBoost: Gradient boosting framework
  • CatBoost: Categorical feature boosting
  • dill: Advanced object serialization

Development Tools

  • setuptools: Package management
  • Custom logging: Application monitoring
  • Exception handling: Error management

Frontend

  • HTML5 & CSS3: Modern responsive web interface
  • Jinja2: Template engine for dynamic content

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Setup

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Areas for Contribution

  • πŸ”§ Additional ML algorithms
  • πŸ“Š Enhanced data visualization
  • 🌐 API improvements
  • πŸ“± Mobile responsiveness
  • πŸ§ͺ Unit testing
  • πŸ“š Documentation improvements

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ‘¨β€πŸ’» Author

Yash Pinjarkar


⭐ If you found this project helpful, please consider giving it a star!

Built with ❀️ for the ML community