MLOps-Platforms / Docs /MLOPS_README.md
songhieng's picture
Upload 72 files
7e825f9 verified

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade

πŸ€– MLOps Training Platform

A beginner-friendly, multilingual MLOps platform for training text classification models. Built with Streamlit and PyTorch, this platform enables users to train, evaluate, and deploy machine learning models for content detection across multiple languages.

Platform Preview

✨ Features

🌍 Multilingual Support

  • English πŸ‡¬πŸ‡§ - Standard NLP preprocessing
  • Chinese (δΈ­ζ–‡) πŸ‡¨πŸ‡³ - Jieba word segmentation
  • Khmer (αž—αžΆαžŸαžΆαžαŸ’αž˜αŸ‚αžš) πŸ‡°πŸ‡­ - Unicode normalization for Khmer script

🎯 Key Capabilities

  • Easy Data Upload - Upload CSV files with your training data
  • Interactive Configuration - Adjust hyperparameters with sliders and dropdowns
  • Real-time Training Monitoring - Watch progress and metrics as your model trains
  • Comprehensive Evaluation - View confusion matrices, accuracy, precision, recall, F1
  • Model Export - Download trained models for deployment

🧠 Supported Model Architectures

Model Languages Description
mBERT EN, ZH, KM Multilingual BERT supporting 104 languages
XLM-RoBERTa EN, ZH, KM Cross-lingual model with excellent performance
DistilBERT EN, ZH, KM Lightweight, faster training
RoBERTa EN Optimized for English text

πŸš€ Quick Start

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • 4GB+ RAM recommended

Installation

  1. Clone the repository
git clone <repository-url>
cd "90. Content Detection Template"
  1. Create a virtual environment (recommended)
python -m venv venv

# Windows
venv\Scripts\activate

# Linux/Mac
source venv/bin/activate
  1. Install dependencies
pip install -r requirements.txt
  1. Run the Streamlit app
streamlit run streamlit_app.py
  1. Open your browser Navigate to http://localhost:8501

πŸ“– User Guide

Step 1: Upload Your Data

Your CSV file should have at least two columns:

  • text - The text content to classify
  • label - Binary labels (0 or 1)

Example CSV format:

text,label
"This is a legitimate message.",0
"URGENT: Click here to claim your prize!",1
"Meeting scheduled for tomorrow at 3pm.",0
"Your account has been compromised!",1

Step 2: Configure Training

  1. Select Target Language - Choose the language of your data
  2. Choose Model Architecture - Select based on your needs:
    • Use DistilBERT for faster training on CPU
    • Use mBERT or XLM-RoBERTa for best multilingual performance
  3. Set Hyperparameters:
    • Learning Rate: Start with 2e-5 (default)
    • Epochs: 3-5 is usually sufficient
    • Batch Size: Use 8-16 for CPU, larger for GPU

Step 3: Train Your Model

  1. Click Start Training
  2. Monitor progress in real-time
  3. View training metrics as they update

Step 4: Evaluate & Download

  1. Review final metrics (accuracy, precision, recall, F1)
  2. Test the model with new text samples
  3. Download the trained model as a ZIP file

πŸ“ Project Structure

β”œβ”€β”€ streamlit_app.py          # Main Streamlit application
β”œβ”€β”€ mlops/                    # MLOps backend modules
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py             # Configuration classes
β”‚   β”œβ”€β”€ preprocessor.py       # Language-specific preprocessing
β”‚   β”œβ”€β”€ trainer.py            # Model training logic
β”‚   └── evaluator.py          # Model evaluation & visualization
β”œβ”€β”€ app/                      # FastAPI backend (original)
β”‚   └── main.py
β”œβ”€β”€ models/                   # Pre-trained models
β”œβ”€β”€ trained_models/           # Output directory for trained models
β”œβ”€β”€ requirements.txt          # Python dependencies
└── README.md                 # This file

πŸ”§ Configuration Options

Training Parameters

Parameter Default Description
Learning Rate 2e-5 Model learning rate
Epochs 3 Number of training epochs
Batch Size 16 Training batch size
Max Length 256 Maximum token sequence length
Train Split 80% Percentage for training
Validation Split 10% Percentage for validation
Test Split 10% Percentage for testing

Advanced Options

Parameter Default Description
Warmup Ratio 0.1 LR warmup fraction
Weight Decay 0.01 L2 regularization
Random Seed 42 For reproducibility

🌐 API Integration

The platform also includes a FastAPI backend for programmatic access:

import requests

# Make a prediction
response = requests.post(
    "http://localhost:8000/predict",
    json={"text": "Your text here"}
)
print(response.json())

🐳 Docker Support

# Build the image
docker build -t mlops-platform .

# Run the container
docker run -p 8501:8501 mlops-platform

πŸ“Š Sample Datasets

Load sample data directly from the UI by clicking the Sample button in the sidebar. Sample data is available for:

  • English phishing detection
  • Chinese phishing detection (δΈ­ζ–‡ι’“ι±Όζ£€ζ΅‹)
  • Khmer content classification (αž€αžΆαžšαž…αžΆαžαŸ‹αžαŸ’αž“αžΆαž€αŸ‹αž˜αžΆαžαž·αž€αžΆαžαŸ’αž˜αŸ‚αžš)

πŸ› οΈ Troubleshooting

Common Issues

Out of Memory Error

  • Reduce batch size
  • Use a smaller model (DistilBERT)
  • Reduce max sequence length

Slow Training

  • Ensure you're using GPU if available
  • Reduce number of epochs
  • Use a smaller dataset for testing

Model Loading Errors

  • Ensure internet connection for downloading models
  • Check available disk space
  • Try a different model architecture

πŸ“ License

This project is licensed under the MIT License.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“§ Support

For issues and feature requests, please open a GitHub issue.


Built with ❀️ using Streamlit & PyTorch

Supports: πŸ‡¬πŸ‡§ English | πŸ‡¨πŸ‡³ δΈ­ζ–‡ | πŸ‡°πŸ‡­ αž—αžΆαžŸαžΆαžαŸ’αž˜αŸ‚αžš