Spaces:

songhieng
/

MLOps-Platforms

Sleeping

App Files Files Community

MLOps-Platforms / Docs /MLOPS_README.md

songhieng

Upload 72 files

7e825f9 verified about 1 month ago

preview code

raw

history blame contribute delete

6.38 kB

A newer version of the Streamlit SDK is available: 1.54.0

Upgrade

🤖 MLOps Training Platform

A beginner-friendly, multilingual MLOps platform for training text classification models. Built with Streamlit and PyTorch, this platform enables users to train, evaluate, and deploy machine learning models for content detection across multiple languages.

✨ Features

🌍 Multilingual Support

English 🇬🇧 - Standard NLP preprocessing
Chinese (中文) 🇨🇳 - Jieba word segmentation
Khmer (ភាសាខ្មែរ) 🇰🇭 - Unicode normalization for Khmer script

🎯 Key Capabilities

Easy Data Upload - Upload CSV files with your training data
Interactive Configuration - Adjust hyperparameters with sliders and dropdowns
Real-time Training Monitoring - Watch progress and metrics as your model trains
Comprehensive Evaluation - View confusion matrices, accuracy, precision, recall, F1
Model Export - Download trained models for deployment

🧠 Supported Model Architectures

Model	Languages	Description
mBERT	EN, ZH, KM	Multilingual BERT supporting 104 languages
XLM-RoBERTa	EN, ZH, KM	Cross-lingual model with excellent performance
DistilBERT	EN, ZH, KM	Lightweight, faster training
RoBERTa	EN	Optimized for English text

🚀 Quick Start

Prerequisites

Python 3.8 or higher
pip package manager
4GB+ RAM recommended

Installation

Clone the repository

git clone <repository-url>
cd "90. Content Detection Template"

Create a virtual environment (recommended)

python -m venv venv

# Windows
venv\Scripts\activate

# Linux/Mac
source venv/bin/activate

Install dependencies

pip install -r requirements.txt

Run the Streamlit app

streamlit run streamlit_app.py

Open your browser Navigate to http://localhost:8501

📖 User Guide

Step 1: Upload Your Data

Your CSV file should have at least two columns:

text - The text content to classify
label - Binary labels (0 or 1)

Example CSV format:

text,label
"This is a legitimate message.",0
"URGENT: Click here to claim your prize!",1
"Meeting scheduled for tomorrow at 3pm.",0
"Your account has been compromised!",1

Step 2: Configure Training

Select Target Language - Choose the language of your data
Choose Model Architecture - Select based on your needs:
- Use DistilBERT for faster training on CPU
- Use mBERT or XLM-RoBERTa for best multilingual performance
Set Hyperparameters:
- Learning Rate: Start with 2e-5 (default)
- Epochs: 3-5 is usually sufficient
- Batch Size: Use 8-16 for CPU, larger for GPU

Step 3: Train Your Model

Click Start Training
Monitor progress in real-time
View training metrics as they update

Step 4: Evaluate & Download

Review final metrics (accuracy, precision, recall, F1)
Test the model with new text samples
Download the trained model as a ZIP file

📁 Project Structure

├── streamlit_app.py          # Main Streamlit application
├── mlops/                    # MLOps backend modules
│   ├── __init__.py
│   ├── config.py             # Configuration classes
│   ├── preprocessor.py       # Language-specific preprocessing
│   ├── trainer.py            # Model training logic
│   └── evaluator.py          # Model evaluation & visualization
├── app/                      # FastAPI backend (original)
│   └── main.py
├── models/                   # Pre-trained models
├── trained_models/           # Output directory for trained models
├── requirements.txt          # Python dependencies
└── README.md                 # This file

🔧 Configuration Options

Training Parameters

Parameter	Default	Description
Learning Rate	2e-5	Model learning rate
Epochs	3	Number of training epochs
Batch Size	16	Training batch size
Max Length	256	Maximum token sequence length
Train Split	80%	Percentage for training
Validation Split	10%	Percentage for validation
Test Split	10%	Percentage for testing

Advanced Options

Parameter	Default	Description
Warmup Ratio	0.1	LR warmup fraction
Weight Decay	0.01	L2 regularization
Random Seed	42	For reproducibility

🌐 API Integration

The platform also includes a FastAPI backend for programmatic access:

import requests

# Make a prediction
response = requests.post(
    "http://localhost:8000/predict",
    json={"text": "Your text here"}
)
print(response.json())

🐳 Docker Support

# Build the image
docker build -t mlops-platform .

# Run the container
docker run -p 8501:8501 mlops-platform

📊 Sample Datasets

Load sample data directly from the UI by clicking the Sample button in the sidebar. Sample data is available for:

English phishing detection
Chinese phishing detection (中文钓鱼检测)
Khmer content classification (ការចាត់ថ្នាក់មាតិកាខ្មែរ)

🛠️ Troubleshooting

Common Issues

Out of Memory Error

Reduce batch size
Use a smaller model (DistilBERT)
Reduce max sequence length

Slow Training

Ensure you're using GPU if available
Reduce number of epochs
Use a smaller dataset for testing

Model Loading Errors

Ensure internet connection for downloading models
Check available disk space
Try a different model architecture

📝 License

This project is licensed under the MIT License.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📧 Support

For issues and feature requests, please open a GitHub issue.

Built with ❤️ using Streamlit & PyTorch

Supports: 🇬🇧 English | 🇨🇳 中文 | 🇰🇭 ភាសាខ្មែរ