Spaces:
Sleeping
Sleeping
| # π€ MLOps Training Platform | |
| A beginner-friendly, multilingual MLOps platform for training text classification models. Built with Streamlit and PyTorch, this platform enables users to train, evaluate, and deploy machine learning models for content detection across multiple languages. | |
|  | |
| ## β¨ Features | |
| ### π Multilingual Support | |
| - **English** π¬π§ - Standard NLP preprocessing | |
| - **Chinese (δΈζ)** π¨π³ - Jieba word segmentation | |
| - **Khmer (ααΆααΆααααα)** π°π - Unicode normalization for Khmer script | |
| ### π― Key Capabilities | |
| - **Easy Data Upload** - Upload CSV files with your training data | |
| - **Interactive Configuration** - Adjust hyperparameters with sliders and dropdowns | |
| - **Real-time Training Monitoring** - Watch progress and metrics as your model trains | |
| - **Comprehensive Evaluation** - View confusion matrices, accuracy, precision, recall, F1 | |
| - **Model Export** - Download trained models for deployment | |
| ### π§ Supported Model Architectures | |
| | Model | Languages | Description | | |
| |-------|-----------|-------------| | |
| | mBERT | EN, ZH, KM | Multilingual BERT supporting 104 languages | | |
| | XLM-RoBERTa | EN, ZH, KM | Cross-lingual model with excellent performance | | |
| | DistilBERT | EN, ZH, KM | Lightweight, faster training | | |
| | RoBERTa | EN | Optimized for English text | | |
| ## π Quick Start | |
| ### Prerequisites | |
| - Python 3.8 or higher | |
| - pip package manager | |
| - 4GB+ RAM recommended | |
| ### Installation | |
| 1. **Clone the repository** | |
| ```bash | |
| git clone <repository-url> | |
| cd "90. Content Detection Template" | |
| ``` | |
| 2. **Create a virtual environment** (recommended) | |
| ```bash | |
| python -m venv venv | |
| # Windows | |
| venv\Scripts\activate | |
| # Linux/Mac | |
| source venv/bin/activate | |
| ``` | |
| 3. **Install dependencies** | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 4. **Run the Streamlit app** | |
| ```bash | |
| streamlit run streamlit_app.py | |
| ``` | |
| 5. **Open your browser** | |
| Navigate to `http://localhost:8501` | |
| ## π User Guide | |
| ### Step 1: Upload Your Data | |
| Your CSV file should have at least two columns: | |
| - `text` - The text content to classify | |
| - `label` - Binary labels (0 or 1) | |
| **Example CSV format:** | |
| ```csv | |
| text,label | |
| "This is a legitimate message.",0 | |
| "URGENT: Click here to claim your prize!",1 | |
| "Meeting scheduled for tomorrow at 3pm.",0 | |
| "Your account has been compromised!",1 | |
| ``` | |
| ### Step 2: Configure Training | |
| 1. **Select Target Language** - Choose the language of your data | |
| 2. **Choose Model Architecture** - Select based on your needs: | |
| - Use **DistilBERT** for faster training on CPU | |
| - Use **mBERT** or **XLM-RoBERTa** for best multilingual performance | |
| 3. **Set Hyperparameters**: | |
| - **Learning Rate**: Start with 2e-5 (default) | |
| - **Epochs**: 3-5 is usually sufficient | |
| - **Batch Size**: Use 8-16 for CPU, larger for GPU | |
| ### Step 3: Train Your Model | |
| 1. Click **Start Training** | |
| 2. Monitor progress in real-time | |
| 3. View training metrics as they update | |
| ### Step 4: Evaluate & Download | |
| 1. Review final metrics (accuracy, precision, recall, F1) | |
| 2. Test the model with new text samples | |
| 3. Download the trained model as a ZIP file | |
| ## π Project Structure | |
| ``` | |
| βββ streamlit_app.py # Main Streamlit application | |
| βββ mlops/ # MLOps backend modules | |
| β βββ __init__.py | |
| β βββ config.py # Configuration classes | |
| β βββ preprocessor.py # Language-specific preprocessing | |
| β βββ trainer.py # Model training logic | |
| β βββ evaluator.py # Model evaluation & visualization | |
| βββ app/ # FastAPI backend (original) | |
| β βββ main.py | |
| βββ models/ # Pre-trained models | |
| βββ trained_models/ # Output directory for trained models | |
| βββ requirements.txt # Python dependencies | |
| βββ README.md # This file | |
| ``` | |
| ## π§ Configuration Options | |
| ### Training Parameters | |
| | Parameter | Default | Description | | |
| |-----------|---------|-------------| | |
| | Learning Rate | 2e-5 | Model learning rate | | |
| | Epochs | 3 | Number of training epochs | | |
| | Batch Size | 16 | Training batch size | | |
| | Max Length | 256 | Maximum token sequence length | | |
| | Train Split | 80% | Percentage for training | | |
| | Validation Split | 10% | Percentage for validation | | |
| | Test Split | 10% | Percentage for testing | | |
| ### Advanced Options | |
| | Parameter | Default | Description | | |
| |-----------|---------|-------------| | |
| | Warmup Ratio | 0.1 | LR warmup fraction | | |
| | Weight Decay | 0.01 | L2 regularization | | |
| | Random Seed | 42 | For reproducibility | | |
| ## π API Integration | |
| The platform also includes a FastAPI backend for programmatic access: | |
| ```python | |
| import requests | |
| # Make a prediction | |
| response = requests.post( | |
| "http://localhost:8000/predict", | |
| json={"text": "Your text here"} | |
| ) | |
| print(response.json()) | |
| ``` | |
| ## π³ Docker Support | |
| ```bash | |
| # Build the image | |
| docker build -t mlops-platform . | |
| # Run the container | |
| docker run -p 8501:8501 mlops-platform | |
| ``` | |
| ## π Sample Datasets | |
| Load sample data directly from the UI by clicking the **Sample** button in the sidebar. Sample data is available for: | |
| - English phishing detection | |
| - Chinese phishing detection (δΈζιι±Όζ£ζ΅) | |
| - Khmer content classification (ααΆαα αΆααααααΆααααΆαα·ααΆααααα) | |
| ## π οΈ Troubleshooting | |
| ### Common Issues | |
| **Out of Memory Error** | |
| - Reduce batch size | |
| - Use a smaller model (DistilBERT) | |
| - Reduce max sequence length | |
| **Slow Training** | |
| - Ensure you're using GPU if available | |
| - Reduce number of epochs | |
| - Use a smaller dataset for testing | |
| **Model Loading Errors** | |
| - Ensure internet connection for downloading models | |
| - Check available disk space | |
| - Try a different model architecture | |
| ## π License | |
| This project is licensed under the MIT License. | |
| ## π€ Contributing | |
| Contributions are welcome! Please feel free to submit a Pull Request. | |
| ## π§ Support | |
| For issues and feature requests, please open a GitHub issue. | |
| --- | |
| <div align="center"> | |
| <p>Built with β€οΈ using Streamlit & PyTorch</p> | |
| <p>Supports: π¬π§ English | π¨π³ δΈζ | π°π ααΆααΆααααα</p> | |
| </div> | |