Spaces:
Sleeping
Sleeping
| # π MLOps Platform Startup Guide | |
| Welcome to the MLOps Training Platform! This guide will help you get started quickly. | |
| ## β‘ Quick Launch | |
| ### Option 1: Streamlit Web Interface (Recommended) | |
| ```bash | |
| # Activate your virtual environment | |
| # Windows: | |
| venv\Scripts\activate | |
| # Linux/Mac: | |
| source venv/bin/activate | |
| # Launch the Streamlit app | |
| streamlit run streamlit_app.py | |
| # The app will open in your browser at http://localhost:8501 | |
| ``` | |
| ### Option 2: Programmatic Usage | |
| ```bash | |
| # Run the example script | |
| python example_usage.py | |
| ``` | |
| ### Option 3: FastAPI Backend (Original) | |
| ```bash | |
| # Run the FastAPI server | |
| python -m uvicorn app.main:app --reload | |
| # API will be available at http://localhost:8000 | |
| # Interactive docs at http://localhost:8000/docs | |
| ``` | |
| ## π¦ First-Time Setup Checklist | |
| - [ ] Python 3.8+ installed | |
| - [ ] Virtual environment created (`python -m venv venv`) | |
| - [ ] Virtual environment activated | |
| - [ ] Dependencies installed (`pip install -r requirements.txt`) | |
| - [ ] At least 4GB RAM available | |
| - [ ] Internet connection (for downloading models) | |
| ## π― Your First Training Session | |
| ### 1. Prepare Your Data | |
| Create a CSV file with these columns: | |
| - `text` - Your text samples | |
| - `label` - Binary labels (0 or 1) | |
| **Example: phishing_data.csv** | |
| ```csv | |
| text,label | |
| "Legitimate business email content",0 | |
| "URGENT: Click here to claim prize!",1 | |
| "Meeting scheduled for tomorrow",0 | |
| "Your account is compromised! Act now!",1 | |
| ``` | |
| ### 2. Launch the Platform | |
| ```bash | |
| streamlit run streamlit_app.py | |
| ``` | |
| ### 3. Follow the Workflow | |
| 1. **Data Upload Tab** | |
| - Upload your CSV file | |
| - Or click "Sample" button to load example data | |
| - Verify data structure and class distribution | |
| 2. **Training Config Tab** | |
| - Select target language (English, Chinese, Khmer) | |
| - Choose model architecture (start with DistilBERT for CPU) | |
| - Adjust hyperparameters: | |
| - Epochs: 3-5 for most tasks | |
| - Batch size: 8-16 for CPU, 32-64 for GPU | |
| - Learning rate: 2e-5 (default is good) | |
| 3. **Training Tab** | |
| - Click "Start Training" | |
| - Monitor progress in real-time | |
| - Watch metrics update live | |
| 4. **Evaluation Tab** | |
| - Review final metrics | |
| - Test model with new text | |
| - Download trained model | |
| ## π Language-Specific Tips | |
| ### English π¬π§ | |
| - Use RoBERTa or DistilBERT | |
| - Standard preprocessing works well | |
| - Fast training on CPU | |
| ### Chinese π¨π³ | |
| - Use mBERT or XLM-RoBERTa | |
| - Automatic word segmentation with jieba | |
| - May need more training time | |
| ### Khmer π°π | |
| - Use mBERT or XLM-RoBERTa | |
| - Unicode normalization applied | |
| - Ensure UTF-8 encoding in CSV | |
| ## π‘ Pro Tips | |
| ### For CPU Training | |
| ```python | |
| # In Training Config: | |
| - Model: distilbert-base-multilingual-cased | |
| - Batch size: 8 | |
| - Max length: 128 | |
| - Epochs: 3 | |
| ``` | |
| ### For GPU Training | |
| ```python | |
| # In Training Config: | |
| - Model: xlm-roberta-base | |
| - Batch size: 32 | |
| - Max length: 256 | |
| - Epochs: 5 | |
| ``` | |
| ### Dealing with Imbalanced Data | |
| - Ensure both classes have sufficient samples (min 20-30 each) | |
| - Consider using stratified sampling | |
| - Monitor precision and recall, not just accuracy | |
| ## π Common Issues & Solutions | |
| ### Issue: "Out of Memory" | |
| **Solutions:** | |
| - Reduce batch size to 4 or 8 | |
| - Use DistilBERT instead of larger models | |
| - Reduce max sequence length to 128 | |
| ### Issue: "Model download fails" | |
| **Solutions:** | |
| - Check internet connection | |
| - Try with VPN if blocked | |
| - Manually download model from Hugging Face Hub | |
| ### Issue: "Training too slow" | |
| **Solutions:** | |
| - Use smaller model (DistilBERT) | |
| - Reduce dataset size for testing | |
| - Check if GPU is available: `torch.cuda.is_available()` | |
| ### Issue: "Low accuracy" | |
| **Solutions:** | |
| - Increase number of epochs | |
| - Try different learning rate (3e-5 or 5e-5) | |
| - Ensure data quality and labels are correct | |
| - Use more training data | |
| ## π Understanding Metrics | |
| | Metric | What it means | When to focus on it | | |
| |--------|---------------|---------------------| | |
| | **Accuracy** | Overall correct predictions | Balanced datasets | | |
| | **Precision** | Of predicted positives, how many are correct | Minimize false alarms | | |
| | **Recall** | Of actual positives, how many found | Don't miss any positives | | |
| | **F1 Score** | Balance of precision and recall | General performance | | |
| ## π Useful Resources | |
| - [Transformers Documentation](https://huggingface.co/docs/transformers) | |
| - [Streamlit Documentation](https://docs.streamlit.io) | |
| - [PyTorch Tutorials](https://pytorch.org/tutorials/) | |
| ## π Getting Help | |
| 1. Check the troubleshooting section in MLOPS_README.md | |
| 2. Review the logs in the training tab | |
| 3. Run `example_usage.py` to test programmatically | |
| 4. Check console output for detailed error messages | |
| ## π Next Steps | |
| After successfully training your first model: | |
| 1. **Export Model**: Download from Evaluation tab | |
| 2. **Deploy**: Use with FastAPI backend or integrate elsewhere | |
| 3. **Iterate**: Try different languages, models, hyperparameters | |
| 4. **Scale**: Train on larger datasets with GPU | |
| --- | |
| **Happy Training! π** | |
| For detailed documentation, see [MLOPS_README.md](MLOPS_README.md) | |