Spaces:
Sleeping
Sleeping
File size: 5,216 Bytes
7e825f9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 | # π MLOps Platform Startup Guide
Welcome to the MLOps Training Platform! This guide will help you get started quickly.
## β‘ Quick Launch
### Option 1: Streamlit Web Interface (Recommended)
```bash
# Activate your virtual environment
# Windows:
venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate
# Launch the Streamlit app
streamlit run streamlit_app.py
# The app will open in your browser at http://localhost:8501
```
### Option 2: Programmatic Usage
```bash
# Run the example script
python example_usage.py
```
### Option 3: FastAPI Backend (Original)
```bash
# Run the FastAPI server
python -m uvicorn app.main:app --reload
# API will be available at http://localhost:8000
# Interactive docs at http://localhost:8000/docs
```
## π¦ First-Time Setup Checklist
- [ ] Python 3.8+ installed
- [ ] Virtual environment created (`python -m venv venv`)
- [ ] Virtual environment activated
- [ ] Dependencies installed (`pip install -r requirements.txt`)
- [ ] At least 4GB RAM available
- [ ] Internet connection (for downloading models)
## π― Your First Training Session
### 1. Prepare Your Data
Create a CSV file with these columns:
- `text` - Your text samples
- `label` - Binary labels (0 or 1)
**Example: phishing_data.csv**
```csv
text,label
"Legitimate business email content",0
"URGENT: Click here to claim prize!",1
"Meeting scheduled for tomorrow",0
"Your account is compromised! Act now!",1
```
### 2. Launch the Platform
```bash
streamlit run streamlit_app.py
```
### 3. Follow the Workflow
1. **Data Upload Tab**
- Upload your CSV file
- Or click "Sample" button to load example data
- Verify data structure and class distribution
2. **Training Config Tab**
- Select target language (English, Chinese, Khmer)
- Choose model architecture (start with DistilBERT for CPU)
- Adjust hyperparameters:
- Epochs: 3-5 for most tasks
- Batch size: 8-16 for CPU, 32-64 for GPU
- Learning rate: 2e-5 (default is good)
3. **Training Tab**
- Click "Start Training"
- Monitor progress in real-time
- Watch metrics update live
4. **Evaluation Tab**
- Review final metrics
- Test model with new text
- Download trained model
## π Language-Specific Tips
### English π¬π§
- Use RoBERTa or DistilBERT
- Standard preprocessing works well
- Fast training on CPU
### Chinese π¨π³
- Use mBERT or XLM-RoBERTa
- Automatic word segmentation with jieba
- May need more training time
### Khmer π°π
- Use mBERT or XLM-RoBERTa
- Unicode normalization applied
- Ensure UTF-8 encoding in CSV
## π‘ Pro Tips
### For CPU Training
```python
# In Training Config:
- Model: distilbert-base-multilingual-cased
- Batch size: 8
- Max length: 128
- Epochs: 3
```
### For GPU Training
```python
# In Training Config:
- Model: xlm-roberta-base
- Batch size: 32
- Max length: 256
- Epochs: 5
```
### Dealing with Imbalanced Data
- Ensure both classes have sufficient samples (min 20-30 each)
- Consider using stratified sampling
- Monitor precision and recall, not just accuracy
## π Common Issues & Solutions
### Issue: "Out of Memory"
**Solutions:**
- Reduce batch size to 4 or 8
- Use DistilBERT instead of larger models
- Reduce max sequence length to 128
### Issue: "Model download fails"
**Solutions:**
- Check internet connection
- Try with VPN if blocked
- Manually download model from Hugging Face Hub
### Issue: "Training too slow"
**Solutions:**
- Use smaller model (DistilBERT)
- Reduce dataset size for testing
- Check if GPU is available: `torch.cuda.is_available()`
### Issue: "Low accuracy"
**Solutions:**
- Increase number of epochs
- Try different learning rate (3e-5 or 5e-5)
- Ensure data quality and labels are correct
- Use more training data
## π Understanding Metrics
| Metric | What it means | When to focus on it |
|--------|---------------|---------------------|
| **Accuracy** | Overall correct predictions | Balanced datasets |
| **Precision** | Of predicted positives, how many are correct | Minimize false alarms |
| **Recall** | Of actual positives, how many found | Don't miss any positives |
| **F1 Score** | Balance of precision and recall | General performance |
## π Useful Resources
- [Transformers Documentation](https://huggingface.co/docs/transformers)
- [Streamlit Documentation](https://docs.streamlit.io)
- [PyTorch Tutorials](https://pytorch.org/tutorials/)
## π Getting Help
1. Check the troubleshooting section in MLOPS_README.md
2. Review the logs in the training tab
3. Run `example_usage.py` to test programmatically
4. Check console output for detailed error messages
## π Next Steps
After successfully training your first model:
1. **Export Model**: Download from Evaluation tab
2. **Deploy**: Use with FastAPI backend or integrate elsewhere
3. **Iterate**: Try different languages, models, hyperparameters
4. **Scale**: Train on larger datasets with GPU
---
**Happy Training! π**
For detailed documentation, see [MLOPS_README.md](MLOPS_README.md)
|