File size: 5,216 Bytes
7e825f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
# πŸš€ MLOps Platform Startup Guide

Welcome to the MLOps Training Platform! This guide will help you get started quickly.

## ⚑ Quick Launch

### Option 1: Streamlit Web Interface (Recommended)

```bash

# Activate your virtual environment

# Windows:

venv\Scripts\activate

# Linux/Mac:

source venv/bin/activate



# Launch the Streamlit app

streamlit run streamlit_app.py



# The app will open in your browser at http://localhost:8501

```

### Option 2: Programmatic Usage

```bash

# Run the example script

python example_usage.py

```

### Option 3: FastAPI Backend (Original)

```bash

# Run the FastAPI server

python -m uvicorn app.main:app --reload



# API will be available at http://localhost:8000

# Interactive docs at http://localhost:8000/docs

```

## πŸ“¦ First-Time Setup Checklist

- [ ] Python 3.8+ installed
- [ ] Virtual environment created (`python -m venv venv`)
- [ ] Virtual environment activated
- [ ] Dependencies installed (`pip install -r requirements.txt`)
- [ ] At least 4GB RAM available
- [ ] Internet connection (for downloading models)

## 🎯 Your First Training Session

### 1. Prepare Your Data

Create a CSV file with these columns:
- `text` - Your text samples
- `label` - Binary labels (0 or 1)

**Example: phishing_data.csv**

```csv

text,label

"Legitimate business email content",0

"URGENT: Click here to claim prize!",1

"Meeting scheduled for tomorrow",0

"Your account is compromised! Act now!",1

```



### 2. Launch the Platform



```bash

streamlit run streamlit_app.py

```



### 3. Follow the Workflow



1. **Data Upload Tab**
   - Upload your CSV file
   - Or click "Sample" button to load example data
   - Verify data structure and class distribution

2. **Training Config Tab**
   - Select target language (English, Chinese, Khmer)
   - Choose model architecture (start with DistilBERT for CPU)
   - Adjust hyperparameters:
     - Epochs: 3-5 for most tasks
     - Batch size: 8-16 for CPU, 32-64 for GPU
     - Learning rate: 2e-5 (default is good)

3. **Training Tab**
   - Click "Start Training"
   - Monitor progress in real-time
   - Watch metrics update live

4. **Evaluation Tab**
   - Review final metrics
   - Test model with new text
   - Download trained model

## 🌍 Language-Specific Tips

### English πŸ‡¬πŸ‡§
- Use RoBERTa or DistilBERT
- Standard preprocessing works well
- Fast training on CPU

### Chinese πŸ‡¨πŸ‡³
- Use mBERT or XLM-RoBERTa
- Automatic word segmentation with jieba
- May need more training time

### Khmer πŸ‡°πŸ‡­
- Use mBERT or XLM-RoBERTa
- Unicode normalization applied
- Ensure UTF-8 encoding in CSV

## πŸ’‘ Pro Tips

### For CPU Training
```python

# In Training Config:

- Model: distilbert-base-multilingual-cased

- Batch size: 8

- Max length: 128

- Epochs: 3

```

### For GPU Training
```python

# In Training Config:

- Model: xlm-roberta-base

- Batch size: 32

- Max length: 256

- Epochs: 5

```

### Dealing with Imbalanced Data
- Ensure both classes have sufficient samples (min 20-30 each)
- Consider using stratified sampling
- Monitor precision and recall, not just accuracy

## πŸ› Common Issues & Solutions

### Issue: "Out of Memory"
**Solutions:**
- Reduce batch size to 4 or 8
- Use DistilBERT instead of larger models
- Reduce max sequence length to 128

### Issue: "Model download fails"
**Solutions:**
- Check internet connection
- Try with VPN if blocked
- Manually download model from Hugging Face Hub

### Issue: "Training too slow"
**Solutions:**
- Use smaller model (DistilBERT)
- Reduce dataset size for testing
- Check if GPU is available: `torch.cuda.is_available()`

### Issue: "Low accuracy"
**Solutions:**
- Increase number of epochs
- Try different learning rate (3e-5 or 5e-5)
- Ensure data quality and labels are correct
- Use more training data

## πŸ“Š Understanding Metrics

| Metric | What it means | When to focus on it |
|--------|---------------|---------------------|
| **Accuracy** | Overall correct predictions | Balanced datasets |
| **Precision** | Of predicted positives, how many are correct | Minimize false alarms |
| **Recall** | Of actual positives, how many found | Don't miss any positives |
| **F1 Score** | Balance of precision and recall | General performance |

## πŸ”— Useful Resources

- [Transformers Documentation](https://huggingface.co/docs/transformers)
- [Streamlit Documentation](https://docs.streamlit.io)
- [PyTorch Tutorials](https://pytorch.org/tutorials/)

## πŸ†˜ Getting Help

1. Check the troubleshooting section in MLOPS_README.md

2. Review the logs in the training tab

3. Run `example_usage.py` to test programmatically
4. Check console output for detailed error messages

## πŸŽ‰ Next Steps

After successfully training your first model:

1. **Export Model**: Download from Evaluation tab
2. **Deploy**: Use with FastAPI backend or integrate elsewhere
3. **Iterate**: Try different languages, models, hyperparameters
4. **Scale**: Train on larger datasets with GPU

---

**Happy Training! πŸš€**

For detailed documentation, see [MLOPS_README.md](MLOPS_README.md)