File size: 6,946 Bytes
6d94a60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
# Local Model Setup - Solution Summary

## 🎯 Problem Resolved

**Issue**: Training failed with `OSError: [Errno 116] Stale file handle` when trying to download/use models from HuggingFace cache.

**Root Cause**: Corrupted NFS file handle in HuggingFace cache directory preventing model access.

**Solution**: Downloaded Mistral-7B-v0.1 model directly to workspace, bypassing the corrupted cache.

---

## πŸ“¦ Model Location

```
/workspace/ftt/base_models/Mistral-7B-v0.1
```

**Size**: 28 GB (includes both PyTorch and SafeTensors formats)

**Contents**:
- βœ“ Model weights (model-00001-of-00002.safetensors, model-00002-of-00002.safetensors)
- βœ“ Tokenizer (tokenizer.model, tokenizer.json)
- βœ“ Configuration files (config.json, generation_config.json)

---

## πŸ”§ Changes Made

### 1. Downloaded Model Locally
Used `huggingface-cli` to download model directly to workspace:
```bash
huggingface-cli download mistralai/Mistral-7B-v0.1 \
  --local-dir /workspace/ftt/base_models/Mistral-7B-v0.1 \
  --local-dir-use-symlinks False
```

### 2. Updated Gradio Interface
**File**: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py`

**Change**: Updated default base model path from HuggingFace ID to local path:
```python
# Before:
value="mistralai/Mistral-7B-v0.1"

# After:
value="/workspace/ftt/base_models/Mistral-7B-v0.1"
```

### 3. Restarted Interface
Killed old Gradio process and started fresh instance with updated configuration.

---

## πŸš€ How to Use

### Starting Training

1. **Access Gradio Interface**:
   - The interface is running on port 7860
   - Access via the public link displayed in the terminal

2. **Fine-tuning Tab**:
   - Base Model field now defaults to: `/workspace/ftt/base_models/Mistral-7B-v0.1`
   - You can still use HuggingFace model IDs if needed
   - Upload your dataset or use HuggingFace datasets
   - Configure training parameters
   - Click "Start Fine-tuning"

3. **Monitor Training**:
   - Status updates in real-time
   - Progress bar shows epoch and loss
   - Logs are scrollable with copy functionality

### Using Other Models

If you want to use a different base model:

**Option 1: Download Another Model Locally**
```bash
cd /workspace/ftt
source /venv/main/bin/activate

# Download model
huggingface-cli download <model-id> \
  --local-dir /workspace/ftt/base_models/<model-name> \
  --local-dir-use-symlinks False

# Use the path in Gradio:
# /workspace/ftt/base_models/<model-name>
```

**Option 2: Use HuggingFace ID Directly**
- Simply enter the model ID in the Base Model field (e.g., `mistralai/Mistral-7B-Instruct-v0.2`)
- The script will download it if not cached (may hit cache issues if they persist)

---

## πŸ” Verification

### Check Model is Accessible
```bash
python3 << 'EOF'
from transformers import AutoTokenizer, AutoConfig

model_path = "/workspace/ftt/base_models/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
config = AutoConfig.from_pretrained(model_path, local_files_only=True)

print(f"βœ“ Tokenizer: {len(tokenizer)} tokens")
print(f"βœ“ Model: {config.model_type}")
EOF
```

### Check Gradio Status
```bash
# Check process
ps aux | grep interface_app.py

# Check port
lsof -i :7860

# View logs (if started with nohup)
tail -f /tmp/gradio_interface.log
```

---

## πŸ“Š Interface Features

### Fine-tuning Section
- βœ“ File upload support (JSON/JSONL)
- βœ“ HuggingFace dataset integration
- βœ“ Automatic train/validation/test split
- βœ“ Max sequence length up to 6000
- βœ“ GPU-based parameter recommendations
- βœ“ Detailed tooltips for all parameters
- βœ“ Real-time progress tracking
- βœ“ Checkpoint/resume functionality

### API Hosting Section
- βœ“ Host fine-tuned models from local paths
- βœ“ Host models from HuggingFace repositories
- βœ“ FastAPI with automatic documentation
- βœ“ Health checks and status monitoring

### Test Inference Section
- βœ“ Test local fine-tuned models
- βœ“ Test HuggingFace models
- βœ“ Adjustable max-length (up to 6000)
- βœ“ Temperature control with tooltips
- βœ“ Uses API if running, otherwise direct loading

### UI Controls
- βœ“ Stop Training button
- βœ“ Refresh Status button
- βœ“ Scrollable logs with copy functionality
- βœ“ Progress bars for training
- βœ“ πŸ›‘ Shutdown Gradio Server button (System Controls)

---

## πŸ› Troubleshooting

### Issue: Cache errors persist
**Solution**: Always use local model paths from `/workspace/ftt/base_models/`

### Issue: Training logs not updating
**Solution**: 
1. Click "Refresh Status" button
2. Check that training process is running: `ps aux | grep finetune_mistral`

### Issue: Interface not accessible
**Solution**:
```bash
# Check if running
lsof -i :7860

# Restart if needed
pkill -f interface_app.py
cd /workspace/ftt/semicon-finetuning-scripts
python3 interface_app.py
```

### Issue: Out of memory during training
**Solution**:
1. Reduce batch size
2. Reduce max sequence length
3. Enable gradient checkpointing (already enabled in script)
4. Use LoRA with lower rank (r=8 instead of r=16)

---

## πŸ“ Technical Details

### Training Script
**Location**: `/workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py`

**Key Features**:
- LoRA fine-tuning for memory efficiency
- Gradient checkpointing enabled
- Automatic device detection (CUDA/MPS/CPU)
- Resume from checkpoint support
- JSON configuration export

### Fine-tuning Command (Generated by Interface)
```bash
python3 -u /workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py \
  --base-model /workspace/ftt/base_models/Mistral-7B-v0.1 \
  --dataset /path/to/your/dataset.jsonl \
  --output-dir ./your-finetuned-model \
  --max-length 2048 \
  --num-epochs 3 \
  --batch-size 4 \
  --learning-rate 2e-4 \
  --lora-r 16 \
  --lora-alpha 32
```

---

## πŸŽ‰ Success Criteria

You'll know everything is working when:

1. βœ… Gradio interface loads without errors
2. βœ… Base model field shows local path
3. βœ… Training starts without cache errors
4. βœ… Progress updates appear in UI
5. βœ… Model weights are saved to output directory

---

## πŸ“š Related Files

- **Interface**: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py`
- **Training Script**: `/workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py`
- **Base Model**: `/workspace/ftt/base_models/Mistral-7B-v0.1/`
- **Startup Script**: `/workspace/ftt/semicon-finetuning-scripts/start_interface.sh`
- **Requirements**: `/workspace/ftt/semicon-finetuning-scripts/requirements_interface.txt`

---

## πŸ†˜ Support

If you encounter any issues:

1. Check this document's troubleshooting section
2. Review the training logs in the UI
3. Check process status: `ps aux | grep -E "interface_app|finetune_mistral"`
4. Review cache directories are clear: `ls -lh /workspace/.hf_home/hub/`

---

*Last Updated: 2025-11-24*
*Solution: Local model download to bypass corrupted HuggingFace cache*