Prithvik-1 commited on
Commit
6d94a60
Β·
verified Β·
1 Parent(s): 3ba49d5

Upload docs/LOCAL_MODEL_SETUP.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. docs/LOCAL_MODEL_SETUP.md +257 -0
docs/LOCAL_MODEL_SETUP.md ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Local Model Setup - Solution Summary
2
+
3
+ ## 🎯 Problem Resolved
4
+
5
+ **Issue**: Training failed with `OSError: [Errno 116] Stale file handle` when trying to download/use models from HuggingFace cache.
6
+
7
+ **Root Cause**: Corrupted NFS file handle in HuggingFace cache directory preventing model access.
8
+
9
+ **Solution**: Downloaded Mistral-7B-v0.1 model directly to workspace, bypassing the corrupted cache.
10
+
11
+ ---
12
+
13
+ ## πŸ“¦ Model Location
14
+
15
+ ```
16
+ /workspace/ftt/base_models/Mistral-7B-v0.1
17
+ ```
18
+
19
+ **Size**: 28 GB (includes both PyTorch and SafeTensors formats)
20
+
21
+ **Contents**:
22
+ - βœ“ Model weights (model-00001-of-00002.safetensors, model-00002-of-00002.safetensors)
23
+ - βœ“ Tokenizer (tokenizer.model, tokenizer.json)
24
+ - βœ“ Configuration files (config.json, generation_config.json)
25
+
26
+ ---
27
+
28
+ ## πŸ”§ Changes Made
29
+
30
+ ### 1. Downloaded Model Locally
31
+ Used `huggingface-cli` to download model directly to workspace:
32
+ ```bash
33
+ huggingface-cli download mistralai/Mistral-7B-v0.1 \
34
+ --local-dir /workspace/ftt/base_models/Mistral-7B-v0.1 \
35
+ --local-dir-use-symlinks False
36
+ ```
37
+
38
+ ### 2. Updated Gradio Interface
39
+ **File**: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py`
40
+
41
+ **Change**: Updated default base model path from HuggingFace ID to local path:
42
+ ```python
43
+ # Before:
44
+ value="mistralai/Mistral-7B-v0.1"
45
+
46
+ # After:
47
+ value="/workspace/ftt/base_models/Mistral-7B-v0.1"
48
+ ```
49
+
50
+ ### 3. Restarted Interface
51
+ Killed old Gradio process and started fresh instance with updated configuration.
52
+
53
+ ---
54
+
55
+ ## πŸš€ How to Use
56
+
57
+ ### Starting Training
58
+
59
+ 1. **Access Gradio Interface**:
60
+ - The interface is running on port 7860
61
+ - Access via the public link displayed in the terminal
62
+
63
+ 2. **Fine-tuning Tab**:
64
+ - Base Model field now defaults to: `/workspace/ftt/base_models/Mistral-7B-v0.1`
65
+ - You can still use HuggingFace model IDs if needed
66
+ - Upload your dataset or use HuggingFace datasets
67
+ - Configure training parameters
68
+ - Click "Start Fine-tuning"
69
+
70
+ 3. **Monitor Training**:
71
+ - Status updates in real-time
72
+ - Progress bar shows epoch and loss
73
+ - Logs are scrollable with copy functionality
74
+
75
+ ### Using Other Models
76
+
77
+ If you want to use a different base model:
78
+
79
+ **Option 1: Download Another Model Locally**
80
+ ```bash
81
+ cd /workspace/ftt
82
+ source /venv/main/bin/activate
83
+
84
+ # Download model
85
+ huggingface-cli download <model-id> \
86
+ --local-dir /workspace/ftt/base_models/<model-name> \
87
+ --local-dir-use-symlinks False
88
+
89
+ # Use the path in Gradio:
90
+ # /workspace/ftt/base_models/<model-name>
91
+ ```
92
+
93
+ **Option 2: Use HuggingFace ID Directly**
94
+ - Simply enter the model ID in the Base Model field (e.g., `mistralai/Mistral-7B-Instruct-v0.2`)
95
+ - The script will download it if not cached (may hit cache issues if they persist)
96
+
97
+ ---
98
+
99
+ ## πŸ” Verification
100
+
101
+ ### Check Model is Accessible
102
+ ```bash
103
+ python3 << 'EOF'
104
+ from transformers import AutoTokenizer, AutoConfig
105
+
106
+ model_path = "/workspace/ftt/base_models/Mistral-7B-v0.1"
107
+ tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
108
+ config = AutoConfig.from_pretrained(model_path, local_files_only=True)
109
+
110
+ print(f"βœ“ Tokenizer: {len(tokenizer)} tokens")
111
+ print(f"βœ“ Model: {config.model_type}")
112
+ EOF
113
+ ```
114
+
115
+ ### Check Gradio Status
116
+ ```bash
117
+ # Check process
118
+ ps aux | grep interface_app.py
119
+
120
+ # Check port
121
+ lsof -i :7860
122
+
123
+ # View logs (if started with nohup)
124
+ tail -f /tmp/gradio_interface.log
125
+ ```
126
+
127
+ ---
128
+
129
+ ## πŸ“Š Interface Features
130
+
131
+ ### Fine-tuning Section
132
+ - βœ“ File upload support (JSON/JSONL)
133
+ - βœ“ HuggingFace dataset integration
134
+ - βœ“ Automatic train/validation/test split
135
+ - βœ“ Max sequence length up to 6000
136
+ - βœ“ GPU-based parameter recommendations
137
+ - βœ“ Detailed tooltips for all parameters
138
+ - βœ“ Real-time progress tracking
139
+ - βœ“ Checkpoint/resume functionality
140
+
141
+ ### API Hosting Section
142
+ - βœ“ Host fine-tuned models from local paths
143
+ - βœ“ Host models from HuggingFace repositories
144
+ - βœ“ FastAPI with automatic documentation
145
+ - βœ“ Health checks and status monitoring
146
+
147
+ ### Test Inference Section
148
+ - βœ“ Test local fine-tuned models
149
+ - βœ“ Test HuggingFace models
150
+ - βœ“ Adjustable max-length (up to 6000)
151
+ - βœ“ Temperature control with tooltips
152
+ - βœ“ Uses API if running, otherwise direct loading
153
+
154
+ ### UI Controls
155
+ - βœ“ Stop Training button
156
+ - βœ“ Refresh Status button
157
+ - βœ“ Scrollable logs with copy functionality
158
+ - βœ“ Progress bars for training
159
+ - βœ“ πŸ›‘ Shutdown Gradio Server button (System Controls)
160
+
161
+ ---
162
+
163
+ ## πŸ› Troubleshooting
164
+
165
+ ### Issue: Cache errors persist
166
+ **Solution**: Always use local model paths from `/workspace/ftt/base_models/`
167
+
168
+ ### Issue: Training logs not updating
169
+ **Solution**:
170
+ 1. Click "Refresh Status" button
171
+ 2. Check that training process is running: `ps aux | grep finetune_mistral`
172
+
173
+ ### Issue: Interface not accessible
174
+ **Solution**:
175
+ ```bash
176
+ # Check if running
177
+ lsof -i :7860
178
+
179
+ # Restart if needed
180
+ pkill -f interface_app.py
181
+ cd /workspace/ftt/semicon-finetuning-scripts
182
+ python3 interface_app.py
183
+ ```
184
+
185
+ ### Issue: Out of memory during training
186
+ **Solution**:
187
+ 1. Reduce batch size
188
+ 2. Reduce max sequence length
189
+ 3. Enable gradient checkpointing (already enabled in script)
190
+ 4. Use LoRA with lower rank (r=8 instead of r=16)
191
+
192
+ ---
193
+
194
+ ## πŸ“ Technical Details
195
+
196
+ ### Training Script
197
+ **Location**: `/workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py`
198
+
199
+ **Key Features**:
200
+ - LoRA fine-tuning for memory efficiency
201
+ - Gradient checkpointing enabled
202
+ - Automatic device detection (CUDA/MPS/CPU)
203
+ - Resume from checkpoint support
204
+ - JSON configuration export
205
+
206
+ ### Fine-tuning Command (Generated by Interface)
207
+ ```bash
208
+ python3 -u /workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py \
209
+ --base-model /workspace/ftt/base_models/Mistral-7B-v0.1 \
210
+ --dataset /path/to/your/dataset.jsonl \
211
+ --output-dir ./your-finetuned-model \
212
+ --max-length 2048 \
213
+ --num-epochs 3 \
214
+ --batch-size 4 \
215
+ --learning-rate 2e-4 \
216
+ --lora-r 16 \
217
+ --lora-alpha 32
218
+ ```
219
+
220
+ ---
221
+
222
+ ## πŸŽ‰ Success Criteria
223
+
224
+ You'll know everything is working when:
225
+
226
+ 1. βœ… Gradio interface loads without errors
227
+ 2. βœ… Base model field shows local path
228
+ 3. βœ… Training starts without cache errors
229
+ 4. βœ… Progress updates appear in UI
230
+ 5. βœ… Model weights are saved to output directory
231
+
232
+ ---
233
+
234
+ ## πŸ“š Related Files
235
+
236
+ - **Interface**: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py`
237
+ - **Training Script**: `/workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py`
238
+ - **Base Model**: `/workspace/ftt/base_models/Mistral-7B-v0.1/`
239
+ - **Startup Script**: `/workspace/ftt/semicon-finetuning-scripts/start_interface.sh`
240
+ - **Requirements**: `/workspace/ftt/semicon-finetuning-scripts/requirements_interface.txt`
241
+
242
+ ---
243
+
244
+ ## πŸ†˜ Support
245
+
246
+ If you encounter any issues:
247
+
248
+ 1. Check this document's troubleshooting section
249
+ 2. Review the training logs in the UI
250
+ 3. Check process status: `ps aux | grep -E "interface_app|finetune_mistral"`
251
+ 4. Review cache directories are clear: `ls -lh /workspace/.hf_home/hub/`
252
+
253
+ ---
254
+
255
+ *Last Updated: 2025-11-24*
256
+ *Solution: Local model download to bypass corrupted HuggingFace cache*
257
+