Prithvik-1 commited on
Commit
244a62f
Β·
verified Β·
1 Parent(s): 6d94a60

Upload docs/MODEL_INFERENCE_FIXES.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. docs/MODEL_INFERENCE_FIXES.md +428 -0
docs/MODEL_INFERENCE_FIXES.md ADDED
@@ -0,0 +1,428 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Inference Fixes - Complete Guide
2
+
3
+ ## πŸŽ‰ Issues Resolved
4
+
5
+ ### Issue 1: New Fine-tuned Model Not Showing in UI
6
+ **Status**: βœ… FIXED
7
+
8
+ **Problem**: After completing fine-tuning, the new model `mistral-finetuned-fifo1` was not appearing in the dropdown lists for API Hosting or Test Inference.
9
+
10
+ **Root Cause**: The `list_models()` function was only checking:
11
+ - `/workspace/ftt/` (parent directory)
12
+ - `/workspace/ftt/semicon-finetuning-scripts/models/msp/` (MODELS_DIR)
13
+
14
+ But the new model was saved to:
15
+ - `/workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1` (BASE_DIR)
16
+
17
+ **Solution**: Updated `list_models()` function to also scan `BASE_DIR`:
18
+
19
+ ```python
20
+ def list_models():
21
+ """List available fine-tuned models"""
22
+ models = []
23
+
24
+ # Check in BASE_DIR (semicon-finetuning-scripts directory) - NEW!
25
+ for item in BASE_DIR.iterdir():
26
+ if item.is_dir() and "mistral" in item.name.lower() and not item.name.startswith('.'):
27
+ models.append(str(item))
28
+
29
+ # Check in BASE_DIR parent (ftt directory)
30
+ ftt_dir = BASE_DIR.parent
31
+ for item in ftt_dir.iterdir():
32
+ if item.is_dir() and "mistral" in item.name.lower():
33
+ models.append(str(item))
34
+
35
+ # Check in MODELS_DIR
36
+ if MODELS_DIR.exists():
37
+ for item in MODELS_DIR.iterdir():
38
+ if item.is_dir() and "mistral" in item.name.lower():
39
+ models.append(str(item))
40
+
41
+ return sorted(list(set(models))) if models else ["No models found"]
42
+ ```
43
+
44
+ **File Modified**: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py` (lines 116-133)
45
+
46
+ ---
47
+
48
+ ### Issue 2: API Hosting Server Not Starting
49
+ **Status**: βœ… FIXED
50
+
51
+ **Problem**: When trying to start the API hosting server with the fine-tuned model, it failed with:
52
+ ```
53
+ OSError: [Errno 116] Stale file handle:
54
+ '/workspace/.hf_home/hub/models--mistralai--Mistral-7B-v0.1/blobs/...'
55
+ ```
56
+
57
+ **Root Cause**:
58
+ 1. The fine-tuned model is a **LoRA adapter** (not a full model)
59
+ 2. To use it, the API server must load the **base model** first, then apply the LoRA adapter
60
+ 3. The inference script was hardcoded to load `mistralai/Mistral-7B-v0.1` from HuggingFace
61
+ 4. This triggered the corrupted cache issue again
62
+
63
+ **Solution**: Updated the inference script to use the local base model we downloaded earlier:
64
+
65
+ ```python
66
+ if is_lora:
67
+ # Load base model - prefer local model to avoid cache issues
68
+ local_base_model = "/workspace/ftt/base_models/Mistral-7B-v0.1"
69
+
70
+ # Check if local model exists, otherwise use HuggingFace
71
+ if os.path.exists(local_base_model):
72
+ base_model_name = local_base_model
73
+ print(f"Loading base model from local: {base_model_name}")
74
+ else:
75
+ base_model_name = "mistralai/Mistral-7B-v0.1"
76
+ print(f"Loading base model from HuggingFace: {base_model_name}")
77
+
78
+ base_model = AutoModelForCausalLM.from_pretrained(
79
+ base_model_name,
80
+ local_files_only=os.path.exists(local_base_model),
81
+ **get_model_kwargs(use_quantization)
82
+ )
83
+
84
+ # Load LoRA adapter
85
+ print("Loading LoRA adapter...")
86
+ model = PeftModel.from_pretrained(base_model, model_path)
87
+ model = model.merge_and_unload() # Merge adapter weights
88
+ ```
89
+
90
+ **File Modified**: `/workspace/ftt/semicon-finetuning-scripts/models/msp/inference/inference_mistral7b.py` (lines 96-109)
91
+
92
+ ---
93
+
94
+ ## πŸ“¦ Your Fine-tuned Model
95
+
96
+ **Location**: `/workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1`
97
+
98
+ **Type**: LoRA Adapter (161 MB)
99
+
100
+ **Contents**:
101
+ ```
102
+ mistral-finetuned-fifo1/
103
+ β”œβ”€β”€ adapter_model.safetensors # LoRA weights (161 MB)
104
+ β”œβ”€β”€ adapter_config.json # LoRA configuration
105
+ β”œβ”€β”€ tokenizer.json # Tokenizer
106
+ β”œβ”€β”€ tokenizer_config.json # Tokenizer config
107
+ β”œβ”€β”€ special_tokens_map.json # Special tokens
108
+ β”œβ”€β”€ training_args.bin # Training arguments
109
+ β”œβ”€β”€ training_config.json # Training configuration
110
+ β”œβ”€β”€ checkpoint-24/ # Best checkpoint
111
+ └── README.md # Model card
112
+ ```
113
+
114
+ **How it works**:
115
+ - Your model is a **LoRA adapter** (Low-Rank Adaptation)
116
+ - It contains only the **fine-tuned weights** (161 MB)
117
+ - To use it, it needs the **base model** (Mistral-7B-v0.1, 28 GB)
118
+ - The adapter is merged with the base model at inference time
119
+
120
+ ---
121
+
122
+ ## πŸš€ Using Your Model
123
+
124
+ ### Option 1: Via Gradio UI (Recommended)
125
+
126
+ #### For API Hosting:
127
+
128
+ 1. **Access Gradio Interface**:
129
+ - URL: https://3833be2ce50507322f.gradio.live
130
+ - Or: http://0.0.0.0:7860 (if local)
131
+
132
+ 2. **Go to "🌐 API Hosting" Tab**
133
+
134
+ 3. **Select Your Model**:
135
+ - Model Source: **Local Model**
136
+ - Dropdown: Select `/workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1`
137
+
138
+ 4. **Configure** (optional):
139
+ - Host: 0.0.0.0 (default)
140
+ - Port: 8000 (default)
141
+
142
+ 5. **Start Server**:
143
+ - Click "πŸš€ Start API Server"
144
+ - Wait 15-20 seconds for model loading
145
+ - Status will show "βœ… API server started!"
146
+
147
+ 6. **Access API**:
148
+ - API: http://0.0.0.0:8000
149
+ - Docs: http://0.0.0.0:8000/docs
150
+
151
+ #### For Direct Inference:
152
+
153
+ 1. **Go to "πŸ§ͺ Test Inference" Tab**
154
+
155
+ 2. **Select Your Model**:
156
+ - Model Source: **Local Model**
157
+ - Dropdown: Select `/workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1`
158
+
159
+ 3. **Configure Parameters**:
160
+ - Max Length: 512 (default) or up to 6000
161
+ - Temperature: 0.7 (default) or adjust for creativity
162
+
163
+ 4. **Enter Prompt**:
164
+ - Type your test prompt in the text box
165
+
166
+ 5. **Run Inference**:
167
+ - Click "πŸ”„ Run Inference"
168
+ - Results will appear below
169
+
170
+ ---
171
+
172
+ ### Option 2: Via Python Script
173
+
174
+ ```python
175
+ from transformers import AutoTokenizer, AutoModelForCausalLM
176
+ from peft import PeftModel
177
+ import torch
178
+
179
+ # Load base model
180
+ base_model_path = "/workspace/ftt/base_models/Mistral-7B-v0.1"
181
+ base_model = AutoModelForCausalLM.from_pretrained(
182
+ base_model_path,
183
+ torch_dtype=torch.float16,
184
+ device_map="auto",
185
+ local_files_only=True
186
+ )
187
+
188
+ # Load LoRA adapter
189
+ adapter_path = "/workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1"
190
+ model = PeftModel.from_pretrained(base_model, adapter_path)
191
+ model = model.merge_and_unload() # Merge weights
192
+ model.eval()
193
+
194
+ # Load tokenizer
195
+ tokenizer = AutoTokenizer.from_pretrained(adapter_path)
196
+
197
+ # Run inference
198
+ prompt = "Your prompt here"
199
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
200
+ outputs = model.generate(**inputs, max_length=512)
201
+ result = tokenizer.decode(outputs[0], skip_special_tokens=True)
202
+ print(result)
203
+ ```
204
+
205
+ ---
206
+
207
+ ### Option 3: Via API (After Starting Server)
208
+
209
+ ```bash
210
+ # Start API server first via Gradio UI or:
211
+ cd /workspace/ftt/semicon-finetuning-scripts
212
+ python3 models/msp/api/api_server.py \
213
+ --model-path /workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1 \
214
+ --host 0.0.0.0 \
215
+ --port 8000
216
+
217
+ # Then call the API:
218
+ curl -X POST "http://localhost:8000/generate" \
219
+ -H "Content-Type: application/json" \
220
+ -d '{
221
+ "prompt": "Your prompt here",
222
+ "max_length": 512,
223
+ "temperature": 0.7
224
+ }'
225
+ ```
226
+
227
+ ---
228
+
229
+ ## πŸ” Verification
230
+
231
+ ### Check Models are Listed:
232
+
233
+ ```bash
234
+ cd /workspace/ftt/semicon-finetuning-scripts
235
+ python3 << 'EOF'
236
+ from pathlib import Path
237
+
238
+ BASE_DIR = Path("/workspace/ftt/semicon-finetuning-scripts")
239
+ models = [
240
+ str(item) for item in BASE_DIR.iterdir()
241
+ if item.is_dir() and "mistral" in item.name.lower()
242
+ ]
243
+ print("Models found in BASE_DIR:")
244
+ for m in sorted(models):
245
+ print(f" - {Path(m).name}")
246
+ EOF
247
+ ```
248
+
249
+ Expected output should include: `mistral-finetuned-fifo1`
250
+
251
+ ### Test API Server Manually:
252
+
253
+ ```bash
254
+ cd /workspace/ftt/semicon-finetuning-scripts
255
+ source /venv/main/bin/activate
256
+
257
+ python3 models/msp/api/api_server.py \
258
+ --model-path /workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1 \
259
+ --host 0.0.0.0 \
260
+ --port 8001
261
+ ```
262
+
263
+ Expected output should include:
264
+ - βœ“ Loading base model from local: /workspace/ftt/base_models/Mistral-7B-v0.1
265
+ - βœ“ Loading LoRA adapter...
266
+ - βœ“ Model loaded successfully on cuda!
267
+ - βœ“ Server ready to accept requests
268
+
269
+ ---
270
+
271
+ ## πŸ› Troubleshooting
272
+
273
+ ### Model Not Appearing in Dropdown
274
+
275
+ **Check 1**: Verify model exists
276
+ ```bash
277
+ ls -lh /workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1/
278
+ ```
279
+
280
+ **Check 2**: Restart Gradio interface
281
+ ```bash
282
+ pkill -f interface_app.py
283
+ cd /workspace/ftt/semicon-finetuning-scripts
284
+ python3 interface_app.py
285
+ ```
286
+
287
+ **Check 3**: Manually verify list_models() function
288
+ ```bash
289
+ cd /workspace/ftt/semicon-finetuning-scripts
290
+ python3 -c "from interface_app import list_models; print('\n'.join(list_models()))"
291
+ ```
292
+
293
+ ### API Server Fails to Start
294
+
295
+ **Check 1**: Verify base model exists
296
+ ```bash
297
+ ls -lh /workspace/ftt/base_models/Mistral-7B-v0.1/
298
+ ```
299
+
300
+ If missing, re-download:
301
+ ```bash
302
+ huggingface-cli download mistralai/Mistral-7B-v0.1 \
303
+ --local-dir /workspace/ftt/base_models/Mistral-7B-v0.1 \
304
+ --local-dir-use-symlinks False
305
+ ```
306
+
307
+ **Check 2**: Test model loading manually
308
+ ```bash
309
+ cd /workspace/ftt/semicon-finetuning-scripts
310
+ python3 << 'EOF'
311
+ from models.msp.inference.inference_mistral7b import load_local_model
312
+
313
+ model_path = "/workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1"
314
+ print("Testing model load...")
315
+ model, tokenizer = load_local_model(model_path)
316
+ print("βœ“ Model loaded successfully!")
317
+ EOF
318
+ ```
319
+
320
+ **Check 3**: Check GPU memory
321
+ ```bash
322
+ nvidia-smi
323
+ ```
324
+
325
+ If GPU is full, free up memory:
326
+ ```bash
327
+ pkill -f python3 # Kill other Python processes
328
+ python3 -c "import torch; torch.cuda.empty_cache()"
329
+ ```
330
+
331
+ ### Inference Takes Too Long
332
+
333
+ **Option 1**: Reduce max_length
334
+ - Set max_length to 128 or 256 instead of 512+
335
+
336
+ **Option 2**: Use quantization
337
+ - The server automatically uses 4-bit quantization if GPU memory is low
338
+ - This makes it faster but slightly less accurate
339
+
340
+ **Option 3**: Adjust temperature
341
+ - Lower temperature (0.1-0.5) = faster, more deterministic
342
+ - Higher temperature (0.7-1.0) = slower, more creative
343
+
344
+ ---
345
+
346
+ ## πŸ“Š Performance Notes
347
+
348
+ ### Model Loading Time:
349
+ - **Base Model Load**: ~15-20 seconds (28 GB from disk)
350
+ - **LoRA Adapter Load**: ~2-3 seconds (161 MB)
351
+ - **Merge & Unload**: ~5 seconds
352
+ - **Total**: ~20-30 seconds
353
+
354
+ ### Inference Speed (A100 GPU):
355
+ - **Short prompts** (<100 tokens): 1-2 seconds
356
+ - **Medium prompts** (100-500 tokens): 3-8 seconds
357
+ - **Long prompts** (500+ tokens): 10-30 seconds
358
+
359
+ ### Memory Usage:
360
+ - **Base Model**: ~14 GB GPU RAM (FP16)
361
+ - **With LoRA**: ~14.5 GB GPU RAM
362
+ - **With Quantization**: ~7-8 GB GPU RAM (4-bit)
363
+
364
+ ---
365
+
366
+ ## πŸ“š Technical Details
367
+
368
+ ### LoRA Configuration (from adapter_config.json):
369
+ ```json
370
+ {
371
+ "r": 16, # LoRA rank
372
+ "lora_alpha": 32, # LoRA scaling
373
+ "target_modules": [ # Layers fine-tuned
374
+ "q_proj",
375
+ "v_proj"
376
+ ],
377
+ "lora_dropout": 0.05,
378
+ "bias": "none",
379
+ "task_type": "CAUSAL_LM"
380
+ }
381
+ ```
382
+
383
+ ### Training Configuration (from training_config.json):
384
+ - **Base Model**: mistralai/Mistral-7B-v0.1
385
+ - **Dataset**: 100 samples (FIFO-related)
386
+ - **Max Length**: 2048 tokens
387
+ - **Epochs**: 3
388
+ - **Batch Size**: 4
389
+ - **Learning Rate**: 2e-4
390
+ - **Device**: CUDA (A100 GPU)
391
+
392
+ ---
393
+
394
+ ## 🎯 Summary
395
+
396
+ ### What Was Fixed:
397
+
398
+ 1. βœ… **Model Listing**: Updated to scan BASE_DIR where models are saved
399
+ 2. βœ… **API Server**: Updated to use local base model instead of HuggingFace cache
400
+ 3. βœ… **Inference**: Now works both directly and via API
401
+
402
+ ### What's Working Now:
403
+
404
+ 1. βœ… Your model appears in all dropdowns
405
+ 2. βœ… API server starts successfully
406
+ 3. βœ… Inference works via UI
407
+ 4. βœ… Inference works via API
408
+ 5. βœ… No more cache errors!
409
+
410
+ ### Files Modified:
411
+
412
+ 1. `/workspace/ftt/semicon-finetuning-scripts/interface_app.py` - Model listing
413
+ 2. `/workspace/ftt/semicon-finetuning-scripts/models/msp/inference/inference_mistral7b.py` - Inference
414
+
415
+ ---
416
+
417
+ ## 🌐 Access Links
418
+
419
+ **Gradio Interface**: https://3833be2ce50507322f.gradio.live
420
+ **Local Port**: 7860
421
+ **API Port** (when started): 8000
422
+
423
+ ---
424
+
425
+ *Last Updated: 2024-11-24*
426
+ *Model: mistral-finetuned-fifo1 (LoRA Adapter)*
427
+ *Base: Mistral-7B-v0.1 (Local)*
428
+