Spaces:

Prithvik-1
/

mistral-finetuning-interface

Paused

App Files Files Community

mistral-finetuning-interface / docs /MODEL_INFERENCE_FIXES.md

Prithvik-1

Upload docs/MODEL_INFERENCE_FIXES.md with huggingface_hub

244a62f verified 3 months ago

preview code

raw

history blame contribute delete

11.9 kB

	# Model Inference Fixes - Complete Guide

	## 🎉 Issues Resolved

	### Issue 1: New Fine-tuned Model Not Showing in UI
	Status: ✅ FIXED

	Problem: After completing fine-tuning, the new model `mistral-finetuned-fifo1` was not appearing in the dropdown lists for API Hosting or Test Inference.

	Root Cause: The `list_models()` function was only checking:
	- `/workspace/ftt/` (parent directory)
	- `/workspace/ftt/semicon-finetuning-scripts/models/msp/` (MODELS_DIR)

	But the new model was saved to:
	- `/workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1` (BASE_DIR)

	Solution: Updated `list_models()` function to also scan `BASE_DIR`:

	```python
	def list_models():
	"""List available fine-tuned models"""
	models = []

	# Check in BASE_DIR (semicon-finetuning-scripts directory) - NEW!
	for item in BASE_DIR.iterdir():
	if item.is_dir() and "mistral" in item.name.lower() and not item.name.startswith('.'):
	models.append(str(item))

	# Check in BASE_DIR parent (ftt directory)
	ftt_dir = BASE_DIR.parent
	for item in ftt_dir.iterdir():
	if item.is_dir() and "mistral" in item.name.lower():
	models.append(str(item))

	# Check in MODELS_DIR
	if MODELS_DIR.exists():
	for item in MODELS_DIR.iterdir():
	if item.is_dir() and "mistral" in item.name.lower():
	models.append(str(item))

	return sorted(list(set(models))) if models else ["No models found"]
	```

	File Modified: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py` (lines 116-133)

	---

	### Issue 2: API Hosting Server Not Starting
	Status: ✅ FIXED

	Problem: When trying to start the API hosting server with the fine-tuned model, it failed with:
	```
	OSError: [Errno 116] Stale file handle:
	'/workspace/.hf_home/hub/models--mistralai--Mistral-7B-v0.1/blobs/...'
	```

	Root Cause:
	1. The fine-tuned model is a LoRA adapter (not a full model)
	2. To use it, the API server must load the base model first, then apply the LoRA adapter
	3. The inference script was hardcoded to load `mistralai/Mistral-7B-v0.1` from HuggingFace
	4. This triggered the corrupted cache issue again

	Solution: Updated the inference script to use the local base model we downloaded earlier:

	```python
	if is_lora:
	# Load base model - prefer local model to avoid cache issues
	local_base_model = "/workspace/ftt/base_models/Mistral-7B-v0.1"

	# Check if local model exists, otherwise use HuggingFace
	if os.path.exists(local_base_model):
	base_model_name = local_base_model
	print(f"Loading base model from local: {base_model_name}")
	else:
	base_model_name = "mistralai/Mistral-7B-v0.1"
	print(f"Loading base model from HuggingFace: {base_model_name}")

	base_model = AutoModelForCausalLM.from_pretrained(
	base_model_name,
	local_files_only=os.path.exists(local_base_model),
	**get_model_kwargs(use_quantization)
	)

	# Load LoRA adapter
	print("Loading LoRA adapter...")
	model = PeftModel.from_pretrained(base_model, model_path)
	model = model.merge_and_unload() # Merge adapter weights
	```

	File Modified: `/workspace/ftt/semicon-finetuning-scripts/models/msp/inference/inference_mistral7b.py` (lines 96-109)

	---

	## 📦 Your Fine-tuned Model

	Location: `/workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1`

	Type: LoRA Adapter (161 MB)

	Contents:
	```
	mistral-finetuned-fifo1/
	├── adapter_model.safetensors # LoRA weights (161 MB)
	├── adapter_config.json # LoRA configuration
	├── tokenizer.json # Tokenizer
	├── tokenizer_config.json # Tokenizer config
	├── special_tokens_map.json # Special tokens
	├── training_args.bin # Training arguments
	├── training_config.json # Training configuration
	├── checkpoint-24/ # Best checkpoint
	└── README.md # Model card
	```

	How it works:
	- Your model is a LoRA adapter (Low-Rank Adaptation)
	- It contains only the fine-tuned weights (161 MB)
	- To use it, it needs the base model (Mistral-7B-v0.1, 28 GB)
	- The adapter is merged with the base model at inference time

	---

	## 🚀 Using Your Model

	### Option 1: Via Gradio UI (Recommended)

	#### For API Hosting:

	1. Access Gradio Interface:
	- URL: https://3833be2ce50507322f.gradio.live
	- Or: http://0.0.0.0:7860 (if local)

	2. Go to "🌐 API Hosting" Tab

	3. Select Your Model:
	- Model Source: Local Model
	- Dropdown: Select `/workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1`

	4. Configure (optional):
	- Host: 0.0.0.0 (default)
	- Port: 8000 (default)

	5. Start Server:
	- Click "🚀 Start API Server"
	- Wait 15-20 seconds for model loading
	- Status will show "✅ API server started!"

	6. Access API:
	- API: http://0.0.0.0:8000
	- Docs: http://0.0.0.0:8000/docs

	#### For Direct Inference:

	1. Go to "🧪 Test Inference" Tab

	2. Select Your Model:
	- Model Source: Local Model
	- Dropdown: Select `/workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1`

	3. Configure Parameters:
	- Max Length: 512 (default) or up to 6000
	- Temperature: 0.7 (default) or adjust for creativity

	4. Enter Prompt:
	- Type your test prompt in the text box

	5. Run Inference:
	- Click "🔄 Run Inference"
	- Results will appear below

	---

	### Option 2: Via Python Script

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import PeftModel
	import torch

	# Load base model
	base_model_path = "/workspace/ftt/base_models/Mistral-7B-v0.1"
	base_model = AutoModelForCausalLM.from_pretrained(
	base_model_path,
	torch_dtype=torch.float16,
	device_map="auto",
	local_files_only=True
	)

	# Load LoRA adapter
	adapter_path = "/workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1"
	model = PeftModel.from_pretrained(base_model, adapter_path)
	model = model.merge_and_unload() # Merge weights
	model.eval()

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained(adapter_path)

	# Run inference
	prompt = "Your prompt here"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_length=512)
	result = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(result)
	```

	---

	### Option 3: Via API (After Starting Server)

	```bash
	# Start API server first via Gradio UI or:
	cd /workspace/ftt/semicon-finetuning-scripts
	python3 models/msp/api/api_server.py \
	--model-path /workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1 \
	--host 0.0.0.0 \
	--port 8000

	# Then call the API:
	curl -X POST "http://localhost:8000/generate" \
	-H "Content-Type: application/json" \
	-d '{
	"prompt": "Your prompt here",
	"max_length": 512,
	"temperature": 0.7
	}'
	```

	---

	## 🔍 Verification

	### Check Models are Listed:

	```bash
	cd /workspace/ftt/semicon-finetuning-scripts
	python3 << 'EOF'
	from pathlib import Path

	BASE_DIR = Path("/workspace/ftt/semicon-finetuning-scripts")
	models = [
	str(item) for item in BASE_DIR.iterdir()
	if item.is_dir() and "mistral" in item.name.lower()
	]
	print("Models found in BASE_DIR:")
	for m in sorted(models):
	print(f" - {Path(m).name}")
	EOF
	```

	Expected output should include: `mistral-finetuned-fifo1`

	### Test API Server Manually:

	```bash
	cd /workspace/ftt/semicon-finetuning-scripts
	source /venv/main/bin/activate

	python3 models/msp/api/api_server.py \
	--model-path /workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1 \
	--host 0.0.0.0 \
	--port 8001
	```

	Expected output should include:
	- ✓ Loading base model from local: /workspace/ftt/base_models/Mistral-7B-v0.1
	- ✓ Loading LoRA adapter...
	- ✓ Model loaded successfully on cuda!
	- ✓ Server ready to accept requests

	---

	## 🐛 Troubleshooting

	### Model Not Appearing in Dropdown

	Check 1: Verify model exists
	```bash
	ls -lh /workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1/
	```

	Check 2: Restart Gradio interface
	```bash
	pkill -f interface_app.py
	cd /workspace/ftt/semicon-finetuning-scripts
	python3 interface_app.py
	```

	Check 3: Manually verify list_models() function
	```bash
	cd /workspace/ftt/semicon-finetuning-scripts
	python3 -c "from interface_app import list_models; print('\n'.join(list_models()))"
	```

	### API Server Fails to Start

	Check 1: Verify base model exists
	```bash
	ls -lh /workspace/ftt/base_models/Mistral-7B-v0.1/
	```

	If missing, re-download:
	```bash
	huggingface-cli download mistralai/Mistral-7B-v0.1 \
	--local-dir /workspace/ftt/base_models/Mistral-7B-v0.1 \
	--local-dir-use-symlinks False
	```

	Check 2: Test model loading manually
	```bash
	cd /workspace/ftt/semicon-finetuning-scripts
	python3 << 'EOF'
	from models.msp.inference.inference_mistral7b import load_local_model

	model_path = "/workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1"
	print("Testing model load...")
	model, tokenizer = load_local_model(model_path)
	print("✓ Model loaded successfully!")
	EOF
	```

	Check 3: Check GPU memory
	```bash
	nvidia-smi
	```

	If GPU is full, free up memory:
	```bash
	pkill -f python3 # Kill other Python processes
	python3 -c "import torch; torch.cuda.empty_cache()"
	```

	### Inference Takes Too Long

	Option 1: Reduce max_length
	- Set max_length to 128 or 256 instead of 512+

	Option 2: Use quantization
	- The server automatically uses 4-bit quantization if GPU memory is low
	- This makes it faster but slightly less accurate

	Option 3: Adjust temperature
	- Lower temperature (0.1-0.5) = faster, more deterministic
	- Higher temperature (0.7-1.0) = slower, more creative

	---

	## 📊 Performance Notes

	### Model Loading Time:
	- Base Model Load: ~15-20 seconds (28 GB from disk)
	- LoRA Adapter Load: ~2-3 seconds (161 MB)
	- Merge & Unload: ~5 seconds
	- Total: ~20-30 seconds

	### Inference Speed (A100 GPU):
	- Short prompts (<100 tokens): 1-2 seconds
	- Medium prompts (100-500 tokens): 3-8 seconds
	- Long prompts (500+ tokens): 10-30 seconds

	### Memory Usage:
	- Base Model: ~14 GB GPU RAM (FP16)
	- With LoRA: ~14.5 GB GPU RAM
	- With Quantization: ~7-8 GB GPU RAM (4-bit)

	---

	## 📚 Technical Details

	### LoRA Configuration (from adapter_config.json):
	```json
	{
	"r": 16, # LoRA rank
	"lora_alpha": 32, # LoRA scaling
	"target_modules": [ # Layers fine-tuned
	"q_proj",
	"v_proj"
	],
	"lora_dropout": 0.05,
	"bias": "none",
	"task_type": "CAUSAL_LM"
	}
	```

	### Training Configuration (from training_config.json):
	- Base Model: mistralai/Mistral-7B-v0.1
	- Dataset: 100 samples (FIFO-related)
	- Max Length: 2048 tokens
	- Epochs: 3
	- Batch Size: 4
	- Learning Rate: 2e-4
	- Device: CUDA (A100 GPU)

	---

	## 🎯 Summary

	### What Was Fixed:

	1. ✅ Model Listing: Updated to scan BASE_DIR where models are saved
	2. ✅ API Server: Updated to use local base model instead of HuggingFace cache
	3. ✅ Inference: Now works both directly and via API

	### What's Working Now:

	1. ✅ Your model appears in all dropdowns
	2. ✅ API server starts successfully
	3. ✅ Inference works via UI
	4. ✅ Inference works via API
	5. ✅ No more cache errors!

	### Files Modified:

	1. `/workspace/ftt/semicon-finetuning-scripts/interface_app.py` - Model listing
	2. `/workspace/ftt/semicon-finetuning-scripts/models/msp/inference/inference_mistral7b.py` - Inference

	---

	## 🌐 Access Links

	Gradio Interface: https://3833be2ce50507322f.gradio.live
	Local Port: 7860
	API Port (when started): 8000

	---

	Last Updated: 2024-11-24
	Model: mistral-finetuned-fifo1 (LoRA Adapter)
	Base: Mistral-7B-v0.1 (Local)