Spaces:

Prithvik-1
/

mistral-finetuning-interface

Paused

App Files Files Community

mistral-finetuning-interface / docs /LOCAL_MODEL_SETUP.md

Prithvik-1

Upload docs/LOCAL_MODEL_SETUP.md with huggingface_hub

6d94a60 verified 3 months ago

preview code

raw

history blame contribute delete

6.95 kB

	# Local Model Setup - Solution Summary

	## 🎯 Problem Resolved

	Issue: Training failed with `OSError: [Errno 116] Stale file handle` when trying to download/use models from HuggingFace cache.

	Root Cause: Corrupted NFS file handle in HuggingFace cache directory preventing model access.

	Solution: Downloaded Mistral-7B-v0.1 model directly to workspace, bypassing the corrupted cache.

	---

	## 📦 Model Location

	```
	/workspace/ftt/base_models/Mistral-7B-v0.1
	```

	Size: 28 GB (includes both PyTorch and SafeTensors formats)

	Contents:
	- ✓ Model weights (model-00001-of-00002.safetensors, model-00002-of-00002.safetensors)
	- ✓ Tokenizer (tokenizer.model, tokenizer.json)
	- ✓ Configuration files (config.json, generation_config.json)

	---

	## 🔧 Changes Made

	### 1. Downloaded Model Locally
	Used `huggingface-cli` to download model directly to workspace:
	```bash
	huggingface-cli download mistralai/Mistral-7B-v0.1 \
	--local-dir /workspace/ftt/base_models/Mistral-7B-v0.1 \
	--local-dir-use-symlinks False
	```

	### 2. Updated Gradio Interface
	File: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py`

	Change: Updated default base model path from HuggingFace ID to local path:
	```python
	# Before:
	value="mistralai/Mistral-7B-v0.1"

	# After:
	value="/workspace/ftt/base_models/Mistral-7B-v0.1"
	```

	### 3. Restarted Interface
	Killed old Gradio process and started fresh instance with updated configuration.

	---

	## 🚀 How to Use

	### Starting Training

	1. Access Gradio Interface:
	- The interface is running on port 7860
	- Access via the public link displayed in the terminal

	2. Fine-tuning Tab:
	- Base Model field now defaults to: `/workspace/ftt/base_models/Mistral-7B-v0.1`
	- You can still use HuggingFace model IDs if needed
	- Upload your dataset or use HuggingFace datasets
	- Configure training parameters
	- Click "Start Fine-tuning"

	3. Monitor Training:
	- Status updates in real-time
	- Progress bar shows epoch and loss
	- Logs are scrollable with copy functionality

	### Using Other Models

	If you want to use a different base model:

	Option 1: Download Another Model Locally
	```bash
	cd /workspace/ftt
	source /venv/main/bin/activate

	# Download model
	huggingface-cli download <model-id> \
	--local-dir /workspace/ftt/base_models/<model-name> \
	--local-dir-use-symlinks False

	# Use the path in Gradio:
	# /workspace/ftt/base_models/<model-name>
	```

	Option 2: Use HuggingFace ID Directly
	- Simply enter the model ID in the Base Model field (e.g., `mistralai/Mistral-7B-Instruct-v0.2`)
	- The script will download it if not cached (may hit cache issues if they persist)

	---

	## 🔍 Verification

	### Check Model is Accessible
	```bash
	python3 << 'EOF'
	from transformers import AutoTokenizer, AutoConfig

	model_path = "/workspace/ftt/base_models/Mistral-7B-v0.1"
	tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
	config = AutoConfig.from_pretrained(model_path, local_files_only=True)

	print(f"✓ Tokenizer: {len(tokenizer)} tokens")
	print(f"✓ Model: {config.model_type}")
	EOF
	```

	### Check Gradio Status
	```bash
	# Check process
	ps aux \| grep interface_app.py

	# Check port
	lsof -i :7860

	# View logs (if started with nohup)
	tail -f /tmp/gradio_interface.log
	```

	---

	## 📊 Interface Features

	### Fine-tuning Section
	- ✓ File upload support (JSON/JSONL)
	- ✓ HuggingFace dataset integration
	- ✓ Automatic train/validation/test split
	- ✓ Max sequence length up to 6000
	- ✓ GPU-based parameter recommendations
	- ✓ Detailed tooltips for all parameters
	- ✓ Real-time progress tracking
	- ✓ Checkpoint/resume functionality

	### API Hosting Section
	- ✓ Host fine-tuned models from local paths
	- ✓ Host models from HuggingFace repositories
	- ✓ FastAPI with automatic documentation
	- ✓ Health checks and status monitoring

	### Test Inference Section
	- ✓ Test local fine-tuned models
	- ✓ Test HuggingFace models
	- ✓ Adjustable max-length (up to 6000)
	- ✓ Temperature control with tooltips
	- ✓ Uses API if running, otherwise direct loading

	### UI Controls
	- ✓ Stop Training button
	- ✓ Refresh Status button
	- ✓ Scrollable logs with copy functionality
	- ✓ Progress bars for training
	- ✓ 🛑 Shutdown Gradio Server button (System Controls)

	---

	## 🐛 Troubleshooting

	### Issue: Cache errors persist
	Solution: Always use local model paths from `/workspace/ftt/base_models/`

	### Issue: Training logs not updating
	Solution:
	1. Click "Refresh Status" button
	2. Check that training process is running: `ps aux \| grep finetune_mistral`

	### Issue: Interface not accessible
	Solution:
	```bash
	# Check if running
	lsof -i :7860

	# Restart if needed
	pkill -f interface_app.py
	cd /workspace/ftt/semicon-finetuning-scripts
	python3 interface_app.py
	```

	### Issue: Out of memory during training
	Solution:
	1. Reduce batch size
	2. Reduce max sequence length
	3. Enable gradient checkpointing (already enabled in script)
	4. Use LoRA with lower rank (r=8 instead of r=16)

	---

	## 📝 Technical Details

	### Training Script
	Location: `/workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py`

	Key Features:
	- LoRA fine-tuning for memory efficiency
	- Gradient checkpointing enabled
	- Automatic device detection (CUDA/MPS/CPU)
	- Resume from checkpoint support
	- JSON configuration export

	### Fine-tuning Command (Generated by Interface)
	```bash
	python3 -u /workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py \
	--base-model /workspace/ftt/base_models/Mistral-7B-v0.1 \
	--dataset /path/to/your/dataset.jsonl \
	--output-dir ./your-finetuned-model \
	--max-length 2048 \
	--num-epochs 3 \
	--batch-size 4 \
	--learning-rate 2e-4 \
	--lora-r 16 \
	--lora-alpha 32
	```

	---

	## 🎉 Success Criteria

	You'll know everything is working when:

	1. ✅ Gradio interface loads without errors
	2. ✅ Base model field shows local path
	3. ✅ Training starts without cache errors
	4. ✅ Progress updates appear in UI
	5. ✅ Model weights are saved to output directory

	---

	## 📚 Related Files

	- Interface: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py`
	- Training Script: `/workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py`
	- Base Model: `/workspace/ftt/base_models/Mistral-7B-v0.1/`
	- Startup Script: `/workspace/ftt/semicon-finetuning-scripts/start_interface.sh`
	- Requirements: `/workspace/ftt/semicon-finetuning-scripts/requirements_interface.txt`

	---

	## 🆘 Support

	If you encounter any issues:

	1. Check this document's troubleshooting section
	2. Review the training logs in the UI
	3. Check process status: `ps aux \| grep -E "interface_app\|finetune_mistral"`
	4. Review cache directories are clear: `ls -lh /workspace/.hf_home/hub/`

	---

	Last Updated: 2025-11-24
	Solution: Local model download to bypass corrupted HuggingFace cache