# Local Model Setup - Solution Summary ## 🎯 Problem Resolved **Issue**: Training failed with `OSError: [Errno 116] Stale file handle` when trying to download/use models from HuggingFace cache. **Root Cause**: Corrupted NFS file handle in HuggingFace cache directory preventing model access. **Solution**: Downloaded Mistral-7B-v0.1 model directly to workspace, bypassing the corrupted cache. --- ## 📦 Model Location ``` /workspace/ftt/base_models/Mistral-7B-v0.1 ``` **Size**: 28 GB (includes both PyTorch and SafeTensors formats) **Contents**: - ✓ Model weights (model-00001-of-00002.safetensors, model-00002-of-00002.safetensors) - ✓ Tokenizer (tokenizer.model, tokenizer.json) - ✓ Configuration files (config.json, generation_config.json) --- ## 🔧 Changes Made ### 1. Downloaded Model Locally Used `huggingface-cli` to download model directly to workspace: ```bash huggingface-cli download mistralai/Mistral-7B-v0.1 \ --local-dir /workspace/ftt/base_models/Mistral-7B-v0.1 \ --local-dir-use-symlinks False ``` ### 2. Updated Gradio Interface **File**: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py` **Change**: Updated default base model path from HuggingFace ID to local path: ```python # Before: value="mistralai/Mistral-7B-v0.1" # After: value="/workspace/ftt/base_models/Mistral-7B-v0.1" ``` ### 3. Restarted Interface Killed old Gradio process and started fresh instance with updated configuration. --- ## 🚀 How to Use ### Starting Training 1. **Access Gradio Interface**: - The interface is running on port 7860 - Access via the public link displayed in the terminal 2. **Fine-tuning Tab**: - Base Model field now defaults to: `/workspace/ftt/base_models/Mistral-7B-v0.1` - You can still use HuggingFace model IDs if needed - Upload your dataset or use HuggingFace datasets - Configure training parameters - Click "Start Fine-tuning" 3. **Monitor Training**: - Status updates in real-time - Progress bar shows epoch and loss - Logs are scrollable with copy functionality ### Using Other Models If you want to use a different base model: **Option 1: Download Another Model Locally** ```bash cd /workspace/ftt source /venv/main/bin/activate # Download model huggingface-cli download \ --local-dir /workspace/ftt/base_models/ \ --local-dir-use-symlinks False # Use the path in Gradio: # /workspace/ftt/base_models/ ``` **Option 2: Use HuggingFace ID Directly** - Simply enter the model ID in the Base Model field (e.g., `mistralai/Mistral-7B-Instruct-v0.2`) - The script will download it if not cached (may hit cache issues if they persist) --- ## 🔍 Verification ### Check Model is Accessible ```bash python3 << 'EOF' from transformers import AutoTokenizer, AutoConfig model_path = "/workspace/ftt/base_models/Mistral-7B-v0.1" tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True) config = AutoConfig.from_pretrained(model_path, local_files_only=True) print(f"✓ Tokenizer: {len(tokenizer)} tokens") print(f"✓ Model: {config.model_type}") EOF ``` ### Check Gradio Status ```bash # Check process ps aux | grep interface_app.py # Check port lsof -i :7860 # View logs (if started with nohup) tail -f /tmp/gradio_interface.log ``` --- ## 📊 Interface Features ### Fine-tuning Section - ✓ File upload support (JSON/JSONL) - ✓ HuggingFace dataset integration - ✓ Automatic train/validation/test split - ✓ Max sequence length up to 6000 - ✓ GPU-based parameter recommendations - ✓ Detailed tooltips for all parameters - ✓ Real-time progress tracking - ✓ Checkpoint/resume functionality ### API Hosting Section - ✓ Host fine-tuned models from local paths - ✓ Host models from HuggingFace repositories - ✓ FastAPI with automatic documentation - ✓ Health checks and status monitoring ### Test Inference Section - ✓ Test local fine-tuned models - ✓ Test HuggingFace models - ✓ Adjustable max-length (up to 6000) - ✓ Temperature control with tooltips - ✓ Uses API if running, otherwise direct loading ### UI Controls - ✓ Stop Training button - ✓ Refresh Status button - ✓ Scrollable logs with copy functionality - ✓ Progress bars for training - ✓ 🛑 Shutdown Gradio Server button (System Controls) --- ## 🐛 Troubleshooting ### Issue: Cache errors persist **Solution**: Always use local model paths from `/workspace/ftt/base_models/` ### Issue: Training logs not updating **Solution**: 1. Click "Refresh Status" button 2. Check that training process is running: `ps aux | grep finetune_mistral` ### Issue: Interface not accessible **Solution**: ```bash # Check if running lsof -i :7860 # Restart if needed pkill -f interface_app.py cd /workspace/ftt/semicon-finetuning-scripts python3 interface_app.py ``` ### Issue: Out of memory during training **Solution**: 1. Reduce batch size 2. Reduce max sequence length 3. Enable gradient checkpointing (already enabled in script) 4. Use LoRA with lower rank (r=8 instead of r=16) --- ## 📝 Technical Details ### Training Script **Location**: `/workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py` **Key Features**: - LoRA fine-tuning for memory efficiency - Gradient checkpointing enabled - Automatic device detection (CUDA/MPS/CPU) - Resume from checkpoint support - JSON configuration export ### Fine-tuning Command (Generated by Interface) ```bash python3 -u /workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py \ --base-model /workspace/ftt/base_models/Mistral-7B-v0.1 \ --dataset /path/to/your/dataset.jsonl \ --output-dir ./your-finetuned-model \ --max-length 2048 \ --num-epochs 3 \ --batch-size 4 \ --learning-rate 2e-4 \ --lora-r 16 \ --lora-alpha 32 ``` --- ## 🎉 Success Criteria You'll know everything is working when: 1. ✅ Gradio interface loads without errors 2. ✅ Base model field shows local path 3. ✅ Training starts without cache errors 4. ✅ Progress updates appear in UI 5. ✅ Model weights are saved to output directory --- ## 📚 Related Files - **Interface**: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py` - **Training Script**: `/workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py` - **Base Model**: `/workspace/ftt/base_models/Mistral-7B-v0.1/` - **Startup Script**: `/workspace/ftt/semicon-finetuning-scripts/start_interface.sh` - **Requirements**: `/workspace/ftt/semicon-finetuning-scripts/requirements_interface.txt` --- ## 🆘 Support If you encounter any issues: 1. Check this document's troubleshooting section 2. Review the training logs in the UI 3. Check process status: `ps aux | grep -E "interface_app|finetune_mistral"` 4. Review cache directories are clear: `ls -lh /workspace/.hf_home/hub/` --- *Last Updated: 2025-11-24* *Solution: Local model download to bypass corrupted HuggingFace cache*