| # Local Model Setup - Solution Summary | |
| ## π― Problem Resolved | |
| **Issue**: Training failed with `OSError: [Errno 116] Stale file handle` when trying to download/use models from HuggingFace cache. | |
| **Root Cause**: Corrupted NFS file handle in HuggingFace cache directory preventing model access. | |
| **Solution**: Downloaded Mistral-7B-v0.1 model directly to workspace, bypassing the corrupted cache. | |
| --- | |
| ## π¦ Model Location | |
| ``` | |
| /workspace/ftt/base_models/Mistral-7B-v0.1 | |
| ``` | |
| **Size**: 28 GB (includes both PyTorch and SafeTensors formats) | |
| **Contents**: | |
| - β Model weights (model-00001-of-00002.safetensors, model-00002-of-00002.safetensors) | |
| - β Tokenizer (tokenizer.model, tokenizer.json) | |
| - β Configuration files (config.json, generation_config.json) | |
| --- | |
| ## π§ Changes Made | |
| ### 1. Downloaded Model Locally | |
| Used `huggingface-cli` to download model directly to workspace: | |
| ```bash | |
| huggingface-cli download mistralai/Mistral-7B-v0.1 \ | |
| --local-dir /workspace/ftt/base_models/Mistral-7B-v0.1 \ | |
| --local-dir-use-symlinks False | |
| ``` | |
| ### 2. Updated Gradio Interface | |
| **File**: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py` | |
| **Change**: Updated default base model path from HuggingFace ID to local path: | |
| ```python | |
| # Before: | |
| value="mistralai/Mistral-7B-v0.1" | |
| # After: | |
| value="/workspace/ftt/base_models/Mistral-7B-v0.1" | |
| ``` | |
| ### 3. Restarted Interface | |
| Killed old Gradio process and started fresh instance with updated configuration. | |
| --- | |
| ## π How to Use | |
| ### Starting Training | |
| 1. **Access Gradio Interface**: | |
| - The interface is running on port 7860 | |
| - Access via the public link displayed in the terminal | |
| 2. **Fine-tuning Tab**: | |
| - Base Model field now defaults to: `/workspace/ftt/base_models/Mistral-7B-v0.1` | |
| - You can still use HuggingFace model IDs if needed | |
| - Upload your dataset or use HuggingFace datasets | |
| - Configure training parameters | |
| - Click "Start Fine-tuning" | |
| 3. **Monitor Training**: | |
| - Status updates in real-time | |
| - Progress bar shows epoch and loss | |
| - Logs are scrollable with copy functionality | |
| ### Using Other Models | |
| If you want to use a different base model: | |
| **Option 1: Download Another Model Locally** | |
| ```bash | |
| cd /workspace/ftt | |
| source /venv/main/bin/activate | |
| # Download model | |
| huggingface-cli download <model-id> \ | |
| --local-dir /workspace/ftt/base_models/<model-name> \ | |
| --local-dir-use-symlinks False | |
| # Use the path in Gradio: | |
| # /workspace/ftt/base_models/<model-name> | |
| ``` | |
| **Option 2: Use HuggingFace ID Directly** | |
| - Simply enter the model ID in the Base Model field (e.g., `mistralai/Mistral-7B-Instruct-v0.2`) | |
| - The script will download it if not cached (may hit cache issues if they persist) | |
| --- | |
| ## π Verification | |
| ### Check Model is Accessible | |
| ```bash | |
| python3 << 'EOF' | |
| from transformers import AutoTokenizer, AutoConfig | |
| model_path = "/workspace/ftt/base_models/Mistral-7B-v0.1" | |
| tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True) | |
| config = AutoConfig.from_pretrained(model_path, local_files_only=True) | |
| print(f"β Tokenizer: {len(tokenizer)} tokens") | |
| print(f"β Model: {config.model_type}") | |
| EOF | |
| ``` | |
| ### Check Gradio Status | |
| ```bash | |
| # Check process | |
| ps aux | grep interface_app.py | |
| # Check port | |
| lsof -i :7860 | |
| # View logs (if started with nohup) | |
| tail -f /tmp/gradio_interface.log | |
| ``` | |
| --- | |
| ## π Interface Features | |
| ### Fine-tuning Section | |
| - β File upload support (JSON/JSONL) | |
| - β HuggingFace dataset integration | |
| - β Automatic train/validation/test split | |
| - β Max sequence length up to 6000 | |
| - β GPU-based parameter recommendations | |
| - β Detailed tooltips for all parameters | |
| - β Real-time progress tracking | |
| - β Checkpoint/resume functionality | |
| ### API Hosting Section | |
| - β Host fine-tuned models from local paths | |
| - β Host models from HuggingFace repositories | |
| - β FastAPI with automatic documentation | |
| - β Health checks and status monitoring | |
| ### Test Inference Section | |
| - β Test local fine-tuned models | |
| - β Test HuggingFace models | |
| - β Adjustable max-length (up to 6000) | |
| - β Temperature control with tooltips | |
| - β Uses API if running, otherwise direct loading | |
| ### UI Controls | |
| - β Stop Training button | |
| - β Refresh Status button | |
| - β Scrollable logs with copy functionality | |
| - β Progress bars for training | |
| - β π Shutdown Gradio Server button (System Controls) | |
| --- | |
| ## π Troubleshooting | |
| ### Issue: Cache errors persist | |
| **Solution**: Always use local model paths from `/workspace/ftt/base_models/` | |
| ### Issue: Training logs not updating | |
| **Solution**: | |
| 1. Click "Refresh Status" button | |
| 2. Check that training process is running: `ps aux | grep finetune_mistral` | |
| ### Issue: Interface not accessible | |
| **Solution**: | |
| ```bash | |
| # Check if running | |
| lsof -i :7860 | |
| # Restart if needed | |
| pkill -f interface_app.py | |
| cd /workspace/ftt/semicon-finetuning-scripts | |
| python3 interface_app.py | |
| ``` | |
| ### Issue: Out of memory during training | |
| **Solution**: | |
| 1. Reduce batch size | |
| 2. Reduce max sequence length | |
| 3. Enable gradient checkpointing (already enabled in script) | |
| 4. Use LoRA with lower rank (r=8 instead of r=16) | |
| --- | |
| ## π Technical Details | |
| ### Training Script | |
| **Location**: `/workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py` | |
| **Key Features**: | |
| - LoRA fine-tuning for memory efficiency | |
| - Gradient checkpointing enabled | |
| - Automatic device detection (CUDA/MPS/CPU) | |
| - Resume from checkpoint support | |
| - JSON configuration export | |
| ### Fine-tuning Command (Generated by Interface) | |
| ```bash | |
| python3 -u /workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py \ | |
| --base-model /workspace/ftt/base_models/Mistral-7B-v0.1 \ | |
| --dataset /path/to/your/dataset.jsonl \ | |
| --output-dir ./your-finetuned-model \ | |
| --max-length 2048 \ | |
| --num-epochs 3 \ | |
| --batch-size 4 \ | |
| --learning-rate 2e-4 \ | |
| --lora-r 16 \ | |
| --lora-alpha 32 | |
| ``` | |
| --- | |
| ## π Success Criteria | |
| You'll know everything is working when: | |
| 1. β Gradio interface loads without errors | |
| 2. β Base model field shows local path | |
| 3. β Training starts without cache errors | |
| 4. β Progress updates appear in UI | |
| 5. β Model weights are saved to output directory | |
| --- | |
| ## π Related Files | |
| - **Interface**: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py` | |
| - **Training Script**: `/workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py` | |
| - **Base Model**: `/workspace/ftt/base_models/Mistral-7B-v0.1/` | |
| - **Startup Script**: `/workspace/ftt/semicon-finetuning-scripts/start_interface.sh` | |
| - **Requirements**: `/workspace/ftt/semicon-finetuning-scripts/requirements_interface.txt` | |
| --- | |
| ## π Support | |
| If you encounter any issues: | |
| 1. Check this document's troubleshooting section | |
| 2. Review the training logs in the UI | |
| 3. Check process status: `ps aux | grep -E "interface_app|finetune_mistral"` | |
| 4. Review cache directories are clear: `ls -lh /workspace/.hf_home/hub/` | |
| --- | |
| *Last Updated: 2025-11-24* | |
| *Solution: Local model download to bypass corrupted HuggingFace cache* | |