Spaces:

Prithvik-1
/

mistral-finetuning-interface

Paused

App Files Files Community

Prithvik-1 commited on Nov 24, 2025

Commit

6d94a60

verified ·

1 Parent(s): 3ba49d5

Upload docs/LOCAL_MODEL_SETUP.md with huggingface_hub

Browse files

Files changed (1) hide show

docs/LOCAL_MODEL_SETUP.md +257 -0

docs/LOCAL_MODEL_SETUP.md ADDED Viewed

	@@ -0,0 +1,257 @@

+# Local Model Setup - Solution Summary
+## 🎯 Problem Resolved
+**Issue**: Training failed with `OSError: [Errno 116] Stale file handle` when trying to download/use models from HuggingFace cache.
+**Root Cause**: Corrupted NFS file handle in HuggingFace cache directory preventing model access.
+**Solution**: Downloaded Mistral-7B-v0.1 model directly to workspace, bypassing the corrupted cache.
+---
+## 📦 Model Location
+```
+/workspace/ftt/base_models/Mistral-7B-v0.1
+```
+**Size**: 28 GB (includes both PyTorch and SafeTensors formats)
+**Contents**:
+- ✓ Model weights (model-00001-of-00002.safetensors, model-00002-of-00002.safetensors)
+- ✓ Tokenizer (tokenizer.model, tokenizer.json)
+- ✓ Configuration files (config.json, generation_config.json)
+---
+## 🔧 Changes Made
+### 1. Downloaded Model Locally
+Used `huggingface-cli` to download model directly to workspace:
+```bash
+huggingface-cli download mistralai/Mistral-7B-v0.1 \
+  --local-dir /workspace/ftt/base_models/Mistral-7B-v0.1 \
+  --local-dir-use-symlinks False
+```
+### 2. Updated Gradio Interface
+**File**: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py`
+**Change**: Updated default base model path from HuggingFace ID to local path:
+```python
+# Before:
+value="mistralai/Mistral-7B-v0.1"
+# After:
+value="/workspace/ftt/base_models/Mistral-7B-v0.1"
+```
+### 3. Restarted Interface
+Killed old Gradio process and started fresh instance with updated configuration.
+---
+## 🚀 How to Use
+### Starting Training
+1. **Access Gradio Interface**:
+   - The interface is running on port 7860
+   - Access via the public link displayed in the terminal
+2. **Fine-tuning Tab**:
+   - Base Model field now defaults to: `/workspace/ftt/base_models/Mistral-7B-v0.1`
+   - You can still use HuggingFace model IDs if needed
+   - Upload your dataset or use HuggingFace datasets
+   - Configure training parameters
+   - Click "Start Fine-tuning"
+3. **Monitor Training**:
+   - Status updates in real-time
+   - Progress bar shows epoch and loss
+   - Logs are scrollable with copy functionality
+### Using Other Models
+If you want to use a different base model:
+**Option 1: Download Another Model Locally**
+```bash
+cd /workspace/ftt
+source /venv/main/bin/activate
+# Download model
+huggingface-cli download <model-id> \
+  --local-dir /workspace/ftt/base_models/<model-name> \
+  --local-dir-use-symlinks False
+# Use the path in Gradio:
+# /workspace/ftt/base_models/<model-name>
+```
+**Option 2: Use HuggingFace ID Directly**
+- Simply enter the model ID in the Base Model field (e.g., `mistralai/Mistral-7B-Instruct-v0.2`)
+- The script will download it if not cached (may hit cache issues if they persist)
+---
+## 🔍 Verification
+### Check Model is Accessible
+```bash
+python3 << 'EOF'
+from transformers import AutoTokenizer, AutoConfig
+model_path = "/workspace/ftt/base_models/Mistral-7B-v0.1"
+tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
+config = AutoConfig.from_pretrained(model_path, local_files_only=True)
+print(f"✓ Tokenizer: {len(tokenizer)} tokens")
+print(f"✓ Model: {config.model_type}")
+EOF
+```
+### Check Gradio Status
+```bash
+# Check process
+ps aux | grep interface_app.py
+# Check port
+lsof -i :7860
+# View logs (if started with nohup)
+tail -f /tmp/gradio_interface.log
+```
+---
+## 📊 Interface Features
+### Fine-tuning Section
+- ✓ File upload support (JSON/JSONL)
+- ✓ HuggingFace dataset integration
+- ✓ Automatic train/validation/test split
+- ✓ Max sequence length up to 6000
+- ✓ GPU-based parameter recommendations
+- ✓ Detailed tooltips for all parameters
+- ✓ Real-time progress tracking
+- ✓ Checkpoint/resume functionality
+### API Hosting Section
+- ✓ Host fine-tuned models from local paths
+- ✓ Host models from HuggingFace repositories
+- ✓ FastAPI with automatic documentation
+- ✓ Health checks and status monitoring
+### Test Inference Section
+- ✓ Test local fine-tuned models
+- ✓ Test HuggingFace models
+- ✓ Adjustable max-length (up to 6000)
+- ✓ Temperature control with tooltips
+- ✓ Uses API if running, otherwise direct loading
+### UI Controls
+- ✓ Stop Training button
+- ✓ Refresh Status button
+- ✓ Scrollable logs with copy functionality
+- ✓ Progress bars for training
+- ✓ 🛑 Shutdown Gradio Server button (System Controls)
+---
+## 🐛 Troubleshooting
+### Issue: Cache errors persist
+**Solution**: Always use local model paths from `/workspace/ftt/base_models/`
+### Issue: Training logs not updating
+**Solution**:
+1. Click "Refresh Status" button
+2. Check that training process is running: `ps aux | grep finetune_mistral`
+### Issue: Interface not accessible
+**Solution**:
+```bash
+# Check if running
+lsof -i :7860
+# Restart if needed
+pkill -f interface_app.py
+cd /workspace/ftt/semicon-finetuning-scripts
+python3 interface_app.py
+```
+### Issue: Out of memory during training
+**Solution**:
+1. Reduce batch size
+2. Reduce max sequence length
+3. Enable gradient checkpointing (already enabled in script)
+4. Use LoRA with lower rank (r=8 instead of r=16)
+---
+## 📝 Technical Details
+### Training Script
+**Location**: `/workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py`
+**Key Features**:
+- LoRA fine-tuning for memory efficiency
+- Gradient checkpointing enabled
+- Automatic device detection (CUDA/MPS/CPU)
+- Resume from checkpoint support
+- JSON configuration export
+### Fine-tuning Command (Generated by Interface)
+```bash
+python3 -u /workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py \
+  --base-model /workspace/ftt/base_models/Mistral-7B-v0.1 \
+  --dataset /path/to/your/dataset.jsonl \
+  --output-dir ./your-finetuned-model \
+  --max-length 2048 \
+  --num-epochs 3 \
+  --batch-size 4 \
+  --learning-rate 2e-4 \
+  --lora-r 16 \
+  --lora-alpha 32
+```
+---
+## 🎉 Success Criteria
+You'll know everything is working when:
+1. ✅ Gradio interface loads without errors
+2. ✅ Base model field shows local path
+3. ✅ Training starts without cache errors
+4. ✅ Progress updates appear in UI
+5. ✅ Model weights are saved to output directory
+---
+## 📚 Related Files
+- **Interface**: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py`
+- **Training Script**: `/workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py`
+- **Base Model**: `/workspace/ftt/base_models/Mistral-7B-v0.1/`
+- **Startup Script**: `/workspace/ftt/semicon-finetuning-scripts/start_interface.sh`
+- **Requirements**: `/workspace/ftt/semicon-finetuning-scripts/requirements_interface.txt`
+---
+## 🆘 Support
+If you encounter any issues:
+1. Check this document's troubleshooting section
+2. Review the training logs in the UI
+3. Check process status: `ps aux | grep -E "interface_app|finetune_mistral"`
+4. Review cache directories are clear: `ls -lh /workspace/.hf_home/hub/`
+---
+*Last Updated: 2025-11-24*
+*Solution: Local model download to bypass corrupted HuggingFace cache*