mistral-finetuning-interface / docs /LOCAL_MODEL_SETUP.md
Prithvik-1's picture
Upload docs/LOCAL_MODEL_SETUP.md with huggingface_hub
6d94a60 verified
# Local Model Setup - Solution Summary
## 🎯 Problem Resolved
**Issue**: Training failed with `OSError: [Errno 116] Stale file handle` when trying to download/use models from HuggingFace cache.
**Root Cause**: Corrupted NFS file handle in HuggingFace cache directory preventing model access.
**Solution**: Downloaded Mistral-7B-v0.1 model directly to workspace, bypassing the corrupted cache.
---
## πŸ“¦ Model Location
```
/workspace/ftt/base_models/Mistral-7B-v0.1
```
**Size**: 28 GB (includes both PyTorch and SafeTensors formats)
**Contents**:
- βœ“ Model weights (model-00001-of-00002.safetensors, model-00002-of-00002.safetensors)
- βœ“ Tokenizer (tokenizer.model, tokenizer.json)
- βœ“ Configuration files (config.json, generation_config.json)
---
## πŸ”§ Changes Made
### 1. Downloaded Model Locally
Used `huggingface-cli` to download model directly to workspace:
```bash
huggingface-cli download mistralai/Mistral-7B-v0.1 \
--local-dir /workspace/ftt/base_models/Mistral-7B-v0.1 \
--local-dir-use-symlinks False
```
### 2. Updated Gradio Interface
**File**: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py`
**Change**: Updated default base model path from HuggingFace ID to local path:
```python
# Before:
value="mistralai/Mistral-7B-v0.1"
# After:
value="/workspace/ftt/base_models/Mistral-7B-v0.1"
```
### 3. Restarted Interface
Killed old Gradio process and started fresh instance with updated configuration.
---
## πŸš€ How to Use
### Starting Training
1. **Access Gradio Interface**:
- The interface is running on port 7860
- Access via the public link displayed in the terminal
2. **Fine-tuning Tab**:
- Base Model field now defaults to: `/workspace/ftt/base_models/Mistral-7B-v0.1`
- You can still use HuggingFace model IDs if needed
- Upload your dataset or use HuggingFace datasets
- Configure training parameters
- Click "Start Fine-tuning"
3. **Monitor Training**:
- Status updates in real-time
- Progress bar shows epoch and loss
- Logs are scrollable with copy functionality
### Using Other Models
If you want to use a different base model:
**Option 1: Download Another Model Locally**
```bash
cd /workspace/ftt
source /venv/main/bin/activate
# Download model
huggingface-cli download <model-id> \
--local-dir /workspace/ftt/base_models/<model-name> \
--local-dir-use-symlinks False
# Use the path in Gradio:
# /workspace/ftt/base_models/<model-name>
```
**Option 2: Use HuggingFace ID Directly**
- Simply enter the model ID in the Base Model field (e.g., `mistralai/Mistral-7B-Instruct-v0.2`)
- The script will download it if not cached (may hit cache issues if they persist)
---
## πŸ” Verification
### Check Model is Accessible
```bash
python3 << 'EOF'
from transformers import AutoTokenizer, AutoConfig
model_path = "/workspace/ftt/base_models/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
config = AutoConfig.from_pretrained(model_path, local_files_only=True)
print(f"βœ“ Tokenizer: {len(tokenizer)} tokens")
print(f"βœ“ Model: {config.model_type}")
EOF
```
### Check Gradio Status
```bash
# Check process
ps aux | grep interface_app.py
# Check port
lsof -i :7860
# View logs (if started with nohup)
tail -f /tmp/gradio_interface.log
```
---
## πŸ“Š Interface Features
### Fine-tuning Section
- βœ“ File upload support (JSON/JSONL)
- βœ“ HuggingFace dataset integration
- βœ“ Automatic train/validation/test split
- βœ“ Max sequence length up to 6000
- βœ“ GPU-based parameter recommendations
- βœ“ Detailed tooltips for all parameters
- βœ“ Real-time progress tracking
- βœ“ Checkpoint/resume functionality
### API Hosting Section
- βœ“ Host fine-tuned models from local paths
- βœ“ Host models from HuggingFace repositories
- βœ“ FastAPI with automatic documentation
- βœ“ Health checks and status monitoring
### Test Inference Section
- βœ“ Test local fine-tuned models
- βœ“ Test HuggingFace models
- βœ“ Adjustable max-length (up to 6000)
- βœ“ Temperature control with tooltips
- βœ“ Uses API if running, otherwise direct loading
### UI Controls
- βœ“ Stop Training button
- βœ“ Refresh Status button
- βœ“ Scrollable logs with copy functionality
- βœ“ Progress bars for training
- βœ“ πŸ›‘ Shutdown Gradio Server button (System Controls)
---
## πŸ› Troubleshooting
### Issue: Cache errors persist
**Solution**: Always use local model paths from `/workspace/ftt/base_models/`
### Issue: Training logs not updating
**Solution**:
1. Click "Refresh Status" button
2. Check that training process is running: `ps aux | grep finetune_mistral`
### Issue: Interface not accessible
**Solution**:
```bash
# Check if running
lsof -i :7860
# Restart if needed
pkill -f interface_app.py
cd /workspace/ftt/semicon-finetuning-scripts
python3 interface_app.py
```
### Issue: Out of memory during training
**Solution**:
1. Reduce batch size
2. Reduce max sequence length
3. Enable gradient checkpointing (already enabled in script)
4. Use LoRA with lower rank (r=8 instead of r=16)
---
## πŸ“ Technical Details
### Training Script
**Location**: `/workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py`
**Key Features**:
- LoRA fine-tuning for memory efficiency
- Gradient checkpointing enabled
- Automatic device detection (CUDA/MPS/CPU)
- Resume from checkpoint support
- JSON configuration export
### Fine-tuning Command (Generated by Interface)
```bash
python3 -u /workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py \
--base-model /workspace/ftt/base_models/Mistral-7B-v0.1 \
--dataset /path/to/your/dataset.jsonl \
--output-dir ./your-finetuned-model \
--max-length 2048 \
--num-epochs 3 \
--batch-size 4 \
--learning-rate 2e-4 \
--lora-r 16 \
--lora-alpha 32
```
---
## πŸŽ‰ Success Criteria
You'll know everything is working when:
1. βœ… Gradio interface loads without errors
2. βœ… Base model field shows local path
3. βœ… Training starts without cache errors
4. βœ… Progress updates appear in UI
5. βœ… Model weights are saved to output directory
---
## πŸ“š Related Files
- **Interface**: `/workspace/ftt/semicon-finetuning-scripts/interface_app.py`
- **Training Script**: `/workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py`
- **Base Model**: `/workspace/ftt/base_models/Mistral-7B-v0.1/`
- **Startup Script**: `/workspace/ftt/semicon-finetuning-scripts/start_interface.sh`
- **Requirements**: `/workspace/ftt/semicon-finetuning-scripts/requirements_interface.txt`
---
## πŸ†˜ Support
If you encounter any issues:
1. Check this document's troubleshooting section
2. Review the training logs in the UI
3. Check process status: `ps aux | grep -E "interface_app|finetune_mistral"`
4. Review cache directories are clear: `ls -lh /workspace/.hf_home/hub/`
---
*Last Updated: 2025-11-24*
*Solution: Local model download to bypass corrupted HuggingFace cache*