A newer version of the Gradio SDK is available:
6.6.0
Local Model Setup - Solution Summary
π― Problem Resolved
Issue: Training failed with OSError: [Errno 116] Stale file handle when trying to download/use models from HuggingFace cache.
Root Cause: Corrupted NFS file handle in HuggingFace cache directory preventing model access.
Solution: Downloaded Mistral-7B-v0.1 model directly to workspace, bypassing the corrupted cache.
π¦ Model Location
/workspace/ftt/base_models/Mistral-7B-v0.1
Size: 28 GB (includes both PyTorch and SafeTensors formats)
Contents:
- β Model weights (model-00001-of-00002.safetensors, model-00002-of-00002.safetensors)
- β Tokenizer (tokenizer.model, tokenizer.json)
- β Configuration files (config.json, generation_config.json)
π§ Changes Made
1. Downloaded Model Locally
Used huggingface-cli to download model directly to workspace:
huggingface-cli download mistralai/Mistral-7B-v0.1 \
--local-dir /workspace/ftt/base_models/Mistral-7B-v0.1 \
--local-dir-use-symlinks False
2. Updated Gradio Interface
File: /workspace/ftt/semicon-finetuning-scripts/interface_app.py
Change: Updated default base model path from HuggingFace ID to local path:
# Before:
value="mistralai/Mistral-7B-v0.1"
# After:
value="/workspace/ftt/base_models/Mistral-7B-v0.1"
3. Restarted Interface
Killed old Gradio process and started fresh instance with updated configuration.
π How to Use
Starting Training
Access Gradio Interface:
- The interface is running on port 7860
- Access via the public link displayed in the terminal
Fine-tuning Tab:
- Base Model field now defaults to:
/workspace/ftt/base_models/Mistral-7B-v0.1 - You can still use HuggingFace model IDs if needed
- Upload your dataset or use HuggingFace datasets
- Configure training parameters
- Click "Start Fine-tuning"
- Base Model field now defaults to:
Monitor Training:
- Status updates in real-time
- Progress bar shows epoch and loss
- Logs are scrollable with copy functionality
Using Other Models
If you want to use a different base model:
Option 1: Download Another Model Locally
cd /workspace/ftt
source /venv/main/bin/activate
# Download model
huggingface-cli download <model-id> \
--local-dir /workspace/ftt/base_models/<model-name> \
--local-dir-use-symlinks False
# Use the path in Gradio:
# /workspace/ftt/base_models/<model-name>
Option 2: Use HuggingFace ID Directly
- Simply enter the model ID in the Base Model field (e.g.,
mistralai/Mistral-7B-Instruct-v0.2) - The script will download it if not cached (may hit cache issues if they persist)
π Verification
Check Model is Accessible
python3 << 'EOF'
from transformers import AutoTokenizer, AutoConfig
model_path = "/workspace/ftt/base_models/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
config = AutoConfig.from_pretrained(model_path, local_files_only=True)
print(f"β Tokenizer: {len(tokenizer)} tokens")
print(f"β Model: {config.model_type}")
EOF
Check Gradio Status
# Check process
ps aux | grep interface_app.py
# Check port
lsof -i :7860
# View logs (if started with nohup)
tail -f /tmp/gradio_interface.log
π Interface Features
Fine-tuning Section
- β File upload support (JSON/JSONL)
- β HuggingFace dataset integration
- β Automatic train/validation/test split
- β Max sequence length up to 6000
- β GPU-based parameter recommendations
- β Detailed tooltips for all parameters
- β Real-time progress tracking
- β Checkpoint/resume functionality
API Hosting Section
- β Host fine-tuned models from local paths
- β Host models from HuggingFace repositories
- β FastAPI with automatic documentation
- β Health checks and status monitoring
Test Inference Section
- β Test local fine-tuned models
- β Test HuggingFace models
- β Adjustable max-length (up to 6000)
- β Temperature control with tooltips
- β Uses API if running, otherwise direct loading
UI Controls
- β Stop Training button
- β Refresh Status button
- β Scrollable logs with copy functionality
- β Progress bars for training
- β π Shutdown Gradio Server button (System Controls)
π Troubleshooting
Issue: Cache errors persist
Solution: Always use local model paths from /workspace/ftt/base_models/
Issue: Training logs not updating
Solution:
- Click "Refresh Status" button
- Check that training process is running:
ps aux | grep finetune_mistral
Issue: Interface not accessible
Solution:
# Check if running
lsof -i :7860
# Restart if needed
pkill -f interface_app.py
cd /workspace/ftt/semicon-finetuning-scripts
python3 interface_app.py
Issue: Out of memory during training
Solution:
- Reduce batch size
- Reduce max sequence length
- Enable gradient checkpointing (already enabled in script)
- Use LoRA with lower rank (r=8 instead of r=16)
π Technical Details
Training Script
Location: /workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py
Key Features:
- LoRA fine-tuning for memory efficiency
- Gradient checkpointing enabled
- Automatic device detection (CUDA/MPS/CPU)
- Resume from checkpoint support
- JSON configuration export
Fine-tuning Command (Generated by Interface)
python3 -u /workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py \
--base-model /workspace/ftt/base_models/Mistral-7B-v0.1 \
--dataset /path/to/your/dataset.jsonl \
--output-dir ./your-finetuned-model \
--max-length 2048 \
--num-epochs 3 \
--batch-size 4 \
--learning-rate 2e-4 \
--lora-r 16 \
--lora-alpha 32
π Success Criteria
You'll know everything is working when:
- β Gradio interface loads without errors
- β Base model field shows local path
- β Training starts without cache errors
- β Progress updates appear in UI
- β Model weights are saved to output directory
π Related Files
- Interface:
/workspace/ftt/semicon-finetuning-scripts/interface_app.py - Training Script:
/workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py - Base Model:
/workspace/ftt/base_models/Mistral-7B-v0.1/ - Startup Script:
/workspace/ftt/semicon-finetuning-scripts/start_interface.sh - Requirements:
/workspace/ftt/semicon-finetuning-scripts/requirements_interface.txt
π Support
If you encounter any issues:
- Check this document's troubleshooting section
- Review the training logs in the UI
- Check process status:
ps aux | grep -E "interface_app|finetune_mistral" - Review cache directories are clear:
ls -lh /workspace/.hf_home/hub/
Last Updated: 2025-11-24 Solution: Local model download to bypass corrupted HuggingFace cache