Spaces:

Prithvik-1
/

mistral-finetuning-interface

Paused

App Files Files Community

mistral-finetuning-interface / docs /LOCAL_MODEL_SETUP.md

Prithvik-1

Upload docs/LOCAL_MODEL_SETUP.md with huggingface_hub

6d94a60 verified 3 months ago

preview code

raw

history blame contribute delete

6.95 kB

A newer version of the Gradio SDK is available: 6.6.0

Upgrade

Local Model Setup - Solution Summary

🎯 Problem Resolved

Issue: Training failed with OSError: [Errno 116] Stale file handle when trying to download/use models from HuggingFace cache.

Root Cause: Corrupted NFS file handle in HuggingFace cache directory preventing model access.

Solution: Downloaded Mistral-7B-v0.1 model directly to workspace, bypassing the corrupted cache.

📦 Model Location

/workspace/ftt/base_models/Mistral-7B-v0.1

Size: 28 GB (includes both PyTorch and SafeTensors formats)

Contents:

✓ Model weights (model-00001-of-00002.safetensors, model-00002-of-00002.safetensors)
✓ Tokenizer (tokenizer.model, tokenizer.json)
✓ Configuration files (config.json, generation_config.json)

🔧 Changes Made

1. Downloaded Model Locally

Used huggingface-cli to download model directly to workspace:

huggingface-cli download mistralai/Mistral-7B-v0.1 \
  --local-dir /workspace/ftt/base_models/Mistral-7B-v0.1 \
  --local-dir-use-symlinks False

2. Updated Gradio Interface

File: /workspace/ftt/semicon-finetuning-scripts/interface_app.py

Change: Updated default base model path from HuggingFace ID to local path:

# Before:
value="mistralai/Mistral-7B-v0.1"

# After:
value="/workspace/ftt/base_models/Mistral-7B-v0.1"

3. Restarted Interface

Killed old Gradio process and started fresh instance with updated configuration.

🚀 How to Use

Starting Training

Access Gradio Interface:
- The interface is running on port 7860
- Access via the public link displayed in the terminal
Fine-tuning Tab:
- Base Model field now defaults to: /workspace/ftt/base_models/Mistral-7B-v0.1
- You can still use HuggingFace model IDs if needed
- Upload your dataset or use HuggingFace datasets
- Configure training parameters
- Click "Start Fine-tuning"
Monitor Training:
- Status updates in real-time
- Progress bar shows epoch and loss
- Logs are scrollable with copy functionality

Using Other Models

If you want to use a different base model:

Option 1: Download Another Model Locally

cd /workspace/ftt
source /venv/main/bin/activate

# Download model
huggingface-cli download <model-id> \
  --local-dir /workspace/ftt/base_models/<model-name> \
  --local-dir-use-symlinks False

# Use the path in Gradio:
# /workspace/ftt/base_models/<model-name>

Option 2: Use HuggingFace ID Directly

Simply enter the model ID in the Base Model field (e.g., mistralai/Mistral-7B-Instruct-v0.2)
The script will download it if not cached (may hit cache issues if they persist)

🔍 Verification

Check Model is Accessible

python3 << 'EOF'
from transformers import AutoTokenizer, AutoConfig

model_path = "/workspace/ftt/base_models/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
config = AutoConfig.from_pretrained(model_path, local_files_only=True)

print(f"✓ Tokenizer: {len(tokenizer)} tokens")
print(f"✓ Model: {config.model_type}")
EOF

Check Gradio Status

# Check process
ps aux | grep interface_app.py

# Check port
lsof -i :7860

# View logs (if started with nohup)
tail -f /tmp/gradio_interface.log

📊 Interface Features

Fine-tuning Section

✓ File upload support (JSON/JSONL)
✓ HuggingFace dataset integration
✓ Automatic train/validation/test split
✓ Max sequence length up to 6000
✓ GPU-based parameter recommendations
✓ Detailed tooltips for all parameters
✓ Real-time progress tracking
✓ Checkpoint/resume functionality

API Hosting Section

✓ Host fine-tuned models from local paths
✓ Host models from HuggingFace repositories
✓ FastAPI with automatic documentation
✓ Health checks and status monitoring

Test Inference Section

✓ Test local fine-tuned models
✓ Test HuggingFace models
✓ Adjustable max-length (up to 6000)
✓ Temperature control with tooltips
✓ Uses API if running, otherwise direct loading

UI Controls

✓ Stop Training button
✓ Refresh Status button
✓ Scrollable logs with copy functionality
✓ Progress bars for training
✓ 🛑 Shutdown Gradio Server button (System Controls)

🐛 Troubleshooting

Issue: Cache errors persist

Solution: Always use local model paths from /workspace/ftt/base_models/

Issue: Training logs not updating

Solution:

Click "Refresh Status" button
Check that training process is running: ps aux | grep finetune_mistral

Issue: Interface not accessible

Solution:

# Check if running
lsof -i :7860

# Restart if needed
pkill -f interface_app.py
cd /workspace/ftt/semicon-finetuning-scripts
python3 interface_app.py

Issue: Out of memory during training

Solution:

Reduce batch size
Reduce max sequence length
Enable gradient checkpointing (already enabled in script)
Use LoRA with lower rank (r=8 instead of r=16)

📝 Technical Details

Training Script

Location: /workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py

Key Features:

LoRA fine-tuning for memory efficiency
Gradient checkpointing enabled
Automatic device detection (CUDA/MPS/CPU)
Resume from checkpoint support
JSON configuration export

Fine-tuning Command (Generated by Interface)

python3 -u /workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py \
  --base-model /workspace/ftt/base_models/Mistral-7B-v0.1 \
  --dataset /path/to/your/dataset.jsonl \
  --output-dir ./your-finetuned-model \
  --max-length 2048 \
  --num-epochs 3 \
  --batch-size 4 \
  --learning-rate 2e-4 \
  --lora-r 16 \
  --lora-alpha 32

🎉 Success Criteria

You'll know everything is working when:

✅ Gradio interface loads without errors
✅ Base model field shows local path
✅ Training starts without cache errors
✅ Progress updates appear in UI
✅ Model weights are saved to output directory

📚 Related Files

Interface: /workspace/ftt/semicon-finetuning-scripts/interface_app.py
Training Script: /workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py
Base Model: /workspace/ftt/base_models/Mistral-7B-v0.1/
Startup Script: /workspace/ftt/semicon-finetuning-scripts/start_interface.sh
Requirements: /workspace/ftt/semicon-finetuning-scripts/requirements_interface.txt

🆘 Support

If you encounter any issues:

Check this document's troubleshooting section
Review the training logs in the UI
Check process status: ps aux | grep -E "interface_app|finetune_mistral"
Review cache directories are clear: ls -lh /workspace/.hf_home/hub/

Last Updated: 2025-11-24 Solution: Local model download to bypass corrupted HuggingFace cache