mistral-finetuning-interface / docs /LOCAL_MODEL_SETUP.md
Prithvik-1's picture
Upload docs/LOCAL_MODEL_SETUP.md with huggingface_hub
6d94a60 verified

A newer version of the Gradio SDK is available: 6.6.0

Upgrade

Local Model Setup - Solution Summary

🎯 Problem Resolved

Issue: Training failed with OSError: [Errno 116] Stale file handle when trying to download/use models from HuggingFace cache.

Root Cause: Corrupted NFS file handle in HuggingFace cache directory preventing model access.

Solution: Downloaded Mistral-7B-v0.1 model directly to workspace, bypassing the corrupted cache.


πŸ“¦ Model Location

/workspace/ftt/base_models/Mistral-7B-v0.1

Size: 28 GB (includes both PyTorch and SafeTensors formats)

Contents:

  • βœ“ Model weights (model-00001-of-00002.safetensors, model-00002-of-00002.safetensors)
  • βœ“ Tokenizer (tokenizer.model, tokenizer.json)
  • βœ“ Configuration files (config.json, generation_config.json)

πŸ”§ Changes Made

1. Downloaded Model Locally

Used huggingface-cli to download model directly to workspace:

huggingface-cli download mistralai/Mistral-7B-v0.1 \
  --local-dir /workspace/ftt/base_models/Mistral-7B-v0.1 \
  --local-dir-use-symlinks False

2. Updated Gradio Interface

File: /workspace/ftt/semicon-finetuning-scripts/interface_app.py

Change: Updated default base model path from HuggingFace ID to local path:

# Before:
value="mistralai/Mistral-7B-v0.1"

# After:
value="/workspace/ftt/base_models/Mistral-7B-v0.1"

3. Restarted Interface

Killed old Gradio process and started fresh instance with updated configuration.


πŸš€ How to Use

Starting Training

  1. Access Gradio Interface:

    • The interface is running on port 7860
    • Access via the public link displayed in the terminal
  2. Fine-tuning Tab:

    • Base Model field now defaults to: /workspace/ftt/base_models/Mistral-7B-v0.1
    • You can still use HuggingFace model IDs if needed
    • Upload your dataset or use HuggingFace datasets
    • Configure training parameters
    • Click "Start Fine-tuning"
  3. Monitor Training:

    • Status updates in real-time
    • Progress bar shows epoch and loss
    • Logs are scrollable with copy functionality

Using Other Models

If you want to use a different base model:

Option 1: Download Another Model Locally

cd /workspace/ftt
source /venv/main/bin/activate

# Download model
huggingface-cli download <model-id> \
  --local-dir /workspace/ftt/base_models/<model-name> \
  --local-dir-use-symlinks False

# Use the path in Gradio:
# /workspace/ftt/base_models/<model-name>

Option 2: Use HuggingFace ID Directly

  • Simply enter the model ID in the Base Model field (e.g., mistralai/Mistral-7B-Instruct-v0.2)
  • The script will download it if not cached (may hit cache issues if they persist)

πŸ” Verification

Check Model is Accessible

python3 << 'EOF'
from transformers import AutoTokenizer, AutoConfig

model_path = "/workspace/ftt/base_models/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
config = AutoConfig.from_pretrained(model_path, local_files_only=True)

print(f"βœ“ Tokenizer: {len(tokenizer)} tokens")
print(f"βœ“ Model: {config.model_type}")
EOF

Check Gradio Status

# Check process
ps aux | grep interface_app.py

# Check port
lsof -i :7860

# View logs (if started with nohup)
tail -f /tmp/gradio_interface.log

πŸ“Š Interface Features

Fine-tuning Section

  • βœ“ File upload support (JSON/JSONL)
  • βœ“ HuggingFace dataset integration
  • βœ“ Automatic train/validation/test split
  • βœ“ Max sequence length up to 6000
  • βœ“ GPU-based parameter recommendations
  • βœ“ Detailed tooltips for all parameters
  • βœ“ Real-time progress tracking
  • βœ“ Checkpoint/resume functionality

API Hosting Section

  • βœ“ Host fine-tuned models from local paths
  • βœ“ Host models from HuggingFace repositories
  • βœ“ FastAPI with automatic documentation
  • βœ“ Health checks and status monitoring

Test Inference Section

  • βœ“ Test local fine-tuned models
  • βœ“ Test HuggingFace models
  • βœ“ Adjustable max-length (up to 6000)
  • βœ“ Temperature control with tooltips
  • βœ“ Uses API if running, otherwise direct loading

UI Controls

  • βœ“ Stop Training button
  • βœ“ Refresh Status button
  • βœ“ Scrollable logs with copy functionality
  • βœ“ Progress bars for training
  • βœ“ πŸ›‘ Shutdown Gradio Server button (System Controls)

πŸ› Troubleshooting

Issue: Cache errors persist

Solution: Always use local model paths from /workspace/ftt/base_models/

Issue: Training logs not updating

Solution:

  1. Click "Refresh Status" button
  2. Check that training process is running: ps aux | grep finetune_mistral

Issue: Interface not accessible

Solution:

# Check if running
lsof -i :7860

# Restart if needed
pkill -f interface_app.py
cd /workspace/ftt/semicon-finetuning-scripts
python3 interface_app.py

Issue: Out of memory during training

Solution:

  1. Reduce batch size
  2. Reduce max sequence length
  3. Enable gradient checkpointing (already enabled in script)
  4. Use LoRA with lower rank (r=8 instead of r=16)

πŸ“ Technical Details

Training Script

Location: /workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py

Key Features:

  • LoRA fine-tuning for memory efficiency
  • Gradient checkpointing enabled
  • Automatic device detection (CUDA/MPS/CPU)
  • Resume from checkpoint support
  • JSON configuration export

Fine-tuning Command (Generated by Interface)

python3 -u /workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py \
  --base-model /workspace/ftt/base_models/Mistral-7B-v0.1 \
  --dataset /path/to/your/dataset.jsonl \
  --output-dir ./your-finetuned-model \
  --max-length 2048 \
  --num-epochs 3 \
  --batch-size 4 \
  --learning-rate 2e-4 \
  --lora-r 16 \
  --lora-alpha 32

πŸŽ‰ Success Criteria

You'll know everything is working when:

  1. βœ… Gradio interface loads without errors
  2. βœ… Base model field shows local path
  3. βœ… Training starts without cache errors
  4. βœ… Progress updates appear in UI
  5. βœ… Model weights are saved to output directory

πŸ“š Related Files

  • Interface: /workspace/ftt/semicon-finetuning-scripts/interface_app.py
  • Training Script: /workspace/ftt/semicon-finetuning-scripts/models/msp/ft/finetune_mistral7b.py
  • Base Model: /workspace/ftt/base_models/Mistral-7B-v0.1/
  • Startup Script: /workspace/ftt/semicon-finetuning-scripts/start_interface.sh
  • Requirements: /workspace/ftt/semicon-finetuning-scripts/requirements_interface.txt

πŸ†˜ Support

If you encounter any issues:

  1. Check this document's troubleshooting section
  2. Review the training logs in the UI
  3. Check process status: ps aux | grep -E "interface_app|finetune_mistral"
  4. Review cache directories are clear: ls -lh /workspace/.hf_home/hub/

Last Updated: 2025-11-24 Solution: Local model download to bypass corrupted HuggingFace cache