mistral-finetuning-interface / docs /MODEL_INFERENCE_FIXES.md
Prithvik-1's picture
Upload docs/MODEL_INFERENCE_FIXES.md with huggingface_hub
244a62f verified

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

Model Inference Fixes - Complete Guide

πŸŽ‰ Issues Resolved

Issue 1: New Fine-tuned Model Not Showing in UI

Status: βœ… FIXED

Problem: After completing fine-tuning, the new model mistral-finetuned-fifo1 was not appearing in the dropdown lists for API Hosting or Test Inference.

Root Cause: The list_models() function was only checking:

  • /workspace/ftt/ (parent directory)
  • /workspace/ftt/semicon-finetuning-scripts/models/msp/ (MODELS_DIR)

But the new model was saved to:

  • /workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1 (BASE_DIR)

Solution: Updated list_models() function to also scan BASE_DIR:

def list_models():
    """List available fine-tuned models"""
    models = []
    
    # Check in BASE_DIR (semicon-finetuning-scripts directory) - NEW!
    for item in BASE_DIR.iterdir():
        if item.is_dir() and "mistral" in item.name.lower() and not item.name.startswith('.'):
            models.append(str(item))
    
    # Check in BASE_DIR parent (ftt directory)
    ftt_dir = BASE_DIR.parent
    for item in ftt_dir.iterdir():
        if item.is_dir() and "mistral" in item.name.lower():
            models.append(str(item))
    
    # Check in MODELS_DIR
    if MODELS_DIR.exists():
        for item in MODELS_DIR.iterdir():
            if item.is_dir() and "mistral" in item.name.lower():
                models.append(str(item))
    
    return sorted(list(set(models))) if models else ["No models found"]

File Modified: /workspace/ftt/semicon-finetuning-scripts/interface_app.py (lines 116-133)


Issue 2: API Hosting Server Not Starting

Status: βœ… FIXED

Problem: When trying to start the API hosting server with the fine-tuned model, it failed with:

OSError: [Errno 116] Stale file handle: 
'/workspace/.hf_home/hub/models--mistralai--Mistral-7B-v0.1/blobs/...'

Root Cause:

  1. The fine-tuned model is a LoRA adapter (not a full model)
  2. To use it, the API server must load the base model first, then apply the LoRA adapter
  3. The inference script was hardcoded to load mistralai/Mistral-7B-v0.1 from HuggingFace
  4. This triggered the corrupted cache issue again

Solution: Updated the inference script to use the local base model we downloaded earlier:

if is_lora:
    # Load base model - prefer local model to avoid cache issues
    local_base_model = "/workspace/ftt/base_models/Mistral-7B-v0.1"
    
    # Check if local model exists, otherwise use HuggingFace
    if os.path.exists(local_base_model):
        base_model_name = local_base_model
        print(f"Loading base model from local: {base_model_name}")
    else:
        base_model_name = "mistralai/Mistral-7B-v0.1"
        print(f"Loading base model from HuggingFace: {base_model_name}")
    
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        local_files_only=os.path.exists(local_base_model),
        **get_model_kwargs(use_quantization)
    )
    
    # Load LoRA adapter
    print("Loading LoRA adapter...")
    model = PeftModel.from_pretrained(base_model, model_path)
    model = model.merge_and_unload()  # Merge adapter weights

File Modified: /workspace/ftt/semicon-finetuning-scripts/models/msp/inference/inference_mistral7b.py (lines 96-109)


πŸ“¦ Your Fine-tuned Model

Location: /workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1

Type: LoRA Adapter (161 MB)

Contents:

mistral-finetuned-fifo1/
β”œβ”€β”€ adapter_model.safetensors    # LoRA weights (161 MB)
β”œβ”€β”€ adapter_config.json          # LoRA configuration
β”œβ”€β”€ tokenizer.json               # Tokenizer
β”œβ”€β”€ tokenizer_config.json        # Tokenizer config
β”œβ”€β”€ special_tokens_map.json      # Special tokens
β”œβ”€β”€ training_args.bin            # Training arguments
β”œβ”€β”€ training_config.json         # Training configuration
β”œβ”€β”€ checkpoint-24/               # Best checkpoint
└── README.md                    # Model card

How it works:

  • Your model is a LoRA adapter (Low-Rank Adaptation)
  • It contains only the fine-tuned weights (161 MB)
  • To use it, it needs the base model (Mistral-7B-v0.1, 28 GB)
  • The adapter is merged with the base model at inference time

πŸš€ Using Your Model

Option 1: Via Gradio UI (Recommended)

For API Hosting:

  1. Access Gradio Interface:

  2. Go to "🌐 API Hosting" Tab

  3. Select Your Model:

    • Model Source: Local Model
    • Dropdown: Select /workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1
  4. Configure (optional):

    • Host: 0.0.0.0 (default)
    • Port: 8000 (default)
  5. Start Server:

    • Click "πŸš€ Start API Server"
    • Wait 15-20 seconds for model loading
    • Status will show "βœ… API server started!"
  6. Access API:

For Direct Inference:

  1. Go to "πŸ§ͺ Test Inference" Tab

  2. Select Your Model:

    • Model Source: Local Model
    • Dropdown: Select /workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1
  3. Configure Parameters:

    • Max Length: 512 (default) or up to 6000
    • Temperature: 0.7 (default) or adjust for creativity
  4. Enter Prompt:

    • Type your test prompt in the text box
  5. Run Inference:

    • Click "πŸ”„ Run Inference"
    • Results will appear below

Option 2: Via Python Script

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model
base_model_path = "/workspace/ftt/base_models/Mistral-7B-v0.1"
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    torch_dtype=torch.float16,
    device_map="auto",
    local_files_only=True
)

# Load LoRA adapter
adapter_path = "/workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1"
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload()  # Merge weights
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(adapter_path)

# Run inference
prompt = "Your prompt here"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=512)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Option 3: Via API (After Starting Server)

# Start API server first via Gradio UI or:
cd /workspace/ftt/semicon-finetuning-scripts
python3 models/msp/api/api_server.py \
    --model-path /workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1 \
    --host 0.0.0.0 \
    --port 8000

# Then call the API:
curl -X POST "http://localhost:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{
       "prompt": "Your prompt here",
       "max_length": 512,
       "temperature": 0.7
     }'

πŸ” Verification

Check Models are Listed:

cd /workspace/ftt/semicon-finetuning-scripts
python3 << 'EOF'
from pathlib import Path

BASE_DIR = Path("/workspace/ftt/semicon-finetuning-scripts")
models = [
    str(item) for item in BASE_DIR.iterdir()
    if item.is_dir() and "mistral" in item.name.lower()
]
print("Models found in BASE_DIR:")
for m in sorted(models):
    print(f"  - {Path(m).name}")
EOF

Expected output should include: mistral-finetuned-fifo1

Test API Server Manually:

cd /workspace/ftt/semicon-finetuning-scripts
source /venv/main/bin/activate

python3 models/msp/api/api_server.py \
    --model-path /workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1 \
    --host 0.0.0.0 \
    --port 8001

Expected output should include:

  • βœ“ Loading base model from local: /workspace/ftt/base_models/Mistral-7B-v0.1
  • βœ“ Loading LoRA adapter...
  • βœ“ Model loaded successfully on cuda!
  • βœ“ Server ready to accept requests

πŸ› Troubleshooting

Model Not Appearing in Dropdown

Check 1: Verify model exists

ls -lh /workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1/

Check 2: Restart Gradio interface

pkill -f interface_app.py
cd /workspace/ftt/semicon-finetuning-scripts
python3 interface_app.py

Check 3: Manually verify list_models() function

cd /workspace/ftt/semicon-finetuning-scripts
python3 -c "from interface_app import list_models; print('\n'.join(list_models()))"

API Server Fails to Start

Check 1: Verify base model exists

ls -lh /workspace/ftt/base_models/Mistral-7B-v0.1/

If missing, re-download:

huggingface-cli download mistralai/Mistral-7B-v0.1 \
    --local-dir /workspace/ftt/base_models/Mistral-7B-v0.1 \
    --local-dir-use-symlinks False

Check 2: Test model loading manually

cd /workspace/ftt/semicon-finetuning-scripts
python3 << 'EOF'
from models.msp.inference.inference_mistral7b import load_local_model

model_path = "/workspace/ftt/semicon-finetuning-scripts/mistral-finetuned-fifo1"
print("Testing model load...")
model, tokenizer = load_local_model(model_path)
print("βœ“ Model loaded successfully!")
EOF

Check 3: Check GPU memory

nvidia-smi

If GPU is full, free up memory:

pkill -f python3  # Kill other Python processes
python3 -c "import torch; torch.cuda.empty_cache()"

Inference Takes Too Long

Option 1: Reduce max_length

  • Set max_length to 128 or 256 instead of 512+

Option 2: Use quantization

  • The server automatically uses 4-bit quantization if GPU memory is low
  • This makes it faster but slightly less accurate

Option 3: Adjust temperature

  • Lower temperature (0.1-0.5) = faster, more deterministic
  • Higher temperature (0.7-1.0) = slower, more creative

πŸ“Š Performance Notes

Model Loading Time:

  • Base Model Load: ~15-20 seconds (28 GB from disk)
  • LoRA Adapter Load: ~2-3 seconds (161 MB)
  • Merge & Unload: ~5 seconds
  • Total: ~20-30 seconds

Inference Speed (A100 GPU):

  • Short prompts (<100 tokens): 1-2 seconds
  • Medium prompts (100-500 tokens): 3-8 seconds
  • Long prompts (500+ tokens): 10-30 seconds

Memory Usage:

  • Base Model: ~14 GB GPU RAM (FP16)
  • With LoRA: ~14.5 GB GPU RAM
  • With Quantization: ~7-8 GB GPU RAM (4-bit)

πŸ“š Technical Details

LoRA Configuration (from adapter_config.json):

{
  "r": 16,                    # LoRA rank
  "lora_alpha": 32,           # LoRA scaling
  "target_modules": [         # Layers fine-tuned
    "q_proj",
    "v_proj"
  ],
  "lora_dropout": 0.05,
  "bias": "none",
  "task_type": "CAUSAL_LM"
}

Training Configuration (from training_config.json):

  • Base Model: mistralai/Mistral-7B-v0.1
  • Dataset: 100 samples (FIFO-related)
  • Max Length: 2048 tokens
  • Epochs: 3
  • Batch Size: 4
  • Learning Rate: 2e-4
  • Device: CUDA (A100 GPU)

🎯 Summary

What Was Fixed:

  1. βœ… Model Listing: Updated to scan BASE_DIR where models are saved
  2. βœ… API Server: Updated to use local base model instead of HuggingFace cache
  3. βœ… Inference: Now works both directly and via API

What's Working Now:

  1. βœ… Your model appears in all dropdowns
  2. βœ… API server starts successfully
  3. βœ… Inference works via UI
  4. βœ… Inference works via API
  5. βœ… No more cache errors!

Files Modified:

  1. /workspace/ftt/semicon-finetuning-scripts/interface_app.py - Model listing
  2. /workspace/ftt/semicon-finetuning-scripts/models/msp/inference/inference_mistral7b.py - Inference

🌐 Access Links

Gradio Interface: https://3833be2ce50507322f.gradio.live
Local Port: 7860
API Port (when started): 8000


Last Updated: 2024-11-24
Model: mistral-finetuned-fifo1 (LoRA Adapter)
Base: Mistral-7B-v0.1 (Local)