code / md /QWEN_LOCAL_SETUP.md
Laura Wagner
to commit or not commit that is the question
5f5806d

Running Qwen-2.5-32B Locally with Ollama

This guide explains how to run Qwen-2.5-32B-Instruct locally on your A100 GPU using Ollama.

Why Run Locally?

βœ… FREE - No API costs ($0 per query) βœ… FAST - Local inference on A100 (5-10 tokens/sec) βœ… PRIVATE - Data never leaves your machine βœ… OFFLINE - Works without internet (after model download) βœ… HIGH QUALITY - 32B parameter model with strong multilingual support

System Requirements

Minimum Specs

  • GPU: NVIDIA A100 80GB (or similar high-end GPU)
  • VRAM: 22-25GB during inference
  • RAM: 32GB system RAM (you have 265GB - more than enough!)
  • Storage: ~20GB for model download
  • OS: Linux (you're on Ubuntu)

Your Setup

βœ… NVIDIA A100 80GB - Perfect for Qwen-2.5-32B βœ… 265GB RAM - Excellent βœ… Linux (Ubuntu) - Supported βœ… Ollama already installed at /usr/local/bin/ollama

Installation Steps

1. Verify Ollama Installation

# Check if Ollama is installed
which ollama
# Should output: /usr/local/bin/ollama

# Check Ollama version
ollama --version

If not installed, install with:

curl -fsSL https://ollama.com/install.sh | sh

2. Pull Qwen-2.5-32B-Instruct Model

# This will download ~20GB
ollama pull qwen2.5:32b-instruct

# Alternative: Use the base model (not instruct-tuned)
# ollama pull qwen2.5:32b

Download time: ~10-30 minutes depending on your internet speed.

Model cache location: By default, models are cached at:

  • Linux: ~/.ollama/models/

To use custom cache location (e.g., data/models/):

export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
ollama pull qwen2.5:32b-instruct

3. Verify Model is Ready

# List all installed models
ollama list

# Test the model
ollama run qwen2.5:32b-instruct "Hello, who are you?"

You should see a response from Qwen!

4. Start Ollama Server (if needed)

Ollama runs as a background service by default. If you need to start it manually:

# Start Ollama server
ollama serve

# Or run in background
nohup ollama serve > /dev/null 2>&1 &

Using Qwen-2.5-32B in the Notebook

Cell 20: Qwen-2.5-32B Local Annotation

The notebook cell handles everything automatically:

  1. Checks Ollama installation
  2. Verifies model availability
  3. Runs inference locally
  4. Saves progress every 10 rows

Configuration

# In Cell 20
TEST_MODE = True        # Start with small test
TEST_SIZE = 10          # Test on 10 samples first
MAX_ROWS = 20000        # Full dataset size
SAVE_INTERVAL = 10      # Save every 10 rows

MODEL_NAME = "qwen2.5:32b-instruct"  # Model to use
OLLAMA_HOST = "http://localhost:11434"  # Default Ollama port

Running the Pipeline

  1. Test run first (recommended):

    TEST_MODE = True
    TEST_SIZE = 10
    

    Run Cell 20 to test on 10 samples (~1-2 minutes)

  2. Check results:

    # Output saved to:
    data/CSV/qwen_local_annotated_POI_test.csv
    
  3. Full run:

    TEST_MODE = False
    MAX_ROWS = 20000  # or None for all rows
    

    Run Cell 20 for full dataset (~2-3 hours for 10k samples on A100)

Performance Expectations

On NVIDIA A100 80GB:

  • Speed: 5-10 tokens/second
  • Throughput: 100-200 samples/hour (depends on prompt length)
  • Memory: ~22-25GB VRAM during inference
  • Time for 10k samples: ~50-100 hours (can run overnight/over weekend)

Monitoring

The cell shows progress updates:

Qwen Local: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10/10 [02:30<00:00, 15.0s/it]
βœ… Saved after 10 rows (~24.0 samples/hour)

βœ… Done! Results: data/CSV/qwen_local_annotated_POI_test.csv
Total time: 2.5 minutes
Average speed: 240.0 samples/hour

Troubleshooting

Model Not Found

# Check if model is installed
ollama list

# If not listed, pull it
ollama pull qwen2.5:32b-instruct

Ollama Server Not Running

# Check if Ollama is running
ps aux | grep ollama

# If not running, start it
ollama serve

GPU Not Detected

# Check NVIDIA GPU
nvidia-smi

# Check CUDA
nvcc --version

# Ollama should automatically detect GPU
# If not, check Ollama logs
journalctl -u ollama

Out of Memory (OOM)

If you get OOM errors:

  1. Check VRAM usage:

    watch -n 1 nvidia-smi
    
  2. Try smaller batch size (not applicable here - we process 1 at a time)

  3. Try quantized version (smaller model):

    # 4-bit quantized version (~12GB VRAM)
    ollama pull qwen2.5:32b-instruct-q4_0
    
    # Update MODEL_NAME in notebook
    MODEL_NAME = "qwen2.5:32b-instruct-q4_0"
    

Slow Inference

If inference is very slow (<1 token/sec):

  1. Check GPU utilization:

    nvidia-smi
    

    GPU should show ~90%+ utilization during inference

  2. Check CPU vs GPU: Ollama might be using CPU instead of GPU

    # Force GPU usage
    OLLAMA_GPU=1 ollama serve
    

Model Variants

Ollama provides several Qwen-2.5 variants:

Model Size VRAM Speed Quality
qwen2.5:32b-instruct 32B ~25GB Medium Best
qwen2.5:32b-instruct-q4_0 32B (4-bit) ~12GB Fast Good
qwen2.5:14b-instruct 14B ~10GB Fast Good
qwen2.5:7b-instruct 7B ~5GB Very Fast OK

For your A100 80GB, qwen2.5:32b-instruct is recommended (best quality, no VRAM issues).

Custom Model Cache Location

To store models in data/models/ directory:

# Set environment variable
export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"

# Add to ~/.bashrc for persistence
echo 'export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"' >> ~/.bashrc

# Pull model (will download to data/models/)
ollama pull qwen2.5:32b-instruct

# Verify
ls -lh $OLLAMA_MODELS/

Comparing Results

After running both API and local versions, compare results:

import pandas as pd

# Load results
qwen_api = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
qwen_local = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv')

# Compare professions
print("API professions:", qwen_api['profession_llm'].value_counts().head())
print("Local professions:", qwen_local['profession_llm'].value_counts().head())

# Check agreement
agreement = (qwen_api['profession_llm'] == qwen_local['profession_llm']).mean()
print(f"Agreement rate: {agreement*100:.1f}%")

Cost Comparison (10,000 samples)

Method Cost Time Privacy
Qwen Local (A100) $0 ~50-100 hours βœ… Full
Qwen API (Alibaba) ~$10-20 ~5-10 hours ⚠️ Data sent to Alibaba
Llama API (Together) ~$5-10 ~5-10 hours ⚠️ Data sent to Together AI
Deepseek API ~$1-2 ~5-10 hours ⚠️ Data sent to Deepseek

Recommendation:

  • For small tests (<100 samples): Use API (faster)
  • For large datasets (>1000 samples): Use local (free, private)
  • For research papers: Use local to avoid data privacy concerns

Advanced: Parallel Processing

For faster processing on multi-GPU setup:

# Not implemented yet, but possible with:
# - Multiple Ollama instances on different GPUs
# - Ray or Dask for parallel processing
# - ~4x speedup with 4 GPUs

Summary

βœ… Ollama already installed βœ… A100 80GB GPU - perfect for Qwen-2.5-32B βœ… Free inference - no API costs βœ… Privacy - data stays local

Next steps:

  1. Pull model: ollama pull qwen2.5:32b-instruct
  2. Test with Cell 20: TEST_MODE = True, TEST_SIZE = 10
  3. Run full dataset: TEST_MODE = False

Estimated time for 10,000 samples: ~50-100 hours Cost: $0

Good luck! πŸš€