Running Qwen-2.5-32B Locally with Ollama
This guide explains how to run Qwen-2.5-32B-Instruct locally on your A100 GPU using Ollama.
Why Run Locally?
β FREE - No API costs ($0 per query) β FAST - Local inference on A100 (5-10 tokens/sec) β PRIVATE - Data never leaves your machine β OFFLINE - Works without internet (after model download) β HIGH QUALITY - 32B parameter model with strong multilingual support
System Requirements
Minimum Specs
- GPU: NVIDIA A100 80GB (or similar high-end GPU)
- VRAM: 22-25GB during inference
- RAM: 32GB system RAM (you have 265GB - more than enough!)
- Storage: ~20GB for model download
- OS: Linux (you're on Ubuntu)
Your Setup
β
NVIDIA A100 80GB - Perfect for Qwen-2.5-32B
β
265GB RAM - Excellent
β
Linux (Ubuntu) - Supported
β
Ollama already installed at /usr/local/bin/ollama
Installation Steps
1. Verify Ollama Installation
# Check if Ollama is installed
which ollama
# Should output: /usr/local/bin/ollama
# Check Ollama version
ollama --version
If not installed, install with:
curl -fsSL https://ollama.com/install.sh | sh
2. Pull Qwen-2.5-32B-Instruct Model
# This will download ~20GB
ollama pull qwen2.5:32b-instruct
# Alternative: Use the base model (not instruct-tuned)
# ollama pull qwen2.5:32b
Download time: ~10-30 minutes depending on your internet speed.
Model cache location: By default, models are cached at:
- Linux:
~/.ollama/models/
To use custom cache location (e.g., data/models/):
export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
ollama pull qwen2.5:32b-instruct
3. Verify Model is Ready
# List all installed models
ollama list
# Test the model
ollama run qwen2.5:32b-instruct "Hello, who are you?"
You should see a response from Qwen!
4. Start Ollama Server (if needed)
Ollama runs as a background service by default. If you need to start it manually:
# Start Ollama server
ollama serve
# Or run in background
nohup ollama serve > /dev/null 2>&1 &
Using Qwen-2.5-32B in the Notebook
Cell 20: Qwen-2.5-32B Local Annotation
The notebook cell handles everything automatically:
- Checks Ollama installation
- Verifies model availability
- Runs inference locally
- Saves progress every 10 rows
Configuration
# In Cell 20
TEST_MODE = True # Start with small test
TEST_SIZE = 10 # Test on 10 samples first
MAX_ROWS = 20000 # Full dataset size
SAVE_INTERVAL = 10 # Save every 10 rows
MODEL_NAME = "qwen2.5:32b-instruct" # Model to use
OLLAMA_HOST = "http://localhost:11434" # Default Ollama port
Running the Pipeline
Test run first (recommended):
TEST_MODE = True TEST_SIZE = 10Run Cell 20 to test on 10 samples (~1-2 minutes)
Check results:
# Output saved to: data/CSV/qwen_local_annotated_POI_test.csvFull run:
TEST_MODE = False MAX_ROWS = 20000 # or None for all rowsRun Cell 20 for full dataset (~2-3 hours for 10k samples on A100)
Performance Expectations
On NVIDIA A100 80GB:
- Speed: 5-10 tokens/second
- Throughput: 100-200 samples/hour (depends on prompt length)
- Memory: ~22-25GB VRAM during inference
- Time for 10k samples: ~50-100 hours (can run overnight/over weekend)
Monitoring
The cell shows progress updates:
Qwen Local: 100%|ββββββββββ| 10/10 [02:30<00:00, 15.0s/it]
β
Saved after 10 rows (~24.0 samples/hour)
β
Done! Results: data/CSV/qwen_local_annotated_POI_test.csv
Total time: 2.5 minutes
Average speed: 240.0 samples/hour
Troubleshooting
Model Not Found
# Check if model is installed
ollama list
# If not listed, pull it
ollama pull qwen2.5:32b-instruct
Ollama Server Not Running
# Check if Ollama is running
ps aux | grep ollama
# If not running, start it
ollama serve
GPU Not Detected
# Check NVIDIA GPU
nvidia-smi
# Check CUDA
nvcc --version
# Ollama should automatically detect GPU
# If not, check Ollama logs
journalctl -u ollama
Out of Memory (OOM)
If you get OOM errors:
Check VRAM usage:
watch -n 1 nvidia-smiTry smaller batch size (not applicable here - we process 1 at a time)
Try quantized version (smaller model):
# 4-bit quantized version (~12GB VRAM) ollama pull qwen2.5:32b-instruct-q4_0 # Update MODEL_NAME in notebook MODEL_NAME = "qwen2.5:32b-instruct-q4_0"
Slow Inference
If inference is very slow (<1 token/sec):
Check GPU utilization:
nvidia-smiGPU should show ~90%+ utilization during inference
Check CPU vs GPU: Ollama might be using CPU instead of GPU
# Force GPU usage OLLAMA_GPU=1 ollama serve
Model Variants
Ollama provides several Qwen-2.5 variants:
| Model | Size | VRAM | Speed | Quality |
|---|---|---|---|---|
qwen2.5:32b-instruct |
32B | ~25GB | Medium | Best |
qwen2.5:32b-instruct-q4_0 |
32B (4-bit) | ~12GB | Fast | Good |
qwen2.5:14b-instruct |
14B | ~10GB | Fast | Good |
qwen2.5:7b-instruct |
7B | ~5GB | Very Fast | OK |
For your A100 80GB, qwen2.5:32b-instruct is recommended (best quality, no VRAM issues).
Custom Model Cache Location
To store models in data/models/ directory:
# Set environment variable
export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
# Add to ~/.bashrc for persistence
echo 'export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"' >> ~/.bashrc
# Pull model (will download to data/models/)
ollama pull qwen2.5:32b-instruct
# Verify
ls -lh $OLLAMA_MODELS/
Comparing Results
After running both API and local versions, compare results:
import pandas as pd
# Load results
qwen_api = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
qwen_local = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv')
# Compare professions
print("API professions:", qwen_api['profession_llm'].value_counts().head())
print("Local professions:", qwen_local['profession_llm'].value_counts().head())
# Check agreement
agreement = (qwen_api['profession_llm'] == qwen_local['profession_llm']).mean()
print(f"Agreement rate: {agreement*100:.1f}%")
Cost Comparison (10,000 samples)
| Method | Cost | Time | Privacy |
|---|---|---|---|
| Qwen Local (A100) | $0 | ~50-100 hours | β Full |
| Qwen API (Alibaba) | ~$10-20 | ~5-10 hours | β οΈ Data sent to Alibaba |
| Llama API (Together) | ~$5-10 | ~5-10 hours | β οΈ Data sent to Together AI |
| Deepseek API | ~$1-2 | ~5-10 hours | β οΈ Data sent to Deepseek |
Recommendation:
- For small tests (<100 samples): Use API (faster)
- For large datasets (>1000 samples): Use local (free, private)
- For research papers: Use local to avoid data privacy concerns
Advanced: Parallel Processing
For faster processing on multi-GPU setup:
# Not implemented yet, but possible with:
# - Multiple Ollama instances on different GPUs
# - Ray or Dask for parallel processing
# - ~4x speedup with 4 GPUs
Summary
β Ollama already installed β A100 80GB GPU - perfect for Qwen-2.5-32B β Free inference - no API costs β Privacy - data stays local
Next steps:
- Pull model:
ollama pull qwen2.5:32b-instruct - Test with Cell 20:
TEST_MODE = True,TEST_SIZE = 10 - Run full dataset:
TEST_MODE = False
Estimated time for 10,000 samples: ~50-100 hours Cost: $0
Good luck! π