code / md /QWEN_LOCAL_SETUP.md

Laura Wagner

to commit or not commit that is the question

5f5806d about 2 months ago

preview code

raw

history blame contribute delete

7.84 kB

Running Qwen-2.5-32B Locally with Ollama

This guide explains how to run Qwen-2.5-32B-Instruct locally on your A100 GPU using Ollama.

Why Run Locally?

✅ FREE - No API costs ($0 per query) ✅ FAST - Local inference on A100 (5-10 tokens/sec) ✅ PRIVATE - Data never leaves your machine ✅ OFFLINE - Works without internet (after model download) ✅ HIGH QUALITY - 32B parameter model with strong multilingual support

System Requirements

Minimum Specs

GPU: NVIDIA A100 80GB (or similar high-end GPU)
VRAM: 22-25GB during inference
RAM: 32GB system RAM (you have 265GB - more than enough!)
Storage: ~20GB for model download
OS: Linux (you're on Ubuntu)

Your Setup

✅ NVIDIA A100 80GB - Perfect for Qwen-2.5-32B ✅ 265GB RAM - Excellent ✅ Linux (Ubuntu) - Supported ✅ Ollama already installed at /usr/local/bin/ollama

Installation Steps

1. Verify Ollama Installation

# Check if Ollama is installed
which ollama
# Should output: /usr/local/bin/ollama

# Check Ollama version
ollama --version

If not installed, install with:

curl -fsSL https://ollama.com/install.sh | sh

2. Pull Qwen-2.5-32B-Instruct Model

# This will download ~20GB
ollama pull qwen2.5:32b-instruct

# Alternative: Use the base model (not instruct-tuned)
# ollama pull qwen2.5:32b

Download time: ~10-30 minutes depending on your internet speed.

Model cache location: By default, models are cached at:

Linux: ~/.ollama/models/

To use custom cache location (e.g., data/models/):

export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
ollama pull qwen2.5:32b-instruct

3. Verify Model is Ready

# List all installed models
ollama list

# Test the model
ollama run qwen2.5:32b-instruct "Hello, who are you?"

You should see a response from Qwen!

4. Start Ollama Server (if needed)

Ollama runs as a background service by default. If you need to start it manually:

# Start Ollama server
ollama serve

# Or run in background
nohup ollama serve > /dev/null 2>&1 &

Using Qwen-2.5-32B in the Notebook

Cell 20: Qwen-2.5-32B Local Annotation

The notebook cell handles everything automatically:

Checks Ollama installation
Verifies model availability
Runs inference locally
Saves progress every 10 rows

Configuration

# In Cell 20
TEST_MODE = True        # Start with small test
TEST_SIZE = 10          # Test on 10 samples first
MAX_ROWS = 20000        # Full dataset size
SAVE_INTERVAL = 10      # Save every 10 rows

MODEL_NAME = "qwen2.5:32b-instruct"  # Model to use
OLLAMA_HOST = "http://localhost:11434"  # Default Ollama port

Running the Pipeline

Test run first (recommended):
```
TEST_MODE = True
TEST_SIZE = 10
```
Run Cell 20 to test on 10 samples (~1-2 minutes)

Check results:

# Output saved to:
data/CSV/qwen_local_annotated_POI_test.csv

Full run:
```
TEST_MODE = False
MAX_ROWS = 20000  # or None for all rows
```
Run Cell 20 for full dataset (~2-3 hours for 10k samples on A100)

Performance Expectations

On NVIDIA A100 80GB:

Speed: 5-10 tokens/second
Throughput: 100-200 samples/hour (depends on prompt length)
Memory: ~22-25GB VRAM during inference
Time for 10k samples: ~50-100 hours (can run overnight/over weekend)

Monitoring

The cell shows progress updates:

Qwen Local: 100%|██████████| 10/10 [02:30<00:00, 15.0s/it]
✅ Saved after 10 rows (~24.0 samples/hour)

✅ Done! Results: data/CSV/qwen_local_annotated_POI_test.csv
Total time: 2.5 minutes
Average speed: 240.0 samples/hour

Troubleshooting

Model Not Found

# Check if model is installed
ollama list

# If not listed, pull it
ollama pull qwen2.5:32b-instruct

Ollama Server Not Running

# Check if Ollama is running
ps aux | grep ollama

# If not running, start it
ollama serve

GPU Not Detected

# Check NVIDIA GPU
nvidia-smi

# Check CUDA
nvcc --version

# Ollama should automatically detect GPU
# If not, check Ollama logs
journalctl -u ollama

Out of Memory (OOM)

If you get OOM errors:

Check VRAM usage:
```
watch -n 1 nvidia-smi
```
Try smaller batch size (not applicable here - we process 1 at a time)

Try quantized version (smaller model):

# 4-bit quantized version (~12GB VRAM)
ollama pull qwen2.5:32b-instruct-q4_0

# Update MODEL_NAME in notebook
MODEL_NAME = "qwen2.5:32b-instruct-q4_0"

Slow Inference

If inference is very slow (<1 token/sec):

Check GPU utilization:
```
nvidia-smi
```
GPU should show ~90%+ utilization during inference
Check CPU vs GPU: Ollama might be using CPU instead of GPU
```
# Force GPU usage
OLLAMA_GPU=1 ollama serve
```

Model Variants

Ollama provides several Qwen-2.5 variants:

Model	Size	VRAM	Speed	Quality
`qwen2.5:32b-instruct`	32B	~25GB	Medium	Best
`qwen2.5:32b-instruct-q4_0`	32B (4-bit)	~12GB	Fast	Good
`qwen2.5:14b-instruct`	14B	~10GB	Fast	Good
`qwen2.5:7b-instruct`	7B	~5GB	Very Fast	OK

For your A100 80GB, qwen2.5:32b-instruct is recommended (best quality, no VRAM issues).

Custom Model Cache Location

To store models in data/models/ directory:

# Set environment variable
export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"

# Add to ~/.bashrc for persistence
echo 'export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"' >> ~/.bashrc

# Pull model (will download to data/models/)
ollama pull qwen2.5:32b-instruct

# Verify
ls -lh $OLLAMA_MODELS/

Comparing Results

After running both API and local versions, compare results:

import pandas as pd

# Load results
qwen_api = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
qwen_local = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv')

# Compare professions
print("API professions:", qwen_api['profession_llm'].value_counts().head())
print("Local professions:", qwen_local['profession_llm'].value_counts().head())

# Check agreement
agreement = (qwen_api['profession_llm'] == qwen_local['profession_llm']).mean()
print(f"Agreement rate: {agreement*100:.1f}%")

Cost Comparison (10,000 samples)

Method	Cost	Time	Privacy
Qwen Local (A100)	$0	~50-100 hours	✅ Full
Qwen API (Alibaba)	~$10-20	~5-10 hours	⚠️ Data sent to Alibaba
Llama API (Together)	~$5-10	~5-10 hours	⚠️ Data sent to Together AI
Deepseek API	~$1-2	~5-10 hours	⚠️ Data sent to Deepseek

Recommendation:

For small tests (<100 samples): Use API (faster)
For large datasets (>1000 samples): Use local (free, private)
For research papers: Use local to avoid data privacy concerns

Advanced: Parallel Processing

For faster processing on multi-GPU setup:

# Not implemented yet, but possible with:
# - Multiple Ollama instances on different GPUs
# - Ray or Dask for parallel processing
# - ~4x speedup with 4 GPUs

Summary

✅ Ollama already installed ✅ A100 80GB GPU - perfect for Qwen-2.5-32B ✅ Free inference - no API costs ✅ Privacy - data stays local

Next steps:

Pull model: ollama pull qwen2.5:32b-instruct
Test with Cell 20: TEST_MODE = True, TEST_SIZE = 10
Run full dataset: TEST_MODE = False

Estimated time for 10,000 samples: ~50-100 hours Cost: $0

Good luck! 🚀