code / md /QWEN_LOCAL_SETUP.md
Laura Wagner
to commit or not commit that is the question
5f5806d
# Running Qwen-2.5-32B Locally with Ollama
This guide explains how to run Qwen-2.5-32B-Instruct locally on your A100 GPU using Ollama.
## Why Run Locally?
βœ… **FREE** - No API costs ($0 per query)
βœ… **FAST** - Local inference on A100 (5-10 tokens/sec)
βœ… **PRIVATE** - Data never leaves your machine
βœ… **OFFLINE** - Works without internet (after model download)
βœ… **HIGH QUALITY** - 32B parameter model with strong multilingual support
## System Requirements
### Minimum Specs
- **GPU**: NVIDIA A100 80GB (or similar high-end GPU)
- **VRAM**: 22-25GB during inference
- **RAM**: 32GB system RAM (you have 265GB - more than enough!)
- **Storage**: ~20GB for model download
- **OS**: Linux (you're on Ubuntu)
### Your Setup
βœ… NVIDIA A100 80GB - Perfect for Qwen-2.5-32B
βœ… 265GB RAM - Excellent
βœ… Linux (Ubuntu) - Supported
βœ… Ollama already installed at `/usr/local/bin/ollama`
## Installation Steps
### 1. Verify Ollama Installation
```bash
# Check if Ollama is installed
which ollama
# Should output: /usr/local/bin/ollama
# Check Ollama version
ollama --version
```
If not installed, install with:
```bash
curl -fsSL https://ollama.com/install.sh | sh
```
### 2. Pull Qwen-2.5-32B-Instruct Model
```bash
# This will download ~20GB
ollama pull qwen2.5:32b-instruct
# Alternative: Use the base model (not instruct-tuned)
# ollama pull qwen2.5:32b
```
**Download time**: ~10-30 minutes depending on your internet speed.
**Model cache location**: By default, models are cached at:
- Linux: `~/.ollama/models/`
To use custom cache location (e.g., `data/models/`):
```bash
export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
ollama pull qwen2.5:32b-instruct
```
### 3. Verify Model is Ready
```bash
# List all installed models
ollama list
# Test the model
ollama run qwen2.5:32b-instruct "Hello, who are you?"
```
You should see a response from Qwen!
### 4. Start Ollama Server (if needed)
Ollama runs as a background service by default. If you need to start it manually:
```bash
# Start Ollama server
ollama serve
# Or run in background
nohup ollama serve > /dev/null 2>&1 &
```
## Using Qwen-2.5-32B in the Notebook
### Cell 20: Qwen-2.5-32B Local Annotation
The notebook cell handles everything automatically:
1. **Checks Ollama installation**
2. **Verifies model availability**
3. **Runs inference locally**
4. **Saves progress every 10 rows**
### Configuration
```python
# In Cell 20
TEST_MODE = True # Start with small test
TEST_SIZE = 10 # Test on 10 samples first
MAX_ROWS = 20000 # Full dataset size
SAVE_INTERVAL = 10 # Save every 10 rows
MODEL_NAME = "qwen2.5:32b-instruct" # Model to use
OLLAMA_HOST = "http://localhost:11434" # Default Ollama port
```
### Running the Pipeline
1. **Test run first** (recommended):
```python
TEST_MODE = True
TEST_SIZE = 10
```
Run Cell 20 to test on 10 samples (~1-2 minutes)
2. **Check results**:
```python
# Output saved to:
data/CSV/qwen_local_annotated_POI_test.csv
```
3. **Full run**:
```python
TEST_MODE = False
MAX_ROWS = 20000 # or None for all rows
```
Run Cell 20 for full dataset (~2-3 hours for 10k samples on A100)
### Performance Expectations
On NVIDIA A100 80GB:
- **Speed**: 5-10 tokens/second
- **Throughput**: 100-200 samples/hour (depends on prompt length)
- **Memory**: ~22-25GB VRAM during inference
- **Time for 10k samples**: ~50-100 hours (can run overnight/over weekend)
### Monitoring
The cell shows progress updates:
```
Qwen Local: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10/10 [02:30<00:00, 15.0s/it]
βœ… Saved after 10 rows (~24.0 samples/hour)
βœ… Done! Results: data/CSV/qwen_local_annotated_POI_test.csv
Total time: 2.5 minutes
Average speed: 240.0 samples/hour
```
## Troubleshooting
### Model Not Found
```bash
# Check if model is installed
ollama list
# If not listed, pull it
ollama pull qwen2.5:32b-instruct
```
### Ollama Server Not Running
```bash
# Check if Ollama is running
ps aux | grep ollama
# If not running, start it
ollama serve
```
### GPU Not Detected
```bash
# Check NVIDIA GPU
nvidia-smi
# Check CUDA
nvcc --version
# Ollama should automatically detect GPU
# If not, check Ollama logs
journalctl -u ollama
```
### Out of Memory (OOM)
If you get OOM errors:
1. **Check VRAM usage**:
```bash
watch -n 1 nvidia-smi
```
2. **Try smaller batch size** (not applicable here - we process 1 at a time)
3. **Try quantized version** (smaller model):
```bash
# 4-bit quantized version (~12GB VRAM)
ollama pull qwen2.5:32b-instruct-q4_0
# Update MODEL_NAME in notebook
MODEL_NAME = "qwen2.5:32b-instruct-q4_0"
```
### Slow Inference
If inference is very slow (<1 token/sec):
1. **Check GPU utilization**:
```bash
nvidia-smi
```
GPU should show ~90%+ utilization during inference
2. **Check CPU vs GPU**:
Ollama might be using CPU instead of GPU
```bash
# Force GPU usage
OLLAMA_GPU=1 ollama serve
```
## Model Variants
Ollama provides several Qwen-2.5 variants:
| Model | Size | VRAM | Speed | Quality |
|-------|------|------|-------|---------|
| `qwen2.5:32b-instruct` | 32B | ~25GB | Medium | Best |
| `qwen2.5:32b-instruct-q4_0` | 32B (4-bit) | ~12GB | Fast | Good |
| `qwen2.5:14b-instruct` | 14B | ~10GB | Fast | Good |
| `qwen2.5:7b-instruct` | 7B | ~5GB | Very Fast | OK |
For your A100 80GB, **`qwen2.5:32b-instruct`** is recommended (best quality, no VRAM issues).
## Custom Model Cache Location
To store models in `data/models/` directory:
```bash
# Set environment variable
export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
# Add to ~/.bashrc for persistence
echo 'export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"' >> ~/.bashrc
# Pull model (will download to data/models/)
ollama pull qwen2.5:32b-instruct
# Verify
ls -lh $OLLAMA_MODELS/
```
## Comparing Results
After running both API and local versions, compare results:
```python
import pandas as pd
# Load results
qwen_api = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
qwen_local = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv')
# Compare professions
print("API professions:", qwen_api['profession_llm'].value_counts().head())
print("Local professions:", qwen_local['profession_llm'].value_counts().head())
# Check agreement
agreement = (qwen_api['profession_llm'] == qwen_local['profession_llm']).mean()
print(f"Agreement rate: {agreement*100:.1f}%")
```
## Cost Comparison (10,000 samples)
| Method | Cost | Time | Privacy |
|--------|------|------|---------|
| **Qwen Local (A100)** | **$0** | ~50-100 hours | βœ… Full |
| Qwen API (Alibaba) | ~$10-20 | ~5-10 hours | ⚠️ Data sent to Alibaba |
| Llama API (Together) | ~$5-10 | ~5-10 hours | ⚠️ Data sent to Together AI |
| Deepseek API | ~$1-2 | ~5-10 hours | ⚠️ Data sent to Deepseek |
**Recommendation**:
- For **small tests** (<100 samples): Use API (faster)
- For **large datasets** (>1000 samples): Use local (free, private)
- For **research papers**: Use local to avoid data privacy concerns
## Advanced: Parallel Processing
For faster processing on multi-GPU setup:
```python
# Not implemented yet, but possible with:
# - Multiple Ollama instances on different GPUs
# - Ray or Dask for parallel processing
# - ~4x speedup with 4 GPUs
```
## Summary
βœ… **Ollama** already installed
βœ… **A100 80GB** GPU - perfect for Qwen-2.5-32B
βœ… **Free inference** - no API costs
βœ… **Privacy** - data stays local
**Next steps:**
1. Pull model: `ollama pull qwen2.5:32b-instruct`
2. Test with Cell 20: `TEST_MODE = True`, `TEST_SIZE = 10`
3. Run full dataset: `TEST_MODE = False`
**Estimated time for 10,000 samples**: ~50-100 hours
**Cost**: $0
Good luck! πŸš€