# Running Qwen-2.5-32B Locally with Ollama

This guide explains how to run Qwen-2.5-32B-Instruct locally on your A100 GPU using Ollama.

## Why Run Locally?

✅ **FREE** - No API costs ($0 per query)
✅ **FAST** - Local inference on A100 (5-10 tokens/sec)
✅ **PRIVATE** - Data never leaves your machine
✅ **OFFLINE** - Works without internet (after model download)
✅ **HIGH QUALITY** - 32B parameter model with strong multilingual support

## System Requirements

### Minimum Specs
- **GPU**: NVIDIA A100 80GB (or similar high-end GPU)
- **VRAM**: 22-25GB during inference
- **RAM**: 32GB system RAM (you have 265GB - more than enough!)
- **Storage**: ~20GB for model download
- **OS**: Linux (you're on Ubuntu)

### Your Setup
✅ NVIDIA A100 80GB - Perfect for Qwen-2.5-32B
✅ 265GB RAM - Excellent
✅ Linux (Ubuntu) - Supported
✅ Ollama already installed at `/usr/local/bin/ollama`

## Installation Steps

### 1. Verify Ollama Installation

```bash
# Check if Ollama is installed
which ollama
# Should output: /usr/local/bin/ollama

# Check Ollama version
ollama --version
```

If not installed, install with:
```bash
curl -fsSL https://ollama.com/install.sh | sh
```

### 2. Pull Qwen-2.5-32B-Instruct Model

```bash
# This will download ~20GB
ollama pull qwen2.5:32b-instruct

# Alternative: Use the base model (not instruct-tuned)
# ollama pull qwen2.5:32b
```

**Download time**: ~10-30 minutes depending on your internet speed.

**Model cache location**: By default, models are cached at:
- Linux: `~/.ollama/models/`

To use custom cache location (e.g., `data/models/`):
```bash
export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
ollama pull qwen2.5:32b-instruct
```

### 3. Verify Model is Ready

```bash
# List all installed models
ollama list

# Test the model
ollama run qwen2.5:32b-instruct "Hello, who are you?"
```

You should see a response from Qwen!

### 4. Start Ollama Server (if needed)

Ollama runs as a background service by default. If you need to start it manually:

```bash
# Start Ollama server
ollama serve

# Or run in background
nohup ollama serve > /dev/null 2>&1 &
```

## Using Qwen-2.5-32B in the Notebook

### Cell 20: Qwen-2.5-32B Local Annotation

The notebook cell handles everything automatically:

1. **Checks Ollama installation**
2. **Verifies model availability**
3. **Runs inference locally**
4. **Saves progress every 10 rows**

### Configuration

```python
# In Cell 20
TEST_MODE = True        # Start with small test
TEST_SIZE = 10          # Test on 10 samples first
MAX_ROWS = 20000        # Full dataset size
SAVE_INTERVAL = 10      # Save every 10 rows

MODEL_NAME = "qwen2.5:32b-instruct"  # Model to use
OLLAMA_HOST = "http://localhost:11434"  # Default Ollama port
```

### Running the Pipeline

1. **Test run first** (recommended):
   ```python
   TEST_MODE = True
   TEST_SIZE = 10
   ```
   Run Cell 20 to test on 10 samples (~1-2 minutes)

2. **Check results**:
   ```python
   # Output saved to:
   data/CSV/qwen_local_annotated_POI_test.csv
   ```

3. **Full run**:
   ```python
   TEST_MODE = False
   MAX_ROWS = 20000  # or None for all rows
   ```
   Run Cell 20 for full dataset (~2-3 hours for 10k samples on A100)

### Performance Expectations

On NVIDIA A100 80GB:
- **Speed**: 5-10 tokens/second
- **Throughput**: 100-200 samples/hour (depends on prompt length)
- **Memory**: ~22-25GB VRAM during inference
- **Time for 10k samples**: ~50-100 hours (can run overnight/over weekend)

### Monitoring

The cell shows progress updates:
```
Qwen Local: 100%|██████████| 10/10 [02:30<00:00, 15.0s/it]
✅ Saved after 10 rows (~24.0 samples/hour)

✅ Done! Results: data/CSV/qwen_local_annotated_POI_test.csv
Total time: 2.5 minutes
Average speed: 240.0 samples/hour
```

## Troubleshooting

### Model Not Found

```bash
# Check if model is installed
ollama list

# If not listed, pull it
ollama pull qwen2.5:32b-instruct
```

### Ollama Server Not Running

```bash
# Check if Ollama is running
ps aux | grep ollama

# If not running, start it
ollama serve
```

### GPU Not Detected

```bash
# Check NVIDIA GPU
nvidia-smi

# Check CUDA
nvcc --version

# Ollama should automatically detect GPU
# If not, check Ollama logs
journalctl -u ollama
```

### Out of Memory (OOM)

If you get OOM errors:

1. **Check VRAM usage**:
   ```bash
   watch -n 1 nvidia-smi
   ```

2. **Try smaller batch size** (not applicable here - we process 1 at a time)

3. **Try quantized version** (smaller model):
   ```bash
   # 4-bit quantized version (~12GB VRAM)
   ollama pull qwen2.5:32b-instruct-q4_0

   # Update MODEL_NAME in notebook
   MODEL_NAME = "qwen2.5:32b-instruct-q4_0"
   ```

### Slow Inference

If inference is very slow (<1 token/sec):

1. **Check GPU utilization**:
   ```bash
   nvidia-smi
   ```
   GPU should show ~90%+ utilization during inference

2. **Check CPU vs GPU**:
   Ollama might be using CPU instead of GPU
   ```bash
   # Force GPU usage
   OLLAMA_GPU=1 ollama serve
   ```

## Model Variants

Ollama provides several Qwen-2.5 variants:

| Model | Size | VRAM | Speed | Quality |
|-------|------|------|-------|---------|
| `qwen2.5:32b-instruct` | 32B | ~25GB | Medium | Best |
| `qwen2.5:32b-instruct-q4_0` | 32B (4-bit) | ~12GB | Fast | Good |
| `qwen2.5:14b-instruct` | 14B | ~10GB | Fast | Good |
| `qwen2.5:7b-instruct` | 7B | ~5GB | Very Fast | OK |

For your A100 80GB, **`qwen2.5:32b-instruct`** is recommended (best quality, no VRAM issues).

## Custom Model Cache Location

To store models in `data/models/` directory:

```bash
# Set environment variable
export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"

# Add to ~/.bashrc for persistence
echo 'export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"' >> ~/.bashrc

# Pull model (will download to data/models/)
ollama pull qwen2.5:32b-instruct

# Verify
ls -lh $OLLAMA_MODELS/
```

## Comparing Results

After running both API and local versions, compare results:

```python
import pandas as pd

# Load results
qwen_api = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
qwen_local = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv')

# Compare professions
print("API professions:", qwen_api['profession_llm'].value_counts().head())
print("Local professions:", qwen_local['profession_llm'].value_counts().head())

# Check agreement
agreement = (qwen_api['profession_llm'] == qwen_local['profession_llm']).mean()
print(f"Agreement rate: {agreement*100:.1f}%")
```

## Cost Comparison (10,000 samples)

| Method | Cost | Time | Privacy |
|--------|------|------|---------|
| **Qwen Local (A100)** | **$0** | ~50-100 hours | ✅ Full |
| Qwen API (Alibaba) | ~$10-20 | ~5-10 hours | ⚠️ Data sent to Alibaba |
| Llama API (Together) | ~$5-10 | ~5-10 hours | ⚠️ Data sent to Together AI |
| Deepseek API | ~$1-2 | ~5-10 hours | ⚠️ Data sent to Deepseek |

**Recommendation**:
- For **small tests** (<100 samples): Use API (faster)
- For **large datasets** (>1000 samples): Use local (free, private)
- For **research papers**: Use local to avoid data privacy concerns

## Advanced: Parallel Processing

For faster processing on multi-GPU setup:

```python
# Not implemented yet, but possible with:
# - Multiple Ollama instances on different GPUs
# - Ray or Dask for parallel processing
# - ~4x speedup with 4 GPUs
```

## Summary

✅ **Ollama** already installed
✅ **A100 80GB** GPU - perfect for Qwen-2.5-32B
✅ **Free inference** - no API costs
✅ **Privacy** - data stays local

**Next steps:**
1. Pull model: `ollama pull qwen2.5:32b-instruct`
2. Test with Cell 20: `TEST_MODE = True`, `TEST_SIZE = 10`
3. Run full dataset: `TEST_MODE = False`

**Estimated time for 10,000 samples**: ~50-100 hours
**Cost**: $0

Good luck! 🚀