code / md /QWEN_LOCAL_SETUP.md

Laura Wagner

to commit or not commit that is the question

5f5806d about 2 months ago

7.84 kB

	# Running Qwen-2.5-32B Locally with Ollama

	This guide explains how to run Qwen-2.5-32B-Instruct locally on your A100 GPU using Ollama.

	## Why Run Locally?

	✅ FREE - No API costs ($0 per query)
	✅ FAST - Local inference on A100 (5-10 tokens/sec)
	✅ PRIVATE - Data never leaves your machine
	✅ OFFLINE - Works without internet (after model download)
	✅ HIGH QUALITY - 32B parameter model with strong multilingual support

	## System Requirements

	### Minimum Specs
	- GPU: NVIDIA A100 80GB (or similar high-end GPU)
	- VRAM: 22-25GB during inference
	- RAM: 32GB system RAM (you have 265GB - more than enough!)
	- Storage: ~20GB for model download
	- OS: Linux (you're on Ubuntu)

	### Your Setup
	✅ NVIDIA A100 80GB - Perfect for Qwen-2.5-32B
	✅ 265GB RAM - Excellent
	✅ Linux (Ubuntu) - Supported
	✅ Ollama already installed at `/usr/local/bin/ollama`

	## Installation Steps

	### 1. Verify Ollama Installation

	```bash
	# Check if Ollama is installed
	which ollama
	# Should output: /usr/local/bin/ollama

	# Check Ollama version
	ollama --version
	```

	If not installed, install with:
	```bash
	curl -fsSL https://ollama.com/install.sh \| sh
	```

	### 2. Pull Qwen-2.5-32B-Instruct Model

	```bash
	# This will download ~20GB
	ollama pull qwen2.5:32b-instruct

	# Alternative: Use the base model (not instruct-tuned)
	# ollama pull qwen2.5:32b
	```

	Download time: ~10-30 minutes depending on your internet speed.

	Model cache location: By default, models are cached at:
	- Linux: `~/.ollama/models/`

	To use custom cache location (e.g., `data/models/`):
	```bash
	export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
	ollama pull qwen2.5:32b-instruct
	```

	### 3. Verify Model is Ready

	```bash
	# List all installed models
	ollama list

	# Test the model
	ollama run qwen2.5:32b-instruct "Hello, who are you?"
	```

	You should see a response from Qwen!

	### 4. Start Ollama Server (if needed)

	Ollama runs as a background service by default. If you need to start it manually:

	```bash
	# Start Ollama server
	ollama serve

	# Or run in background
	nohup ollama serve > /dev/null 2>&1 &
	```

	## Using Qwen-2.5-32B in the Notebook

	### Cell 20: Qwen-2.5-32B Local Annotation

	The notebook cell handles everything automatically:

	1. Checks Ollama installation
	2. Verifies model availability
	3. Runs inference locally
	4. Saves progress every 10 rows

	### Configuration

	```python
	# In Cell 20
	TEST_MODE = True # Start with small test
	TEST_SIZE = 10 # Test on 10 samples first
	MAX_ROWS = 20000 # Full dataset size
	SAVE_INTERVAL = 10 # Save every 10 rows

	MODEL_NAME = "qwen2.5:32b-instruct" # Model to use
	OLLAMA_HOST = "http://localhost:11434" # Default Ollama port
	```

	### Running the Pipeline

	1. Test run first (recommended):
	```python
	TEST_MODE = True
	TEST_SIZE = 10
	```
	Run Cell 20 to test on 10 samples (~1-2 minutes)

	2. Check results:
	```python
	# Output saved to:
	data/CSV/qwen_local_annotated_POI_test.csv
	```

	3. Full run:
	```python
	TEST_MODE = False
	MAX_ROWS = 20000 # or None for all rows
	```
	Run Cell 20 for full dataset (~2-3 hours for 10k samples on A100)

	### Performance Expectations

	On NVIDIA A100 80GB:
	- Speed: 5-10 tokens/second
	- Throughput: 100-200 samples/hour (depends on prompt length)
	- Memory: ~22-25GB VRAM during inference
	- Time for 10k samples: ~50-100 hours (can run overnight/over weekend)

	### Monitoring

	The cell shows progress updates:
	```
	Qwen Local: 100%\|██████████\| 10/10 [02:30<00:00, 15.0s/it]
	✅ Saved after 10 rows (~24.0 samples/hour)

	✅ Done! Results: data/CSV/qwen_local_annotated_POI_test.csv
	Total time: 2.5 minutes
	Average speed: 240.0 samples/hour
	```

	## Troubleshooting

	### Model Not Found

	```bash
	# Check if model is installed
	ollama list

	# If not listed, pull it
	ollama pull qwen2.5:32b-instruct
	```

	### Ollama Server Not Running

	```bash
	# Check if Ollama is running
	ps aux \| grep ollama

	# If not running, start it
	ollama serve
	```

	### GPU Not Detected

	```bash
	# Check NVIDIA GPU
	nvidia-smi

	# Check CUDA
	nvcc --version

	# Ollama should automatically detect GPU
	# If not, check Ollama logs
	journalctl -u ollama
	```

	### Out of Memory (OOM)

	If you get OOM errors:

	1. Check VRAM usage:
	```bash
	watch -n 1 nvidia-smi
	```

	2. Try smaller batch size (not applicable here - we process 1 at a time)

	3. Try quantized version (smaller model):
	```bash
	# 4-bit quantized version (~12GB VRAM)
	ollama pull qwen2.5:32b-instruct-q4_0

	# Update MODEL_NAME in notebook
	MODEL_NAME = "qwen2.5:32b-instruct-q4_0"
	```

	### Slow Inference

	If inference is very slow (<1 token/sec):

	1. Check GPU utilization:
	```bash
	nvidia-smi
	```
	GPU should show ~90%+ utilization during inference

	2. Check CPU vs GPU:
	Ollama might be using CPU instead of GPU
	```bash
	# Force GPU usage
	OLLAMA_GPU=1 ollama serve
	```

	## Model Variants

	Ollama provides several Qwen-2.5 variants:

	\| Model \| Size \| VRAM \| Speed \| Quality \|
	\|-------\|------\|------\|-------\|---------\|
	\| `qwen2.5:32b-instruct` \| 32B \| ~25GB \| Medium \| Best \|
	\| `qwen2.5:32b-instruct-q4_0` \| 32B (4-bit) \| ~12GB \| Fast \| Good \|
	\| `qwen2.5:14b-instruct` \| 14B \| ~10GB \| Fast \| Good \|
	\| `qwen2.5:7b-instruct` \| 7B \| ~5GB \| Very Fast \| OK \|

	For your A100 80GB, `qwen2.5:32b-instruct` is recommended (best quality, no VRAM issues).

	## Custom Model Cache Location

	To store models in `data/models/` directory:

	```bash
	# Set environment variable
	export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"

	# Add to ~/.bashrc for persistence
	echo 'export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"' >> ~/.bashrc

	# Pull model (will download to data/models/)
	ollama pull qwen2.5:32b-instruct

	# Verify
	ls -lh $OLLAMA_MODELS/
	```

	## Comparing Results

	After running both API and local versions, compare results:

	```python
	import pandas as pd

	# Load results
	qwen_api = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
	qwen_local = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv')

	# Compare professions
	print("API professions:", qwen_api['profession_llm'].value_counts().head())
	print("Local professions:", qwen_local['profession_llm'].value_counts().head())

	# Check agreement
	agreement = (qwen_api['profession_llm'] == qwen_local['profession_llm']).mean()
	print(f"Agreement rate: {agreement*100:.1f}%")
	```

	## Cost Comparison (10,000 samples)

	\| Method \| Cost \| Time \| Privacy \|
	\|--------\|------\|------\|---------\|
	\| Qwen Local (A100) \| $0 \| ~50-100 hours \| ✅ Full \|
	\| Qwen API (Alibaba) \| ~$10-20 \| ~5-10 hours \| ⚠️ Data sent to Alibaba \|
	\| Llama API (Together) \| ~$5-10 \| ~5-10 hours \| ⚠️ Data sent to Together AI \|
	\| Deepseek API \| ~$1-2 \| ~5-10 hours \| ⚠️ Data sent to Deepseek \|

	Recommendation:
	- For small tests (<100 samples): Use API (faster)
	- For large datasets (>1000 samples): Use local (free, private)
	- For research papers: Use local to avoid data privacy concerns

	## Advanced: Parallel Processing

	For faster processing on multi-GPU setup:

	```python
	# Not implemented yet, but possible with:
	# - Multiple Ollama instances on different GPUs
	# - Ray or Dask for parallel processing
	# - ~4x speedup with 4 GPUs
	```

	## Summary

	✅ Ollama already installed
	✅ A100 80GB GPU - perfect for Qwen-2.5-32B
	✅ Free inference - no API costs
	✅ Privacy - data stays local

	Next steps:
	1. Pull model: `ollama pull qwen2.5:32b-instruct`
	2. Test with Cell 20: `TEST_MODE = True`, `TEST_SIZE = 10`
	3. Run full dataset: `TEST_MODE = False`

	Estimated time for 10,000 samples: ~50-100 hours
	Cost: $0

	Good luck! 🚀