File size: 7,836 Bytes
5f5806d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 |
# Running Qwen-2.5-32B Locally with Ollama
This guide explains how to run Qwen-2.5-32B-Instruct locally on your A100 GPU using Ollama.
## Why Run Locally?
β
**FREE** - No API costs ($0 per query)
β
**FAST** - Local inference on A100 (5-10 tokens/sec)
β
**PRIVATE** - Data never leaves your machine
β
**OFFLINE** - Works without internet (after model download)
β
**HIGH QUALITY** - 32B parameter model with strong multilingual support
## System Requirements
### Minimum Specs
- **GPU**: NVIDIA A100 80GB (or similar high-end GPU)
- **VRAM**: 22-25GB during inference
- **RAM**: 32GB system RAM (you have 265GB - more than enough!)
- **Storage**: ~20GB for model download
- **OS**: Linux (you're on Ubuntu)
### Your Setup
β
NVIDIA A100 80GB - Perfect for Qwen-2.5-32B
β
265GB RAM - Excellent
β
Linux (Ubuntu) - Supported
β
Ollama already installed at `/usr/local/bin/ollama`
## Installation Steps
### 1. Verify Ollama Installation
```bash
# Check if Ollama is installed
which ollama
# Should output: /usr/local/bin/ollama
# Check Ollama version
ollama --version
```
If not installed, install with:
```bash
curl -fsSL https://ollama.com/install.sh | sh
```
### 2. Pull Qwen-2.5-32B-Instruct Model
```bash
# This will download ~20GB
ollama pull qwen2.5:32b-instruct
# Alternative: Use the base model (not instruct-tuned)
# ollama pull qwen2.5:32b
```
**Download time**: ~10-30 minutes depending on your internet speed.
**Model cache location**: By default, models are cached at:
- Linux: `~/.ollama/models/`
To use custom cache location (e.g., `data/models/`):
```bash
export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
ollama pull qwen2.5:32b-instruct
```
### 3. Verify Model is Ready
```bash
# List all installed models
ollama list
# Test the model
ollama run qwen2.5:32b-instruct "Hello, who are you?"
```
You should see a response from Qwen!
### 4. Start Ollama Server (if needed)
Ollama runs as a background service by default. If you need to start it manually:
```bash
# Start Ollama server
ollama serve
# Or run in background
nohup ollama serve > /dev/null 2>&1 &
```
## Using Qwen-2.5-32B in the Notebook
### Cell 20: Qwen-2.5-32B Local Annotation
The notebook cell handles everything automatically:
1. **Checks Ollama installation**
2. **Verifies model availability**
3. **Runs inference locally**
4. **Saves progress every 10 rows**
### Configuration
```python
# In Cell 20
TEST_MODE = True # Start with small test
TEST_SIZE = 10 # Test on 10 samples first
MAX_ROWS = 20000 # Full dataset size
SAVE_INTERVAL = 10 # Save every 10 rows
MODEL_NAME = "qwen2.5:32b-instruct" # Model to use
OLLAMA_HOST = "http://localhost:11434" # Default Ollama port
```
### Running the Pipeline
1. **Test run first** (recommended):
```python
TEST_MODE = True
TEST_SIZE = 10
```
Run Cell 20 to test on 10 samples (~1-2 minutes)
2. **Check results**:
```python
# Output saved to:
data/CSV/qwen_local_annotated_POI_test.csv
```
3. **Full run**:
```python
TEST_MODE = False
MAX_ROWS = 20000 # or None for all rows
```
Run Cell 20 for full dataset (~2-3 hours for 10k samples on A100)
### Performance Expectations
On NVIDIA A100 80GB:
- **Speed**: 5-10 tokens/second
- **Throughput**: 100-200 samples/hour (depends on prompt length)
- **Memory**: ~22-25GB VRAM during inference
- **Time for 10k samples**: ~50-100 hours (can run overnight/over weekend)
### Monitoring
The cell shows progress updates:
```
Qwen Local: 100%|ββββββββββ| 10/10 [02:30<00:00, 15.0s/it]
β
Saved after 10 rows (~24.0 samples/hour)
β
Done! Results: data/CSV/qwen_local_annotated_POI_test.csv
Total time: 2.5 minutes
Average speed: 240.0 samples/hour
```
## Troubleshooting
### Model Not Found
```bash
# Check if model is installed
ollama list
# If not listed, pull it
ollama pull qwen2.5:32b-instruct
```
### Ollama Server Not Running
```bash
# Check if Ollama is running
ps aux | grep ollama
# If not running, start it
ollama serve
```
### GPU Not Detected
```bash
# Check NVIDIA GPU
nvidia-smi
# Check CUDA
nvcc --version
# Ollama should automatically detect GPU
# If not, check Ollama logs
journalctl -u ollama
```
### Out of Memory (OOM)
If you get OOM errors:
1. **Check VRAM usage**:
```bash
watch -n 1 nvidia-smi
```
2. **Try smaller batch size** (not applicable here - we process 1 at a time)
3. **Try quantized version** (smaller model):
```bash
# 4-bit quantized version (~12GB VRAM)
ollama pull qwen2.5:32b-instruct-q4_0
# Update MODEL_NAME in notebook
MODEL_NAME = "qwen2.5:32b-instruct-q4_0"
```
### Slow Inference
If inference is very slow (<1 token/sec):
1. **Check GPU utilization**:
```bash
nvidia-smi
```
GPU should show ~90%+ utilization during inference
2. **Check CPU vs GPU**:
Ollama might be using CPU instead of GPU
```bash
# Force GPU usage
OLLAMA_GPU=1 ollama serve
```
## Model Variants
Ollama provides several Qwen-2.5 variants:
| Model | Size | VRAM | Speed | Quality |
|-------|------|------|-------|---------|
| `qwen2.5:32b-instruct` | 32B | ~25GB | Medium | Best |
| `qwen2.5:32b-instruct-q4_0` | 32B (4-bit) | ~12GB | Fast | Good |
| `qwen2.5:14b-instruct` | 14B | ~10GB | Fast | Good |
| `qwen2.5:7b-instruct` | 7B | ~5GB | Very Fast | OK |
For your A100 80GB, **`qwen2.5:32b-instruct`** is recommended (best quality, no VRAM issues).
## Custom Model Cache Location
To store models in `data/models/` directory:
```bash
# Set environment variable
export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
# Add to ~/.bashrc for persistence
echo 'export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"' >> ~/.bashrc
# Pull model (will download to data/models/)
ollama pull qwen2.5:32b-instruct
# Verify
ls -lh $OLLAMA_MODELS/
```
## Comparing Results
After running both API and local versions, compare results:
```python
import pandas as pd
# Load results
qwen_api = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
qwen_local = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv')
# Compare professions
print("API professions:", qwen_api['profession_llm'].value_counts().head())
print("Local professions:", qwen_local['profession_llm'].value_counts().head())
# Check agreement
agreement = (qwen_api['profession_llm'] == qwen_local['profession_llm']).mean()
print(f"Agreement rate: {agreement*100:.1f}%")
```
## Cost Comparison (10,000 samples)
| Method | Cost | Time | Privacy |
|--------|------|------|---------|
| **Qwen Local (A100)** | **$0** | ~50-100 hours | β
Full |
| Qwen API (Alibaba) | ~$10-20 | ~5-10 hours | β οΈ Data sent to Alibaba |
| Llama API (Together) | ~$5-10 | ~5-10 hours | β οΈ Data sent to Together AI |
| Deepseek API | ~$1-2 | ~5-10 hours | β οΈ Data sent to Deepseek |
**Recommendation**:
- For **small tests** (<100 samples): Use API (faster)
- For **large datasets** (>1000 samples): Use local (free, private)
- For **research papers**: Use local to avoid data privacy concerns
## Advanced: Parallel Processing
For faster processing on multi-GPU setup:
```python
# Not implemented yet, but possible with:
# - Multiple Ollama instances on different GPUs
# - Ray or Dask for parallel processing
# - ~4x speedup with 4 GPUs
```
## Summary
β
**Ollama** already installed
β
**A100 80GB** GPU - perfect for Qwen-2.5-32B
β
**Free inference** - no API costs
β
**Privacy** - data stays local
**Next steps:**
1. Pull model: `ollama pull qwen2.5:32b-instruct`
2. Test with Cell 20: `TEST_MODE = True`, `TEST_SIZE = 10`
3. Run full dataset: `TEST_MODE = False`
**Estimated time for 10,000 samples**: ~50-100 hours
**Cost**: $0
Good luck! π
|