File size: 7,836 Bytes
5f5806d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
# Running Qwen-2.5-32B Locally with Ollama

This guide explains how to run Qwen-2.5-32B-Instruct locally on your A100 GPU using Ollama.

## Why Run Locally?

βœ… **FREE** - No API costs ($0 per query)
βœ… **FAST** - Local inference on A100 (5-10 tokens/sec)
βœ… **PRIVATE** - Data never leaves your machine
βœ… **OFFLINE** - Works without internet (after model download)
βœ… **HIGH QUALITY** - 32B parameter model with strong multilingual support

## System Requirements

### Minimum Specs
- **GPU**: NVIDIA A100 80GB (or similar high-end GPU)
- **VRAM**: 22-25GB during inference
- **RAM**: 32GB system RAM (you have 265GB - more than enough!)
- **Storage**: ~20GB for model download
- **OS**: Linux (you're on Ubuntu)

### Your Setup
βœ… NVIDIA A100 80GB - Perfect for Qwen-2.5-32B
βœ… 265GB RAM - Excellent
βœ… Linux (Ubuntu) - Supported
βœ… Ollama already installed at `/usr/local/bin/ollama`

## Installation Steps

### 1. Verify Ollama Installation

```bash
# Check if Ollama is installed
which ollama
# Should output: /usr/local/bin/ollama

# Check Ollama version
ollama --version
```

If not installed, install with:
```bash
curl -fsSL https://ollama.com/install.sh | sh
```

### 2. Pull Qwen-2.5-32B-Instruct Model

```bash
# This will download ~20GB
ollama pull qwen2.5:32b-instruct

# Alternative: Use the base model (not instruct-tuned)
# ollama pull qwen2.5:32b
```

**Download time**: ~10-30 minutes depending on your internet speed.

**Model cache location**: By default, models are cached at:
- Linux: `~/.ollama/models/`

To use custom cache location (e.g., `data/models/`):
```bash
export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
ollama pull qwen2.5:32b-instruct
```

### 3. Verify Model is Ready

```bash
# List all installed models
ollama list

# Test the model
ollama run qwen2.5:32b-instruct "Hello, who are you?"
```

You should see a response from Qwen!

### 4. Start Ollama Server (if needed)

Ollama runs as a background service by default. If you need to start it manually:

```bash
# Start Ollama server
ollama serve

# Or run in background
nohup ollama serve > /dev/null 2>&1 &
```

## Using Qwen-2.5-32B in the Notebook

### Cell 20: Qwen-2.5-32B Local Annotation

The notebook cell handles everything automatically:

1. **Checks Ollama installation**
2. **Verifies model availability**
3. **Runs inference locally**
4. **Saves progress every 10 rows**

### Configuration

```python
# In Cell 20
TEST_MODE = True        # Start with small test
TEST_SIZE = 10          # Test on 10 samples first
MAX_ROWS = 20000        # Full dataset size
SAVE_INTERVAL = 10      # Save every 10 rows

MODEL_NAME = "qwen2.5:32b-instruct"  # Model to use
OLLAMA_HOST = "http://localhost:11434"  # Default Ollama port
```

### Running the Pipeline

1. **Test run first** (recommended):
   ```python
   TEST_MODE = True
   TEST_SIZE = 10
   ```
   Run Cell 20 to test on 10 samples (~1-2 minutes)

2. **Check results**:
   ```python
   # Output saved to:
   data/CSV/qwen_local_annotated_POI_test.csv
   ```

3. **Full run**:
   ```python
   TEST_MODE = False
   MAX_ROWS = 20000  # or None for all rows
   ```
   Run Cell 20 for full dataset (~2-3 hours for 10k samples on A100)

### Performance Expectations

On NVIDIA A100 80GB:
- **Speed**: 5-10 tokens/second
- **Throughput**: 100-200 samples/hour (depends on prompt length)
- **Memory**: ~22-25GB VRAM during inference
- **Time for 10k samples**: ~50-100 hours (can run overnight/over weekend)

### Monitoring

The cell shows progress updates:
```
Qwen Local: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10/10 [02:30<00:00, 15.0s/it]
βœ… Saved after 10 rows (~24.0 samples/hour)

βœ… Done! Results: data/CSV/qwen_local_annotated_POI_test.csv
Total time: 2.5 minutes
Average speed: 240.0 samples/hour
```

## Troubleshooting

### Model Not Found

```bash
# Check if model is installed
ollama list

# If not listed, pull it
ollama pull qwen2.5:32b-instruct
```

### Ollama Server Not Running

```bash
# Check if Ollama is running
ps aux | grep ollama

# If not running, start it
ollama serve
```

### GPU Not Detected

```bash
# Check NVIDIA GPU
nvidia-smi

# Check CUDA
nvcc --version

# Ollama should automatically detect GPU
# If not, check Ollama logs
journalctl -u ollama
```

### Out of Memory (OOM)

If you get OOM errors:

1. **Check VRAM usage**:
   ```bash
   watch -n 1 nvidia-smi
   ```

2. **Try smaller batch size** (not applicable here - we process 1 at a time)

3. **Try quantized version** (smaller model):
   ```bash
   # 4-bit quantized version (~12GB VRAM)
   ollama pull qwen2.5:32b-instruct-q4_0

   # Update MODEL_NAME in notebook
   MODEL_NAME = "qwen2.5:32b-instruct-q4_0"
   ```

### Slow Inference

If inference is very slow (<1 token/sec):

1. **Check GPU utilization**:
   ```bash
   nvidia-smi
   ```
   GPU should show ~90%+ utilization during inference

2. **Check CPU vs GPU**:
   Ollama might be using CPU instead of GPU
   ```bash
   # Force GPU usage
   OLLAMA_GPU=1 ollama serve
   ```

## Model Variants

Ollama provides several Qwen-2.5 variants:

| Model | Size | VRAM | Speed | Quality |
|-------|------|------|-------|---------|
| `qwen2.5:32b-instruct` | 32B | ~25GB | Medium | Best |
| `qwen2.5:32b-instruct-q4_0` | 32B (4-bit) | ~12GB | Fast | Good |
| `qwen2.5:14b-instruct` | 14B | ~10GB | Fast | Good |
| `qwen2.5:7b-instruct` | 7B | ~5GB | Very Fast | OK |

For your A100 80GB, **`qwen2.5:32b-instruct`** is recommended (best quality, no VRAM issues).

## Custom Model Cache Location

To store models in `data/models/` directory:

```bash
# Set environment variable
export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"

# Add to ~/.bashrc for persistence
echo 'export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"' >> ~/.bashrc

# Pull model (will download to data/models/)
ollama pull qwen2.5:32b-instruct

# Verify
ls -lh $OLLAMA_MODELS/
```

## Comparing Results

After running both API and local versions, compare results:

```python
import pandas as pd

# Load results
qwen_api = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
qwen_local = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv')

# Compare professions
print("API professions:", qwen_api['profession_llm'].value_counts().head())
print("Local professions:", qwen_local['profession_llm'].value_counts().head())

# Check agreement
agreement = (qwen_api['profession_llm'] == qwen_local['profession_llm']).mean()
print(f"Agreement rate: {agreement*100:.1f}%")
```

## Cost Comparison (10,000 samples)

| Method | Cost | Time | Privacy |
|--------|------|------|---------|
| **Qwen Local (A100)** | **$0** | ~50-100 hours | βœ… Full |
| Qwen API (Alibaba) | ~$10-20 | ~5-10 hours | ⚠️ Data sent to Alibaba |
| Llama API (Together) | ~$5-10 | ~5-10 hours | ⚠️ Data sent to Together AI |
| Deepseek API | ~$1-2 | ~5-10 hours | ⚠️ Data sent to Deepseek |

**Recommendation**:
- For **small tests** (<100 samples): Use API (faster)
- For **large datasets** (>1000 samples): Use local (free, private)
- For **research papers**: Use local to avoid data privacy concerns

## Advanced: Parallel Processing

For faster processing on multi-GPU setup:

```python
# Not implemented yet, but possible with:
# - Multiple Ollama instances on different GPUs
# - Ray or Dask for parallel processing
# - ~4x speedup with 4 GPUs
```

## Summary

βœ… **Ollama** already installed
βœ… **A100 80GB** GPU - perfect for Qwen-2.5-32B
βœ… **Free inference** - no API costs
βœ… **Privacy** - data stays local

**Next steps:**
1. Pull model: `ollama pull qwen2.5:32b-instruct`
2. Test with Cell 20: `TEST_MODE = True`, `TEST_SIZE = 10`
3. Run full dataset: `TEST_MODE = False`

**Estimated time for 10,000 samples**: ~50-100 hours
**Cost**: $0

Good luck! πŸš€