| # Running Qwen-2.5-32B Locally with Ollama | |
| This guide explains how to run Qwen-2.5-32B-Instruct locally on your A100 GPU using Ollama. | |
| ## Why Run Locally? | |
| β **FREE** - No API costs ($0 per query) | |
| β **FAST** - Local inference on A100 (5-10 tokens/sec) | |
| β **PRIVATE** - Data never leaves your machine | |
| β **OFFLINE** - Works without internet (after model download) | |
| β **HIGH QUALITY** - 32B parameter model with strong multilingual support | |
| ## System Requirements | |
| ### Minimum Specs | |
| - **GPU**: NVIDIA A100 80GB (or similar high-end GPU) | |
| - **VRAM**: 22-25GB during inference | |
| - **RAM**: 32GB system RAM (you have 265GB - more than enough!) | |
| - **Storage**: ~20GB for model download | |
| - **OS**: Linux (you're on Ubuntu) | |
| ### Your Setup | |
| β NVIDIA A100 80GB - Perfect for Qwen-2.5-32B | |
| β 265GB RAM - Excellent | |
| β Linux (Ubuntu) - Supported | |
| β Ollama already installed at `/usr/local/bin/ollama` | |
| ## Installation Steps | |
| ### 1. Verify Ollama Installation | |
| ```bash | |
| # Check if Ollama is installed | |
| which ollama | |
| # Should output: /usr/local/bin/ollama | |
| # Check Ollama version | |
| ollama --version | |
| ``` | |
| If not installed, install with: | |
| ```bash | |
| curl -fsSL https://ollama.com/install.sh | sh | |
| ``` | |
| ### 2. Pull Qwen-2.5-32B-Instruct Model | |
| ```bash | |
| # This will download ~20GB | |
| ollama pull qwen2.5:32b-instruct | |
| # Alternative: Use the base model (not instruct-tuned) | |
| # ollama pull qwen2.5:32b | |
| ``` | |
| **Download time**: ~10-30 minutes depending on your internet speed. | |
| **Model cache location**: By default, models are cached at: | |
| - Linux: `~/.ollama/models/` | |
| To use custom cache location (e.g., `data/models/`): | |
| ```bash | |
| export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models" | |
| ollama pull qwen2.5:32b-instruct | |
| ``` | |
| ### 3. Verify Model is Ready | |
| ```bash | |
| # List all installed models | |
| ollama list | |
| # Test the model | |
| ollama run qwen2.5:32b-instruct "Hello, who are you?" | |
| ``` | |
| You should see a response from Qwen! | |
| ### 4. Start Ollama Server (if needed) | |
| Ollama runs as a background service by default. If you need to start it manually: | |
| ```bash | |
| # Start Ollama server | |
| ollama serve | |
| # Or run in background | |
| nohup ollama serve > /dev/null 2>&1 & | |
| ``` | |
| ## Using Qwen-2.5-32B in the Notebook | |
| ### Cell 20: Qwen-2.5-32B Local Annotation | |
| The notebook cell handles everything automatically: | |
| 1. **Checks Ollama installation** | |
| 2. **Verifies model availability** | |
| 3. **Runs inference locally** | |
| 4. **Saves progress every 10 rows** | |
| ### Configuration | |
| ```python | |
| # In Cell 20 | |
| TEST_MODE = True # Start with small test | |
| TEST_SIZE = 10 # Test on 10 samples first | |
| MAX_ROWS = 20000 # Full dataset size | |
| SAVE_INTERVAL = 10 # Save every 10 rows | |
| MODEL_NAME = "qwen2.5:32b-instruct" # Model to use | |
| OLLAMA_HOST = "http://localhost:11434" # Default Ollama port | |
| ``` | |
| ### Running the Pipeline | |
| 1. **Test run first** (recommended): | |
| ```python | |
| TEST_MODE = True | |
| TEST_SIZE = 10 | |
| ``` | |
| Run Cell 20 to test on 10 samples (~1-2 minutes) | |
| 2. **Check results**: | |
| ```python | |
| # Output saved to: | |
| data/CSV/qwen_local_annotated_POI_test.csv | |
| ``` | |
| 3. **Full run**: | |
| ```python | |
| TEST_MODE = False | |
| MAX_ROWS = 20000 # or None for all rows | |
| ``` | |
| Run Cell 20 for full dataset (~2-3 hours for 10k samples on A100) | |
| ### Performance Expectations | |
| On NVIDIA A100 80GB: | |
| - **Speed**: 5-10 tokens/second | |
| - **Throughput**: 100-200 samples/hour (depends on prompt length) | |
| - **Memory**: ~22-25GB VRAM during inference | |
| - **Time for 10k samples**: ~50-100 hours (can run overnight/over weekend) | |
| ### Monitoring | |
| The cell shows progress updates: | |
| ``` | |
| Qwen Local: 100%|ββββββββββ| 10/10 [02:30<00:00, 15.0s/it] | |
| β Saved after 10 rows (~24.0 samples/hour) | |
| β Done! Results: data/CSV/qwen_local_annotated_POI_test.csv | |
| Total time: 2.5 minutes | |
| Average speed: 240.0 samples/hour | |
| ``` | |
| ## Troubleshooting | |
| ### Model Not Found | |
| ```bash | |
| # Check if model is installed | |
| ollama list | |
| # If not listed, pull it | |
| ollama pull qwen2.5:32b-instruct | |
| ``` | |
| ### Ollama Server Not Running | |
| ```bash | |
| # Check if Ollama is running | |
| ps aux | grep ollama | |
| # If not running, start it | |
| ollama serve | |
| ``` | |
| ### GPU Not Detected | |
| ```bash | |
| # Check NVIDIA GPU | |
| nvidia-smi | |
| # Check CUDA | |
| nvcc --version | |
| # Ollama should automatically detect GPU | |
| # If not, check Ollama logs | |
| journalctl -u ollama | |
| ``` | |
| ### Out of Memory (OOM) | |
| If you get OOM errors: | |
| 1. **Check VRAM usage**: | |
| ```bash | |
| watch -n 1 nvidia-smi | |
| ``` | |
| 2. **Try smaller batch size** (not applicable here - we process 1 at a time) | |
| 3. **Try quantized version** (smaller model): | |
| ```bash | |
| # 4-bit quantized version (~12GB VRAM) | |
| ollama pull qwen2.5:32b-instruct-q4_0 | |
| # Update MODEL_NAME in notebook | |
| MODEL_NAME = "qwen2.5:32b-instruct-q4_0" | |
| ``` | |
| ### Slow Inference | |
| If inference is very slow (<1 token/sec): | |
| 1. **Check GPU utilization**: | |
| ```bash | |
| nvidia-smi | |
| ``` | |
| GPU should show ~90%+ utilization during inference | |
| 2. **Check CPU vs GPU**: | |
| Ollama might be using CPU instead of GPU | |
| ```bash | |
| # Force GPU usage | |
| OLLAMA_GPU=1 ollama serve | |
| ``` | |
| ## Model Variants | |
| Ollama provides several Qwen-2.5 variants: | |
| | Model | Size | VRAM | Speed | Quality | | |
| |-------|------|------|-------|---------| | |
| | `qwen2.5:32b-instruct` | 32B | ~25GB | Medium | Best | | |
| | `qwen2.5:32b-instruct-q4_0` | 32B (4-bit) | ~12GB | Fast | Good | | |
| | `qwen2.5:14b-instruct` | 14B | ~10GB | Fast | Good | | |
| | `qwen2.5:7b-instruct` | 7B | ~5GB | Very Fast | OK | | |
| For your A100 80GB, **`qwen2.5:32b-instruct`** is recommended (best quality, no VRAM issues). | |
| ## Custom Model Cache Location | |
| To store models in `data/models/` directory: | |
| ```bash | |
| # Set environment variable | |
| export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models" | |
| # Add to ~/.bashrc for persistence | |
| echo 'export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"' >> ~/.bashrc | |
| # Pull model (will download to data/models/) | |
| ollama pull qwen2.5:32b-instruct | |
| # Verify | |
| ls -lh $OLLAMA_MODELS/ | |
| ``` | |
| ## Comparing Results | |
| After running both API and local versions, compare results: | |
| ```python | |
| import pandas as pd | |
| # Load results | |
| qwen_api = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv') | |
| qwen_local = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv') | |
| # Compare professions | |
| print("API professions:", qwen_api['profession_llm'].value_counts().head()) | |
| print("Local professions:", qwen_local['profession_llm'].value_counts().head()) | |
| # Check agreement | |
| agreement = (qwen_api['profession_llm'] == qwen_local['profession_llm']).mean() | |
| print(f"Agreement rate: {agreement*100:.1f}%") | |
| ``` | |
| ## Cost Comparison (10,000 samples) | |
| | Method | Cost | Time | Privacy | | |
| |--------|------|------|---------| | |
| | **Qwen Local (A100)** | **$0** | ~50-100 hours | β Full | | |
| | Qwen API (Alibaba) | ~$10-20 | ~5-10 hours | β οΈ Data sent to Alibaba | | |
| | Llama API (Together) | ~$5-10 | ~5-10 hours | β οΈ Data sent to Together AI | | |
| | Deepseek API | ~$1-2 | ~5-10 hours | β οΈ Data sent to Deepseek | | |
| **Recommendation**: | |
| - For **small tests** (<100 samples): Use API (faster) | |
| - For **large datasets** (>1000 samples): Use local (free, private) | |
| - For **research papers**: Use local to avoid data privacy concerns | |
| ## Advanced: Parallel Processing | |
| For faster processing on multi-GPU setup: | |
| ```python | |
| # Not implemented yet, but possible with: | |
| # - Multiple Ollama instances on different GPUs | |
| # - Ray or Dask for parallel processing | |
| # - ~4x speedup with 4 GPUs | |
| ``` | |
| ## Summary | |
| β **Ollama** already installed | |
| β **A100 80GB** GPU - perfect for Qwen-2.5-32B | |
| β **Free inference** - no API costs | |
| β **Privacy** - data stays local | |
| **Next steps:** | |
| 1. Pull model: `ollama pull qwen2.5:32b-instruct` | |
| 2. Test with Cell 20: `TEST_MODE = True`, `TEST_SIZE = 10` | |
| 3. Run full dataset: `TEST_MODE = False` | |
| **Estimated time for 10,000 samples**: ~50-100 hours | |
| **Cost**: $0 | |
| Good luck! π | |