# Running Qwen-2.5-32B Locally with Ollama This guide explains how to run Qwen-2.5-32B-Instruct locally on your A100 GPU using Ollama. ## Why Run Locally? ✅ **FREE** - No API costs ($0 per query) ✅ **FAST** - Local inference on A100 (5-10 tokens/sec) ✅ **PRIVATE** - Data never leaves your machine ✅ **OFFLINE** - Works without internet (after model download) ✅ **HIGH QUALITY** - 32B parameter model with strong multilingual support ## System Requirements ### Minimum Specs - **GPU**: NVIDIA A100 80GB (or similar high-end GPU) - **VRAM**: 22-25GB during inference - **RAM**: 32GB system RAM (you have 265GB - more than enough!) - **Storage**: ~20GB for model download - **OS**: Linux (you're on Ubuntu) ### Your Setup ✅ NVIDIA A100 80GB - Perfect for Qwen-2.5-32B ✅ 265GB RAM - Excellent ✅ Linux (Ubuntu) - Supported ✅ Ollama already installed at `/usr/local/bin/ollama` ## Installation Steps ### 1. Verify Ollama Installation ```bash # Check if Ollama is installed which ollama # Should output: /usr/local/bin/ollama # Check Ollama version ollama --version ``` If not installed, install with: ```bash curl -fsSL https://ollama.com/install.sh | sh ``` ### 2. Pull Qwen-2.5-32B-Instruct Model ```bash # This will download ~20GB ollama pull qwen2.5:32b-instruct # Alternative: Use the base model (not instruct-tuned) # ollama pull qwen2.5:32b ``` **Download time**: ~10-30 minutes depending on your internet speed. **Model cache location**: By default, models are cached at: - Linux: `~/.ollama/models/` To use custom cache location (e.g., `data/models/`): ```bash export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models" ollama pull qwen2.5:32b-instruct ``` ### 3. Verify Model is Ready ```bash # List all installed models ollama list # Test the model ollama run qwen2.5:32b-instruct "Hello, who are you?" ``` You should see a response from Qwen! ### 4. Start Ollama Server (if needed) Ollama runs as a background service by default. If you need to start it manually: ```bash # Start Ollama server ollama serve # Or run in background nohup ollama serve > /dev/null 2>&1 & ``` ## Using Qwen-2.5-32B in the Notebook ### Cell 20: Qwen-2.5-32B Local Annotation The notebook cell handles everything automatically: 1. **Checks Ollama installation** 2. **Verifies model availability** 3. **Runs inference locally** 4. **Saves progress every 10 rows** ### Configuration ```python # In Cell 20 TEST_MODE = True # Start with small test TEST_SIZE = 10 # Test on 10 samples first MAX_ROWS = 20000 # Full dataset size SAVE_INTERVAL = 10 # Save every 10 rows MODEL_NAME = "qwen2.5:32b-instruct" # Model to use OLLAMA_HOST = "http://localhost:11434" # Default Ollama port ``` ### Running the Pipeline 1. **Test run first** (recommended): ```python TEST_MODE = True TEST_SIZE = 10 ``` Run Cell 20 to test on 10 samples (~1-2 minutes) 2. **Check results**: ```python # Output saved to: data/CSV/qwen_local_annotated_POI_test.csv ``` 3. **Full run**: ```python TEST_MODE = False MAX_ROWS = 20000 # or None for all rows ``` Run Cell 20 for full dataset (~2-3 hours for 10k samples on A100) ### Performance Expectations On NVIDIA A100 80GB: - **Speed**: 5-10 tokens/second - **Throughput**: 100-200 samples/hour (depends on prompt length) - **Memory**: ~22-25GB VRAM during inference - **Time for 10k samples**: ~50-100 hours (can run overnight/over weekend) ### Monitoring The cell shows progress updates: ``` Qwen Local: 100%|██████████| 10/10 [02:30<00:00, 15.0s/it] ✅ Saved after 10 rows (~24.0 samples/hour) ✅ Done! Results: data/CSV/qwen_local_annotated_POI_test.csv Total time: 2.5 minutes Average speed: 240.0 samples/hour ``` ## Troubleshooting ### Model Not Found ```bash # Check if model is installed ollama list # If not listed, pull it ollama pull qwen2.5:32b-instruct ``` ### Ollama Server Not Running ```bash # Check if Ollama is running ps aux | grep ollama # If not running, start it ollama serve ``` ### GPU Not Detected ```bash # Check NVIDIA GPU nvidia-smi # Check CUDA nvcc --version # Ollama should automatically detect GPU # If not, check Ollama logs journalctl -u ollama ``` ### Out of Memory (OOM) If you get OOM errors: 1. **Check VRAM usage**: ```bash watch -n 1 nvidia-smi ``` 2. **Try smaller batch size** (not applicable here - we process 1 at a time) 3. **Try quantized version** (smaller model): ```bash # 4-bit quantized version (~12GB VRAM) ollama pull qwen2.5:32b-instruct-q4_0 # Update MODEL_NAME in notebook MODEL_NAME = "qwen2.5:32b-instruct-q4_0" ``` ### Slow Inference If inference is very slow (<1 token/sec): 1. **Check GPU utilization**: ```bash nvidia-smi ``` GPU should show ~90%+ utilization during inference 2. **Check CPU vs GPU**: Ollama might be using CPU instead of GPU ```bash # Force GPU usage OLLAMA_GPU=1 ollama serve ``` ## Model Variants Ollama provides several Qwen-2.5 variants: | Model | Size | VRAM | Speed | Quality | |-------|------|------|-------|---------| | `qwen2.5:32b-instruct` | 32B | ~25GB | Medium | Best | | `qwen2.5:32b-instruct-q4_0` | 32B (4-bit) | ~12GB | Fast | Good | | `qwen2.5:14b-instruct` | 14B | ~10GB | Fast | Good | | `qwen2.5:7b-instruct` | 7B | ~5GB | Very Fast | OK | For your A100 80GB, **`qwen2.5:32b-instruct`** is recommended (best quality, no VRAM issues). ## Custom Model Cache Location To store models in `data/models/` directory: ```bash # Set environment variable export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models" # Add to ~/.bashrc for persistence echo 'export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"' >> ~/.bashrc # Pull model (will download to data/models/) ollama pull qwen2.5:32b-instruct # Verify ls -lh $OLLAMA_MODELS/ ``` ## Comparing Results After running both API and local versions, compare results: ```python import pandas as pd # Load results qwen_api = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv') qwen_local = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv') # Compare professions print("API professions:", qwen_api['profession_llm'].value_counts().head()) print("Local professions:", qwen_local['profession_llm'].value_counts().head()) # Check agreement agreement = (qwen_api['profession_llm'] == qwen_local['profession_llm']).mean() print(f"Agreement rate: {agreement*100:.1f}%") ``` ## Cost Comparison (10,000 samples) | Method | Cost | Time | Privacy | |--------|------|------|---------| | **Qwen Local (A100)** | **$0** | ~50-100 hours | ✅ Full | | Qwen API (Alibaba) | ~$10-20 | ~5-10 hours | ⚠️ Data sent to Alibaba | | Llama API (Together) | ~$5-10 | ~5-10 hours | ⚠️ Data sent to Together AI | | Deepseek API | ~$1-2 | ~5-10 hours | ⚠️ Data sent to Deepseek | **Recommendation**: - For **small tests** (<100 samples): Use API (faster) - For **large datasets** (>1000 samples): Use local (free, private) - For **research papers**: Use local to avoid data privacy concerns ## Advanced: Parallel Processing For faster processing on multi-GPU setup: ```python # Not implemented yet, but possible with: # - Multiple Ollama instances on different GPUs # - Ray or Dask for parallel processing # - ~4x speedup with 4 GPUs ``` ## Summary ✅ **Ollama** already installed ✅ **A100 80GB** GPU - perfect for Qwen-2.5-32B ✅ **Free inference** - no API costs ✅ **Privacy** - data stays local **Next steps:** 1. Pull model: `ollama pull qwen2.5:32b-instruct` 2. Test with Cell 20: `TEST_MODE = True`, `TEST_SIZE = 10` 3. Run full dataset: `TEST_MODE = False` **Estimated time for 10,000 samples**: ~50-100 hours **Cost**: $0 Good luck! 🚀