| # Open Source LLM Configuration Guide (HuggingFace & Ollama) | |
| ## Overview | |
| The Recipe Recommendation Bot supports open source models through both HuggingFace and Ollama. This guide explains how to configure these providers for optimal performance, with recommended models under 20B parameters. | |
| > 📚 **For comprehensive model comparisons including closed source options (OpenAI, Google), see [Comprehensive Model Guide](./comprehensive-model-guide.md)** | |
| ## Quick Model Recommendations | |
| | Use Case | Model | Download Size | RAM Required | Quality | | |
| |----------|-------|---------------|--------------|---------| | |
| | **Development** | `gemma2:2b` | 1.6GB | 4GB | Good | | |
| | **Production** | `llama3.1:8b` | 4.7GB | 8GB | Excellent | | |
| | **High Quality** | `llama3.1:13b` | 7.4GB | 16GB | Outstanding | | |
| | **API (Free)** | `deepseek-ai/DeepSeek-V3.1` | 0GB | N/A | Very Good | | |
| ## 🤗 HuggingFace Configuration | |
| ### Environment Variables | |
| Add these variables to your `.env` file: | |
| ```bash | |
| # LLM Provider Configuration | |
| LLM_PROVIDER=huggingface | |
| # HuggingFace Configuration | |
| HUGGINGFACE_API_TOKEN=your_hf_token_here # Optional for public models | |
| HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1 # Current recommended model | |
| HUGGINGFACE_API_URL=https://api-inference.huggingface.co/models/ | |
| HUGGINGFACE_USE_API=true # Use API vs local inference | |
| HUGGINGFACE_USE_GPU=false # Set to true for local GPU inference | |
| # Embedding Configuration | |
| HUGGINGFACE_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 | |
| ``` | |
| ### Deployment Options | |
| #### Option 1: API Inference (Recommended) | |
| ```bash | |
| HUGGINGFACE_USE_API=true | |
| ``` | |
| - **Pros**: No local downloads, fast startup, always latest models | |
| - **Cons**: Requires internet connection, API rate limits | |
| - **Download Size**: 0 bytes (no local storage needed) | |
| - **Best for**: Development, testing, quick prototyping | |
| #### Option 2: Local Inference | |
| ```bash | |
| HUGGINGFACE_USE_API=false | |
| HUGGINGFACE_USE_GPU=false # CPU-only | |
| ``` | |
| - **Pros**: No internet required, no rate limits, private | |
| - **Cons**: Large model downloads, slower inference on CPU | |
| - **Best for**: Production, offline deployments | |
| #### Option 3: Local GPU Inference | |
| ```bash | |
| HUGGINGFACE_USE_API=false | |
| HUGGINGFACE_USE_GPU=true # Requires CUDA GPU | |
| ``` | |
| - **Pros**: Fast inference, no internet required, no rate limits | |
| - **Cons**: Large downloads, requires GPU with sufficient VRAM | |
| - **Best for**: Production with GPU resources | |
| ### Recommended HuggingFace Models | |
| #### Lightweight Models (Good for CPU) | |
| ```bash | |
| HUGGINGFACE_MODEL=microsoft/DialoGPT-small # ~117MB download | |
| HUGGINGFACE_MODEL=distilgpt2 # ~319MB download | |
| HUGGINGFACE_MODEL=google/flan-t5-small # ~242MB download | |
| ``` | |
| #### Balanced Performance Models | |
| ```bash | |
| HUGGINGFACE_MODEL=microsoft/DialoGPT-medium # ~863MB download | |
| HUGGINGFACE_MODEL=google/flan-t5-base # ~990MB download | |
| HUGGINGFACE_MODEL=microsoft/CodeGPT-small-py # ~510MB download | |
| ``` | |
| #### High Quality Models (GPU Recommended) | |
| ```bash | |
| HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1 # ~4.2GB download (7B params) | |
| HUGGINGFACE_MODEL=microsoft/DialoGPT-large # ~3.2GB download | |
| HUGGINGFACE_MODEL=google/flan-t5-large # ~2.8GB download (770M params) | |
| HUGGINGFACE_MODEL=huggingface/CodeBERTa-small-v1 # ~1.1GB download | |
| ``` | |
| #### Specialized Recipe/Cooking Models | |
| ```bash | |
| HUGGINGFACE_MODEL=recipe-nlg/recipe-nlg-base # ~450MB download | |
| HUGGINGFACE_MODEL=cooking-assistant/chef-gpt # ~2.1GB download (if available) | |
| ``` | |
| ## 🦙 Ollama Configuration | |
| ### Installation | |
| First, install Ollama on your system: | |
| ```bash | |
| # Linux/macOS | |
| curl -fsSL https://ollama.ai/install.sh | sh | |
| # Windows | |
| # Download installer from https://ollama.ai/download | |
| ``` | |
| ### Environment Variables | |
| ```bash | |
| # LLM Provider Configuration | |
| LLM_PROVIDER=ollama | |
| # Ollama Configuration | |
| OLLAMA_BASE_URL=http://localhost:11434 | |
| OLLAMA_MODEL=llama3.1:8b | |
| OLLAMA_TEMPERATURE=0.7 | |
| # Embedding Configuration | |
| OLLAMA_EMBEDDING_MODEL=nomic-embed-text | |
| ``` | |
| ### Starting Ollama Service | |
| ```bash | |
| # Start Ollama server | |
| ollama serve | |
| # In another terminal, pull your desired model | |
| ollama pull llama3.1:8b | |
| ``` | |
| ### Recommended Ollama Models | |
| #### Lightweight Models (4GB RAM or less) | |
| ```bash | |
| OLLAMA_MODEL=phi3:mini # ~2.3GB download (3.8B params) | |
| OLLAMA_MODEL=gemma2:2b # ~1.6GB download (2B params) | |
| OLLAMA_MODEL=qwen2:1.5b # ~934MB download (1.5B params) | |
| ``` | |
| #### Balanced Performance Models (8GB RAM) | |
| ```bash | |
| OLLAMA_MODEL=llama3.1:8b # ~4.7GB download (8B params) | |
| OLLAMA_MODEL=gemma2:9b # ~5.4GB download (9B params) | |
| OLLAMA_MODEL=mistral:7b # ~4.1GB download (7B params) | |
| OLLAMA_MODEL=qwen2:7b # ~4.4GB download (7B params) | |
| ``` | |
| #### High Quality Models (16GB+ RAM) | |
| ```bash | |
| OLLAMA_MODEL=llama3.1:13b # ~7.4GB download (13B params) | |
| OLLAMA_MODEL=mixtral:8x7b # ~26GB download (47B params - sparse) | |
| OLLAMA_MODEL=qwen2:14b # ~8.2GB download (14B params) | |
| ``` | |
| #### Code/Instruction Following Models | |
| ```bash | |
| OLLAMA_MODEL=codellama:7b # ~3.8GB download (7B params) | |
| OLLAMA_MODEL=deepseek-coder:6.7b # ~3.8GB download (6.7B params) | |
| OLLAMA_MODEL=wizard-coder:7b # ~4.1GB download (7B params) | |
| ``` | |
| ### Ollama Model Management | |
| ```bash | |
| # List available models | |
| ollama list | |
| # Pull a specific model | |
| ollama pull llama3.1:8b | |
| # Remove a model to free space | |
| ollama rm old-model:tag | |
| # Check model information | |
| ollama show llama3.1:8b | |
| ``` | |
| ## Installation Requirements | |
| ### HuggingFace Setup | |
| #### For API Usage (No Downloads) | |
| ```bash | |
| pip install -r requirements.txt | |
| # No additional setup needed | |
| ``` | |
| #### For Local CPU Inference | |
| ```bash | |
| pip install -r requirements.txt | |
| # Models will be downloaded automatically on first use | |
| ``` | |
| #### For Local GPU Inference | |
| ```bash | |
| # Install CUDA version of PyTorch | |
| pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 | |
| # Install other requirements | |
| pip install -r requirements.txt | |
| # Verify GPU availability | |
| python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" | |
| ``` | |
| ### Ollama Setup | |
| #### Installation | |
| ```bash | |
| # Install Ollama | |
| curl -fsSL https://ollama.ai/install.sh | sh | |
| # Start Ollama service | |
| ollama serve | |
| # Pull your first model (in another terminal) | |
| ollama pull llama3.1:8b | |
| ``` | |
| ## Storage Requirements & Download Sizes | |
| ### HuggingFace Local Models | |
| - **Storage Location**: `~/.cache/huggingface/transformers/` | |
| - **Small Models**: 100MB - 1GB (good for development) | |
| - **Medium Models**: 1GB - 5GB (balanced performance) | |
| - **Large Models**: 5GB - 15GB (high quality, under 20B params) | |
| ### Ollama Models | |
| - **Storage Location**: `~/.ollama/models/` | |
| - **Quantized Storage**: Models use efficient quantization (4-bit, 8-bit) | |
| - **2B Models**: ~1-2GB download | |
| - **7-8B Models**: ~4-5GB download | |
| - **13-14B Models**: ~7-8GB download | |
| ### Embedding Models | |
| ```bash | |
| # HuggingFace Embeddings (auto-downloaded) | |
| sentence-transformers/all-MiniLM-L6-v2 # ~80MB | |
| sentence-transformers/all-mpnet-base-v2 # ~420MB | |
| # Ollama Embeddings | |
| ollama pull nomic-embed-text # ~274MB | |
| ollama pull mxbai-embed-large # ~669MB | |
| ``` | |
| ## Performance & Hardware Recommendations | |
| ### System Requirements | |
| #### Minimum (API Usage) | |
| - **RAM**: 2GB | |
| - **Storage**: 100MB | |
| - **Internet**: Required for API calls | |
| #### CPU Inference | |
| - **RAM**: 8GB+ (16GB for larger models) | |
| - **CPU**: 4+ cores recommended | |
| - **Storage**: 5GB+ for models cache | |
| #### GPU Inference | |
| - **GPU**: 8GB+ VRAM (for 7B models) | |
| - **RAM**: 16GB+ system RAM | |
| - **Storage**: 10GB+ for models | |
| ### Performance Tips | |
| 1. **Start Small**: Begin with lightweight models and upgrade based on quality needs | |
| 2. **Use API First**: Test with HuggingFace API before committing to local inference | |
| 3. **Monitor Resources**: Check CPU/GPU/RAM usage during inference | |
| 4. **Model Caching**: First run downloads models, subsequent runs are faster | |
| ## Troubleshooting | |
| ### HuggingFace Issues | |
| #### "accelerate package required" | |
| ```bash | |
| pip install accelerate | |
| ``` | |
| #### GPU not detected | |
| ```bash | |
| # Check CUDA availability | |
| python -c "import torch; print(torch.cuda.is_available())" | |
| # If false, install CUDA PyTorch | |
| pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 | |
| ``` | |
| #### Out of memory errors | |
| - Switch to a smaller model | |
| - Set `HUGGINGFACE_USE_GPU=false` for CPU inference | |
| - Use API instead: `HUGGINGFACE_USE_API=true` | |
| ### Ollama Issues | |
| #### Ollama service not starting | |
| ```bash | |
| # Check if port 11434 is available | |
| lsof -i :11434 | |
| # Restart Ollama | |
| ollama serve | |
| ``` | |
| #### Model not found | |
| ```bash | |
| # List available models | |
| ollama list | |
| # Pull the model | |
| ollama pull llama3.1:8b | |
| ``` | |
| #### Slow inference | |
| - Try a smaller model | |
| - Check available RAM | |
| - Consider using GPU if available | |
| ## Quick Tests | |
| ### Test HuggingFace Configuration | |
| ```bash | |
| cd backend | |
| python -c " | |
| from services.llm_service import LLMService | |
| import os | |
| os.environ['LLM_PROVIDER'] = 'huggingface' | |
| service = LLMService() | |
| print('✅ HuggingFace LLM working!') | |
| response = service.simple_chat_completion('Hello') | |
| print(f'Response: {response}') | |
| " | |
| ``` | |
| ### Test Ollama Configuration | |
| ```bash | |
| # First ensure Ollama is running | |
| ollama serve & | |
| # Test the service | |
| cd backend | |
| python -c " | |
| from services.llm_service import LLMService | |
| import os | |
| os.environ['LLM_PROVIDER'] = 'ollama' | |
| service = LLMService() | |
| print('✅ Ollama LLM working!') | |
| response = service.simple_chat_completion('Hello') | |
| print(f'Response: {response}') | |
| " | |
| ``` | |
| ## Configuration Examples | |
| ### Development Setup (Fast Start) | |
| ```bash | |
| # Use HuggingFace API for quick testing | |
| LLM_PROVIDER=huggingface | |
| HUGGINGFACE_USE_API=true | |
| HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1 | |
| HUGGINGFACE_API_TOKEN=your_token_here | |
| ``` | |
| ### Local CPU Setup | |
| ```bash | |
| # Local inference on CPU | |
| LLM_PROVIDER=ollama | |
| OLLAMA_MODEL=llama3.1:8b | |
| OLLAMA_BASE_URL=http://localhost:11434 | |
| ``` | |
| ### Local GPU Setup | |
| ```bash | |
| # Local inference with GPU acceleration | |
| LLM_PROVIDER=huggingface | |
| HUGGINGFACE_USE_API=false | |
| HUGGINGFACE_USE_GPU=true | |
| HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1 | |
| ``` | |
| ### Production Setup (High Performance) | |
| ```bash | |
| # Ollama with optimized model | |
| LLM_PROVIDER=ollama | |
| OLLAMA_MODEL=llama3.1:13b # Higher quality | |
| OLLAMA_BASE_URL=http://localhost:11434 | |
| # Ensure 16GB+ RAM available | |
| ``` | |