plg4-dev-server / backend /docs /opensource-llm-configuration.md
Jesse Johnson
New commit for backend deployment: 2025-09-25_13-24-03
c59d808

Open Source LLM Configuration Guide (HuggingFace & Ollama)

Overview

The Recipe Recommendation Bot supports open source models through both HuggingFace and Ollama. This guide explains how to configure these providers for optimal performance, with recommended models under 20B parameters.

πŸ“š For comprehensive model comparisons including closed source options (OpenAI, Google), see Comprehensive Model Guide

Quick Model Recommendations

Use Case Model Download Size RAM Required Quality
Development gemma2:2b 1.6GB 4GB Good
Production llama3.1:8b 4.7GB 8GB Excellent
High Quality llama3.1:13b 7.4GB 16GB Outstanding
API (Free) deepseek-ai/DeepSeek-V3.1 0GB N/A Very Good

πŸ€— HuggingFace Configuration

Environment Variables

Add these variables to your .env file:

# LLM Provider Configuration
LLM_PROVIDER=huggingface

# HuggingFace Configuration
HUGGINGFACE_API_TOKEN=your_hf_token_here        # Optional for public models
HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1    # Current recommended model
HUGGINGFACE_API_URL=https://api-inference.huggingface.co/models/
HUGGINGFACE_USE_API=true                        # Use API vs local inference
HUGGINGFACE_USE_GPU=false                       # Set to true for local GPU inference

# Embedding Configuration
HUGGINGFACE_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

Deployment Options

Option 1: API Inference (Recommended)

HUGGINGFACE_USE_API=true
  • Pros: No local downloads, fast startup, always latest models
  • Cons: Requires internet connection, API rate limits
  • Download Size: 0 bytes (no local storage needed)
  • Best for: Development, testing, quick prototyping

Option 2: Local Inference

HUGGINGFACE_USE_API=false
HUGGINGFACE_USE_GPU=false  # CPU-only
  • Pros: No internet required, no rate limits, private
  • Cons: Large model downloads, slower inference on CPU
  • Best for: Production, offline deployments

Option 3: Local GPU Inference

HUGGINGFACE_USE_API=false
HUGGINGFACE_USE_GPU=true   # Requires CUDA GPU
  • Pros: Fast inference, no internet required, no rate limits
  • Cons: Large downloads, requires GPU with sufficient VRAM
  • Best for: Production with GPU resources

Recommended HuggingFace Models

Lightweight Models (Good for CPU)

HUGGINGFACE_MODEL=microsoft/DialoGPT-small       # ~117MB download
HUGGINGFACE_MODEL=distilgpt2                     # ~319MB download
HUGGINGFACE_MODEL=google/flan-t5-small           # ~242MB download

Balanced Performance Models

HUGGINGFACE_MODEL=microsoft/DialoGPT-medium      # ~863MB download
HUGGINGFACE_MODEL=google/flan-t5-base            # ~990MB download
HUGGINGFACE_MODEL=microsoft/CodeGPT-small-py     # ~510MB download

High Quality Models (GPU Recommended)

HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1      # ~4.2GB download (7B params)
HUGGINGFACE_MODEL=microsoft/DialoGPT-large       # ~3.2GB download
HUGGINGFACE_MODEL=google/flan-t5-large           # ~2.8GB download (770M params)
HUGGINGFACE_MODEL=huggingface/CodeBERTa-small-v1 # ~1.1GB download

Specialized Recipe/Cooking Models

HUGGINGFACE_MODEL=recipe-nlg/recipe-nlg-base     # ~450MB download
HUGGINGFACE_MODEL=cooking-assistant/chef-gpt     # ~2.1GB download (if available)

πŸ¦™ Ollama Configuration

Installation

First, install Ollama on your system:

# Linux/macOS
curl -fsSL https://ollama.ai/install.sh | sh

# Windows
# Download installer from https://ollama.ai/download

Environment Variables

# LLM Provider Configuration
LLM_PROVIDER=ollama

# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.1:8b
OLLAMA_TEMPERATURE=0.7

# Embedding Configuration
OLLAMA_EMBEDDING_MODEL=nomic-embed-text

Starting Ollama Service

# Start Ollama server
ollama serve

# In another terminal, pull your desired model
ollama pull llama3.1:8b

Recommended Ollama Models

Lightweight Models (4GB RAM or less)

OLLAMA_MODEL=phi3:mini                  # ~2.3GB download (3.8B params)
OLLAMA_MODEL=gemma2:2b                  # ~1.6GB download (2B params)
OLLAMA_MODEL=qwen2:1.5b                 # ~934MB download (1.5B params)

Balanced Performance Models (8GB RAM)

OLLAMA_MODEL=llama3.1:8b               # ~4.7GB download (8B params)
OLLAMA_MODEL=gemma2:9b                 # ~5.4GB download (9B params)
OLLAMA_MODEL=mistral:7b                # ~4.1GB download (7B params)
OLLAMA_MODEL=qwen2:7b                  # ~4.4GB download (7B params)

High Quality Models (16GB+ RAM)

OLLAMA_MODEL=llama3.1:13b              # ~7.4GB download (13B params)
OLLAMA_MODEL=mixtral:8x7b              # ~26GB download (47B params - sparse)
OLLAMA_MODEL=qwen2:14b                 # ~8.2GB download (14B params)

Code/Instruction Following Models

OLLAMA_MODEL=codellama:7b              # ~3.8GB download (7B params)
OLLAMA_MODEL=deepseek-coder:6.7b       # ~3.8GB download (6.7B params)
OLLAMA_MODEL=wizard-coder:7b           # ~4.1GB download (7B params)

Ollama Model Management

# List available models
ollama list

# Pull a specific model
ollama pull llama3.1:8b

# Remove a model to free space
ollama rm old-model:tag

# Check model information
ollama show llama3.1:8b

Installation Requirements

HuggingFace Setup

For API Usage (No Downloads)

pip install -r requirements.txt
# No additional setup needed

For Local CPU Inference

pip install -r requirements.txt
# Models will be downloaded automatically on first use

For Local GPU Inference

# Install CUDA version of PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install other requirements
pip install -r requirements.txt

# Verify GPU availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

Ollama Setup

Installation

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama service
ollama serve

# Pull your first model (in another terminal)
ollama pull llama3.1:8b

Storage Requirements & Download Sizes

HuggingFace Local Models

  • Storage Location: ~/.cache/huggingface/transformers/
  • Small Models: 100MB - 1GB (good for development)
  • Medium Models: 1GB - 5GB (balanced performance)
  • Large Models: 5GB - 15GB (high quality, under 20B params)

Ollama Models

  • Storage Location: ~/.ollama/models/
  • Quantized Storage: Models use efficient quantization (4-bit, 8-bit)
  • 2B Models: ~1-2GB download
  • 7-8B Models: ~4-5GB download
  • 13-14B Models: ~7-8GB download

Embedding Models

# HuggingFace Embeddings (auto-downloaded)
sentence-transformers/all-MiniLM-L6-v2     # ~80MB
sentence-transformers/all-mpnet-base-v2    # ~420MB

# Ollama Embeddings
ollama pull nomic-embed-text               # ~274MB
ollama pull mxbai-embed-large              # ~669MB

Performance & Hardware Recommendations

System Requirements

Minimum (API Usage)

  • RAM: 2GB
  • Storage: 100MB
  • Internet: Required for API calls

CPU Inference

  • RAM: 8GB+ (16GB for larger models)
  • CPU: 4+ cores recommended
  • Storage: 5GB+ for models cache

GPU Inference

  • GPU: 8GB+ VRAM (for 7B models)
  • RAM: 16GB+ system RAM
  • Storage: 10GB+ for models

Performance Tips

  1. Start Small: Begin with lightweight models and upgrade based on quality needs
  2. Use API First: Test with HuggingFace API before committing to local inference
  3. Monitor Resources: Check CPU/GPU/RAM usage during inference
  4. Model Caching: First run downloads models, subsequent runs are faster

Troubleshooting

HuggingFace Issues

"accelerate package required"

pip install accelerate

GPU not detected

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# If false, install CUDA PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Out of memory errors

  • Switch to a smaller model
  • Set HUGGINGFACE_USE_GPU=false for CPU inference
  • Use API instead: HUGGINGFACE_USE_API=true

Ollama Issues

Ollama service not starting

# Check if port 11434 is available
lsof -i :11434

# Restart Ollama
ollama serve

Model not found

# List available models
ollama list

# Pull the model
ollama pull llama3.1:8b

Slow inference

  • Try a smaller model
  • Check available RAM
  • Consider using GPU if available

Quick Tests

Test HuggingFace Configuration

cd backend
python -c "
from services.llm_service import LLMService
import os
os.environ['LLM_PROVIDER'] = 'huggingface'
service = LLMService()
print('βœ… HuggingFace LLM working!')
response = service.simple_chat_completion('Hello')
print(f'Response: {response}')
"

Test Ollama Configuration

# First ensure Ollama is running
ollama serve &

# Test the service
cd backend
python -c "
from services.llm_service import LLMService
import os
os.environ['LLM_PROVIDER'] = 'ollama'
service = LLMService()
print('βœ… Ollama LLM working!')
response = service.simple_chat_completion('Hello')
print(f'Response: {response}')
"

Configuration Examples

Development Setup (Fast Start)

# Use HuggingFace API for quick testing
LLM_PROVIDER=huggingface
HUGGINGFACE_USE_API=true
HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1
HUGGINGFACE_API_TOKEN=your_token_here

Local CPU Setup

# Local inference on CPU
LLM_PROVIDER=ollama
OLLAMA_MODEL=llama3.1:8b
OLLAMA_BASE_URL=http://localhost:11434

Local GPU Setup

# Local inference with GPU acceleration
LLM_PROVIDER=huggingface
HUGGINGFACE_USE_API=false
HUGGINGFACE_USE_GPU=true
HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1

Production Setup (High Performance)

# Ollama with optimized model
LLM_PROVIDER=ollama
OLLAMA_MODEL=llama3.1:13b  # Higher quality
OLLAMA_BASE_URL=http://localhost:11434
# Ensure 16GB+ RAM available