Spaces:

jessejohnson
/

plg4-dev-server

Paused

App Files Files Community

plg4-dev-server / backend /docs /opensource-llm-configuration.md

Jesse Johnson

New commit for backend deployment: 2025-09-25_13-24-03

c59d808 5 months ago

preview code

raw

history blame contribute delete

10.4 kB

	# Open Source LLM Configuration Guide (HuggingFace & Ollama)

	## Overview
	The Recipe Recommendation Bot supports open source models through both HuggingFace and Ollama. This guide explains how to configure these providers for optimal performance, with recommended models under 20B parameters.

	> 📚 For comprehensive model comparisons including closed source options (OpenAI, Google), see [Comprehensive Model Guide](./comprehensive-model-guide.md)

	## Quick Model Recommendations

	\| Use Case \| Model \| Download Size \| RAM Required \| Quality \|
	\|----------\|-------\|---------------\|--------------\|---------\|
	\| Development \| `gemma2:2b` \| 1.6GB \| 4GB \| Good \|
	\| Production \| `llama3.1:8b` \| 4.7GB \| 8GB \| Excellent \|
	\| High Quality \| `llama3.1:13b` \| 7.4GB \| 16GB \| Outstanding \|
	\| API (Free) \| `deepseek-ai/DeepSeek-V3.1` \| 0GB \| N/A \| Very Good \|

	## 🤗 HuggingFace Configuration

	### Environment Variables

	Add these variables to your `.env` file:

	```bash
	# LLM Provider Configuration
	LLM_PROVIDER=huggingface

	# HuggingFace Configuration
	HUGGINGFACE_API_TOKEN=your_hf_token_here # Optional for public models
	HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1 # Current recommended model
	HUGGINGFACE_API_URL=https://api-inference.huggingface.co/models/
	HUGGINGFACE_USE_API=true # Use API vs local inference
	HUGGINGFACE_USE_GPU=false # Set to true for local GPU inference

	# Embedding Configuration
	HUGGINGFACE_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
	```

	### Deployment Options

	#### Option 1: API Inference (Recommended)
	```bash
	HUGGINGFACE_USE_API=true
	```
	- Pros: No local downloads, fast startup, always latest models
	- Cons: Requires internet connection, API rate limits
	- Download Size: 0 bytes (no local storage needed)
	- Best for: Development, testing, quick prototyping

	#### Option 2: Local Inference
	```bash
	HUGGINGFACE_USE_API=false
	HUGGINGFACE_USE_GPU=false # CPU-only
	```
	- Pros: No internet required, no rate limits, private
	- Cons: Large model downloads, slower inference on CPU
	- Best for: Production, offline deployments

	#### Option 3: Local GPU Inference
	```bash
	HUGGINGFACE_USE_API=false
	HUGGINGFACE_USE_GPU=true # Requires CUDA GPU
	```
	- Pros: Fast inference, no internet required, no rate limits
	- Cons: Large downloads, requires GPU with sufficient VRAM
	- Best for: Production with GPU resources

	### Recommended HuggingFace Models

	#### Lightweight Models (Good for CPU)
	```bash
	HUGGINGFACE_MODEL=microsoft/DialoGPT-small # ~117MB download
	HUGGINGFACE_MODEL=distilgpt2 # ~319MB download
	HUGGINGFACE_MODEL=google/flan-t5-small # ~242MB download
	```

	#### Balanced Performance Models
	```bash
	HUGGINGFACE_MODEL=microsoft/DialoGPT-medium # ~863MB download
	HUGGINGFACE_MODEL=google/flan-t5-base # ~990MB download
	HUGGINGFACE_MODEL=microsoft/CodeGPT-small-py # ~510MB download
	```

	#### High Quality Models (GPU Recommended)
	```bash
	HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1 # ~4.2GB download (7B params)
	HUGGINGFACE_MODEL=microsoft/DialoGPT-large # ~3.2GB download
	HUGGINGFACE_MODEL=google/flan-t5-large # ~2.8GB download (770M params)
	HUGGINGFACE_MODEL=huggingface/CodeBERTa-small-v1 # ~1.1GB download
	```

	#### Specialized Recipe/Cooking Models
	```bash
	HUGGINGFACE_MODEL=recipe-nlg/recipe-nlg-base # ~450MB download
	HUGGINGFACE_MODEL=cooking-assistant/chef-gpt # ~2.1GB download (if available)
	```

	## 🦙 Ollama Configuration

	### Installation

	First, install Ollama on your system:

	```bash
	# Linux/macOS
	curl -fsSL https://ollama.ai/install.sh \| sh

	# Windows
	# Download installer from https://ollama.ai/download
	```

	### Environment Variables

	```bash
	# LLM Provider Configuration
	LLM_PROVIDER=ollama

	# Ollama Configuration
	OLLAMA_BASE_URL=http://localhost:11434
	OLLAMA_MODEL=llama3.1:8b
	OLLAMA_TEMPERATURE=0.7

	# Embedding Configuration
	OLLAMA_EMBEDDING_MODEL=nomic-embed-text
	```

	### Starting Ollama Service

	```bash
	# Start Ollama server
	ollama serve

	# In another terminal, pull your desired model
	ollama pull llama3.1:8b
	```

	### Recommended Ollama Models

	#### Lightweight Models (4GB RAM or less)
	```bash
	OLLAMA_MODEL=phi3:mini # ~2.3GB download (3.8B params)
	OLLAMA_MODEL=gemma2:2b # ~1.6GB download (2B params)
	OLLAMA_MODEL=qwen2:1.5b # ~934MB download (1.5B params)
	```

	#### Balanced Performance Models (8GB RAM)
	```bash
	OLLAMA_MODEL=llama3.1:8b # ~4.7GB download (8B params)
	OLLAMA_MODEL=gemma2:9b # ~5.4GB download (9B params)
	OLLAMA_MODEL=mistral:7b # ~4.1GB download (7B params)
	OLLAMA_MODEL=qwen2:7b # ~4.4GB download (7B params)
	```

	#### High Quality Models (16GB+ RAM)
	```bash
	OLLAMA_MODEL=llama3.1:13b # ~7.4GB download (13B params)
	OLLAMA_MODEL=mixtral:8x7b # ~26GB download (47B params - sparse)
	OLLAMA_MODEL=qwen2:14b # ~8.2GB download (14B params)
	```

	#### Code/Instruction Following Models
	```bash
	OLLAMA_MODEL=codellama:7b # ~3.8GB download (7B params)
	OLLAMA_MODEL=deepseek-coder:6.7b # ~3.8GB download (6.7B params)
	OLLAMA_MODEL=wizard-coder:7b # ~4.1GB download (7B params)
	```

	### Ollama Model Management

	```bash
	# List available models
	ollama list

	# Pull a specific model
	ollama pull llama3.1:8b

	# Remove a model to free space
	ollama rm old-model:tag

	# Check model information
	ollama show llama3.1:8b
	```

	## Installation Requirements

	### HuggingFace Setup

	#### For API Usage (No Downloads)
	```bash
	pip install -r requirements.txt
	# No additional setup needed
	```

	#### For Local CPU Inference
	```bash
	pip install -r requirements.txt
	# Models will be downloaded automatically on first use
	```

	#### For Local GPU Inference
	```bash
	# Install CUDA version of PyTorch
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

	# Install other requirements
	pip install -r requirements.txt

	# Verify GPU availability
	python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
	```

	### Ollama Setup

	#### Installation
	```bash
	# Install Ollama
	curl -fsSL https://ollama.ai/install.sh \| sh

	# Start Ollama service
	ollama serve

	# Pull your first model (in another terminal)
	ollama pull llama3.1:8b
	```

	## Storage Requirements & Download Sizes

	### HuggingFace Local Models
	- Storage Location: `~/.cache/huggingface/transformers/`
	- Small Models: 100MB - 1GB (good for development)
	- Medium Models: 1GB - 5GB (balanced performance)
	- Large Models: 5GB - 15GB (high quality, under 20B params)

	### Ollama Models
	- Storage Location: `~/.ollama/models/`
	- Quantized Storage: Models use efficient quantization (4-bit, 8-bit)
	- 2B Models: ~1-2GB download
	- 7-8B Models: ~4-5GB download
	- 13-14B Models: ~7-8GB download

	### Embedding Models
	```bash
	# HuggingFace Embeddings (auto-downloaded)
	sentence-transformers/all-MiniLM-L6-v2 # ~80MB
	sentence-transformers/all-mpnet-base-v2 # ~420MB

	# Ollama Embeddings
	ollama pull nomic-embed-text # ~274MB
	ollama pull mxbai-embed-large # ~669MB
	```

	## Performance & Hardware Recommendations

	### System Requirements

	#### Minimum (API Usage)
	- RAM: 2GB
	- Storage: 100MB
	- Internet: Required for API calls

	#### CPU Inference
	- RAM: 8GB+ (16GB for larger models)
	- CPU: 4+ cores recommended
	- Storage: 5GB+ for models cache

	#### GPU Inference
	- GPU: 8GB+ VRAM (for 7B models)
	- RAM: 16GB+ system RAM
	- Storage: 10GB+ for models

	### Performance Tips

	1. Start Small: Begin with lightweight models and upgrade based on quality needs
	2. Use API First: Test with HuggingFace API before committing to local inference
	3. Monitor Resources: Check CPU/GPU/RAM usage during inference
	4. Model Caching: First run downloads models, subsequent runs are faster

	## Troubleshooting

	### HuggingFace Issues

	#### "accelerate package required"
	```bash
	pip install accelerate
	```

	#### GPU not detected
	```bash
	# Check CUDA availability
	python -c "import torch; print(torch.cuda.is_available())"

	# If false, install CUDA PyTorch
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
	```

	#### Out of memory errors
	- Switch to a smaller model
	- Set `HUGGINGFACE_USE_GPU=false` for CPU inference
	- Use API instead: `HUGGINGFACE_USE_API=true`

	### Ollama Issues

	#### Ollama service not starting
	```bash
	# Check if port 11434 is available
	lsof -i :11434

	# Restart Ollama
	ollama serve
	```

	#### Model not found
	```bash
	# List available models
	ollama list

	# Pull the model
	ollama pull llama3.1:8b
	```

	#### Slow inference
	- Try a smaller model
	- Check available RAM
	- Consider using GPU if available

	## Quick Tests

	### Test HuggingFace Configuration
	```bash
	cd backend
	python -c "
	from services.llm_service import LLMService
	import os
	os.environ['LLM_PROVIDER'] = 'huggingface'
	service = LLMService()
	print('✅ HuggingFace LLM working!')
	response = service.simple_chat_completion('Hello')
	print(f'Response: {response}')
	"
	```

	### Test Ollama Configuration
	```bash
	# First ensure Ollama is running
	ollama serve &

	# Test the service
	cd backend
	python -c "
	from services.llm_service import LLMService
	import os
	os.environ['LLM_PROVIDER'] = 'ollama'
	service = LLMService()
	print('✅ Ollama LLM working!')
	response = service.simple_chat_completion('Hello')
	print(f'Response: {response}')
	"
	```

	## Configuration Examples

	### Development Setup (Fast Start)
	```bash
	# Use HuggingFace API for quick testing
	LLM_PROVIDER=huggingface
	HUGGINGFACE_USE_API=true
	HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1
	HUGGINGFACE_API_TOKEN=your_token_here
	```

	### Local CPU Setup
	```bash
	# Local inference on CPU
	LLM_PROVIDER=ollama
	OLLAMA_MODEL=llama3.1:8b
	OLLAMA_BASE_URL=http://localhost:11434
	```

	### Local GPU Setup
	```bash
	# Local inference with GPU acceleration
	LLM_PROVIDER=huggingface
	HUGGINGFACE_USE_API=false
	HUGGINGFACE_USE_GPU=true
	HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1
	```

	### Production Setup (High Performance)
	```bash
	# Ollama with optimized model
	LLM_PROVIDER=ollama
	OLLAMA_MODEL=llama3.1:13b # Higher quality
	OLLAMA_BASE_URL=http://localhost:11434
	# Ensure 16GB+ RAM available
	```