File size: 10,389 Bytes
c59d808
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
# Open Source LLM Configuration Guide (HuggingFace & Ollama)

## Overview
The Recipe Recommendation Bot supports open source models through both HuggingFace and Ollama. This guide explains how to configure these providers for optimal performance, with recommended models under 20B parameters.

> 📚 **For comprehensive model comparisons including closed source options (OpenAI, Google), see [Comprehensive Model Guide](./comprehensive-model-guide.md)**

## Quick Model Recommendations

| Use Case | Model | Download Size | RAM Required | Quality |
|----------|-------|---------------|--------------|---------|
| **Development** | `gemma2:2b` | 1.6GB | 4GB | Good |
| **Production** | `llama3.1:8b` | 4.7GB | 8GB | Excellent |
| **High Quality** | `llama3.1:13b` | 7.4GB | 16GB | Outstanding |
| **API (Free)** | `deepseek-ai/DeepSeek-V3.1` | 0GB | N/A | Very Good |

## 🤗 HuggingFace Configuration

### Environment Variables

Add these variables to your `.env` file:

```bash
# LLM Provider Configuration
LLM_PROVIDER=huggingface

# HuggingFace Configuration
HUGGINGFACE_API_TOKEN=your_hf_token_here        # Optional for public models
HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1    # Current recommended model
HUGGINGFACE_API_URL=https://api-inference.huggingface.co/models/
HUGGINGFACE_USE_API=true                        # Use API vs local inference
HUGGINGFACE_USE_GPU=false                       # Set to true for local GPU inference

# Embedding Configuration
HUGGINGFACE_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
```

### Deployment Options

#### Option 1: API Inference (Recommended)
```bash
HUGGINGFACE_USE_API=true
```
- **Pros**: No local downloads, fast startup, always latest models
- **Cons**: Requires internet connection, API rate limits
- **Download Size**: 0 bytes (no local storage needed)
- **Best for**: Development, testing, quick prototyping

#### Option 2: Local Inference
```bash
HUGGINGFACE_USE_API=false
HUGGINGFACE_USE_GPU=false  # CPU-only
```
- **Pros**: No internet required, no rate limits, private
- **Cons**: Large model downloads, slower inference on CPU
- **Best for**: Production, offline deployments

#### Option 3: Local GPU Inference
```bash
HUGGINGFACE_USE_API=false
HUGGINGFACE_USE_GPU=true   # Requires CUDA GPU
```
- **Pros**: Fast inference, no internet required, no rate limits
- **Cons**: Large downloads, requires GPU with sufficient VRAM
- **Best for**: Production with GPU resources

### Recommended HuggingFace Models

#### Lightweight Models (Good for CPU)
```bash
HUGGINGFACE_MODEL=microsoft/DialoGPT-small       # ~117MB download
HUGGINGFACE_MODEL=distilgpt2                     # ~319MB download
HUGGINGFACE_MODEL=google/flan-t5-small           # ~242MB download
```

#### Balanced Performance Models
```bash
HUGGINGFACE_MODEL=microsoft/DialoGPT-medium      # ~863MB download
HUGGINGFACE_MODEL=google/flan-t5-base            # ~990MB download
HUGGINGFACE_MODEL=microsoft/CodeGPT-small-py     # ~510MB download
```

#### High Quality Models (GPU Recommended)
```bash
HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1      # ~4.2GB download (7B params)
HUGGINGFACE_MODEL=microsoft/DialoGPT-large       # ~3.2GB download
HUGGINGFACE_MODEL=google/flan-t5-large           # ~2.8GB download (770M params)
HUGGINGFACE_MODEL=huggingface/CodeBERTa-small-v1 # ~1.1GB download
```

#### Specialized Recipe/Cooking Models
```bash
HUGGINGFACE_MODEL=recipe-nlg/recipe-nlg-base     # ~450MB download
HUGGINGFACE_MODEL=cooking-assistant/chef-gpt     # ~2.1GB download (if available)
```

## 🦙 Ollama Configuration

### Installation

First, install Ollama on your system:

```bash
# Linux/macOS
curl -fsSL https://ollama.ai/install.sh | sh

# Windows
# Download installer from https://ollama.ai/download
```

### Environment Variables

```bash
# LLM Provider Configuration
LLM_PROVIDER=ollama

# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.1:8b
OLLAMA_TEMPERATURE=0.7

# Embedding Configuration
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
```

### Starting Ollama Service

```bash
# Start Ollama server
ollama serve

# In another terminal, pull your desired model
ollama pull llama3.1:8b
```

### Recommended Ollama Models

#### Lightweight Models (4GB RAM or less)
```bash
OLLAMA_MODEL=phi3:mini                  # ~2.3GB download (3.8B params)
OLLAMA_MODEL=gemma2:2b                  # ~1.6GB download (2B params)
OLLAMA_MODEL=qwen2:1.5b                 # ~934MB download (1.5B params)
```

#### Balanced Performance Models (8GB RAM)
```bash
OLLAMA_MODEL=llama3.1:8b               # ~4.7GB download (8B params)
OLLAMA_MODEL=gemma2:9b                 # ~5.4GB download (9B params)
OLLAMA_MODEL=mistral:7b                # ~4.1GB download (7B params)
OLLAMA_MODEL=qwen2:7b                  # ~4.4GB download (7B params)
```

#### High Quality Models (16GB+ RAM)
```bash
OLLAMA_MODEL=llama3.1:13b              # ~7.4GB download (13B params)
OLLAMA_MODEL=mixtral:8x7b              # ~26GB download (47B params - sparse)
OLLAMA_MODEL=qwen2:14b                 # ~8.2GB download (14B params)
```

#### Code/Instruction Following Models
```bash
OLLAMA_MODEL=codellama:7b              # ~3.8GB download (7B params)
OLLAMA_MODEL=deepseek-coder:6.7b       # ~3.8GB download (6.7B params)
OLLAMA_MODEL=wizard-coder:7b           # ~4.1GB download (7B params)
```

### Ollama Model Management

```bash
# List available models
ollama list

# Pull a specific model
ollama pull llama3.1:8b

# Remove a model to free space
ollama rm old-model:tag

# Check model information
ollama show llama3.1:8b
```

## Installation Requirements

### HuggingFace Setup

#### For API Usage (No Downloads)
```bash
pip install -r requirements.txt
# No additional setup needed
```

#### For Local CPU Inference
```bash
pip install -r requirements.txt
# Models will be downloaded automatically on first use
```

#### For Local GPU Inference
```bash
# Install CUDA version of PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install other requirements
pip install -r requirements.txt

# Verify GPU availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
```

### Ollama Setup

#### Installation
```bash
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama service
ollama serve

# Pull your first model (in another terminal)
ollama pull llama3.1:8b
```

## Storage Requirements & Download Sizes

### HuggingFace Local Models
- **Storage Location**: `~/.cache/huggingface/transformers/`
- **Small Models**: 100MB - 1GB (good for development)
- **Medium Models**: 1GB - 5GB (balanced performance)
- **Large Models**: 5GB - 15GB (high quality, under 20B params)

### Ollama Models
- **Storage Location**: `~/.ollama/models/`
- **Quantized Storage**: Models use efficient quantization (4-bit, 8-bit)
- **2B Models**: ~1-2GB download
- **7-8B Models**: ~4-5GB download  
- **13-14B Models**: ~7-8GB download

### Embedding Models
```bash
# HuggingFace Embeddings (auto-downloaded)
sentence-transformers/all-MiniLM-L6-v2     # ~80MB
sentence-transformers/all-mpnet-base-v2    # ~420MB

# Ollama Embeddings
ollama pull nomic-embed-text               # ~274MB
ollama pull mxbai-embed-large              # ~669MB
```

## Performance & Hardware Recommendations

### System Requirements

#### Minimum (API Usage)
- **RAM**: 2GB
- **Storage**: 100MB
- **Internet**: Required for API calls

#### CPU Inference
- **RAM**: 8GB+ (16GB for larger models)
- **CPU**: 4+ cores recommended
- **Storage**: 5GB+ for models cache

#### GPU Inference
- **GPU**: 8GB+ VRAM (for 7B models)
- **RAM**: 16GB+ system RAM
- **Storage**: 10GB+ for models

### Performance Tips

1. **Start Small**: Begin with lightweight models and upgrade based on quality needs
2. **Use API First**: Test with HuggingFace API before committing to local inference
3. **Monitor Resources**: Check CPU/GPU/RAM usage during inference
4. **Model Caching**: First run downloads models, subsequent runs are faster

## Troubleshooting

### HuggingFace Issues

#### "accelerate package required"
```bash
pip install accelerate
```

#### GPU not detected
```bash
# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# If false, install CUDA PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```

#### Out of memory errors
- Switch to a smaller model
- Set `HUGGINGFACE_USE_GPU=false` for CPU inference
- Use API instead: `HUGGINGFACE_USE_API=true`

### Ollama Issues

#### Ollama service not starting
```bash
# Check if port 11434 is available
lsof -i :11434

# Restart Ollama
ollama serve
```

#### Model not found
```bash
# List available models
ollama list

# Pull the model
ollama pull llama3.1:8b
```

#### Slow inference
- Try a smaller model
- Check available RAM
- Consider using GPU if available

## Quick Tests

### Test HuggingFace Configuration
```bash
cd backend
python -c "
from services.llm_service import LLMService
import os
os.environ['LLM_PROVIDER'] = 'huggingface'
service = LLMService()
print('✅ HuggingFace LLM working!')
response = service.simple_chat_completion('Hello')
print(f'Response: {response}')
"
```

### Test Ollama Configuration
```bash
# First ensure Ollama is running
ollama serve &

# Test the service
cd backend
python -c "
from services.llm_service import LLMService
import os
os.environ['LLM_PROVIDER'] = 'ollama'
service = LLMService()
print('✅ Ollama LLM working!')
response = service.simple_chat_completion('Hello')
print(f'Response: {response}')
"
```

## Configuration Examples

### Development Setup (Fast Start)
```bash
# Use HuggingFace API for quick testing
LLM_PROVIDER=huggingface
HUGGINGFACE_USE_API=true
HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1
HUGGINGFACE_API_TOKEN=your_token_here
```

### Local CPU Setup
```bash
# Local inference on CPU
LLM_PROVIDER=ollama
OLLAMA_MODEL=llama3.1:8b
OLLAMA_BASE_URL=http://localhost:11434
```

### Local GPU Setup
```bash
# Local inference with GPU acceleration
LLM_PROVIDER=huggingface
HUGGINGFACE_USE_API=false
HUGGINGFACE_USE_GPU=true
HUGGINGFACE_MODEL=deepseek-ai/DeepSeek-V3.1
```

### Production Setup (High Performance)
```bash
# Ollama with optimized model
LLM_PROVIDER=ollama
OLLAMA_MODEL=llama3.1:13b  # Higher quality
OLLAMA_BASE_URL=http://localhost:11434
# Ensure 16GB+ RAM available
```