File size: 11,035 Bytes
10d1fd4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
# AI Integration

MiniSearch supports four AI inference backends, each with different trade-offs between privacy, performance, and setup complexity.

## Inference Types Overview

| Type | Privacy | Speed | Setup | Best For |
|------|---------|-------|-------|----------|
| **Browser** (WebLLM/Wllama) | Maximum (no data leaves device) | Fast (WebGPU) / Slow (CPU) | None | Personal use, privacy-critical scenarios |
| **OpenAI** | Low (data sent to OpenAI) | Very Fast | API Key | Maximum quality, convenience |
| **AI Horde** | Medium (distributed volunteers) | Variable | Anonymous | Free GPU access, no setup |
| **Internal** | High (your infrastructure) | Depends on hardware | Self-hosted API | Teams, compliance requirements |

## Browser-Based Inference

Runs AI models entirely in the browser using WebAssembly or WebGPU. No data leaves the user's device.

### WebLLM (WebGPU Accelerated)

Uses `@mlc-ai/web-llm` for GPU-accelerated inference.

**Requirements:**
- Modern browser with WebGPU support (Chrome 113+, Edge 113+, Firefox Nightly)
- ~500MB-2GB free RAM
- GPU with F16 shader support (for optimal models)

**How It Works:**
1. User searches with "Enable AI Response" on
2. Library checks WebGPU availability and F16 shader support
3. Downloads model weights from HuggingFace (cached in IndexedDB)
4. Loads model into GPU memory
5. Generates response streaming tokens

**Model Selection:**
```typescript
// WebLLM model IDs from MLC registry
const models = {
  fast: 'Qwen3-0.6B-q4f16_1-MLC',      // 600M params, ~400MB
  balanced: 'SmolLM2-1.7B-q4f16_1-MLC', // 1.7B params, ~1GB
  capable: 'Llama-3.2-1B-q4f16_1-MLC'   // 1B params, ~600MB
};
```

**Configuration:**
- Settings β†’ Inference Type: `Browser`
- Settings β†’ Browser Model: Select from dropdown
- Settings β†’ Enable WebGPU: Toggle (auto-detected)

**Limitations:**
- First load requires model download (progressive via sharded files)
- Limited to smaller models (3B params max due to browser memory)
- Requires modern browser with WebGPU

### Wllama (CPU-Based)

Uses `@wllama/wllama` for CPU inference via WebAssembly.

**Requirements:**
- Any modern browser
- ~300MB-1GB free RAM
- No WebGPU required

**How It Works:**
1. Downloads model from HuggingFace (GGUF format)
2. Runs inference in WebAssembly (slower but universally compatible)
3. Supports 40+ pre-configured models

**Pre-configured Models:**
All stored at `Felladrin/gguf-sharded-*` on HuggingFace:

| Model | Params | Size | Speed | Quality |
|-------|--------|------|-------|---------|
| qwen-3-0.6b | 600M | ~400MB | Fast | Good |
| smollm2-1.7b | 1.7B | ~1.1GB | Medium | Better |
| llama-3.2-1b | 1B | ~650MB | Fast | Good |
| gemma-3-1b | 1B | ~650MB | Fast | Good |
| phi-4-mini | 3.8B | ~2.2GB | Slow | Best |

**Configuration:**
- Settings β†’ Inference Type: `Browser`
- Settings β†’ Use WebGPU: OFF
- Settings β†’ Wllama Model: Select from dropdown

**Limitations:**
- Slower than WebGPU (2-5x slower)
- Same memory constraints
- No GPU acceleration

### WebLLM vs Wllama Decision Matrix

```
WebGPU Available?
β”œβ”€β”€ Yes β†’ WebLLM (F16 if supported, else F32)
└── No  β†’ Wllama (CPU inference)
```

**Code Detection:**
```typescript
// client/modules/webGpu.ts
export async function isWebGpuAvailable(): Promise<boolean> {
  if (!navigator.gpu) return false;
  try {
    const adapter = await navigator.gpu.requestAdapter();
    return !!adapter;
  } catch {
    return false;
  }
}

export async function isF16ShaderSupported(): Promise<boolean> {
  const adapter = await navigator.gpu?.requestAdapter();
  return adapter?.features.has('shader-f16') ?? false;
}
```

## OpenAI API Integration

Uses OpenAI's API or any OpenAI-compatible service.

**Setup:**
1. Get API key from OpenAI or compatible provider
2. Settings β†’ Inference Type: `OpenAI`
3. Settings β†’ OpenAI API Key: Enter key
4. Settings β†’ OpenAI Model: Select or enter model ID

**Supported Providers:**
- OpenAI (gpt-4, gpt-3.5-turbo)
- Anthropic (via OpenAI-compatible endpoint)
- Google (Gemini via OpenAI-compatible endpoint)
- Any custom provider with OpenAI-compatible API

**Features:**
- Streaming responses
- Auto model selection (if blank)
- Retry logic with fallback models
- Reasoning content support

**Configuration:**
```typescript
{
  inferenceType: 'openai',
  openaiApiKey: 'sk-xxx',
  openaiModel: 'gpt-4', // Optional: auto-detected if empty
  inferenceTemperature: 0.7,
  inferenceMaxTokens: 4096
}
```

**Privacy Considerations:**
- Search queries and results sent to OpenAI
- Not suitable for sensitive data
- Consider internal API for private data

## AI Horde Integration

Uses aihorde.net, a distributed volunteer GPU network.

**Setup:**
1. Settings β†’ Inference Type: `AI Horde`
2. (Optional) Settings β†’ AI Horde API Key: Get from aihorde.net
3. Settings β†’ AI Horde Model: Select preferred model

**How It Works:**
1. Request sent to AI Horde API
2. Distributed to volunteer workers
3. Multiple workers may process in parallel
4. First response wins (race condition handling)
5. Results streamed back

**Features:**
- Free to use (anonymous or authenticated)
- Kudos-based priority system
- Large model selection (70B+ params available)
- No API key required (but recommended for priority)

**Configuration:**
```typescript
{
  inferenceType: 'horde',
  aiHordeApiKey: '', // Optional
  aiHordeModel: 'koboldcpp/LLaMA2-70B-Psyfighter2' // Optional
}
```

**Limitations:**
- Variable latency (depends on worker availability)
- Quality varies by worker
- May queue during high demand
- Requires internet connection

## Internal API Integration

Self-hosted OpenAI-compatible API for teams and compliance.

**Setup:**
1. Host an OpenAI-compatible API (e.g., vLLM, llama.cpp server, Ollama with OpenAI compat)
2. Configure environment variables (see `docs/configuration.md`)
3. Settings β†’ Inference Type: `Internal`

**Environment Variables:**
```bash
INTERNAL_OPENAI_COMPATIBLE_API_BASE_URL="https://llm.company.com/v1"
INTERNAL_OPENAI_COMPATIBLE_API_KEY="sk-internal-xxx"
INTERNAL_OPENAI_COMPATIBLE_API_MODEL="llama-3.1-8b"
INTERNAL_OPENAI_COMPATIBLE_API_NAME="Company LLM"
```

**Server-Side Proxy:**
The internal API uses a server-side proxy to:
- Hide API keys from client
- Add request logging/auditing
- Apply rate limiting
- Enable token-based authentication

**Endpoint:**
```
POST /inference
Content-Type: application/json
Authorization: Bearer <VITE_SEARCH_TOKEN>

{
  "messages": [...],
  "model": "llama-3.1-8b",
  "stream": true
}
```

**Features:**
- Private data stays in your infrastructure
- Custom model selection
- Server-side logging
- Compatible with any OpenAI-compatible API

**Recommended Self-Hosted Options:**
- **vLLM**: High-performance, production-ready
- **llama.cpp server**: Single binary, easy setup
- **Ollama**: Simple, Docker-friendly
- **text-generation-webui**: Feature-rich, UI included

## Text Generation Flow

### Search-Triggered Generation

```
User Query
    ↓
searchAndRespond() [client/modules/textGeneration.ts]
    ↓
startTextSearch() β†’ searchText() [search.ts]
    ↓
Wait for search results
    ↓
canStartResponding() checks state
    ↓
Load AI model (if browser-based)
    ↓
Generate system prompt with search results
    ↓
Stream response via selected inference type
    ↓
Update PubSub channels (response, textGenerationState)
```

### Chat Generation

```
User sends message
    ↓
generateChatResponse() [textGeneration.ts]
    ↓
Manage token budget (75% of 4096 = ~3072 tokens)
    ↓
Create conversation summary if needed (800-token limit)
    ↓
Build context: System prompt + Summary + Recent turns
    ↓
Call inference API (streaming)
    ↓
Update PubSub (chatMessages, response)
    ↓
Save to history database
```

## Conversation Memory

### Token Budget Management

- **Context Window:** 4096 tokens
- **Reserved for Response:** 25% (~1024 tokens)
- **Available for Context:** 75% (~3072 tokens)

**Allocation Priority:**
1. System prompt (with search results)
2. Conversation summary (if exists)
3. Recent chat messages (newest first)
4. Older messages (summarized or dropped)

### Rolling Summaries

When conversation exceeds token budget:

1. **Detect Overflow:** Current tokens > 3072
2. **Generate Summary:** Call LLM with 800-token limit
3. **Store Summary:** Save in conversationSummaryPubSub
4. **Drop Old Messages:** Remove summarized messages from chatMessages
5. **Continue:** Use summary + remaining messages for context

**Summary Prompt:**
```
Summarize this conversation in 3-5 sentences, preserving key facts
and user intent. Be concise but informative.
```

## Error Handling and Fallbacks

### Browser Inference Failures

```typescript
// If WebLLM fails, fallback to Wllama
try {
  await generateWithWebLLM();
} catch (error) {
  if (error.message.includes('WebGPU')) {
    // Auto-switch to Wllama
    settings.enableWebGpu = false;
    await generateWithWllama();
  }
}
```

### API Failures

- **OpenAI:** Retry with exponential backoff, fallback to cheaper model
- **AI Horde:** Queue with timeout, retry with different model
- **Internal:** Log error, return user-friendly message

### State Recovery

If generation fails mid-stream:
1. Set `textGenerationState` to `failed`
2. Preserve partial response in `responsePubSub`
3. Allow user to retry or modify query

## Performance Optimization

### Model Caching

All browser-based models cached in IndexedDB:
- WebLLM: `webllm/model-cache`
- Wllama: `wllama/model-cache`
- Subsequent loads: Instant (no re-download)

### Streaming Strategy

- Tokens streamed at 12 updates/second max (throttled)
- UI updates batched via React's automatic batching
- Web Workers used for non-blocking inference

### Progressive Model Loading

Wllama models are sharded (split into chunks):
1. Download metadata first (small, fast)
2. Download required shards progressively
3. Start inference when first shards available
4. Continue downloading remaining shards in background

## Best Practices

### For Privacy-Critical Use
- Use Browser inference (WebLLM/Wllama)
- Disable `shareModelDownloads`
- Set `historyRetentionDays: 0` (no persistence)

### For Maximum Quality
- Use OpenAI GPT-4 or Internal API with large model
- Set `searchResultsToConsider: 5-10`
- Adjust temperature: 0.5-0.7 for factual, 0.8-1.0 for creative

### For Cost Efficiency
- Use AI Horde (free) or Browser inference (one-time download)
- Set `searchResultsToConsider: 3` (default)
- Limit `inferenceMaxTokens: 2048`

### For Teams/Enterprise
- Deploy Internal API with vLLM
- Set `ACCESS_KEYS` for access control
- Enable server-side logging
- Use consistent `INTERNAL_OPENAI_COMPATIBLE_API_MODEL`

## Related Topics

- **Configuration**: `docs/configuration.md` - Environment variables and settings
- **Conversation Memory**: `docs/conversation-memory.md` - Detailed token budgeting
- **UI Components**: `docs/ui-components.md` - How AI response UI works
- **Security**: `docs/security.md` - Privacy implications of each inference type