Spaces:
Running
Running
File size: 11,035 Bytes
10d1fd4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 | # AI Integration
MiniSearch supports four AI inference backends, each with different trade-offs between privacy, performance, and setup complexity.
## Inference Types Overview
| Type | Privacy | Speed | Setup | Best For |
|------|---------|-------|-------|----------|
| **Browser** (WebLLM/Wllama) | Maximum (no data leaves device) | Fast (WebGPU) / Slow (CPU) | None | Personal use, privacy-critical scenarios |
| **OpenAI** | Low (data sent to OpenAI) | Very Fast | API Key | Maximum quality, convenience |
| **AI Horde** | Medium (distributed volunteers) | Variable | Anonymous | Free GPU access, no setup |
| **Internal** | High (your infrastructure) | Depends on hardware | Self-hosted API | Teams, compliance requirements |
## Browser-Based Inference
Runs AI models entirely in the browser using WebAssembly or WebGPU. No data leaves the user's device.
### WebLLM (WebGPU Accelerated)
Uses `@mlc-ai/web-llm` for GPU-accelerated inference.
**Requirements:**
- Modern browser with WebGPU support (Chrome 113+, Edge 113+, Firefox Nightly)
- ~500MB-2GB free RAM
- GPU with F16 shader support (for optimal models)
**How It Works:**
1. User searches with "Enable AI Response" on
2. Library checks WebGPU availability and F16 shader support
3. Downloads model weights from HuggingFace (cached in IndexedDB)
4. Loads model into GPU memory
5. Generates response streaming tokens
**Model Selection:**
```typescript
// WebLLM model IDs from MLC registry
const models = {
fast: 'Qwen3-0.6B-q4f16_1-MLC', // 600M params, ~400MB
balanced: 'SmolLM2-1.7B-q4f16_1-MLC', // 1.7B params, ~1GB
capable: 'Llama-3.2-1B-q4f16_1-MLC' // 1B params, ~600MB
};
```
**Configuration:**
- Settings β Inference Type: `Browser`
- Settings β Browser Model: Select from dropdown
- Settings β Enable WebGPU: Toggle (auto-detected)
**Limitations:**
- First load requires model download (progressive via sharded files)
- Limited to smaller models (3B params max due to browser memory)
- Requires modern browser with WebGPU
### Wllama (CPU-Based)
Uses `@wllama/wllama` for CPU inference via WebAssembly.
**Requirements:**
- Any modern browser
- ~300MB-1GB free RAM
- No WebGPU required
**How It Works:**
1. Downloads model from HuggingFace (GGUF format)
2. Runs inference in WebAssembly (slower but universally compatible)
3. Supports 40+ pre-configured models
**Pre-configured Models:**
All stored at `Felladrin/gguf-sharded-*` on HuggingFace:
| Model | Params | Size | Speed | Quality |
|-------|--------|------|-------|---------|
| qwen-3-0.6b | 600M | ~400MB | Fast | Good |
| smollm2-1.7b | 1.7B | ~1.1GB | Medium | Better |
| llama-3.2-1b | 1B | ~650MB | Fast | Good |
| gemma-3-1b | 1B | ~650MB | Fast | Good |
| phi-4-mini | 3.8B | ~2.2GB | Slow | Best |
**Configuration:**
- Settings β Inference Type: `Browser`
- Settings β Use WebGPU: OFF
- Settings β Wllama Model: Select from dropdown
**Limitations:**
- Slower than WebGPU (2-5x slower)
- Same memory constraints
- No GPU acceleration
### WebLLM vs Wllama Decision Matrix
```
WebGPU Available?
βββ Yes β WebLLM (F16 if supported, else F32)
βββ No β Wllama (CPU inference)
```
**Code Detection:**
```typescript
// client/modules/webGpu.ts
export async function isWebGpuAvailable(): Promise<boolean> {
if (!navigator.gpu) return false;
try {
const adapter = await navigator.gpu.requestAdapter();
return !!adapter;
} catch {
return false;
}
}
export async function isF16ShaderSupported(): Promise<boolean> {
const adapter = await navigator.gpu?.requestAdapter();
return adapter?.features.has('shader-f16') ?? false;
}
```
## OpenAI API Integration
Uses OpenAI's API or any OpenAI-compatible service.
**Setup:**
1. Get API key from OpenAI or compatible provider
2. Settings β Inference Type: `OpenAI`
3. Settings β OpenAI API Key: Enter key
4. Settings β OpenAI Model: Select or enter model ID
**Supported Providers:**
- OpenAI (gpt-4, gpt-3.5-turbo)
- Anthropic (via OpenAI-compatible endpoint)
- Google (Gemini via OpenAI-compatible endpoint)
- Any custom provider with OpenAI-compatible API
**Features:**
- Streaming responses
- Auto model selection (if blank)
- Retry logic with fallback models
- Reasoning content support
**Configuration:**
```typescript
{
inferenceType: 'openai',
openaiApiKey: 'sk-xxx',
openaiModel: 'gpt-4', // Optional: auto-detected if empty
inferenceTemperature: 0.7,
inferenceMaxTokens: 4096
}
```
**Privacy Considerations:**
- Search queries and results sent to OpenAI
- Not suitable for sensitive data
- Consider internal API for private data
## AI Horde Integration
Uses aihorde.net, a distributed volunteer GPU network.
**Setup:**
1. Settings β Inference Type: `AI Horde`
2. (Optional) Settings β AI Horde API Key: Get from aihorde.net
3. Settings β AI Horde Model: Select preferred model
**How It Works:**
1. Request sent to AI Horde API
2. Distributed to volunteer workers
3. Multiple workers may process in parallel
4. First response wins (race condition handling)
5. Results streamed back
**Features:**
- Free to use (anonymous or authenticated)
- Kudos-based priority system
- Large model selection (70B+ params available)
- No API key required (but recommended for priority)
**Configuration:**
```typescript
{
inferenceType: 'horde',
aiHordeApiKey: '', // Optional
aiHordeModel: 'koboldcpp/LLaMA2-70B-Psyfighter2' // Optional
}
```
**Limitations:**
- Variable latency (depends on worker availability)
- Quality varies by worker
- May queue during high demand
- Requires internet connection
## Internal API Integration
Self-hosted OpenAI-compatible API for teams and compliance.
**Setup:**
1. Host an OpenAI-compatible API (e.g., vLLM, llama.cpp server, Ollama with OpenAI compat)
2. Configure environment variables (see `docs/configuration.md`)
3. Settings β Inference Type: `Internal`
**Environment Variables:**
```bash
INTERNAL_OPENAI_COMPATIBLE_API_BASE_URL="https://llm.company.com/v1"
INTERNAL_OPENAI_COMPATIBLE_API_KEY="sk-internal-xxx"
INTERNAL_OPENAI_COMPATIBLE_API_MODEL="llama-3.1-8b"
INTERNAL_OPENAI_COMPATIBLE_API_NAME="Company LLM"
```
**Server-Side Proxy:**
The internal API uses a server-side proxy to:
- Hide API keys from client
- Add request logging/auditing
- Apply rate limiting
- Enable token-based authentication
**Endpoint:**
```
POST /inference
Content-Type: application/json
Authorization: Bearer <VITE_SEARCH_TOKEN>
{
"messages": [...],
"model": "llama-3.1-8b",
"stream": true
}
```
**Features:**
- Private data stays in your infrastructure
- Custom model selection
- Server-side logging
- Compatible with any OpenAI-compatible API
**Recommended Self-Hosted Options:**
- **vLLM**: High-performance, production-ready
- **llama.cpp server**: Single binary, easy setup
- **Ollama**: Simple, Docker-friendly
- **text-generation-webui**: Feature-rich, UI included
## Text Generation Flow
### Search-Triggered Generation
```
User Query
β
searchAndRespond() [client/modules/textGeneration.ts]
β
startTextSearch() β searchText() [search.ts]
β
Wait for search results
β
canStartResponding() checks state
β
Load AI model (if browser-based)
β
Generate system prompt with search results
β
Stream response via selected inference type
β
Update PubSub channels (response, textGenerationState)
```
### Chat Generation
```
User sends message
β
generateChatResponse() [textGeneration.ts]
β
Manage token budget (75% of 4096 = ~3072 tokens)
β
Create conversation summary if needed (800-token limit)
β
Build context: System prompt + Summary + Recent turns
β
Call inference API (streaming)
β
Update PubSub (chatMessages, response)
β
Save to history database
```
## Conversation Memory
### Token Budget Management
- **Context Window:** 4096 tokens
- **Reserved for Response:** 25% (~1024 tokens)
- **Available for Context:** 75% (~3072 tokens)
**Allocation Priority:**
1. System prompt (with search results)
2. Conversation summary (if exists)
3. Recent chat messages (newest first)
4. Older messages (summarized or dropped)
### Rolling Summaries
When conversation exceeds token budget:
1. **Detect Overflow:** Current tokens > 3072
2. **Generate Summary:** Call LLM with 800-token limit
3. **Store Summary:** Save in conversationSummaryPubSub
4. **Drop Old Messages:** Remove summarized messages from chatMessages
5. **Continue:** Use summary + remaining messages for context
**Summary Prompt:**
```
Summarize this conversation in 3-5 sentences, preserving key facts
and user intent. Be concise but informative.
```
## Error Handling and Fallbacks
### Browser Inference Failures
```typescript
// If WebLLM fails, fallback to Wllama
try {
await generateWithWebLLM();
} catch (error) {
if (error.message.includes('WebGPU')) {
// Auto-switch to Wllama
settings.enableWebGpu = false;
await generateWithWllama();
}
}
```
### API Failures
- **OpenAI:** Retry with exponential backoff, fallback to cheaper model
- **AI Horde:** Queue with timeout, retry with different model
- **Internal:** Log error, return user-friendly message
### State Recovery
If generation fails mid-stream:
1. Set `textGenerationState` to `failed`
2. Preserve partial response in `responsePubSub`
3. Allow user to retry or modify query
## Performance Optimization
### Model Caching
All browser-based models cached in IndexedDB:
- WebLLM: `webllm/model-cache`
- Wllama: `wllama/model-cache`
- Subsequent loads: Instant (no re-download)
### Streaming Strategy
- Tokens streamed at 12 updates/second max (throttled)
- UI updates batched via React's automatic batching
- Web Workers used for non-blocking inference
### Progressive Model Loading
Wllama models are sharded (split into chunks):
1. Download metadata first (small, fast)
2. Download required shards progressively
3. Start inference when first shards available
4. Continue downloading remaining shards in background
## Best Practices
### For Privacy-Critical Use
- Use Browser inference (WebLLM/Wllama)
- Disable `shareModelDownloads`
- Set `historyRetentionDays: 0` (no persistence)
### For Maximum Quality
- Use OpenAI GPT-4 or Internal API with large model
- Set `searchResultsToConsider: 5-10`
- Adjust temperature: 0.5-0.7 for factual, 0.8-1.0 for creative
### For Cost Efficiency
- Use AI Horde (free) or Browser inference (one-time download)
- Set `searchResultsToConsider: 3` (default)
- Limit `inferenceMaxTokens: 2048`
### For Teams/Enterprise
- Deploy Internal API with vLLM
- Set `ACCESS_KEYS` for access control
- Enable server-side logging
- Use consistent `INTERNAL_OPENAI_COMPATIBLE_API_MODEL`
## Related Topics
- **Configuration**: `docs/configuration.md` - Environment variables and settings
- **Conversation Memory**: `docs/conversation-memory.md` - Detailed token budgeting
- **UI Components**: `docs/ui-components.md` - How AI response UI works
- **Security**: `docs/security.md` - Privacy implications of each inference type
|