Spaces:
Running
Running
| # AI Integration | |
| MiniSearch supports four AI inference backends, each with different trade-offs between privacy, performance, and setup complexity. | |
| ## Inference Types Overview | |
| | Type | Privacy | Speed | Setup | Best For | | |
| |------|---------|-------|-------|----------| | |
| | **Browser** (WebLLM/Wllama) | Maximum (no data leaves device) | Fast (WebGPU) / Slow (CPU) | None | Personal use, privacy-critical scenarios | | |
| | **OpenAI** | Low (data sent to OpenAI) | Very Fast | API Key | Maximum quality, convenience | | |
| | **AI Horde** | Medium (distributed volunteers) | Variable | Anonymous | Free GPU access, no setup | | |
| | **Internal** | High (your infrastructure) | Depends on hardware | Self-hosted API | Teams, compliance requirements | | |
| ## Browser-Based Inference | |
| Runs AI models entirely in the browser using WebAssembly or WebGPU. No data leaves the user's device. | |
| ### WebLLM (WebGPU Accelerated) | |
| Uses `@mlc-ai/web-llm` for GPU-accelerated inference. | |
| **Requirements:** | |
| - Modern browser with WebGPU support (Chrome 113+, Edge 113+, Firefox Nightly) | |
| - ~500MB-2GB free RAM | |
| - GPU with F16 shader support (for optimal models) | |
| **How It Works:** | |
| 1. User searches with "Enable AI Response" on | |
| 2. Library checks WebGPU availability and F16 shader support | |
| 3. Downloads model weights from HuggingFace (cached in IndexedDB) | |
| 4. Loads model into GPU memory | |
| 5. Generates response streaming tokens | |
| **Model Selection:** | |
| ```typescript | |
| // WebLLM model IDs from MLC registry | |
| const models = { | |
| fast: 'Qwen3-0.6B-q4f16_1-MLC', // 600M params, ~400MB | |
| balanced: 'SmolLM2-1.7B-q4f16_1-MLC', // 1.7B params, ~1GB | |
| capable: 'Llama-3.2-1B-q4f16_1-MLC' // 1B params, ~600MB | |
| }; | |
| ``` | |
| **Configuration:** | |
| - Settings β Inference Type: `Browser` | |
| - Settings β Browser Model: Select from dropdown | |
| - Settings β Enable WebGPU: Toggle (auto-detected) | |
| **Limitations:** | |
| - First load requires model download (progressive via sharded files) | |
| - Limited to smaller models (3B params max due to browser memory) | |
| - Requires modern browser with WebGPU | |
| ### Wllama (CPU-Based) | |
| Uses `@wllama/wllama` for CPU inference via WebAssembly. | |
| **Requirements:** | |
| - Any modern browser | |
| - ~300MB-1GB free RAM | |
| - No WebGPU required | |
| **How It Works:** | |
| 1. Downloads model from HuggingFace (GGUF format) | |
| 2. Runs inference in WebAssembly (slower but universally compatible) | |
| 3. Supports 40+ pre-configured models | |
| **Pre-configured Models:** | |
| All stored at `Felladrin/gguf-sharded-*` on HuggingFace: | |
| | Model | Params | Size | Speed | Quality | | |
| |-------|--------|------|-------|---------| | |
| | qwen-3-0.6b | 600M | ~400MB | Fast | Good | | |
| | smollm2-1.7b | 1.7B | ~1.1GB | Medium | Better | | |
| | llama-3.2-1b | 1B | ~650MB | Fast | Good | | |
| | gemma-3-1b | 1B | ~650MB | Fast | Good | | |
| | phi-4-mini | 3.8B | ~2.2GB | Slow | Best | | |
| **Configuration:** | |
| - Settings β Inference Type: `Browser` | |
| - Settings β Use WebGPU: OFF | |
| - Settings β Wllama Model: Select from dropdown | |
| **Limitations:** | |
| - Slower than WebGPU (2-5x slower) | |
| - Same memory constraints | |
| - No GPU acceleration | |
| ### WebLLM vs Wllama Decision Matrix | |
| ``` | |
| WebGPU Available? | |
| βββ Yes β WebLLM (F16 if supported, else F32) | |
| βββ No β Wllama (CPU inference) | |
| ``` | |
| **Code Detection:** | |
| ```typescript | |
| // client/modules/webGpu.ts | |
| export async function isWebGpuAvailable(): Promise<boolean> { | |
| if (!navigator.gpu) return false; | |
| try { | |
| const adapter = await navigator.gpu.requestAdapter(); | |
| return !!adapter; | |
| } catch { | |
| return false; | |
| } | |
| } | |
| export async function isF16ShaderSupported(): Promise<boolean> { | |
| const adapter = await navigator.gpu?.requestAdapter(); | |
| return adapter?.features.has('shader-f16') ?? false; | |
| } | |
| ``` | |
| ## OpenAI API Integration | |
| Uses OpenAI's API or any OpenAI-compatible service. | |
| **Setup:** | |
| 1. Get API key from OpenAI or compatible provider | |
| 2. Settings β Inference Type: `OpenAI` | |
| 3. Settings β OpenAI API Key: Enter key | |
| 4. Settings β OpenAI Model: Select or enter model ID | |
| **Supported Providers:** | |
| - OpenAI (gpt-4, gpt-3.5-turbo) | |
| - Anthropic (via OpenAI-compatible endpoint) | |
| - Google (Gemini via OpenAI-compatible endpoint) | |
| - Any custom provider with OpenAI-compatible API | |
| **Features:** | |
| - Streaming responses | |
| - Auto model selection (if blank) | |
| - Retry logic with fallback models | |
| - Reasoning content support | |
| **Configuration:** | |
| ```typescript | |
| { | |
| inferenceType: 'openai', | |
| openaiApiKey: 'sk-xxx', | |
| openaiModel: 'gpt-4', // Optional: auto-detected if empty | |
| inferenceTemperature: 0.7, | |
| inferenceMaxTokens: 4096 | |
| } | |
| ``` | |
| **Privacy Considerations:** | |
| - Search queries and results sent to OpenAI | |
| - Not suitable for sensitive data | |
| - Consider internal API for private data | |
| ## AI Horde Integration | |
| Uses aihorde.net, a distributed volunteer GPU network. | |
| **Setup:** | |
| 1. Settings β Inference Type: `AI Horde` | |
| 2. (Optional) Settings β AI Horde API Key: Get from aihorde.net | |
| 3. Settings β AI Horde Model: Select preferred model | |
| **How It Works:** | |
| 1. Request sent to AI Horde API | |
| 2. Distributed to volunteer workers | |
| 3. Multiple workers may process in parallel | |
| 4. First response wins (race condition handling) | |
| 5. Results streamed back | |
| **Features:** | |
| - Free to use (anonymous or authenticated) | |
| - Kudos-based priority system | |
| - Large model selection (70B+ params available) | |
| - No API key required (but recommended for priority) | |
| **Configuration:** | |
| ```typescript | |
| { | |
| inferenceType: 'horde', | |
| aiHordeApiKey: '', // Optional | |
| aiHordeModel: 'koboldcpp/LLaMA2-70B-Psyfighter2' // Optional | |
| } | |
| ``` | |
| **Limitations:** | |
| - Variable latency (depends on worker availability) | |
| - Quality varies by worker | |
| - May queue during high demand | |
| - Requires internet connection | |
| ## Internal API Integration | |
| Self-hosted OpenAI-compatible API for teams and compliance. | |
| **Setup:** | |
| 1. Host an OpenAI-compatible API (e.g., vLLM, llama.cpp server, Ollama with OpenAI compat) | |
| 2. Configure environment variables (see `docs/configuration.md`) | |
| 3. Settings β Inference Type: `Internal` | |
| **Environment Variables:** | |
| ```bash | |
| INTERNAL_OPENAI_COMPATIBLE_API_BASE_URL="https://llm.company.com/v1" | |
| INTERNAL_OPENAI_COMPATIBLE_API_KEY="sk-internal-xxx" | |
| INTERNAL_OPENAI_COMPATIBLE_API_MODEL="llama-3.1-8b" | |
| INTERNAL_OPENAI_COMPATIBLE_API_NAME="Company LLM" | |
| ``` | |
| **Server-Side Proxy:** | |
| The internal API uses a server-side proxy to: | |
| - Hide API keys from client | |
| - Add request logging/auditing | |
| - Apply rate limiting | |
| - Enable token-based authentication | |
| **Endpoint:** | |
| ``` | |
| POST /inference | |
| Content-Type: application/json | |
| Authorization: Bearer <VITE_SEARCH_TOKEN> | |
| { | |
| "messages": [...], | |
| "model": "llama-3.1-8b", | |
| "stream": true | |
| } | |
| ``` | |
| **Features:** | |
| - Private data stays in your infrastructure | |
| - Custom model selection | |
| - Server-side logging | |
| - Compatible with any OpenAI-compatible API | |
| **Recommended Self-Hosted Options:** | |
| - **vLLM**: High-performance, production-ready | |
| - **llama.cpp server**: Single binary, easy setup | |
| - **Ollama**: Simple, Docker-friendly | |
| - **text-generation-webui**: Feature-rich, UI included | |
| ## Text Generation Flow | |
| ### Search-Triggered Generation | |
| ``` | |
| User Query | |
| β | |
| searchAndRespond() [client/modules/textGeneration.ts] | |
| β | |
| startTextSearch() β searchText() [search.ts] | |
| β | |
| Wait for search results | |
| β | |
| canStartResponding() checks state | |
| β | |
| Load AI model (if browser-based) | |
| β | |
| Generate system prompt with search results | |
| β | |
| Stream response via selected inference type | |
| β | |
| Update PubSub channels (response, textGenerationState) | |
| ``` | |
| ### Chat Generation | |
| ``` | |
| User sends message | |
| β | |
| generateChatResponse() [textGeneration.ts] | |
| β | |
| Manage token budget (75% of 4096 = ~3072 tokens) | |
| β | |
| Create conversation summary if needed (800-token limit) | |
| β | |
| Build context: System prompt + Summary + Recent turns | |
| β | |
| Call inference API (streaming) | |
| β | |
| Update PubSub (chatMessages, response) | |
| β | |
| Save to history database | |
| ``` | |
| ## Conversation Memory | |
| ### Token Budget Management | |
| - **Context Window:** 4096 tokens | |
| - **Reserved for Response:** 25% (~1024 tokens) | |
| - **Available for Context:** 75% (~3072 tokens) | |
| **Allocation Priority:** | |
| 1. System prompt (with search results) | |
| 2. Conversation summary (if exists) | |
| 3. Recent chat messages (newest first) | |
| 4. Older messages (summarized or dropped) | |
| ### Rolling Summaries | |
| When conversation exceeds token budget: | |
| 1. **Detect Overflow:** Current tokens > 3072 | |
| 2. **Generate Summary:** Call LLM with 800-token limit | |
| 3. **Store Summary:** Save in conversationSummaryPubSub | |
| 4. **Drop Old Messages:** Remove summarized messages from chatMessages | |
| 5. **Continue:** Use summary + remaining messages for context | |
| **Summary Prompt:** | |
| ``` | |
| Summarize this conversation in 3-5 sentences, preserving key facts | |
| and user intent. Be concise but informative. | |
| ``` | |
| ## Error Handling and Fallbacks | |
| ### Browser Inference Failures | |
| ```typescript | |
| // If WebLLM fails, fallback to Wllama | |
| try { | |
| await generateWithWebLLM(); | |
| } catch (error) { | |
| if (error.message.includes('WebGPU')) { | |
| // Auto-switch to Wllama | |
| settings.enableWebGpu = false; | |
| await generateWithWllama(); | |
| } | |
| } | |
| ``` | |
| ### API Failures | |
| - **OpenAI:** Retry with exponential backoff, fallback to cheaper model | |
| - **AI Horde:** Queue with timeout, retry with different model | |
| - **Internal:** Log error, return user-friendly message | |
| ### State Recovery | |
| If generation fails mid-stream: | |
| 1. Set `textGenerationState` to `failed` | |
| 2. Preserve partial response in `responsePubSub` | |
| 3. Allow user to retry or modify query | |
| ## Performance Optimization | |
| ### Model Caching | |
| All browser-based models cached in IndexedDB: | |
| - WebLLM: `webllm/model-cache` | |
| - Wllama: `wllama/model-cache` | |
| - Subsequent loads: Instant (no re-download) | |
| ### Streaming Strategy | |
| - Tokens streamed at 12 updates/second max (throttled) | |
| - UI updates batched via React's automatic batching | |
| - Web Workers used for non-blocking inference | |
| ### Progressive Model Loading | |
| Wllama models are sharded (split into chunks): | |
| 1. Download metadata first (small, fast) | |
| 2. Download required shards progressively | |
| 3. Start inference when first shards available | |
| 4. Continue downloading remaining shards in background | |
| ## Best Practices | |
| ### For Privacy-Critical Use | |
| - Use Browser inference (WebLLM/Wllama) | |
| - Disable `shareModelDownloads` | |
| - Set `historyRetentionDays: 0` (no persistence) | |
| ### For Maximum Quality | |
| - Use OpenAI GPT-4 or Internal API with large model | |
| - Set `searchResultsToConsider: 5-10` | |
| - Adjust temperature: 0.5-0.7 for factual, 0.8-1.0 for creative | |
| ### For Cost Efficiency | |
| - Use AI Horde (free) or Browser inference (one-time download) | |
| - Set `searchResultsToConsider: 3` (default) | |
| - Limit `inferenceMaxTokens: 2048` | |
| ### For Teams/Enterprise | |
| - Deploy Internal API with vLLM | |
| - Set `ACCESS_KEYS` for access control | |
| - Enable server-side logging | |
| - Use consistent `INTERNAL_OPENAI_COMPATIBLE_API_MODEL` | |
| ## Related Topics | |
| - **Configuration**: `docs/configuration.md` - Environment variables and settings | |
| - **Conversation Memory**: `docs/conversation-memory.md` - Detailed token budgeting | |
| - **UI Components**: `docs/ui-components.md` - How AI response UI works | |
| - **Security**: `docs/security.md` - Privacy implications of each inference type | |