Spaces:
Running
AI Integration
MiniSearch supports four AI inference backends, each with different trade-offs between privacy, performance, and setup complexity.
Inference Types Overview
| Type | Privacy | Speed | Setup | Best For |
|---|---|---|---|---|
| Browser (WebLLM/Wllama) | Maximum (no data leaves device) | Fast (WebGPU) / Slow (CPU) | None | Personal use, privacy-critical scenarios |
| OpenAI | Low (data sent to OpenAI) | Very Fast | API Key | Maximum quality, convenience |
| AI Horde | Medium (distributed volunteers) | Variable | Anonymous | Free GPU access, no setup |
| Internal | High (your infrastructure) | Depends on hardware | Self-hosted API | Teams, compliance requirements |
Browser-Based Inference
Runs AI models entirely in the browser using WebAssembly or WebGPU. No data leaves the user's device.
WebLLM (WebGPU Accelerated)
Uses @mlc-ai/web-llm for GPU-accelerated inference.
Requirements:
- Modern browser with WebGPU support (Chrome 113+, Edge 113+, Firefox Nightly)
- ~500MB-2GB free RAM
- GPU with F16 shader support (for optimal models)
How It Works:
- User searches with "Enable AI Response" on
- Library checks WebGPU availability and F16 shader support
- Downloads model weights from HuggingFace (cached in IndexedDB)
- Loads model into GPU memory
- Generates response streaming tokens
Model Selection:
// WebLLM model IDs from MLC registry
const models = {
fast: 'Qwen3-0.6B-q4f16_1-MLC', // 600M params, ~400MB
balanced: 'SmolLM2-1.7B-q4f16_1-MLC', // 1.7B params, ~1GB
capable: 'Llama-3.2-1B-q4f16_1-MLC' // 1B params, ~600MB
};
Configuration:
- Settings β Inference Type:
Browser - Settings β Browser Model: Select from dropdown
- Settings β Enable WebGPU: Toggle (auto-detected)
Limitations:
- First load requires model download (progressive via sharded files)
- Limited to smaller models (3B params max due to browser memory)
- Requires modern browser with WebGPU
Wllama (CPU-Based)
Uses @wllama/wllama for CPU inference via WebAssembly.
Requirements:
- Any modern browser
- ~300MB-1GB free RAM
- No WebGPU required
How It Works:
- Downloads model from HuggingFace (GGUF format)
- Runs inference in WebAssembly (slower but universally compatible)
- Supports 40+ pre-configured models
Pre-configured Models:
All stored at Felladrin/gguf-sharded-* on HuggingFace:
| Model | Params | Size | Speed | Quality |
|---|---|---|---|---|
| qwen-3-0.6b | 600M | ~400MB | Fast | Good |
| smollm2-1.7b | 1.7B | ~1.1GB | Medium | Better |
| llama-3.2-1b | 1B | ~650MB | Fast | Good |
| gemma-3-1b | 1B | ~650MB | Fast | Good |
| phi-4-mini | 3.8B | ~2.2GB | Slow | Best |
Configuration:
- Settings β Inference Type:
Browser - Settings β Use WebGPU: OFF
- Settings β Wllama Model: Select from dropdown
Limitations:
- Slower than WebGPU (2-5x slower)
- Same memory constraints
- No GPU acceleration
WebLLM vs Wllama Decision Matrix
WebGPU Available?
βββ Yes β WebLLM (F16 if supported, else F32)
βββ No β Wllama (CPU inference)
Code Detection:
// client/modules/webGpu.ts
export async function isWebGpuAvailable(): Promise<boolean> {
if (!navigator.gpu) return false;
try {
const adapter = await navigator.gpu.requestAdapter();
return !!adapter;
} catch {
return false;
}
}
export async function isF16ShaderSupported(): Promise<boolean> {
const adapter = await navigator.gpu?.requestAdapter();
return adapter?.features.has('shader-f16') ?? false;
}
OpenAI API Integration
Uses OpenAI's API or any OpenAI-compatible service.
Setup:
- Get API key from OpenAI or compatible provider
- Settings β Inference Type:
OpenAI - Settings β OpenAI API Key: Enter key
- Settings β OpenAI Model: Select or enter model ID
Supported Providers:
- OpenAI (gpt-4, gpt-3.5-turbo)
- Anthropic (via OpenAI-compatible endpoint)
- Google (Gemini via OpenAI-compatible endpoint)
- Any custom provider with OpenAI-compatible API
Features:
- Streaming responses
- Auto model selection (if blank)
- Retry logic with fallback models
- Reasoning content support
Configuration:
{
inferenceType: 'openai',
openaiApiKey: 'sk-xxx',
openaiModel: 'gpt-4', // Optional: auto-detected if empty
inferenceTemperature: 0.7,
inferenceMaxTokens: 4096
}
Privacy Considerations:
- Search queries and results sent to OpenAI
- Not suitable for sensitive data
- Consider internal API for private data
AI Horde Integration
Uses aihorde.net, a distributed volunteer GPU network.
Setup:
- Settings β Inference Type:
AI Horde - (Optional) Settings β AI Horde API Key: Get from aihorde.net
- Settings β AI Horde Model: Select preferred model
How It Works:
- Request sent to AI Horde API
- Distributed to volunteer workers
- Multiple workers may process in parallel
- First response wins (race condition handling)
- Results streamed back
Features:
- Free to use (anonymous or authenticated)
- Kudos-based priority system
- Large model selection (70B+ params available)
- No API key required (but recommended for priority)
Configuration:
{
inferenceType: 'horde',
aiHordeApiKey: '', // Optional
aiHordeModel: 'koboldcpp/LLaMA2-70B-Psyfighter2' // Optional
}
Limitations:
- Variable latency (depends on worker availability)
- Quality varies by worker
- May queue during high demand
- Requires internet connection
Internal API Integration
Self-hosted OpenAI-compatible API for teams and compliance.
Setup:
- Host an OpenAI-compatible API (e.g., vLLM, llama.cpp server, Ollama with OpenAI compat)
- Configure environment variables (see
docs/configuration.md) - Settings β Inference Type:
Internal
Environment Variables:
INTERNAL_OPENAI_COMPATIBLE_API_BASE_URL="https://llm.company.com/v1"
INTERNAL_OPENAI_COMPATIBLE_API_KEY="sk-internal-xxx"
INTERNAL_OPENAI_COMPATIBLE_API_MODEL="llama-3.1-8b"
INTERNAL_OPENAI_COMPATIBLE_API_NAME="Company LLM"
Server-Side Proxy: The internal API uses a server-side proxy to:
- Hide API keys from client
- Add request logging/auditing
- Apply rate limiting
- Enable token-based authentication
Endpoint:
POST /inference
Content-Type: application/json
Authorization: Bearer <VITE_SEARCH_TOKEN>
{
"messages": [...],
"model": "llama-3.1-8b",
"stream": true
}
Features:
- Private data stays in your infrastructure
- Custom model selection
- Server-side logging
- Compatible with any OpenAI-compatible API
Recommended Self-Hosted Options:
- vLLM: High-performance, production-ready
- llama.cpp server: Single binary, easy setup
- Ollama: Simple, Docker-friendly
- text-generation-webui: Feature-rich, UI included
Text Generation Flow
Search-Triggered Generation
User Query
β
searchAndRespond() [client/modules/textGeneration.ts]
β
startTextSearch() β searchText() [search.ts]
β
Wait for search results
β
canStartResponding() checks state
β
Load AI model (if browser-based)
β
Generate system prompt with search results
β
Stream response via selected inference type
β
Update PubSub channels (response, textGenerationState)
Chat Generation
User sends message
β
generateChatResponse() [textGeneration.ts]
β
Manage token budget (75% of 4096 = ~3072 tokens)
β
Create conversation summary if needed (800-token limit)
β
Build context: System prompt + Summary + Recent turns
β
Call inference API (streaming)
β
Update PubSub (chatMessages, response)
β
Save to history database
Conversation Memory
Token Budget Management
- Context Window: 4096 tokens
- Reserved for Response: 25% (~1024 tokens)
- Available for Context: 75% (~3072 tokens)
Allocation Priority:
- System prompt (with search results)
- Conversation summary (if exists)
- Recent chat messages (newest first)
- Older messages (summarized or dropped)
Rolling Summaries
When conversation exceeds token budget:
- Detect Overflow: Current tokens > 3072
- Generate Summary: Call LLM with 800-token limit
- Store Summary: Save in conversationSummaryPubSub
- Drop Old Messages: Remove summarized messages from chatMessages
- Continue: Use summary + remaining messages for context
Summary Prompt:
Summarize this conversation in 3-5 sentences, preserving key facts
and user intent. Be concise but informative.
Error Handling and Fallbacks
Browser Inference Failures
// If WebLLM fails, fallback to Wllama
try {
await generateWithWebLLM();
} catch (error) {
if (error.message.includes('WebGPU')) {
// Auto-switch to Wllama
settings.enableWebGpu = false;
await generateWithWllama();
}
}
API Failures
- OpenAI: Retry with exponential backoff, fallback to cheaper model
- AI Horde: Queue with timeout, retry with different model
- Internal: Log error, return user-friendly message
State Recovery
If generation fails mid-stream:
- Set
textGenerationStatetofailed - Preserve partial response in
responsePubSub - Allow user to retry or modify query
Performance Optimization
Model Caching
All browser-based models cached in IndexedDB:
- WebLLM:
webllm/model-cache - Wllama:
wllama/model-cache - Subsequent loads: Instant (no re-download)
Streaming Strategy
- Tokens streamed at 12 updates/second max (throttled)
- UI updates batched via React's automatic batching
- Web Workers used for non-blocking inference
Progressive Model Loading
Wllama models are sharded (split into chunks):
- Download metadata first (small, fast)
- Download required shards progressively
- Start inference when first shards available
- Continue downloading remaining shards in background
Best Practices
For Privacy-Critical Use
- Use Browser inference (WebLLM/Wllama)
- Disable
shareModelDownloads - Set
historyRetentionDays: 0(no persistence)
For Maximum Quality
- Use OpenAI GPT-4 or Internal API with large model
- Set
searchResultsToConsider: 5-10 - Adjust temperature: 0.5-0.7 for factual, 0.8-1.0 for creative
For Cost Efficiency
- Use AI Horde (free) or Browser inference (one-time download)
- Set
searchResultsToConsider: 3(default) - Limit
inferenceMaxTokens: 2048
For Teams/Enterprise
- Deploy Internal API with vLLM
- Set
ACCESS_KEYSfor access control - Enable server-side logging
- Use consistent
INTERNAL_OPENAI_COMPATIBLE_API_MODEL
Related Topics
- Configuration:
docs/configuration.md- Environment variables and settings - Conversation Memory:
docs/conversation-memory.md- Detailed token budgeting - UI Components:
docs/ui-components.md- How AI response UI works - Security:
docs/security.md- Privacy implications of each inference type