Spaces:

Felladrin
/

MiniSearch

Running

App Files Files Community

MiniSearch / docs /ai-integration.md

github-actions[bot]

Sync from https://github.com/felladrin/MiniSearch

10d1fd4 9 days ago

preview code

raw

history blame contribute delete

11 kB

AI Integration

MiniSearch supports four AI inference backends, each with different trade-offs between privacy, performance, and setup complexity.

Inference Types Overview

Type	Privacy	Speed	Setup	Best For
Browser (WebLLM/Wllama)	Maximum (no data leaves device)	Fast (WebGPU) / Slow (CPU)	None	Personal use, privacy-critical scenarios
OpenAI	Low (data sent to OpenAI)	Very Fast	API Key	Maximum quality, convenience
AI Horde	Medium (distributed volunteers)	Variable	Anonymous	Free GPU access, no setup
Internal	High (your infrastructure)	Depends on hardware	Self-hosted API	Teams, compliance requirements

Browser-Based Inference

Runs AI models entirely in the browser using WebAssembly or WebGPU. No data leaves the user's device.

WebLLM (WebGPU Accelerated)

Uses @mlc-ai/web-llm for GPU-accelerated inference.

Requirements:

Modern browser with WebGPU support (Chrome 113+, Edge 113+, Firefox Nightly)
~500MB-2GB free RAM
GPU with F16 shader support (for optimal models)

How It Works:

User searches with "Enable AI Response" on
Library checks WebGPU availability and F16 shader support
Downloads model weights from HuggingFace (cached in IndexedDB)
Loads model into GPU memory
Generates response streaming tokens

Model Selection:

// WebLLM model IDs from MLC registry
const models = {
  fast: 'Qwen3-0.6B-q4f16_1-MLC',      // 600M params, ~400MB
  balanced: 'SmolLM2-1.7B-q4f16_1-MLC', // 1.7B params, ~1GB
  capable: 'Llama-3.2-1B-q4f16_1-MLC'   // 1B params, ~600MB
};

Configuration:

Settings → Inference Type: Browser
Settings → Browser Model: Select from dropdown
Settings → Enable WebGPU: Toggle (auto-detected)

Limitations:

First load requires model download (progressive via sharded files)
Limited to smaller models (3B params max due to browser memory)
Requires modern browser with WebGPU

Wllama (CPU-Based)

Uses @wllama/wllama for CPU inference via WebAssembly.

Requirements:

Any modern browser
~300MB-1GB free RAM
No WebGPU required

How It Works:

Downloads model from HuggingFace (GGUF format)
Runs inference in WebAssembly (slower but universally compatible)
Supports 40+ pre-configured models

Pre-configured Models: All stored at Felladrin/gguf-sharded-* on HuggingFace:

Model	Params	Size	Speed	Quality
qwen-3-0.6b	600M	~400MB	Fast	Good
smollm2-1.7b	1.7B	~1.1GB	Medium	Better
llama-3.2-1b	1B	~650MB	Fast	Good
gemma-3-1b	1B	~650MB	Fast	Good
phi-4-mini	3.8B	~2.2GB	Slow	Best

Configuration:

Settings → Inference Type: Browser
Settings → Use WebGPU: OFF
Settings → Wllama Model: Select from dropdown

Limitations:

Slower than WebGPU (2-5x slower)
Same memory constraints
No GPU acceleration

WebLLM vs Wllama Decision Matrix

WebGPU Available?
├── Yes → WebLLM (F16 if supported, else F32)
└── No  → Wllama (CPU inference)

Code Detection:

// client/modules/webGpu.ts
export async function isWebGpuAvailable(): Promise<boolean> {
  if (!navigator.gpu) return false;
  try {
    const adapter = await navigator.gpu.requestAdapter();
    return !!adapter;
  } catch {
    return false;
  }
}

export async function isF16ShaderSupported(): Promise<boolean> {
  const adapter = await navigator.gpu?.requestAdapter();
  return adapter?.features.has('shader-f16') ?? false;
}

OpenAI API Integration

Uses OpenAI's API or any OpenAI-compatible service.

Setup:

Get API key from OpenAI or compatible provider
Settings → Inference Type: OpenAI
Settings → OpenAI API Key: Enter key
Settings → OpenAI Model: Select or enter model ID

Supported Providers:

OpenAI (gpt-4, gpt-3.5-turbo)
Anthropic (via OpenAI-compatible endpoint)
Google (Gemini via OpenAI-compatible endpoint)
Any custom provider with OpenAI-compatible API

Features:

Streaming responses
Auto model selection (if blank)
Retry logic with fallback models
Reasoning content support

Configuration:

{
  inferenceType: 'openai',
  openaiApiKey: 'sk-xxx',
  openaiModel: 'gpt-4', // Optional: auto-detected if empty
  inferenceTemperature: 0.7,
  inferenceMaxTokens: 4096
}

Privacy Considerations:

Search queries and results sent to OpenAI
Not suitable for sensitive data
Consider internal API for private data

AI Horde Integration

Uses aihorde.net, a distributed volunteer GPU network.

Setup:

Settings → Inference Type: AI Horde
(Optional) Settings → AI Horde API Key: Get from aihorde.net
Settings → AI Horde Model: Select preferred model

How It Works:

Request sent to AI Horde API
Distributed to volunteer workers
Multiple workers may process in parallel
First response wins (race condition handling)
Results streamed back

Features:

Free to use (anonymous or authenticated)
Kudos-based priority system
Large model selection (70B+ params available)
No API key required (but recommended for priority)

Configuration:

{
  inferenceType: 'horde',
  aiHordeApiKey: '', // Optional
  aiHordeModel: 'koboldcpp/LLaMA2-70B-Psyfighter2' // Optional
}

Limitations:

Variable latency (depends on worker availability)
Quality varies by worker
May queue during high demand
Requires internet connection

Internal API Integration

Self-hosted OpenAI-compatible API for teams and compliance.

Setup:

Host an OpenAI-compatible API (e.g., vLLM, llama.cpp server, Ollama with OpenAI compat)
Configure environment variables (see docs/configuration.md)
Settings → Inference Type: Internal

Environment Variables:

INTERNAL_OPENAI_COMPATIBLE_API_BASE_URL="https://llm.company.com/v1"
INTERNAL_OPENAI_COMPATIBLE_API_KEY="sk-internal-xxx"
INTERNAL_OPENAI_COMPATIBLE_API_MODEL="llama-3.1-8b"
INTERNAL_OPENAI_COMPATIBLE_API_NAME="Company LLM"

Server-Side Proxy: The internal API uses a server-side proxy to:

Hide API keys from client
Add request logging/auditing
Apply rate limiting
Enable token-based authentication

Endpoint:

POST /inference
Content-Type: application/json
Authorization: Bearer <VITE_SEARCH_TOKEN>

{
  "messages": [...],
  "model": "llama-3.1-8b",
  "stream": true
}

Features:

Private data stays in your infrastructure
Custom model selection
Server-side logging
Compatible with any OpenAI-compatible API

Recommended Self-Hosted Options:

vLLM: High-performance, production-ready
llama.cpp server: Single binary, easy setup
Ollama: Simple, Docker-friendly
text-generation-webui: Feature-rich, UI included

Text Generation Flow

Search-Triggered Generation

User Query
    ↓
searchAndRespond() [client/modules/textGeneration.ts]
    ↓
startTextSearch() → searchText() [search.ts]
    ↓
Wait for search results
    ↓
canStartResponding() checks state
    ↓
Load AI model (if browser-based)
    ↓
Generate system prompt with search results
    ↓
Stream response via selected inference type
    ↓
Update PubSub channels (response, textGenerationState)

Chat Generation

User sends message
    ↓
generateChatResponse() [textGeneration.ts]
    ↓
Manage token budget (75% of 4096 = ~3072 tokens)
    ↓
Create conversation summary if needed (800-token limit)
    ↓
Build context: System prompt + Summary + Recent turns
    ↓
Call inference API (streaming)
    ↓
Update PubSub (chatMessages, response)
    ↓
Save to history database

Conversation Memory

Token Budget Management

Context Window: 4096 tokens
Reserved for Response: 25% (~1024 tokens)
Available for Context: 75% (~3072 tokens)

Allocation Priority:

System prompt (with search results)
Conversation summary (if exists)
Recent chat messages (newest first)
Older messages (summarized or dropped)

Rolling Summaries

When conversation exceeds token budget:

Detect Overflow: Current tokens > 3072
Generate Summary: Call LLM with 800-token limit
Store Summary: Save in conversationSummaryPubSub
Drop Old Messages: Remove summarized messages from chatMessages
Continue: Use summary + remaining messages for context

Summary Prompt:

Summarize this conversation in 3-5 sentences, preserving key facts
and user intent. Be concise but informative.

Error Handling and Fallbacks

Browser Inference Failures

// If WebLLM fails, fallback to Wllama
try {
  await generateWithWebLLM();
} catch (error) {
  if (error.message.includes('WebGPU')) {
    // Auto-switch to Wllama
    settings.enableWebGpu = false;
    await generateWithWllama();
  }
}

API Failures

OpenAI: Retry with exponential backoff, fallback to cheaper model
AI Horde: Queue with timeout, retry with different model
Internal: Log error, return user-friendly message

State Recovery

If generation fails mid-stream:

Set textGenerationState to failed
Preserve partial response in responsePubSub
Allow user to retry or modify query

Performance Optimization

Model Caching

All browser-based models cached in IndexedDB:

WebLLM: webllm/model-cache
Wllama: wllama/model-cache
Subsequent loads: Instant (no re-download)

Streaming Strategy

Tokens streamed at 12 updates/second max (throttled)
UI updates batched via React's automatic batching
Web Workers used for non-blocking inference

Progressive Model Loading

Wllama models are sharded (split into chunks):

Download metadata first (small, fast)
Download required shards progressively
Start inference when first shards available
Continue downloading remaining shards in background

Best Practices

For Privacy-Critical Use

Use Browser inference (WebLLM/Wllama)
Disable shareModelDownloads
Set historyRetentionDays: 0 (no persistence)

For Maximum Quality

Use OpenAI GPT-4 or Internal API with large model
Set searchResultsToConsider: 5-10
Adjust temperature: 0.5-0.7 for factual, 0.8-1.0 for creative

For Cost Efficiency

Use AI Horde (free) or Browser inference (one-time download)
Set searchResultsToConsider: 3 (default)
Limit inferenceMaxTokens: 2048

For Teams/Enterprise

Deploy Internal API with vLLM
Set ACCESS_KEYS for access control
Enable server-side logging
Use consistent INTERNAL_OPENAI_COMPATIBLE_API_MODEL

AI Integration

Inference Types Overview

Browser-Based Inference

WebLLM (WebGPU Accelerated)

Wllama (CPU-Based)

WebLLM vs Wllama Decision Matrix

OpenAI API Integration

AI Horde Integration

Internal API Integration

Text Generation Flow

Search-Triggered Generation

Chat Generation

Conversation Memory

Token Budget Management

Rolling Summaries

Error Handling and Fallbacks

Browser Inference Failures

API Failures

State Recovery

Performance Optimization

Model Caching

Streaming Strategy

Progressive Model Loading

Best Practices

For Privacy-Critical Use

For Maximum Quality

For Cost Efficiency

For Teams/Enterprise

Related Topics