widgettdc-api / docs /technical /SEMANTIC_SEARCH_GUIDE.md
Kraft102's picture
fix: sql.js Docker/Alpine compatibility layer for PatternMemory and FailureMemory
5a81b95

🧠 Semantic Search Implementation Complete

What Was Implemented

1. Unified Embedding Service

Location: apps/backend/src/services/embeddings/EmbeddingService.ts

Features:

  • Auto-provider detection - Tries providers in order: OpenAI β†’ HuggingFace β†’ Local Transformers.js
  • Multiple providers supported:
    • OpenAI (text-embedding-3-small, 1536 dimensions)
    • HuggingFace (all-MiniLM-L6-v2, 768 dimensions)
    • Transformers.js (local, 384 dimensions, no API key needed)
  • Singleton pattern - One instance shared across application
  • Automatic fallback - If one provider fails, tries the next

2. Enhanced PgVectorStoreAdapter

Location: apps/backend/src/platform/vector/PgVectorStoreAdapter.ts

New Capabilities:

  • βœ… Auto-embedding generation - Pass content without embedding, it generates it for you
  • βœ… Text-based search - Search using natural language queries
  • βœ… Vector-based search - Still supports raw vector queries
  • βœ… Cosine similarity - Native PostgreSQL pgvector similarity search

3. Updated Compatibility Layer

Location: apps/backend/src/platform/vector/ChromaVectorStoreAdapter.ts

Features:

  • βœ… Transparent upgrade - Old code works without changes
  • βœ… Semantic search enabled - Text queries now actually work
  • βœ… API compatibility - Maintains ChromaDB interface

Usage Examples

Text-Based Semantic Search

import { getPgVectorStore } from './platform/vector/PgVectorStoreAdapter.js';

const vectorStore = getPgVectorStore();
await vectorStore.initialize();

// Search using natural language
const results = await vectorStore.search({
  text: "What is artificial intelligence?",
  limit: 5,
  namespace: "knowledge_base"
});

// Results contain semantically similar documents
results.forEach(result => {
  console.log(`Similarity: ${result.similarity}`);
  console.log(`Content: ${result.content}`);
});

Auto-Embedding on Insert

// Just provide content - embedding is generated automatically
await vectorStore.upsert({
  id: "doc-123",
  content: "Artificial intelligence is the simulation of human intelligence processes by machines.",
  metadata: {
    source: "wikipedia",
    category: "AI"
  },
  namespace: "knowledge_base"
});

Batch Insert with Auto-Embeddings

await vectorStore.batchUpsert({
  records: [
    { id: "1", content: "Machine learning is a subset of AI" },
    { id: "2", content: "Deep learning uses neural networks" },
    { id: "3", content: "NLP processes human language" }
  ],
  namespace: "ai_concepts"
});
// All embeddings generated automatically!

Using with Existing Code (ChromaDB API)

import { getChromaVectorStore } from './platform/vector/ChromaVectorStoreAdapter.js';

const vectorStore = getChromaVectorStore();

// Old code continues to work, now with real semantic search
const results = await vectorStore.search({
  query: "machine learning concepts",
  limit: 10
});

Configuration

Option 1: OpenAI (Recommended for Production)

# .env
EMBEDDING_PROVIDER=openai
OPENAI_API_KEY=sk-...

Pros:

  • Highest quality embeddings (1536D)
  • Fast inference
  • Production-ready

Cons:

  • Costs money (~$0.00002 per 1K tokens)
  • Requires API key

Option 2: HuggingFace (Good Middle Ground)

# .env
EMBEDDING_PROVIDER=huggingface
HUGGINGFACE_API_KEY=hf_...

Pros:

  • Free tier available
  • Good quality (768D)
  • Many models available

Cons:

  • Slower than OpenAI
  • Rate limits on free tier

Option 3: Local Transformers.js (Development)

# .env
EMBEDDING_PROVIDER=transformers
# No API key needed!
# Install dependency
npm install @xenova/transformers

Pros:

  • 100% free
  • No API calls (works offline)
  • Privacy (data never leaves server)

Cons:

  • Smaller dimensions (384D)
  • Slower first run (downloads model)
  • Uses more memory

Option 4: Auto-Select (Default)

# .env
# No EMBEDDING_PROVIDER set
# Tries: OpenAI β†’ HuggingFace β†’ Transformers.js

Testing

1. Quick Test

cd apps/backend
npm install @xenova/transformers  # If using local embeddings

# Start services
docker-compose up -d
npx prisma migrate dev --name init
npm run build
npm start

2. Test Ingestion

The IngestionPipeline now automatically generates embeddings:

// When data is ingested, embeddings are auto-generated
// No code changes needed!

3. Test Search

# Via MCP tool (use in frontend or API)
POST /api/mcp/route
{
  "tool": "vidensarkiv.search",
  "payload": {
    "query": "How do I configure the system?",
    "limit": 5
  }
}

Performance

Embedding Generation Speed

  • OpenAI: ~100ms per text
  • HuggingFace: ~300ms per text
  • Transformers.js: ~500ms per text (first run slower)

Batch Processing

All providers support batch generation for better performance:

// Generate 100 embeddings at once
const texts = [...]; // 100 texts
const embeddings = await embeddingService.generateEmbeddings(texts);

Troubleshooting

"No embedding provider available"

Solution: Configure at least one provider:

npm install @xenova/transformers
# Or set OPENAI_API_KEY or HUGGINGFACE_API_KEY

Slow first search with Transformers.js

Solution: Model downloads on first use (~50MB). Subsequent calls are fast.

Vector dimension mismatch

Solution: If you change providers, you may need to re-embed existing data:

// Delete old embeddings
await vectorStore.deleteNamespace("your_namespace");

// Re-ingest data (will use new provider)

Next Steps

  1. Test semantic search - Try querying your knowledge base
  2. Configure provider - Choose OpenAI for best quality
  3. Monitor usage - Check logs for embedding generation
  4. Optimize - Batch similar operations

Status: βœ… Semantic search fully operational. Vector database is now intelligent.