Spaces:

Kraft102
/

widgettdc-api

Paused

File size: 6,040 Bytes

5a81b95

# 🧠 Semantic Search Implementation Complete

## What Was Implemented

### 1. Unified Embedding Service
**Location:** `apps/backend/src/services/embeddings/EmbeddingService.ts`

**Features:**
- **Auto-provider detection** - Tries providers in order: OpenAI → HuggingFace → Local Transformers.js
- **Multiple providers supported:**
  - **OpenAI** (text-embedding-3-small, 1536 dimensions)
  - **HuggingFace** (all-MiniLM-L6-v2, 768 dimensions)  
  - **Transformers.js** (local, 384 dimensions, no API key needed)
- **Singleton pattern** - One instance shared across application
- **Automatic fallback** - If one provider fails, tries the next

### 2. Enhanced PgVectorStoreAdapter
**Location:** `apps/backend/src/platform/vector/PgVectorStoreAdapter.ts`

**New Capabilities:**
- ✅ **Auto-embedding generation** - Pass `content` without `embedding`, it generates it for you
- ✅ **Text-based search** - Search using natural language queries
- ✅ **Vector-based search** - Still supports raw vector queries
- ✅ **Cosine similarity** - Native PostgreSQL pgvector similarity search

### 3. Updated Compatibility Layer
**Location:** `apps/backend/src/platform/vector/ChromaVectorStoreAdapter.ts`

**Features:**
- ✅ **Transparent upgrade** - Old code works without changes
- ✅ **Semantic search enabled** - Text queries now actually work
- ✅ **API compatibility** - Maintains ChromaDB interface

## Usage Examples

### Text-Based Semantic Search
```typescript
import { getPgVectorStore } from './platform/vector/PgVectorStoreAdapter.js';

const vectorStore = getPgVectorStore();
await vectorStore.initialize();

// Search using natural language
const results = await vectorStore.search({
  text: "What is artificial intelligence?",
  limit: 5,
  namespace: "knowledge_base"
});

// Results contain semantically similar documents
results.forEach(result => {
  console.log(`Similarity: ${result.similarity}`);
  console.log(`Content: ${result.content}`);
});
```

### Auto-Embedding on Insert
```typescript
// Just provide content - embedding is generated automatically
await vectorStore.upsert({
  id: "doc-123",
  content: "Artificial intelligence is the simulation of human intelligence processes by machines.",
  metadata: {
    source: "wikipedia",
    category: "AI"
  },
  namespace: "knowledge_base"
});
```

### Batch Insert with Auto-Embeddings
```typescript
await vectorStore.batchUpsert({
  records: [
    { id: "1", content: "Machine learning is a subset of AI" },
    { id: "2", content: "Deep learning uses neural networks" },
    { id: "3", content: "NLP processes human language" }
  ],
  namespace: "ai_concepts"
});
// All embeddings generated automatically!
```

### Using with Existing Code (ChromaDB API)
```typescript
import { getChromaVectorStore } from './platform/vector/ChromaVectorStoreAdapter.js';

const vectorStore = getChromaVectorStore();

// Old code continues to work, now with real semantic search
const results = await vectorStore.search({
  query: "machine learning concepts",
  limit: 10
});
```

## Configuration

### Option 1: OpenAI (Recommended for Production)
```bash
# .env
EMBEDDING_PROVIDER=openai
OPENAI_API_KEY=sk-...
```

**Pros:**
- Highest quality embeddings (1536D)
- Fast inference
- Production-ready

**Cons:**
- Costs money (~$0.00002 per 1K tokens)
- Requires API key

### Option 2: HuggingFace (Good Middle Ground)
```bash
# .env
EMBEDDING_PROVIDER=huggingface
HUGGINGFACE_API_KEY=hf_...
```

**Pros:**
- Free tier available
- Good quality (768D)
- Many models available

**Cons:**
- Slower than OpenAI
- Rate limits on free tier

### Option 3: Local Transformers.js (Development)
```bash
# .env
EMBEDDING_PROVIDER=transformers
# No API key needed!
```

```bash
# Install dependency
npm install @xenova/transformers
```

**Pros:**
- 100% free
- No API calls (works offline)
- Privacy (data never leaves server)

**Cons:**
- Smaller dimensions (384D)
- Slower first run (downloads model)
- Uses more memory

### Option 4: Auto-Select (Default)
```bash
# .env
# No EMBEDDING_PROVIDER set
# Tries: OpenAI → HuggingFace → Transformers.js
```

## Testing

### 1. Quick Test
```bash
cd apps/backend
npm install @xenova/transformers  # If using local embeddings

# Start services
docker-compose up -d
npx prisma migrate dev --name init
npm run build
npm start
```

### 2. Test Ingestion
The `IngestionPipeline` now automatically generates embeddings:
```typescript
// When data is ingested, embeddings are auto-generated
// No code changes needed!
```

### 3. Test Search
```bash
# Via MCP tool (use in frontend or API)
POST /api/mcp/route
{
  "tool": "vidensarkiv.search",
  "payload": {
    "query": "How do I configure the system?",
    "limit": 5
  }
}
```

## Performance

### Embedding Generation Speed
- **OpenAI:** ~100ms per text
- **HuggingFace:** ~300ms per text
- **Transformers.js:** ~500ms per text (first run slower)

### Batch Processing
All providers support batch generation for better performance:
```typescript
// Generate 100 embeddings at once
const texts = [...]; // 100 texts
const embeddings = await embeddingService.generateEmbeddings(texts);
```

## Troubleshooting

### "No embedding provider available"
**Solution:** Configure at least one provider:
```bash
npm install @xenova/transformers
# Or set OPENAI_API_KEY or HUGGINGFACE_API_KEY
```

### Slow first search with Transformers.js
**Solution:** Model downloads on first use (~50MB). Subsequent calls are fast.

### Vector dimension mismatch
**Solution:** If you change providers, you may need to re-embed existing data:
```typescript
// Delete old embeddings
await vectorStore.deleteNamespace("your_namespace");

// Re-ingest data (will use new provider)
```

## Next Steps

1. **Test semantic search** - Try querying your knowledge base
2. **Configure provider** - Choose OpenAI for best quality
3. **Monitor usage** - Check logs for embedding generation
4. **Optimize** - Batch similar operations

---

**Status:** ✅ Semantic search fully operational. Vector database is now intelligent.