Spaces:

Kraft102
/

widgettdc-api

Paused

App Files Files Community

widgettdc-api / docs /technical /SEMANTIC_SEARCH_GUIDE.md

Kraft102

fix: sql.js Docker/Alpine compatibility layer for PatternMemory and FailureMemory

5a81b95 2 months ago

preview code

raw

history blame contribute delete

6.04 kB

	# 🧠 Semantic Search Implementation Complete

	## What Was Implemented

	### 1. Unified Embedding Service
	Location: `apps/backend/src/services/embeddings/EmbeddingService.ts`

	Features:
	- Auto-provider detection - Tries providers in order: OpenAI → HuggingFace → Local Transformers.js
	- Multiple providers supported:
	- OpenAI (text-embedding-3-small, 1536 dimensions)
	- HuggingFace (all-MiniLM-L6-v2, 768 dimensions)
	- Transformers.js (local, 384 dimensions, no API key needed)
	- Singleton pattern - One instance shared across application
	- Automatic fallback - If one provider fails, tries the next

	### 2. Enhanced PgVectorStoreAdapter
	Location: `apps/backend/src/platform/vector/PgVectorStoreAdapter.ts`

	New Capabilities:
	- ✅ Auto-embedding generation - Pass `content` without `embedding`, it generates it for you
	- ✅ Text-based search - Search using natural language queries
	- ✅ Vector-based search - Still supports raw vector queries
	- ✅ Cosine similarity - Native PostgreSQL pgvector similarity search

	### 3. Updated Compatibility Layer
	Location: `apps/backend/src/platform/vector/ChromaVectorStoreAdapter.ts`

	Features:
	- ✅ Transparent upgrade - Old code works without changes
	- ✅ Semantic search enabled - Text queries now actually work
	- ✅ API compatibility - Maintains ChromaDB interface

	## Usage Examples

	### Text-Based Semantic Search
	```typescript
	import { getPgVectorStore } from './platform/vector/PgVectorStoreAdapter.js';

	const vectorStore = getPgVectorStore();
	await vectorStore.initialize();

	// Search using natural language
	const results = await vectorStore.search({
	text: "What is artificial intelligence?",
	limit: 5,
	namespace: "knowledge_base"
	});

	// Results contain semantically similar documents
	results.forEach(result => {
	console.log(`Similarity: ${result.similarity}`);
	console.log(`Content: ${result.content}`);
	});
	```

	### Auto-Embedding on Insert
	```typescript
	// Just provide content - embedding is generated automatically
	await vectorStore.upsert({
	id: "doc-123",
	content: "Artificial intelligence is the simulation of human intelligence processes by machines.",
	metadata: {
	source: "wikipedia",
	category: "AI"
	},
	namespace: "knowledge_base"
	});
	```

	### Batch Insert with Auto-Embeddings
	```typescript
	await vectorStore.batchUpsert({
	records: [
	{ id: "1", content: "Machine learning is a subset of AI" },
	{ id: "2", content: "Deep learning uses neural networks" },
	{ id: "3", content: "NLP processes human language" }
	],
	namespace: "ai_concepts"
	});
	// All embeddings generated automatically!
	```

	### Using with Existing Code (ChromaDB API)
	```typescript
	import { getChromaVectorStore } from './platform/vector/ChromaVectorStoreAdapter.js';

	const vectorStore = getChromaVectorStore();

	// Old code continues to work, now with real semantic search
	const results = await vectorStore.search({
	query: "machine learning concepts",
	limit: 10
	});
	```

	## Configuration

	### Option 1: OpenAI (Recommended for Production)
	```bash
	# .env
	EMBEDDING_PROVIDER=openai
	OPENAI_API_KEY=sk-...
	```

	Pros:
	- Highest quality embeddings (1536D)
	- Fast inference
	- Production-ready

	Cons:
	- Costs money (~$0.00002 per 1K tokens)
	- Requires API key

	### Option 2: HuggingFace (Good Middle Ground)
	```bash
	# .env
	EMBEDDING_PROVIDER=huggingface
	HUGGINGFACE_API_KEY=hf_...
	```

	Pros:
	- Free tier available
	- Good quality (768D)
	- Many models available

	Cons:
	- Slower than OpenAI
	- Rate limits on free tier

	### Option 3: Local Transformers.js (Development)
	```bash
	# .env
	EMBEDDING_PROVIDER=transformers
	# No API key needed!
	```

	```bash
	# Install dependency
	npm install @xenova/transformers
	```

	Pros:
	- 100% free
	- No API calls (works offline)
	- Privacy (data never leaves server)

	Cons:
	- Smaller dimensions (384D)
	- Slower first run (downloads model)
	- Uses more memory

	### Option 4: Auto-Select (Default)
	```bash
	# .env
	# No EMBEDDING_PROVIDER set
	# Tries: OpenAI → HuggingFace → Transformers.js
	```

	## Testing

	### 1. Quick Test
	```bash
	cd apps/backend
	npm install @xenova/transformers # If using local embeddings

	# Start services
	docker-compose up -d
	npx prisma migrate dev --name init
	npm run build
	npm start
	```

	### 2. Test Ingestion
	The `IngestionPipeline` now automatically generates embeddings:
	```typescript
	// When data is ingested, embeddings are auto-generated
	// No code changes needed!
	```

	### 3. Test Search
	```bash
	# Via MCP tool (use in frontend or API)
	POST /api/mcp/route
	{
	"tool": "vidensarkiv.search",
	"payload": {
	"query": "How do I configure the system?",
	"limit": 5
	}
	}
	```

	## Performance

	### Embedding Generation Speed
	- OpenAI: ~100ms per text
	- HuggingFace: ~300ms per text
	- Transformers.js: ~500ms per text (first run slower)

	### Batch Processing
	All providers support batch generation for better performance:
	```typescript
	// Generate 100 embeddings at once
	const texts = [...]; // 100 texts
	const embeddings = await embeddingService.generateEmbeddings(texts);
	```

	## Troubleshooting

	### "No embedding provider available"
	Solution: Configure at least one provider:
	```bash
	npm install @xenova/transformers
	# Or set OPENAI_API_KEY or HUGGINGFACE_API_KEY
	```

	### Slow first search with Transformers.js
	Solution: Model downloads on first use (~50MB). Subsequent calls are fast.

	### Vector dimension mismatch
	Solution: If you change providers, you may need to re-embed existing data:
	```typescript
	// Delete old embeddings
	await vectorStore.deleteNamespace("your_namespace");

	// Re-ingest data (will use new provider)
	```

	## Next Steps

	1. Test semantic search - Try querying your knowledge base
	2. Configure provider - Choose OpenAI for best quality
	3. Monitor usage - Check logs for embedding generation
	4. Optimize - Batch similar operations

	---

	Status: ✅ Semantic search fully operational. Vector database is now intelligent.