| # Phase 6 Implementation Spec: Embeddings & Semantic Search | |
| **Goal**: Add vector search for semantic evidence retrieval. | |
| **Philosophy**: "Find what you mean, not just what you type." | |
| **Prerequisite**: Phase 5 complete (Magentic working) | |
| --- | |
| ## 1. Why Embeddings? | |
| Current limitation: **Keyword-only search misses semantically related papers.** | |
| Example problem: | |
| - User searches: "metformin alzheimer" | |
| - PubMed returns: Papers with exact keywords | |
| - MISSED: Papers about "AMPK activation neuroprotection" (same mechanism, different words) | |
| With embeddings: | |
| - Embed the query AND all evidence | |
| - Find semantically similar papers even without keyword match | |
| - Deduplicate by meaning, not just URL | |
| --- | |
| ## 2. Architecture | |
| ### Current (Phase 5) | |
| ``` | |
| Query β SearchAgent β PubMed/Web (keyword) β Evidence | |
| ``` | |
| ### Phase 6 | |
| ``` | |
| Query β Embed(Query) β SearchAgent | |
| βββ PubMed/Web (keyword) β Evidence | |
| βββ VectorDB (semantic) β Related Evidence | |
| β | |
| Evidence β Embed β Store | |
| ``` | |
| ### Shared Context Enhancement | |
| ```python | |
| # Current | |
| evidence_store = {"current": []} | |
| # Phase 6 | |
| evidence_store = { | |
| "current": [], # Raw evidence | |
| "embeddings": {}, # URL -> embedding vector | |
| "vector_index": None, # ChromaDB collection | |
| } | |
| ``` | |
| --- | |
| ## 3. Technology Choice | |
| ### ChromaDB (Recommended) | |
| - **Free**, open-source, local-first | |
| - No API keys, no cloud dependency | |
| - Supports sentence-transformers out of the box | |
| - Perfect for hackathon (no infra setup) | |
| ### Embedding Model | |
| - `sentence-transformers/all-MiniLM-L6-v2` (fast, good quality) | |
| - Or `BAAI/bge-small-en-v1.5` (better quality, still fast) | |
| --- | |
| ## 4. Implementation | |
| ### 4.1 Dependencies | |
| Add to `pyproject.toml`: | |
| ```toml | |
| [project.optional-dependencies] | |
| embeddings = [ | |
| "chromadb>=0.4.0", | |
| "sentence-transformers>=2.2.0", | |
| ] | |
| ``` | |
| ### 4.2 Embedding Service (`src/services/embeddings.py`) | |
| ```python | |
| """Embedding service for semantic search.""" | |
| from typing import List | |
| import chromadb | |
| from sentence_transformers import SentenceTransformer | |
| class EmbeddingService: | |
| """Handles text embedding and vector storage.""" | |
| def __init__(self, model_name: str = "all-MiniLM-L6-v2"): | |
| self._model = SentenceTransformer(model_name) | |
| self._client = chromadb.Client() # In-memory for hackathon | |
| self._collection = self._client.create_collection( | |
| name="evidence", | |
| metadata={"hnsw:space": "cosine"} | |
| ) | |
| def embed(self, text: str) -> List[float]: | |
| """Embed a single text.""" | |
| return self._model.encode(text).tolist() | |
| def add_evidence(self, evidence_id: str, content: str, metadata: dict) -> None: | |
| """Add evidence to vector store.""" | |
| embedding = self.embed(content) | |
| self._collection.add( | |
| ids=[evidence_id], | |
| embeddings=[embedding], | |
| metadatas=[metadata], | |
| documents=[content] | |
| ) | |
| def search_similar(self, query: str, n_results: int = 5) -> List[dict]: | |
| """Find semantically similar evidence.""" | |
| query_embedding = self.embed(query) | |
| results = self._collection.query( | |
| query_embeddings=[query_embedding], | |
| n_results=n_results | |
| ) | |
| return [ | |
| {"id": id, "content": doc, "metadata": meta, "distance": dist} | |
| for id, doc, meta, dist in zip( | |
| results["ids"][0], | |
| results["documents"][0], | |
| results["metadatas"][0], | |
| results["distances"][0] | |
| ) | |
| ] | |
| def deduplicate(self, new_evidence: List, threshold: float = 0.9) -> List: | |
| """Remove semantically duplicate evidence.""" | |
| unique = [] | |
| for evidence in new_evidence: | |
| similar = self.search_similar(evidence.content, n_results=1) | |
| if not similar or similar[0]["distance"] > (1 - threshold): | |
| unique.append(evidence) | |
| self.add_evidence( | |
| evidence_id=evidence.citation.url, | |
| content=evidence.content, | |
| metadata={"source": evidence.citation.source} | |
| ) | |
| return unique | |
| ``` | |
| ### 4.3 Enhanced SearchAgent (`src/agents/search_agent.py`) | |
| Update SearchAgent to use embeddings: | |
| ```python | |
| class SearchAgent(BaseAgent): | |
| def __init__( | |
| self, | |
| search_handler: SearchHandlerProtocol, | |
| evidence_store: dict, | |
| embedding_service: EmbeddingService | None = None, # NEW | |
| ): | |
| # ... existing init ... | |
| self._embeddings = embedding_service | |
| async def run(self, messages, *, thread=None, **kwargs) -> AgentRunResponse: | |
| # ... extract query ... | |
| # Execute keyword search | |
| result = await self._handler.execute(query, max_results_per_tool=10) | |
| # Semantic deduplication (NEW) | |
| if self._embeddings: | |
| unique_evidence = self._embeddings.deduplicate(result.evidence) | |
| # Also search for semantically related evidence | |
| related = self._embeddings.search_similar(query, n_results=5) | |
| # Add related evidence not already in results | |
| # ... merge logic ... | |
| # ... rest of method ... | |
| ``` | |
| ### 4.4 Semantic Expansion in Orchestrator | |
| The MagenticOrchestrator can use embeddings to expand queries: | |
| ```python | |
| # In task instruction | |
| task = f"""Research drug repurposing opportunities for: {query} | |
| The system has semantic search enabled. When evidence is found: | |
| 1. Related concepts will be automatically surfaced | |
| 2. Duplicates are removed by meaning, not just URL | |
| 3. Use the surfaced related concepts to refine searches | |
| """ | |
| ``` | |
| --- | |
| ## 5. Directory Structure After Phase 6 | |
| ``` | |
| src/ | |
| βββ services/ # NEW | |
| β βββ __init__.py | |
| β βββ embeddings.py # EmbeddingService | |
| βββ agents/ | |
| β βββ search_agent.py # Updated with embeddings | |
| β βββ judge_agent.py | |
| βββ ... | |
| ``` | |
| --- | |
| ## 6. Tests | |
| ### 6.1 Unit Tests (`tests/unit/services/test_embeddings.py`) | |
| ```python | |
| """Unit tests for EmbeddingService.""" | |
| import pytest | |
| from src.services.embeddings import EmbeddingService | |
| class TestEmbeddingService: | |
| def test_embed_returns_vector(self): | |
| """Embedding should return a float vector.""" | |
| service = EmbeddingService() | |
| embedding = service.embed("metformin diabetes") | |
| assert isinstance(embedding, list) | |
| assert len(embedding) > 0 | |
| assert all(isinstance(x, float) for x in embedding) | |
| def test_similar_texts_have_close_embeddings(self): | |
| """Semantically similar texts should have similar embeddings.""" | |
| service = EmbeddingService() | |
| e1 = service.embed("metformin treats diabetes") | |
| e2 = service.embed("metformin is used for diabetes treatment") | |
| e3 = service.embed("the weather is sunny today") | |
| # Cosine similarity helper | |
| from numpy import dot | |
| from numpy.linalg import norm | |
| cosine = lambda a, b: dot(a, b) / (norm(a) * norm(b)) | |
| # Similar texts should be closer | |
| assert cosine(e1, e2) > cosine(e1, e3) | |
| def test_add_and_search(self): | |
| """Should be able to add evidence and search for similar.""" | |
| service = EmbeddingService() | |
| service.add_evidence( | |
| evidence_id="test1", | |
| content="Metformin activates AMPK pathway", | |
| metadata={"source": "pubmed"} | |
| ) | |
| results = service.search_similar("AMPK activation drugs", n_results=1) | |
| assert len(results) == 1 | |
| assert "AMPK" in results[0]["content"] | |
| ``` | |
| --- | |
| ## 7. Definition of Done | |
| Phase 6 is **COMPLETE** when: | |
| 1. `EmbeddingService` implemented with ChromaDB | |
| 2. SearchAgent uses embeddings for deduplication | |
| 3. Semantic search surfaces related evidence | |
| 4. All unit tests pass | |
| 5. Integration test shows improved recall (finds related papers) | |
| --- | |
| ## 8. Value Delivered | |
| | Before (Phase 5) | After (Phase 6) | | |
| |------------------|-----------------| | |
| | Keyword-only search | Semantic + keyword search | | |
| | URL-based deduplication | Meaning-based deduplication | | |
| | Miss related papers | Surface related concepts | | |
| | Exact match required | Fuzzy semantic matching | | |
| **Real example improvement:** | |
| - Query: "metformin alzheimer" | |
| - Before: Only papers mentioning both words | |
| - After: Also finds "AMPK neuroprotection", "biguanide cognitive", etc. | |