Spaces:
Running on Zero
Running on Zero
File size: 6,092 Bytes
6f9a5fd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | # M11 β Embedding Service
**Spec version:** v1.0
**Depends on:** M03 (bus), X04 (config), X03 (observability), `sentence-transformers`, `torch`
**Depended on by:** M05 (RAG uses embed.text), M06 (marketplace.search uses embed.text)
---
## 1. Responsibility
Provide capabilities `embed.text@1.0` (and Phase 2 `embed.image@1.0`). Wrap one or more embedding backends, register them with the bus, batch incoming requests for throughput.
Embeddings are separated from `llm.*` because their workload is different: small models, high throughput, batchable, often CPU-runnable.
---
## 2. File layout
```
hearthnet/services/embedding/
βββ __init__.py
βββ service.py # EmbeddingService
βββ backends.py # SentenceTransformerBackend, OllamaEmbedBackend (Phase 2)
```
---
## 3. Public API
### 3.1 `backends.py`
```python
# hearthnet/services/embedding/backends.py
from typing import Protocol
class EmbeddingBackend(Protocol):
name: str # "sentence_transformers" | "ollama" | "hf_api"
model: str # "BAAI/bge-small-en-v1.5"
dim: int # 384 for bge-small
max_input: int # max chars per text
async def embed(self, texts: list[str], *, normalize: bool = True) -> list[list[float]]: ...
async def warm(self) -> None: ...
async def close(self) -> None: ...
def health(self) -> dict: ...
class SentenceTransformerBackend:
"""Local backend using sentence-transformers + torch."""
def __init__(self, model: str, device: str = "auto"):
"""device: 'auto' picks cuda if available else cpu."""
async def embed(self, texts, *, normalize=True): ...
async def warm(self): ...
async def close(self): ...
def health(self): ...
```
### 3.2 `service.py`
```python
# hearthnet/services/embedding/service.py
class EmbeddingService:
name = "embedding"
version = "1.0"
def __init__(self, config: EmbeddingConfig):
self._backend: EmbeddingBackend = SentenceTransformerBackend(
model=config.model, device=config.device
)
def capabilities(self) -> list[tuple[CapabilityDescriptor, Callable, ParamsPredicate]]:
"""Returns one entry for embed.text@1.0."""
async def start(self) -> None: ...
async def stop(self) -> None: ...
def health(self) -> dict: ...
# --- handler ---
async def handle_embed_text(self, req: RouteRequest) -> dict:
"""Implements embed.text@1.0 (CONTRACT Β§4.3)."""
```
### 3.3 Capability descriptor (`embed.text@1.0`)
```python
descriptor = CapabilityDescriptor(
name="embed.text",
version=(1, 0),
stability="stable",
request_schema={
"type": "object",
"required": ["params", "input"],
"properties": {
"params": {"type": "object", "properties": {
"model": {"type": "string"},
}, "required": ["model"]},
"input": {"type": "object", "required": ["texts"], "properties": {
"texts": {"type": "array", "items": {"type": "string"}, "minItems": 1, "maxItems": 256},
"normalize": {"type": "boolean", "default": True},
}},
},
},
response_schema={
"type": "object",
"required": ["output", "meta"],
"properties": {
"output": {"type": "object", "required": ["embeddings", "dim"], "properties": {
"embeddings": {"type": "array", "items": {"type": "array", "items": {"type": "number"}}},
"dim": {"type": "integer"},
}},
"meta": {"type": "object", "required": ["model", "ms"]},
},
},
stream_schema=None,
params={"model": "<from backend>"},
max_concurrent=8,
trust_required="member",
timeout_seconds=15,
idempotent=True,
)
```
### 3.4 `params_compatible` predicate
```python
def params_compatible(offered: dict, requested: dict) -> bool:
# request must specify model; must match offered exactly
return requested.get("model") == offered.get("model")
```
---
## 4. Behaviour
### 4.1 Batching (optional optimisation; Phase 1.5)
If multiple requests arrive within a small window (e.g. 20 ms), combine their `texts` arrays into one backend call. Demultiplex results back. MVP: no batching (per-request), simpler.
### 4.2 Validation
- > 256 texts β `bad_request`
- Any text > 8192 chars β `bad_request`
- Unknown model β `not_found` (model not loaded; consider asking another node)
### 4.3 Resource sizing
`max_concurrent = 8` is a sensible default on CPU. On GPU, increase via subclass that overrides `max_concurrent`. The number is declared in the manifest so the bus can throttle correctly.
---
## 5. Errors
Only the universal codes from [CONTRACT Β§9](../CAPABILITY_CONTRACT.md):
- `bad_request` β texts too long, too many, etc.
- `not_found` β model not loaded
- `internal_error` β backend crash
---
## 6. Configuration
From [X04 Β§3](../cross-cutting/X04-config.md):
```python
config.embedding.model # "BAAI/bge-small-en-v1.5"
config.embedding.device # "auto"
```
---
## 7. Tests
### Unit
- `test_descriptor_schema_validates_meta_schema`
- `test_handler_rejects_oversized_text`
- `test_handler_rejects_too_many_texts`
- `test_params_compatible_exact_model_match`
- `test_embed_normalises_to_unit_length`
### Integration
- `test_rag_calls_embed_via_bus` β RAG ingests, embeds via bus.call(), no direct service-to-service import
- `test_remote_embed_fallback` β local backend not loaded, bus routes to peer
---
## 8. Cross-references
| What | Where |
|------|-------|
| `embed.text@1.0` wire spec | [CONTRACT Β§4.3](../CAPABILITY_CONTRACT.md) |
| Service protocol | [M03 Β§4](M03-bus.md) |
| Consumed by RAG | [M05 Β§5](M05-rag.md) |
| Consumed by marketplace.search | [M06 Β§5](M06-marketplace.md) |
---
## 9. Open questions
1. **Phase 2 `embed.image@1.0` (CLIP)** β adds an image backend; `params` includes modality. Reserved.
2. **Batching policy** β measured trade-off; deferred. Defaults to immediate dispatch in MVP.
|