chmielvu commited on
Commit
502fed8
·
verified ·
1 Parent(s): 54d3a50

Replace with FastEmbed ONNX models (jina-embeddings-v2-base-code + BM25 + reranker)

Browse files
Files changed (4) hide show
  1. Dockerfile +6 -7
  2. README.md +41 -42
  3. app.py +205 -346
  4. requirements.txt +5 -8
Dockerfile CHANGED
@@ -1,17 +1,16 @@
1
  FROM python:3.11-slim
2
 
3
- ENV PYTHONDONTWRITEBYTECODE=1 \
4
- PYTHONUNBUFFERED=1 \
5
- PIP_NO_CACHE_DIR=1 \
6
- PORT=7860
7
-
8
  WORKDIR /app
9
 
 
10
  COPY requirements.txt .
11
- RUN pip install --upgrade pip && pip install -r requirements.txt
12
 
 
13
  COPY app.py .
14
 
 
15
  EXPOSE 7860
16
 
17
- CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
 
 
1
  FROM python:3.11-slim
2
 
 
 
 
 
 
3
  WORKDIR /app
4
 
5
+ # Install dependencies
6
  COPY requirements.txt .
7
+ RUN pip install --no-cache-dir -r requirements.txt
8
 
9
+ # Copy application
10
  COPY app.py .
11
 
12
+ # Expose port
13
  EXPOSE 7860
14
 
15
+ # Run server
16
+ CMD ["python", "app.py"]
README.md CHANGED
@@ -1,68 +1,67 @@
1
  ---
2
- title: Code-Embed-Qwen-rerank-sentiment
3
- colorFrom: gray
4
- colorTo: indigo
 
5
  sdk: docker
6
- app_port: 7860
7
- pinned: true
8
  ---
9
 
10
- # Code-Embed-Qwen-rerank-sentiment
11
 
12
- Live API: `https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space`
13
 
14
  ## Models
15
 
16
- - Code embeddings: `jinaai/jina-code-embeddings-0.5b`
17
- Served name: `code-embed`
18
- Vector dimension: `896`
19
- - Reranker: `Qwen/Qwen3-Reranker-0.6B`
20
- Served name: `code-rerank`
21
- - Classifier: `clapAI/modernBERT-base-multilingual-sentiment`
22
- Served name: `code-sentiment`
23
- - Image embeddings: `sentence-transformers/clip-ViT-B-32`
24
- Served name: `clip-image`
25
- Vector dimension: `512`
26
 
27
- ## Endpoints
28
 
29
- - `https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space/health`
30
- - `https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space/models`
31
- - `https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space/embeddings`
32
- - `https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space/embeddings_image`
33
- - `https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space/rerank`
34
- - `https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space/classify`
35
- - `https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space/openapi.json`
36
-
37
- ## OpenAI-Style Aliases
38
-
39
- - `https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space/v1/models`
40
- - `https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space/v1/embeddings`
41
- - `https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space/v1/rerank`
42
- - `https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space/v1/classify`
43
-
44
- ## Example Requests
45
 
 
46
  ```bash
47
- curl -X POST "https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space/embeddings" \
48
  -H "Content-Type: application/json" \
49
- -d '{"model":"code-embed","input":["def quick_sort(arr): return sorted(arr)"]}'
50
  ```
51
 
 
52
  ```bash
53
- curl -X POST "https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space/embeddings_image" \
54
  -H "Content-Type: application/json" \
55
- -d '{"model":"clip-image","input":["https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png"]}'
56
  ```
57
 
 
58
  ```bash
59
- curl -X POST "https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space/rerank" \
60
  -H "Content-Type: application/json" \
61
- -d '{"model":"code-rerank","query":"python quick sort implementation","documents":["def quick_sort(arr): return sorted(arr)","SELECT * FROM users ORDER BY created_at DESC"],"return_documents":true}'
62
  ```
63
 
 
64
  ```bash
65
- curl -X POST "https://chmielvu-code-embed-qwen-rerank-sentiment.hf.space/classify" \
66
  -H "Content-Type: application/json" \
67
- -d '{"model":"code-sentiment","input":["The API is fast and easy to use."]}'
68
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: FastEmbed Code Embeddings
3
+ emoji: 🚀
4
+ colorFrom: blue
5
+ colorTo: green
6
  sdk: docker
7
+ pinned: false
8
+ license: apache-2.0
9
  ---
10
 
11
+ # FastEmbed Code Embeddings Server
12
 
13
+ CPU-optimized embedding server using **FastEmbed** with ONNX quantized models.
14
 
15
  ## Models
16
 
17
+ | Type | Model | Dimensions | Size |
18
+ |------|-------|------------|------|
19
+ | **Dense** | `jinaai/jina-embeddings-v2-base-code` | 768 | 0.64 GB |
20
+ | **Sparse** | `Qdrant/bm25` | BM25 | 0.01 GB |
21
+ | **Reranker** | `jinaai/jina-reranker-v1-tiny-en` | - | 0.13 GB |
 
 
 
 
 
22
 
23
+ **Total: ~0.78 GB** - Fits easily in CPU Basic (2 vCPU, 16GB RAM)
24
 
25
+ ## API Endpoints
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
+ ### Dense Embeddings
28
  ```bash
29
+ curl -X POST https://YOUR_SPACE.hf.space/v1/embeddings \
30
  -H "Content-Type: application/json" \
31
+ -d '{"input": ["def hello(): pass", "class Foo: ..."], "model": "code-embed"}'
32
  ```
33
 
34
+ ### Sparse BM25 Embeddings
35
  ```bash
36
+ curl -X POST https://YOUR_SPACE.hf.space/v1/sparse/embeddings \
37
  -H "Content-Type: application/json" \
38
+ -d '{"input": ["search query", "document text"]}'
39
  ```
40
 
41
+ ### Hybrid Search Embeddings
42
  ```bash
43
+ curl -X POST https://YOUR_SPACE.hf.space/v1/hybrid/embeddings \
44
  -H "Content-Type: application/json" \
45
+ -d '{"input": ["code snippet"]}'
46
  ```
47
 
48
+ ### Reranking
49
  ```bash
50
+ curl -X POST https://YOUR_SPACE.hf.space/v1/rerank \
51
  -H "Content-Type: application/json" \
52
+ -d '{"query": "python async function", "documents": ["doc1", "doc2", "doc3"]}'
53
  ```
54
+
55
+ ## Features
56
+
57
+ - **ONNX Runtime**: Optimized CPU inference, no PyTorch overhead
58
+ - **Model Caching**: Models loaded once, reused across requests
59
+ - **Hybrid Search**: Dense + sparse (BM25) for better retrieval
60
+ - **Code-Optimized**: `jina-embeddings-v2-base-code` specifically trained for code
61
+
62
+ ## Performance
63
+
64
+ Compared to PyTorch-based SentenceTransformers:
65
+ - **5-10x faster** on CPU
66
+ - **5x smaller** model footprint
67
+ - **Lower latency**: ONNX quantization + caching
app.py CHANGED
@@ -1,396 +1,208 @@
1
- import base64
2
- import gc
3
- import io
4
- import math
 
 
 
 
 
 
5
  import time
6
  import uuid
7
  from typing import Any, Literal
8
 
9
  import numpy as np
10
- import requests
11
- import torch
12
- import torch.nn.functional as F
13
- from fastapi import FastAPI, HTTPException
14
- from fastapi.responses import PlainTextResponse
15
- from PIL import Image
16
  from pydantic import BaseModel, ConfigDict, Field
17
- from sentence_transformers import SentenceTransformer
18
- from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
19
-
20
- torch.set_grad_enabled(False)
21
- torch.set_num_threads(2)
22
-
23
- OWNER = "chmielvu"
24
- APP_TITLE = "Code-Embed-Qwen-rerank-sentiment"
25
- DEFAULT_MODEL = "default/not-specified"
26
-
27
- MODEL_CONFIG = {
28
- "code-embed": {
29
- "repo_id": "jinaai/jina-code-embeddings-0.5b",
30
- "kind": "sentence-transformer",
31
- },
32
- "clip-image": {
33
- "repo_id": "sentence-transformers/clip-ViT-B-32",
34
- "kind": "sentence-transformer",
35
- },
36
- "code-rerank": {
37
- "repo_id": "Qwen/Qwen3-Reranker-0.6B",
38
- "kind": "qwen-reranker",
39
- },
40
- "code-sentiment": {
41
- "repo_id": "clapAI/modernBERT-base-multilingual-sentiment",
42
- "kind": "sequence-classification",
43
- },
44
- }
45
-
46
- QWEN_RERANK_INSTRUCTION = (
47
- "Given a developer or code search query, retrieve relevant passages, issue text, "
48
- "or code snippets that answer the query."
49
- )
50
 
51
  app = FastAPI(
52
- title=APP_TITLE,
53
- summary=(
54
- "CPU-first lazy-loading inference API for code embeddings, reranking, "
55
- "classification, and CLIP image embeddings."
56
- ),
57
- version="1.0.0",
58
  )
59
 
60
- _loaded_name: str | None = None
61
- _loaded_kind: str | None = None
62
- _loaded_bundle: dict[str, Any] = {}
63
 
 
 
 
 
 
 
64
 
65
- class CompatibleRequest(BaseModel):
66
- model_config = ConfigDict(extra="allow")
67
 
 
 
 
 
 
 
68
 
69
- class EmbeddingRequest(CompatibleRequest):
70
- input: str | list[str]
71
- model: str = DEFAULT_MODEL
72
- encoding_format: Literal["float", "base64"] = "float"
73
- user: str | None = None
74
- dimensions: int = 0
75
- modality: Literal["text", "image"] = "text"
76
 
 
 
 
 
 
 
77
 
78
- class RerankRequest(CompatibleRequest):
79
- query: str = Field(..., max_length=122880)
80
- documents: list[str] = Field(..., min_length=1, max_length=2048)
81
- return_documents: bool = False
82
- raw_scores: bool = False
83
- model: str = DEFAULT_MODEL
84
- top_n: int | None = None
85
 
 
86
 
87
- class ClassifyRequest(CompatibleRequest):
88
- input: list[str] = Field(..., min_length=1, max_length=2048)
89
- model: str = DEFAULT_MODEL
90
- raw_scores: bool = False
91
 
 
 
92
 
93
- def _now_ts() -> int:
94
- return int(time.time())
 
 
95
 
96
 
97
- def _make_id(prefix: str) -> str:
98
- return f"{prefix}-{uuid.uuid4().hex}"
99
 
 
 
100
 
101
- def _resolve_model_name(route: str, requested: str, modality: str | None = None) -> str:
102
- if requested != DEFAULT_MODEL:
103
- if requested not in MODEL_CONFIG:
104
- raise HTTPException(status_code=400, detail=f"Unknown model '{requested}'")
105
- return requested
106
- if route == "embeddings" and modality == "image":
107
- return "clip-image"
108
- defaults = {
109
- "embeddings": "code-embed",
110
- "rerank": "code-rerank",
111
- "classify": "code-sentiment",
112
- }
113
- return defaults[route]
114
-
115
-
116
- def _unload_current_model() -> None:
117
- global _loaded_name, _loaded_kind, _loaded_bundle
118
- _loaded_name = None
119
- _loaded_kind = None
120
- _loaded_bundle = {}
121
- gc.collect()
122
-
123
-
124
- def _load_sentence_transformer(repo_id: str) -> dict[str, Any]:
125
- model = SentenceTransformer(repo_id, trust_remote_code=True, device="cpu")
126
- return {"model": model}
127
-
128
-
129
- def _load_qwen_reranker(repo_id: str) -> dict[str, Any]:
130
- tokenizer = AutoTokenizer.from_pretrained(repo_id, padding_side="left")
131
- if tokenizer.pad_token is None:
132
- tokenizer.pad_token = tokenizer.eos_token
133
- model = AutoModelForCausalLM.from_pretrained(repo_id).eval()
134
- token_false_id = tokenizer.convert_tokens_to_ids("no")
135
- token_true_id = tokenizer.convert_tokens_to_ids("yes")
136
- prefix = (
137
- "<|im_start|>system\n"
138
- 'Judge whether the Document meets the requirements based on the Query and '
139
- 'the Instruct provided. Note that the answer can only be "yes" or "no".'
140
- "<|im_end|>\n<|im_start|>user\n"
141
- )
142
- suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
143
- prefix_tokens = tokenizer.encode(prefix, add_special_tokens=False)
144
- suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
145
- return {
146
- "model": model,
147
- "tokenizer": tokenizer,
148
- "token_false_id": token_false_id,
149
- "token_true_id": token_true_id,
150
- "prefix_tokens": prefix_tokens,
151
- "suffix_tokens": suffix_tokens,
152
- "max_length": 4096,
153
- }
154
 
 
 
155
 
156
- def _load_sequence_classifier(repo_id: str) -> dict[str, Any]:
157
- tokenizer = AutoTokenizer.from_pretrained(repo_id)
158
- model = AutoModelForSequenceClassification.from_pretrained(repo_id).eval()
159
- return {"model": model, "tokenizer": tokenizer}
 
 
160
 
161
 
162
- def _get_model_bundle(name: str) -> tuple[str, dict[str, Any]]:
163
- global _loaded_name, _loaded_kind, _loaded_bundle
164
- if _loaded_name == name:
165
- return _loaded_kind or "", _loaded_bundle
166
 
167
- _unload_current_model()
168
- config = MODEL_CONFIG[name]
169
- kind = config["kind"]
170
- repo_id = config["repo_id"]
171
 
172
- if kind == "sentence-transformer":
173
- bundle = _load_sentence_transformer(repo_id)
174
- elif kind == "qwen-reranker":
175
- bundle = _load_qwen_reranker(repo_id)
176
- elif kind == "sequence-classification":
177
- bundle = _load_sequence_classifier(repo_id)
178
- else:
179
- raise HTTPException(status_code=500, detail=f"Unsupported kind '{kind}'")
180
 
181
- _loaded_name = name
182
- _loaded_kind = kind
183
- _loaded_bundle = bundle
184
- return kind, bundle
185
 
186
 
187
- def _usage_from_strings(values: list[str], tokenizer: Any | None = None) -> dict[str, int]:
188
- if tokenizer is None:
189
- total = sum(max(1, len(value.split())) for value in values)
190
- return {"prompt_tokens": total, "total_tokens": total}
191
- total = 0
192
- for value in values:
193
- total += len(tokenizer.encode(value, add_special_tokens=True))
194
- return {"prompt_tokens": total, "total_tokens": total}
 
 
 
 
195
 
196
 
197
  def _truncate_embedding(vector: np.ndarray, dimensions: int) -> np.ndarray:
198
- if dimensions and 0 < dimensions < vector.shape[0]:
199
- vector = vector[:dimensions]
200
- norm = np.linalg.norm(vector)
201
- if norm > 0:
202
- vector = vector / norm
203
  return vector
204
 
205
 
206
  def _vector_to_payload(vector: np.ndarray, encoding_format: str) -> list[float] | str:
207
- vector = vector.astype(np.float32)
208
  if encoding_format == "base64":
209
- return base64.b64encode(vector.tobytes()).decode("ascii")
 
210
  return vector.tolist()
211
 
212
 
213
- def _normalize_inputs(value: str | list[str]) -> list[str]:
214
- return value if isinstance(value, list) else [value]
215
-
216
-
217
- def _load_image_from_input(value: str) -> Image.Image:
218
- if value.startswith("data:"):
219
- _, data = value.split(",", 1)
220
- raw = base64.b64decode(data)
221
- return Image.open(io.BytesIO(raw)).convert("RGB")
222
- response = requests.get(value, timeout=30)
223
- response.raise_for_status()
224
- return Image.open(io.BytesIO(response.content)).convert("RGB")
225
-
226
-
227
- def _format_rerank_pair(query: str, document: str) -> str:
228
- return f"<Instruct>: {QWEN_RERANK_INSTRUCTION}\n<Query>: {query}\n<Document>: {document}"
229
-
230
-
231
- def _score_rerank(query: str, documents: list[str], raw_scores: bool, bundle: dict[str, Any]) -> list[float]:
232
- tokenizer = bundle["tokenizer"]
233
- model = bundle["model"]
234
- prefix_tokens = bundle["prefix_tokens"]
235
- suffix_tokens = bundle["suffix_tokens"]
236
- token_true_id = bundle["token_true_id"]
237
- token_false_id = bundle["token_false_id"]
238
- max_length = bundle["max_length"]
239
-
240
- pairs = [_format_rerank_pair(query, document) for document in documents]
241
- inputs = tokenizer(
242
- pairs,
243
- padding=False,
244
- truncation="longest_first",
245
- return_attention_mask=False,
246
- max_length=max_length - len(prefix_tokens) - len(suffix_tokens),
247
- )
248
-
249
- for idx, token_ids in enumerate(inputs["input_ids"]):
250
- inputs["input_ids"][idx] = prefix_tokens + token_ids + suffix_tokens
251
-
252
- padded = tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_length)
253
- logits = model(**padded).logits[:, -1, :]
254
- true_logits = logits[:, token_true_id]
255
- false_logits = logits[:, token_false_id]
256
-
257
- if raw_scores:
258
- return (true_logits - false_logits).detach().cpu().tolist()
259
-
260
- stacked = torch.stack([false_logits, true_logits], dim=1)
261
- probs = torch.nn.functional.softmax(stacked, dim=1)[:, 1]
262
- return probs.detach().cpu().tolist()
263
-
264
-
265
- def _classify_scores(texts: list[str], raw_scores: bool, bundle: dict[str, Any]) -> list[list[dict[str, float | str]]]:
266
- tokenizer = bundle["tokenizer"]
267
- model = bundle["model"]
268
- encoded = tokenizer(
269
- texts,
270
- padding=True,
271
- truncation=True,
272
- max_length=1024,
273
- return_tensors="pt",
274
- )
275
- logits = model(**encoded).logits.detach().cpu()
276
- problem_type = getattr(model.config, "problem_type", None)
277
-
278
- if problem_type == "multi_label_classification":
279
- score_tensor = torch.sigmoid(logits)
280
- else:
281
- score_tensor = torch.softmax(logits, dim=-1)
282
-
283
- label_lookup = model.config.id2label
284
- results: list[list[dict[str, float | str]]] = []
285
- for row_idx in range(logits.shape[0]):
286
- values = logits[row_idx] if raw_scores else score_tensor[row_idx]
287
- row = [
288
- {
289
- "label": label_lookup[col_idx],
290
- "score": float(values[col_idx].item()),
291
- }
292
- for col_idx in range(values.shape[0])
293
- ]
294
- row.sort(key=lambda item: item["score"], reverse=True)
295
- results.append(row)
296
- return results
297
-
298
-
299
- @app.get("/")
300
- def root() -> dict[str, str]:
301
- return {"message": APP_TITLE}
302
 
303
 
304
  @app.get("/health")
305
- def health() -> dict[str, float]:
306
- return {"unix": time.time()}
307
-
308
-
309
- @app.get("/models")
310
- @app.get("/v1/models")
311
- @app.get("/openai/v1/models")
312
- def models() -> dict[str, Any]:
313
- created = _now_ts()
314
- return {
315
- "object": "list",
316
- "data": [
317
- {
318
- "id": model_name,
319
- "object": "model",
320
- "created": created,
321
- "owned_by": OWNER,
322
- "root": config["repo_id"],
323
- }
324
- for model_name, config in MODEL_CONFIG.items()
325
- ],
326
- }
327
 
328
 
329
  @app.post("/embeddings")
330
  @app.post("/v1/embeddings")
331
- @app.post("/openai/v1/embeddings")
332
  def embeddings(request: EmbeddingRequest) -> dict[str, Any]:
333
- model_name = _resolve_model_name("embeddings", request.model, request.modality)
334
- kind, bundle = _get_model_bundle(model_name)
335
- if kind != "sentence-transformer":
336
- raise HTTPException(status_code=400, detail=f"Model '{model_name}' does not support embeddings")
337
-
338
- values = _normalize_inputs(request.input)
339
- model = bundle["model"]
340
-
341
- if request.modality == "image":
342
- images = [_load_image_from_input(value) for value in values]
343
- embeddings_np = np.asarray(model.encode(images, convert_to_numpy=True))
344
- usage = {"prompt_tokens": 0, "total_tokens": 0}
345
- else:
346
- embeddings_np = np.asarray(model.encode(values, convert_to_numpy=True))
347
- tokenizer = getattr(model, "tokenizer", None)
348
- usage = _usage_from_strings(values, tokenizer)
349
 
350
  data = []
351
- for idx, vector in enumerate(embeddings_np):
352
- vector = _truncate_embedding(vector, request.dimensions)
353
- data.append(
354
- {
355
- "object": "embedding",
356
- "embedding": _vector_to_payload(vector, request.encoding_format),
357
- "index": idx,
358
- }
359
- )
360
 
361
  return {
362
  "object": "list",
363
  "data": data,
364
- "model": model_name,
365
- "usage": usage,
366
  "id": _make_id("emb"),
367
  "created": _now_ts(),
368
  }
369
 
370
 
371
- @app.post("/embeddings_image")
372
- def embeddings_image(request: EmbeddingRequest) -> dict[str, Any]:
373
- image_request = EmbeddingRequest(
374
- input=request.input,
375
- model="clip-image" if request.model == DEFAULT_MODEL else request.model,
376
- encoding_format=request.encoding_format,
377
- user=request.user,
378
- dimensions=request.dimensions,
379
- modality="image",
380
- )
381
- return embeddings(image_request)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
382
 
383
 
384
  @app.post("/rerank")
385
  @app.post("/v1/rerank")
386
- @app.post("/openai/v1/rerank")
387
  def rerank(request: RerankRequest) -> dict[str, Any]:
388
- model_name = _resolve_model_name("rerank", request.model)
389
- kind, bundle = _get_model_bundle(model_name)
390
- if kind != "qwen-reranker":
391
- raise HTTPException(status_code=400, detail=f"Model '{model_name}' does not support reranking")
 
392
 
393
- scores = _score_rerank(request.query, request.documents, request.raw_scores, bundle)
394
  results = []
395
  for idx, score in enumerate(scores):
396
  item = {"index": idx, "relevance_score": float(score)}
@@ -398,42 +210,89 @@ def rerank(request: RerankRequest) -> dict[str, Any]:
398
  item["document"] = request.documents[idx]
399
  results.append(item)
400
 
401
- results.sort(key=lambda item: item["relevance_score"], reverse=True)
 
 
402
  if request.top_n is not None:
403
- results = results[: request.top_n]
404
 
405
- usage = _usage_from_strings([request.query] + request.documents, bundle["tokenizer"])
406
  return {
407
  "object": "rerank",
408
  "results": results,
409
- "model": model_name,
410
- "usage": usage,
 
 
 
411
  "id": _make_id("rerank"),
412
  "created": _now_ts(),
413
  }
414
 
415
 
416
- @app.post("/classify")
417
- @app.post("/v1/classify")
418
- @app.post("/openai/v1/classify")
419
- def classify(request: ClassifyRequest) -> dict[str, Any]:
420
- model_name = _resolve_model_name("classify", request.model)
421
- kind, bundle = _get_model_bundle(model_name)
422
- if kind != "sequence-classification":
423
- raise HTTPException(status_code=400, detail=f"Model '{model_name}' does not support classification")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
424
 
425
- data = _classify_scores(request.input, request.raw_scores, bundle)
426
- usage = _usage_from_strings(request.input, bundle["tokenizer"])
427
  return {
428
- "object": "classify",
429
  "data": data,
430
- "model": model_name,
431
- "usage": usage,
432
- "id": _make_id("classify"),
433
  "created": _now_ts(),
434
  }
435
 
436
 
437
- @app.get("/metrics", response_class=PlainTextResponse)
438
- def metrics() -> str:
439
- return ""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ FastEmbed-based Code Embedding Server
3
+ Optimized for CPU Basic (2 vCPU, 16GB RAM)
4
+
5
+ Models:
6
+ - Dense: jinaai/jina-embeddings-v2-base-code (768 dim, 0.64GB)
7
+ - Sparse: Qdrant/bm25 (BM25, 0.01GB)
8
+ - Reranker: jinaai/jina-reranker-v1-tiny-en (0.13GB)
9
+ """
10
+
11
  import time
12
  import uuid
13
  from typing import Any, Literal
14
 
15
  import numpy as np
16
+ from fastapi import FastAPI
 
 
 
 
 
17
  from pydantic import BaseModel, ConfigDict, Field
18
+
19
+ from fastembed import TextEmbedding, SparseTextEmbedding
20
+ from fastembed.rerank.cross_encoder import TextCrossEncoder
21
+
22
+ # Model names
23
+ DENSE_MODEL = "jinaai/jina-embeddings-v2-base-code"
24
+ SPARSE_MODEL = "Qdrant/bm25"
25
+ RERANKER_MODEL = "jinaai/jina-reranker-v1-tiny-en"
26
+
27
+ # Global model cache (loaded once, reused)
28
+ _dense_model: TextEmbedding | None = None
29
+ _sparse_model: SparseTextEmbedding | None = None
30
+ _reranker_model: TextCrossEncoder | None = None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
  app = FastAPI(
33
+ title="FastEmbed Code Embeddings",
34
+ summary="CPU-optimized code embeddings with BM25 sparse and reranking",
35
+ version="2.0.0",
 
 
 
36
  )
37
 
 
 
 
38
 
39
+ def _get_dense_model() -> TextEmbedding:
40
+ """Lazy-load dense model (cached globally)."""
41
+ global _dense_model
42
+ if _dense_model is None:
43
+ _dense_model = TextEmbedding(model_name=DENSE_MODEL)
44
+ return _dense_model
45
 
 
 
46
 
47
+ def _get_sparse_model() -> SparseTextEmbedding:
48
+ """Lazy-load sparse BM25 model (cached globally)."""
49
+ global _sparse_model
50
+ if _sparse_model is None:
51
+ _sparse_model = SparseTextEmbedding(model_name=SPARSE_MODEL)
52
+ return _sparse_model
53
 
 
 
 
 
 
 
 
54
 
55
+ def _get_reranker() -> TextCrossEncoder:
56
+ """Lazy-load reranker model (cached globally)."""
57
+ global _reranker_model
58
+ if _reranker_model is None:
59
+ _reranker_model = TextCrossEncoder(model_name=RERANKER_MODEL)
60
+ return _reranker_model
61
 
 
 
 
 
 
 
 
62
 
63
+ # ==================== Request Models ====================
64
 
 
 
 
 
65
 
66
+ class EmbeddingRequest(BaseModel):
67
+ model_config = ConfigDict(extra="allow")
68
 
69
+ input: str | list[str]
70
+ model: str = "code-embed"
71
+ encoding_format: Literal["float", "base64"] = "float"
72
+ dimensions: int = 0 # 0 = full dimensions
73
 
74
 
75
+ class SparseEmbeddingRequest(BaseModel):
76
+ model_config = ConfigDict(extra="allow")
77
 
78
+ input: str | list[str]
79
+ model: str = "bm25"
80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
+ class RerankRequest(BaseModel):
83
+ model_config = ConfigDict(extra="allow")
84
 
85
+ query: str = Field(..., max_length=8192)
86
+ documents: list[str] = Field(..., min_length=1, max_length=256)
87
+ return_documents: bool = False
88
+ raw_scores: bool = False
89
+ model: str = "code-rerank"
90
+ top_n: int | None = None
91
 
92
 
93
+ class HybridRequest(BaseModel):
94
+ """Request for hybrid search embeddings (dense + sparse)."""
95
+ model_config = ConfigDict(extra="allow")
 
96
 
97
+ input: str | list[str]
98
+ dense_model: str = "code-embed"
99
+ sparse_model: str = "bm25"
 
100
 
 
 
 
 
 
 
 
 
101
 
102
+ # ==================== Helper Functions ====================
 
 
 
103
 
104
 
105
+ def _now_ts() -> int:
106
+ return int(time.time())
107
+
108
+
109
+ def _make_id(prefix: str) -> str:
110
+ return f"{prefix}-{uuid.uuid4().hex}"
111
+
112
+
113
+ def _normalize_input(input: str | list[str]) -> list[str]:
114
+ if isinstance(input, str):
115
+ return [input]
116
+ return input
117
 
118
 
119
  def _truncate_embedding(vector: np.ndarray, dimensions: int) -> np.ndarray:
120
+ if dimensions > 0 and dimensions < len(vector):
121
+ return vector[:dimensions]
 
 
 
122
  return vector
123
 
124
 
125
  def _vector_to_payload(vector: np.ndarray, encoding_format: str) -> list[float] | str:
 
126
  if encoding_format == "base64":
127
+ import base64
128
+ return base64.b64encode(vector.astype(np.float32).tobytes()).decode()
129
  return vector.tolist()
130
 
131
 
132
+ # ==================== API Endpoints ====================
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
 
134
 
135
  @app.get("/health")
136
+ def health() -> dict[str, str]:
137
+ return {"status": "ok", "models": f"{DENSE_MODEL} + {SPARSE_MODEL} + {RERANKER_MODEL}"}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
 
139
 
140
  @app.post("/embeddings")
141
  @app.post("/v1/embeddings")
 
142
  def embeddings(request: EmbeddingRequest) -> dict[str, Any]:
143
+ """Generate dense embeddings using jina-embeddings-v2-base-code."""
144
+ texts = _normalize_input(request.input)
145
+ model = _get_dense_model()
146
+
147
+ # Generate embeddings (ONNX-optimized, cached)
148
+ embeddings_list = list(model.embed(texts))
 
 
 
 
 
 
 
 
 
 
149
 
150
  data = []
151
+ for idx, embedding in enumerate(embeddings_list):
152
+ embedding = _truncate_embedding(embedding, request.dimensions)
153
+ data.append({
154
+ "object": "embedding",
155
+ "embedding": _vector_to_payload(embedding, request.encoding_format),
156
+ "index": idx,
157
+ })
 
 
158
 
159
  return {
160
  "object": "list",
161
  "data": data,
162
+ "model": request.model,
163
+ "usage": {"prompt_tokens": sum(len(t.split()) for t in texts), "total_tokens": 0},
164
  "id": _make_id("emb"),
165
  "created": _now_ts(),
166
  }
167
 
168
 
169
+ @app.post("/sparse/embeddings")
170
+ @app.post("/v1/sparse/embeddings")
171
+ def sparse_embeddings(request: SparseEmbeddingRequest) -> dict[str, Any]:
172
+ """Generate sparse BM25 embeddings."""
173
+ texts = _normalize_input(request.input)
174
+ model = _get_sparse_model()
175
+
176
+ # Generate sparse embeddings
177
+ sparse_embeddings = list(model.embed(texts))
178
+
179
+ data = []
180
+ for idx, emb in enumerate(sparse_embeddings):
181
+ data.append({
182
+ "object": "sparse_embedding",
183
+ "indices": emb.indices.tolist(),
184
+ "values": emb.values.tolist(),
185
+ "index": idx,
186
+ })
187
+
188
+ return {
189
+ "object": "list",
190
+ "data": data,
191
+ "model": request.model,
192
+ "id": _make_id("sparse"),
193
+ "created": _now_ts(),
194
+ }
195
 
196
 
197
  @app.post("/rerank")
198
  @app.post("/v1/rerank")
 
199
  def rerank(request: RerankRequest) -> dict[str, Any]:
200
+ """Rerank documents using cross-encoder."""
201
+ reranker = _get_reranker()
202
+
203
+ # Compute rerank scores
204
+ scores = reranker.rerank(request.query, request.documents)
205
 
 
206
  results = []
207
  for idx, score in enumerate(scores):
208
  item = {"index": idx, "relevance_score": float(score)}
 
210
  item["document"] = request.documents[idx]
211
  results.append(item)
212
 
213
+ # Sort by relevance
214
+ results.sort(key=lambda x: x["relevance_score"], reverse=True)
215
+
216
  if request.top_n is not None:
217
+ results = results[:request.top_n]
218
 
 
219
  return {
220
  "object": "rerank",
221
  "results": results,
222
+ "model": request.model,
223
+ "usage": {
224
+ "prompt_tokens": len(request.query.split()),
225
+ "total_tokens": sum(len(d.split()) for d in request.documents),
226
+ },
227
  "id": _make_id("rerank"),
228
  "created": _now_ts(),
229
  }
230
 
231
 
232
+ @app.post("/hybrid/embeddings")
233
+ @app.post("/v1/hybrid/embeddings")
234
+ def hybrid_embeddings(request: HybridRequest) -> dict[str, Any]:
235
+ """Generate both dense and sparse embeddings for hybrid search."""
236
+ texts = _normalize_input(request.input)
237
+
238
+ dense_model = _get_dense_model()
239
+ sparse_model = _get_sparse_model()
240
+
241
+ # Generate both
242
+ dense_embeddings = list(dense_model.embed(texts))
243
+ sparse_embeddings = list(sparse_model.embed(texts))
244
+
245
+ data = []
246
+ for idx, (dense, sparse) in enumerate(zip(dense_embeddings, sparse_embeddings)):
247
+ data.append({
248
+ "object": "hybrid_embedding",
249
+ "dense": {
250
+ "vector": dense.tolist(),
251
+ "dim": len(dense),
252
+ },
253
+ "sparse": {
254
+ "indices": sparse.indices.tolist(),
255
+ "values": sparse.values.tolist(),
256
+ },
257
+ "index": idx,
258
+ })
259
 
 
 
260
  return {
261
+ "object": "list",
262
  "data": data,
263
+ "model": f"{request.dense_model} + {request.sparse_model}",
264
+ "id": _make_id("hybrid"),
 
265
  "created": _now_ts(),
266
  }
267
 
268
 
269
+ # ==================== Model Info ====================
270
+
271
+
272
+ @app.get("/models")
273
+ def list_models() -> dict[str, Any]:
274
+ """List supported models and their specs."""
275
+ return {
276
+ "dense": {
277
+ "model": DENSE_MODEL,
278
+ "dim": 768,
279
+ "size_gb": 0.64,
280
+ "type": "code-optimized",
281
+ },
282
+ "sparse": {
283
+ "model": SPARSE_MODEL,
284
+ "type": "bm25",
285
+ "size_gb": 0.01,
286
+ "requires_idf": True,
287
+ },
288
+ "reranker": {
289
+ "model": RERANKER_MODEL,
290
+ "size_gb": 0.13,
291
+ "type": "cross-encoder",
292
+ },
293
+ }
294
+
295
+
296
+ if __name__ == "__main__":
297
+ import uvicorn
298
+ uvicorn.run(app, host="0.0.0.0", port=7860)
requirements.txt CHANGED
@@ -1,8 +1,5 @@
1
- fastapi==0.128.0
2
- uvicorn[standard]==0.35.0
3
- torch>=2.3.0
4
- transformers>=4.57.0
5
- sentence-transformers>=3.0.0
6
- pillow>=10.0.0
7
- requests>=2.32.0
8
- numpy>=1.26.0
 
1
+ fastembed>=0.4.0
2
+ fastembed-rerank>=0.1.0
3
+ fastapi>=0.109.0
4
+ uvicorn>=0.27.0
5
+ numpy>=1.24.0