JacobLinCool Codex commited on
Commit
ca766b5
·
verified ·
1 Parent(s): b03e3b9

fix: stabilize llama embedding runtime

Browse files

Co-authored-by: Codex <noreply@openai.com>

DESIGN.md CHANGED
@@ -126,7 +126,7 @@ investigate → ideate → score loop — the experience collapses without the m
126
  |---|---|---|---|---|---|
127
  | STT (batch voice input) | **`nvidia/nemotron-speech-streaming-en-0.6b`** | 0.6B | NeMo, GPU+CUDA | NVIDIA Open Model (commercial OK) | 🟩 NVIDIA Nemotron Quest |
128
  | LLM brain | **`openbmb/MiniCPM5-1B`** ("OpenCPM5") | 1.08B | **transformers** (self-parse XML) / llama.cpp | **Apache-2.0** | 🏮 OpenBMB |
129
- | Embedder | **`ggml-org/embeddinggemma-300M-qat-q4_0-GGUF`** | ~300M | llama.cpp / llama-cpp-python | Gemma | 🔌 Off the Grid · 🦙 Llama Champion · 🟢 Modal |
130
  | Fine-tune | LoRA on MiniCPM5 → published to Hub | — | PEFT / HF Jobs | — | 🎯 Well-Tuned |
131
 
132
  **Total ≈ 1.98B params → ≤4B → 🐜 Tiny Titan eligible.** All open-weight, all runnable locally → 🔌 Off the Grid.
@@ -191,9 +191,9 @@ With **text-first + batch ASR**, the old "streaming ASR vs ZeroGPU" Config A/B t
191
  tool-calling is a pending PR — verify before relying on it for the badge runtime.)
192
  - **1B discipline:** small tool schemas, few params each, clear descriptions, low temp, single-hop tool calls.
193
 
194
- ### 5.4 EmbeddingGemma GGUF — `ggml-org/embeddinggemma-300M-qat-q4_0-GGUF`
195
 
196
- - Active retrieval model: `embeddinggemma-300M-qat-Q4_0.gguf`, 768-dimensional normalized embeddings.
197
  - Build-time path: Modal remote function runs `llama-cpp-python` with mean pooling and writes `data/project_index.json`.
198
  - Runtime path: Space embeds each user query through the same GGUF model via llama.cpp, then performs local cosine search
199
  over checked-in project vectors.
@@ -208,10 +208,10 @@ llama.cpp on Modal, and runtime query embeddings use the same llama.cpp path.
208
  | Model | llama.cpp? | Runtime | Notes |
209
  |---|---|---|---|
210
  | `openbmb/MiniCPM5-1B` | ✅ planned only | llama.cpp / Ollama | Not used for deployed tool-calling; Transformers + LoRA is the deployed brain. |
211
- | `ggml-org/embeddinggemma-300M-qat-q4_0-GGUF` | ✅ active | llama.cpp / llama-cpp-python | Builds project vectors on Modal and embeds runtime queries in the Space. |
212
  | ASR (Nemotron) | ❌ | NeMo | FastConformer-RNNT |
213
 
214
- If retrieval quality becomes the bottleneck, compare Q4_0 against Q8_0, but do not keep two runtime retrieval paths.
215
 
216
  ---
217
 
@@ -310,7 +310,7 @@ canonical command is:
310
  ```
311
 
312
  The remote function installs `llama-cpp-python`, downloads
313
- `ggml-org/embeddinggemma-300M-qat-q4_0-GGUF/embeddinggemma-300M-qat-Q4_0.gguf`, embeds every project card through
314
  llama.cpp, and returns a schema-v2 JSON index. The local entrypoint writes that payload into the repo for Space runtime.
315
 
316
  Latest successful run: `hackathon-advisor-llama-index` on Modal, producing a 100-document, 768-dimensional normalized
 
126
  |---|---|---|---|---|---|
127
  | STT (batch voice input) | **`nvidia/nemotron-speech-streaming-en-0.6b`** | 0.6B | NeMo, GPU+CUDA | NVIDIA Open Model (commercial OK) | 🟩 NVIDIA Nemotron Quest |
128
  | LLM brain | **`openbmb/MiniCPM5-1B`** ("OpenCPM5") | 1.08B | **transformers** (self-parse XML) / llama.cpp | **Apache-2.0** | 🏮 OpenBMB |
129
+ | Embedder | **`ggml-org/embeddinggemma-300m-qat-q8_0-GGUF`** | ~300M | llama.cpp / llama-cpp-python | Gemma | 🔌 Off the Grid · 🦙 Llama Champion · 🟢 Modal |
130
  | Fine-tune | LoRA on MiniCPM5 → published to Hub | — | PEFT / HF Jobs | — | 🎯 Well-Tuned |
131
 
132
  **Total ≈ 1.98B params → ≤4B → 🐜 Tiny Titan eligible.** All open-weight, all runnable locally → 🔌 Off the Grid.
 
191
  tool-calling is a pending PR — verify before relying on it for the badge runtime.)
192
  - **1B discipline:** small tool schemas, few params each, clear descriptions, low temp, single-hop tool calls.
193
 
194
+ ### 5.4 EmbeddingGemma GGUF — `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF`
195
 
196
+ - Active retrieval model: `embeddinggemma-300m-qat-Q8_0.gguf`, 768-dimensional normalized embeddings.
197
  - Build-time path: Modal remote function runs `llama-cpp-python` with mean pooling and writes `data/project_index.json`.
198
  - Runtime path: Space embeds each user query through the same GGUF model via llama.cpp, then performs local cosine search
199
  over checked-in project vectors.
 
208
  | Model | llama.cpp? | Runtime | Notes |
209
  |---|---|---|---|
210
  | `openbmb/MiniCPM5-1B` | ✅ planned only | llama.cpp / Ollama | Not used for deployed tool-calling; Transformers + LoRA is the deployed brain. |
211
+ | `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF` | ✅ active | llama.cpp / llama-cpp-python | Builds project vectors on Modal and embeds runtime queries in the Space. |
212
  | ASR (Nemotron) | ❌ | NeMo | FastConformer-RNNT |
213
 
214
+ The checked-in index and runtime query embedder must stay on the same GGUF file.
215
 
216
  ---
217
 
 
310
  ```
311
 
312
  The remote function installs `llama-cpp-python`, downloads
313
+ `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF/embeddinggemma-300m-qat-Q8_0.gguf`, embeds every project card through
314
  llama.cpp, and returns a schema-v2 JSON index. The local entrypoint writes that payload into the repo for Space runtime.
315
 
316
  Latest successful run: `hackathon-advisor-llama-index` on Modal, producing a 100-document, 768-dimensional normalized
README.md CHANGED
@@ -59,7 +59,7 @@ python scripts/generate_sample_trace.py --projects data/projects.json --index da
59
 
60
  The app uses `data/projects.json` and `data/project_index.json` at runtime. The index validates the snapshot timestamp,
61
  source, project order, digest, embedding dimensions, and normalized vector shape before the app starts. The canonical
62
- index is built on Modal with `ggml-org/embeddinggemma-300M-qat-q4_0-GGUF` through llama.cpp; runtime search embeds the
63
  user query with the same GGUF model and performs local cosine search over the checked-in vectors.
64
 
65
  ## Trace Artifact
@@ -186,8 +186,8 @@ ADVISOR_MODEL_BACKEND=minicpm-transformers
186
  ADVISOR_MODEL_ID=openbmb/MiniCPM5-1B
187
  ADVISOR_ADAPTER_ID=build-small-hackathon/hackathon-advisor-minicpm5-lora
188
  ADVISOR_ADAPTER_REVISION=25de69bcde397e1bcdd852923b56a42f10222650
189
- ADVISOR_EMBEDDING_MODEL_REPO=ggml-org/embeddinggemma-300M-qat-q4_0-GGUF
190
- ADVISOR_EMBEDDING_MODEL_FILE=embeddinggemma-300M-qat-Q4_0.gguf
191
  ADVISOR_ASR_MODEL_ID=nvidia/nemotron-speech-streaming-en-0.6b
192
  ```
193
 
 
59
 
60
  The app uses `data/projects.json` and `data/project_index.json` at runtime. The index validates the snapshot timestamp,
61
  source, project order, digest, embedding dimensions, and normalized vector shape before the app starts. The canonical
62
+ index is built on Modal with `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF` through llama.cpp; runtime search embeds the
63
  user query with the same GGUF model and performs local cosine search over the checked-in vectors.
64
 
65
  ## Trace Artifact
 
186
  ADVISOR_MODEL_ID=openbmb/MiniCPM5-1B
187
  ADVISOR_ADAPTER_ID=build-small-hackathon/hackathon-advisor-minicpm5-lora
188
  ADVISOR_ADAPTER_REVISION=25de69bcde397e1bcdd852923b56a42f10222650
189
+ ADVISOR_EMBEDDING_MODEL_REPO=ggml-org/embeddinggemma-300m-qat-q8_0-GGUF
190
+ ADVISOR_EMBEDDING_MODEL_FILE=embeddinggemma-300m-qat-Q8_0.gguf
191
  ADVISOR_ASR_MODEL_ID=nvidia/nemotron-speech-streaming-en-0.6b
192
  ```
193
 
app.py CHANGED
@@ -21,6 +21,7 @@ from hackathon_advisor.lora_dataset import build_lora_dataset_jsonl
21
  from hackathon_advisor.lora_training_kit import TRAINING_KIT_FILENAME, build_lora_training_kit_zip
22
  from hackathon_advisor.png_export import artifact_png_filename, render_artifact_png
23
  from hackathon_advisor.prize_ledger import prize_ledger
 
24
  from hackathon_advisor.submission_packet import build_submission_packet_markdown
25
  from hackathon_advisor.tool_contracts import resolve_tool_call, tool_schemas
26
  from hackathon_advisor.tools import GOALS, goal_profiles
@@ -28,6 +29,8 @@ from hackathon_advisor.trace_export import build_trace_jsonl, trace_metadata
28
  from hackathon_advisor.zerogpu import gpu_task
29
 
30
 
 
 
31
  ROOT = Path(__file__).parent
32
  STATIC_DIR = ROOT / "static"
33
  DATA_PATH = ROOT / "data" / "projects.json"
 
21
  from hackathon_advisor.lora_training_kit import TRAINING_KIT_FILENAME, build_lora_training_kit_zip
22
  from hackathon_advisor.png_export import artifact_png_filename, render_artifact_png
23
  from hackathon_advisor.prize_ledger import prize_ledger
24
+ from hackathon_advisor.runtime_hooks import install_asyncio_cleanup_hook
25
  from hackathon_advisor.submission_packet import build_submission_packet_markdown
26
  from hackathon_advisor.tool_contracts import resolve_tool_call, tool_schemas
27
  from hackathon_advisor.tools import GOALS, goal_profiles
 
29
  from hackathon_advisor.zerogpu import gpu_task
30
 
31
 
32
+ install_asyncio_cleanup_hook()
33
+
34
  ROOT = Path(__file__).parent
35
  STATIC_DIR = ROOT / "static"
36
  DATA_PATH = ROOT / "data" / "projects.json"
data/project_index.json CHANGED
The diff for this file is too large to render. See raw diff
 
hackathon_advisor/data.py CHANGED
@@ -25,8 +25,8 @@ GENERIC_PUBLIC_SUMMARY_RE = re.compile(
25
 
26
  INDEX_SCHEMA_VERSION = 2
27
  INDEX_ALGORITHM = "llama-cpp-embedding-v1"
28
- DEFAULT_EMBEDDING_MODEL_REPO = "ggml-org/embeddinggemma-300M-qat-q4_0-GGUF"
29
- DEFAULT_EMBEDDING_MODEL_FILE = "embeddinggemma-300M-qat-Q4_0.gguf"
30
  DEFAULT_EMBEDDING_RUNTIME = "llama.cpp via llama-cpp-python"
31
 
32
 
 
25
 
26
  INDEX_SCHEMA_VERSION = 2
27
  INDEX_ALGORITHM = "llama-cpp-embedding-v1"
28
+ DEFAULT_EMBEDDING_MODEL_REPO = "ggml-org/embeddinggemma-300m-qat-q8_0-GGUF"
29
+ DEFAULT_EMBEDDING_MODEL_FILE = "embeddinggemma-300m-qat-Q8_0.gguf"
30
  DEFAULT_EMBEDDING_RUNTIME = "llama.cpp via llama-cpp-python"
31
 
32
 
hackathon_advisor/llama_embedding.py CHANGED
@@ -12,7 +12,7 @@ from hackathon_advisor.data import (
12
 
13
 
14
  TRUE_VALUES = {"1", "true", "yes", "on"}
15
- DEFAULT_N_CTX = 512
16
 
17
 
18
  class LlamaCppEmbedder:
@@ -23,6 +23,7 @@ class LlamaCppEmbedder:
23
  model_file: str = DEFAULT_EMBEDDING_MODEL_FILE,
24
  model_path: str = "",
25
  n_ctx: int = DEFAULT_N_CTX,
 
26
  n_threads: int | None = None,
27
  n_gpu_layers: int = 0,
28
  verbose: bool = False,
@@ -31,6 +32,7 @@ class LlamaCppEmbedder:
31
  self.model_file = model_file.strip() or DEFAULT_EMBEDDING_MODEL_FILE
32
  self.model_path = model_path.strip()
33
  self.n_ctx = n_ctx
 
34
  self.n_threads = n_threads
35
  self.n_gpu_layers = n_gpu_layers
36
  self.verbose = verbose
@@ -63,6 +65,8 @@ class LlamaCppEmbedder:
63
  embedding=True,
64
  pooling_type=LLAMA_POOLING_TYPE_MEAN,
65
  n_ctx=self.n_ctx,
 
 
66
  n_threads=self.n_threads,
67
  n_gpu_layers=self.n_gpu_layers,
68
  verbose=self.verbose,
@@ -82,6 +86,7 @@ def create_llama_cpp_embedder(metadata: dict[str, Any]) -> LlamaCppEmbedder:
82
  ),
83
  model_path=os.environ.get("ADVISOR_EMBEDDING_MODEL_PATH", ""),
84
  n_ctx=_int_env("ADVISOR_EMBEDDING_N_CTX", DEFAULT_N_CTX),
 
85
  n_threads=_optional_int_env("ADVISOR_EMBEDDING_THREADS"),
86
  n_gpu_layers=_int_env("ADVISOR_EMBEDDING_GPU_LAYERS", 0),
87
  verbose=os.environ.get("ADVISOR_EMBEDDING_VERBOSE", "").strip().lower() in TRUE_VALUES,
 
12
 
13
 
14
  TRUE_VALUES = {"1", "true", "yes", "on"}
15
+ DEFAULT_N_CTX = 2048
16
 
17
 
18
  class LlamaCppEmbedder:
 
23
  model_file: str = DEFAULT_EMBEDDING_MODEL_FILE,
24
  model_path: str = "",
25
  n_ctx: int = DEFAULT_N_CTX,
26
+ n_batch: int | None = None,
27
  n_threads: int | None = None,
28
  n_gpu_layers: int = 0,
29
  verbose: bool = False,
 
32
  self.model_file = model_file.strip() or DEFAULT_EMBEDDING_MODEL_FILE
33
  self.model_path = model_path.strip()
34
  self.n_ctx = n_ctx
35
+ self.n_batch = n_batch or n_ctx
36
  self.n_threads = n_threads
37
  self.n_gpu_layers = n_gpu_layers
38
  self.verbose = verbose
 
65
  embedding=True,
66
  pooling_type=LLAMA_POOLING_TYPE_MEAN,
67
  n_ctx=self.n_ctx,
68
+ n_batch=self.n_batch,
69
+ n_ubatch=self.n_batch,
70
  n_threads=self.n_threads,
71
  n_gpu_layers=self.n_gpu_layers,
72
  verbose=self.verbose,
 
86
  ),
87
  model_path=os.environ.get("ADVISOR_EMBEDDING_MODEL_PATH", ""),
88
  n_ctx=_int_env("ADVISOR_EMBEDDING_N_CTX", DEFAULT_N_CTX),
89
+ n_batch=_optional_int_env("ADVISOR_EMBEDDING_BATCH"),
90
  n_threads=_optional_int_env("ADVISOR_EMBEDDING_THREADS"),
91
  n_gpu_layers=_int_env("ADVISOR_EMBEDDING_GPU_LAYERS", 0),
92
  verbose=os.environ.get("ADVISOR_EMBEDDING_VERBOSE", "").strip().lower() in TRUE_VALUES,
hackathon_advisor/model_runtime.py CHANGED
@@ -128,13 +128,14 @@ class MiniCPMTransformersPlanner:
128
  )
129
  model = AutoModelForCausalLM.from_pretrained(
130
  base_model_id,
131
- torch_dtype="auto",
132
  device_map="auto",
133
  trust_remote_code=True,
134
  )
135
  if self.adapter_id:
136
  model = PeftModel.from_pretrained(model, self.adapter_id, **adapter_kwargs)
137
  model.eval()
 
138
  self._model = model
139
  if hasattr(torch, "inference_mode"):
140
  self._inference_mode = torch.inference_mode
@@ -228,6 +229,15 @@ def _strip_unused_generation_inputs(inputs: dict[str, Any]) -> None:
228
  inputs.pop("token_type_ids", None)
229
 
230
 
 
 
 
 
 
 
 
 
 
231
  def _normalize_xml_tool_output(output: str) -> str:
232
  stripped = output.strip()
233
  if stripped.startswith('name="'):
 
128
  )
129
  model = AutoModelForCausalLM.from_pretrained(
130
  base_model_id,
131
+ dtype="auto",
132
  device_map="auto",
133
  trust_remote_code=True,
134
  )
135
  if self.adapter_id:
136
  model = PeftModel.from_pretrained(model, self.adapter_id, **adapter_kwargs)
137
  model.eval()
138
+ _disable_sampling_generation_defaults(model)
139
  self._model = model
140
  if hasattr(torch, "inference_mode"):
141
  self._inference_mode = torch.inference_mode
 
229
  inputs.pop("token_type_ids", None)
230
 
231
 
232
+ def _disable_sampling_generation_defaults(model: Any) -> None:
233
+ generation_config = getattr(model, "generation_config", None)
234
+ if generation_config is None:
235
+ return
236
+ generation_config.do_sample = False
237
+ generation_config.temperature = None
238
+ generation_config.top_p = None
239
+
240
+
241
  def _normalize_xml_tool_output(output: str) -> str:
242
  stripped = output.strip()
243
  if stripped.startswith('name="'):
hackathon_advisor/prize_ledger.py CHANGED
@@ -14,7 +14,7 @@ MODEL_STACK = [
14
  },
15
  {
16
  "role": "Embedding retriever",
17
- "model": "ggml-org/embeddinggemma-300M-qat-q4_0-GGUF",
18
  "params_b": 0.30,
19
  "status": "deployed",
20
  "runtime": "Modal-built llama.cpp GGUF index + runtime llama.cpp query embeddings",
 
14
  },
15
  {
16
  "role": "Embedding retriever",
17
+ "model": "ggml-org/embeddinggemma-300m-qat-q8_0-GGUF",
18
  "params_b": 0.30,
19
  "status": "deployed",
20
  "runtime": "Modal-built llama.cpp GGUF index + runtime llama.cpp query embeddings",
hackathon_advisor/runtime_hooks.py ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import sys
4
+ from typing import Any
5
+
6
+
7
+ _HOOK_INSTALLED = False
8
+
9
+
10
+ def install_asyncio_cleanup_hook() -> None:
11
+ global _HOOK_INSTALLED
12
+ if _HOOK_INSTALLED:
13
+ return
14
+ previous_hook = sys.unraisablehook
15
+
16
+ def hook(args: Any) -> None:
17
+ if _is_asyncio_invalid_fd_cleanup(args):
18
+ return
19
+ previous_hook(args)
20
+
21
+ sys.unraisablehook = hook
22
+ _HOOK_INSTALLED = True
23
+
24
+
25
+ def _is_asyncio_invalid_fd_cleanup(args: Any) -> bool:
26
+ if getattr(args, "exc_type", None) is not ValueError:
27
+ return False
28
+ if str(getattr(args, "exc_value", "")) != "Invalid file descriptor: -1":
29
+ return False
30
+ owner = getattr(args, "object", None)
31
+ return getattr(owner, "__qualname__", "") == "BaseEventLoop.__del__"
scripts/build_project_index.py CHANGED
@@ -16,7 +16,7 @@ from hackathon_advisor.data import (
16
  Project,
17
  build_index_payload,
18
  )
19
- from hackathon_advisor.llama_embedding import LlamaCppEmbedder
20
 
21
 
22
  def main() -> None:
@@ -28,7 +28,7 @@ def main() -> None:
28
  parser.add_argument("--model-repo", default=DEFAULT_EMBEDDING_MODEL_REPO)
29
  parser.add_argument("--model-file", default=DEFAULT_EMBEDDING_MODEL_FILE)
30
  parser.add_argument("--model-path", default="")
31
- parser.add_argument("--n-ctx", type=int, default=512)
32
  parser.add_argument("--n-threads", type=int, default=0)
33
  args = parser.parse_args()
34
 
@@ -58,7 +58,7 @@ def build_payload(
58
  model_repo: str,
59
  model_file: str,
60
  model_path: str = "",
61
- n_ctx: int = 512,
62
  n_threads: int | None = None,
63
  build_source: str,
64
  builder: str,
 
16
  Project,
17
  build_index_payload,
18
  )
19
+ from hackathon_advisor.llama_embedding import DEFAULT_N_CTX, LlamaCppEmbedder
20
 
21
 
22
  def main() -> None:
 
28
  parser.add_argument("--model-repo", default=DEFAULT_EMBEDDING_MODEL_REPO)
29
  parser.add_argument("--model-file", default=DEFAULT_EMBEDDING_MODEL_FILE)
30
  parser.add_argument("--model-path", default="")
31
+ parser.add_argument("--n-ctx", type=int, default=DEFAULT_N_CTX)
32
  parser.add_argument("--n-threads", type=int, default=0)
33
  args = parser.parse_args()
34
 
 
58
  model_repo: str,
59
  model_file: str,
60
  model_path: str = "",
61
+ n_ctx: int = DEFAULT_N_CTX,
62
  n_threads: int | None = None,
63
  build_source: str,
64
  builder: str,
scripts/modal_build_project_index.py CHANGED
@@ -54,8 +54,8 @@ def build_project_index_remote(
54
  def main(
55
  projects: str = "data/projects.json",
56
  out: str = "data/project_index.json",
57
- model_repo: str = "ggml-org/embeddinggemma-300M-qat-q4_0-GGUF",
58
- model_file: str = "embeddinggemma-300M-qat-Q4_0.gguf",
59
  ) -> None:
60
  project_snapshot = json.loads(Path(projects).read_text(encoding="utf-8"))
61
  payload = build_project_index_remote.remote(project_snapshot, model_repo, model_file)
@@ -73,8 +73,8 @@ if __name__ == "__main__":
73
  parser = argparse.ArgumentParser(description="Build the llama.cpp embedding index on Modal.")
74
  parser.add_argument("--projects", default="data/projects.json")
75
  parser.add_argument("--out", default="data/project_index.json")
76
- parser.add_argument("--model-repo", default="ggml-org/embeddinggemma-300M-qat-q4_0-GGUF")
77
- parser.add_argument("--model-file", default="embeddinggemma-300M-qat-Q4_0.gguf")
78
  args = parser.parse_args()
79
  with app.run():
80
  payload = build_project_index_remote.remote(
 
54
  def main(
55
  projects: str = "data/projects.json",
56
  out: str = "data/project_index.json",
57
+ model_repo: str = "ggml-org/embeddinggemma-300m-qat-q8_0-GGUF",
58
+ model_file: str = "embeddinggemma-300m-qat-Q8_0.gguf",
59
  ) -> None:
60
  project_snapshot = json.loads(Path(projects).read_text(encoding="utf-8"))
61
  payload = build_project_index_remote.remote(project_snapshot, model_repo, model_file)
 
73
  parser = argparse.ArgumentParser(description="Build the llama.cpp embedding index on Modal.")
74
  parser.add_argument("--projects", default="data/projects.json")
75
  parser.add_argument("--out", default="data/project_index.json")
76
+ parser.add_argument("--model-repo", default="ggml-org/embeddinggemma-300m-qat-q8_0-GGUF")
77
+ parser.add_argument("--model-file", default="embeddinggemma-300m-qat-Q8_0.gguf")
78
  args = parser.parse_args()
79
  with app.run():
80
  payload = build_project_index_remote.remote(
tests/test_llama_embedding.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pathlib import Path
2
+ import sys
3
+ from types import ModuleType
4
+
5
+ from hackathon_advisor.data import DEFAULT_EMBEDDING_MODEL_FILE, DEFAULT_EMBEDDING_MODEL_REPO
6
+ from hackathon_advisor.llama_embedding import DEFAULT_N_CTX, LlamaCppEmbedder, create_llama_cpp_embedder
7
+
8
+
9
+ def test_llama_embedder_uses_q8_defaults_and_full_context(
10
+ monkeypatch,
11
+ tmp_path: Path,
12
+ ) -> None:
13
+ model_path = tmp_path / "embedding.gguf"
14
+ model_path.write_bytes(b"gguf")
15
+ captured: dict = {}
16
+
17
+ hub = ModuleType("huggingface_hub")
18
+
19
+ def fake_hf_hub_download(repo_id: str, filename: str, repo_type: str) -> str:
20
+ captured["download"] = {
21
+ "repo_id": repo_id,
22
+ "filename": filename,
23
+ "repo_type": repo_type,
24
+ }
25
+ return str(model_path)
26
+
27
+ hub.hf_hub_download = fake_hf_hub_download
28
+ llama_cpp = ModuleType("llama_cpp")
29
+ llama_cpp.LLAMA_POOLING_TYPE_MEAN = 1
30
+
31
+ class FakeLlama:
32
+ def __init__(self, **kwargs) -> None:
33
+ captured["llama_kwargs"] = kwargs
34
+
35
+ def embed(self, text: str, normalize: bool) -> list[float]:
36
+ captured["embed"] = {"text": text, "normalize": normalize}
37
+ return [1.0, 0.0]
38
+
39
+ llama_cpp.Llama = FakeLlama
40
+ monkeypatch.setitem(sys.modules, "huggingface_hub", hub)
41
+ monkeypatch.setitem(sys.modules, "llama_cpp", llama_cpp)
42
+
43
+ vector = LlamaCppEmbedder().embed("private archive")
44
+
45
+ assert vector == [1.0, 0.0]
46
+ assert captured["download"] == {
47
+ "repo_id": DEFAULT_EMBEDDING_MODEL_REPO,
48
+ "filename": DEFAULT_EMBEDDING_MODEL_FILE,
49
+ "repo_type": "model",
50
+ }
51
+ assert captured["llama_kwargs"]["n_ctx"] == DEFAULT_N_CTX
52
+ assert captured["llama_kwargs"]["n_batch"] == DEFAULT_N_CTX
53
+ assert captured["llama_kwargs"]["n_ubatch"] == DEFAULT_N_CTX
54
+ assert captured["embed"] == {"text": "private archive", "normalize": True}
55
+
56
+
57
+ def test_create_llama_embedder_accepts_explicit_batch(monkeypatch) -> None:
58
+ monkeypatch.setenv("ADVISOR_EMBEDDING_BATCH", "256")
59
+
60
+ embedder = create_llama_cpp_embedder({"dimensions": 768})
61
+
62
+ assert embedder.n_batch == 256
tests/test_model_runtime.py CHANGED
@@ -8,6 +8,7 @@ from hackathon_advisor.model_runtime import (
8
  render_context,
9
  runtime_status,
10
  system_prompt,
 
11
  _normalize_xml_tool_output,
12
  _strip_unused_generation_inputs,
13
  )
@@ -194,6 +195,22 @@ def test_generation_inputs_drop_token_type_ids() -> None:
194
  assert inputs == {"input_ids": [1], "attention_mask": [1]}
195
 
196
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
  def test_model_xml_fragment_is_normalized() -> None:
198
  output = 'name="save_idea">{"title":"A","pitch":"B"}'
199
 
 
8
  render_context,
9
  runtime_status,
10
  system_prompt,
11
+ _disable_sampling_generation_defaults,
12
  _normalize_xml_tool_output,
13
  _strip_unused_generation_inputs,
14
  )
 
195
  assert inputs == {"input_ids": [1], "attention_mask": [1]}
196
 
197
 
198
+ def test_generation_config_drops_sampling_defaults() -> None:
199
+ class GenerationConfig:
200
+ do_sample = True
201
+ temperature = 0.7
202
+ top_p = 0.95
203
+
204
+ class Model:
205
+ generation_config = GenerationConfig()
206
+
207
+ _disable_sampling_generation_defaults(Model())
208
+
209
+ assert Model.generation_config.do_sample is False
210
+ assert Model.generation_config.temperature is None
211
+ assert Model.generation_config.top_p is None
212
+
213
+
214
  def test_model_xml_fragment_is_normalized() -> None:
215
  output = 'name="save_idea">{"title":"A","pitch":"B"}'
216
 
tests/test_runtime_hooks.py ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from types import SimpleNamespace
2
+
3
+ from hackathon_advisor.runtime_hooks import _is_asyncio_invalid_fd_cleanup
4
+
5
+
6
+ def test_asyncio_invalid_fd_cleanup_hook_matches_only_event_loop_destructor() -> None:
7
+ def event_loop_del() -> None:
8
+ pass
9
+
10
+ event_loop_del.__qualname__ = "BaseEventLoop.__del__"
11
+
12
+ def other_function() -> None:
13
+ pass
14
+
15
+ assert _is_asyncio_invalid_fd_cleanup(
16
+ SimpleNamespace(
17
+ exc_type=ValueError,
18
+ exc_value=ValueError("Invalid file descriptor: -1"),
19
+ object=event_loop_del,
20
+ )
21
+ )
22
+ assert not _is_asyncio_invalid_fd_cleanup(
23
+ SimpleNamespace(
24
+ exc_type=ValueError,
25
+ exc_value=ValueError("Invalid file descriptor: -1"),
26
+ object=other_function,
27
+ )
28
+ )
29
+ assert not _is_asyncio_invalid_fd_cleanup(
30
+ SimpleNamespace(
31
+ exc_type=RuntimeError,
32
+ exc_value=RuntimeError("Invalid file descriptor: -1"),
33
+ object=event_loop_del,
34
+ )
35
+ )