Spaces:

build-small-hackathon
/

hackathon-advisor

Running on Zero

App Files Files Community

JacobLinCool Codex commited on Jun 7

Commit

ca766b5

verified ·

1 Parent(s): b03e3b9

fix: stabilize llama embedding runtime

Browse files

Co-authored-by: Codex <noreply@openai.com>

Files changed (14) hide show

DESIGN.md +6 -6
README.md +3 -3
app.py +3 -0
data/project_index.json +0 -0
hackathon_advisor/data.py +2 -2
hackathon_advisor/llama_embedding.py +6 -1
hackathon_advisor/model_runtime.py +11 -1
hackathon_advisor/prize_ledger.py +1 -1
hackathon_advisor/runtime_hooks.py +31 -0
scripts/build_project_index.py +3 -3
scripts/modal_build_project_index.py +4 -4
tests/test_llama_embedding.py +62 -0
tests/test_model_runtime.py +17 -0
tests/test_runtime_hooks.py +35 -0

DESIGN.md CHANGED Viewed

@@ -126,7 +126,7 @@ investigate → ideate → score loop — the experience collapses without the m
 |---|---|---|---|---|---|
 | STT (batch voice input) | **`nvidia/nemotron-speech-streaming-en-0.6b`** | 0.6B | NeMo, GPU+CUDA | NVIDIA Open Model (commercial OK) | 🟩 NVIDIA Nemotron Quest |
 | LLM brain | **`openbmb/MiniCPM5-1B`** ("OpenCPM5") | 1.08B | **transformers** (self-parse XML) / llama.cpp | **Apache-2.0** | 🏮 OpenBMB |
-| Embedder | **`ggml-org/embeddinggemma-300M-qat-q4_0-GGUF`** | ~300M | llama.cpp / llama-cpp-python | Gemma | 🔌 Off the Grid · 🦙 Llama Champion · 🟢 Modal |
 | Fine-tune | LoRA on MiniCPM5 → published to Hub | — | PEFT / HF Jobs | — | 🎯 Well-Tuned |
 **Total ≈ 1.98B params → ≤4B → 🐜 Tiny Titan eligible.** All open-weight, all runnable locally → 🔌 Off the Grid.
@@ -191,9 +191,9 @@ With **text-first + batch ASR**, the old "streaming ASR vs ZeroGPU" Config A/B t
   tool-calling is a pending PR — verify before relying on it for the badge runtime.)
 - **1B discipline:** small tool schemas, few params each, clear descriptions, low temp, single-hop tool calls.
-### 5.4 EmbeddingGemma GGUF — `ggml-org/embeddinggemma-300M-qat-q4_0-GGUF`
-- Active retrieval model: `embeddinggemma-300M-qat-Q4_0.gguf`, 768-dimensional normalized embeddings.
 - Build-time path: Modal remote function runs `llama-cpp-python` with mean pooling and writes `data/project_index.json`.
 - Runtime path: Space embeds each user query through the same GGUF model via llama.cpp, then performs local cosine search
   over checked-in project vectors.
@@ -208,10 +208,10 @@ llama.cpp on Modal, and runtime query embeddings use the same llama.cpp path.
 | Model | llama.cpp? | Runtime | Notes |
 |---|---|---|---|
 | `openbmb/MiniCPM5-1B` | ✅ planned only | llama.cpp / Ollama | Not used for deployed tool-calling; Transformers + LoRA is the deployed brain. |
-| `ggml-org/embeddinggemma-300M-qat-q4_0-GGUF` | ✅ active | llama.cpp / llama-cpp-python | Builds project vectors on Modal and embeds runtime queries in the Space. |
 | ASR (Nemotron) | ❌ | NeMo | FastConformer-RNNT |
-If retrieval quality becomes the bottleneck, compare Q4_0 against Q8_0, but do not keep two runtime retrieval paths.
 ---
@@ -310,7 +310,7 @@ canonical command is:
 ```
 The remote function installs `llama-cpp-python`, downloads
-`ggml-org/embeddinggemma-300M-qat-q4_0-GGUF/embeddinggemma-300M-qat-Q4_0.gguf`, embeds every project card through
 llama.cpp, and returns a schema-v2 JSON index. The local entrypoint writes that payload into the repo for Space runtime.
 Latest successful run: `hackathon-advisor-llama-index` on Modal, producing a 100-document, 768-dimensional normalized

 |---|---|---|---|---|---|
 | STT (batch voice input) | **`nvidia/nemotron-speech-streaming-en-0.6b`** | 0.6B | NeMo, GPU+CUDA | NVIDIA Open Model (commercial OK) | 🟩 NVIDIA Nemotron Quest |
 | LLM brain | **`openbmb/MiniCPM5-1B`** ("OpenCPM5") | 1.08B | **transformers** (self-parse XML) / llama.cpp | **Apache-2.0** | 🏮 OpenBMB |
+| Embedder | **`ggml-org/embeddinggemma-300m-qat-q8_0-GGUF`** | ~300M | llama.cpp / llama-cpp-python | Gemma | 🔌 Off the Grid · 🦙 Llama Champion · 🟢 Modal |
 | Fine-tune | LoRA on MiniCPM5 → published to Hub | — | PEFT / HF Jobs | — | 🎯 Well-Tuned |
 **Total ≈ 1.98B params → ≤4B → 🐜 Tiny Titan eligible.** All open-weight, all runnable locally → 🔌 Off the Grid.
   tool-calling is a pending PR — verify before relying on it for the badge runtime.)
 - **1B discipline:** small tool schemas, few params each, clear descriptions, low temp, single-hop tool calls.
+### 5.4 EmbeddingGemma GGUF — `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF`
+- Active retrieval model: `embeddinggemma-300m-qat-Q8_0.gguf`, 768-dimensional normalized embeddings.
 - Build-time path: Modal remote function runs `llama-cpp-python` with mean pooling and writes `data/project_index.json`.
 - Runtime path: Space embeds each user query through the same GGUF model via llama.cpp, then performs local cosine search
   over checked-in project vectors.
 | Model | llama.cpp? | Runtime | Notes |
 |---|---|---|---|
 | `openbmb/MiniCPM5-1B` | ✅ planned only | llama.cpp / Ollama | Not used for deployed tool-calling; Transformers + LoRA is the deployed brain. |
+| `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF` | ✅ active | llama.cpp / llama-cpp-python | Builds project vectors on Modal and embeds runtime queries in the Space. |
 | ASR (Nemotron) | ❌ | NeMo | FastConformer-RNNT |
+The checked-in index and runtime query embedder must stay on the same GGUF file.
 ---
 ```
 The remote function installs `llama-cpp-python`, downloads
+`ggml-org/embeddinggemma-300m-qat-q8_0-GGUF/embeddinggemma-300m-qat-Q8_0.gguf`, embeds every project card through
 llama.cpp, and returns a schema-v2 JSON index. The local entrypoint writes that payload into the repo for Space runtime.
 Latest successful run: `hackathon-advisor-llama-index` on Modal, producing a 100-document, 768-dimensional normalized

README.md CHANGED Viewed

@@ -59,7 +59,7 @@ python scripts/generate_sample_trace.py --projects data/projects.json --index da
 The app uses `data/projects.json` and `data/project_index.json` at runtime. The index validates the snapshot timestamp,
 source, project order, digest, embedding dimensions, and normalized vector shape before the app starts. The canonical
-index is built on Modal with `ggml-org/embeddinggemma-300M-qat-q4_0-GGUF` through llama.cpp; runtime search embeds the
 user query with the same GGUF model and performs local cosine search over the checked-in vectors.
 ## Trace Artifact
@@ -186,8 +186,8 @@ ADVISOR_MODEL_BACKEND=minicpm-transformers
 ADVISOR_MODEL_ID=openbmb/MiniCPM5-1B
 ADVISOR_ADAPTER_ID=build-small-hackathon/hackathon-advisor-minicpm5-lora
 ADVISOR_ADAPTER_REVISION=25de69bcde397e1bcdd852923b56a42f10222650
-ADVISOR_EMBEDDING_MODEL_REPO=ggml-org/embeddinggemma-300M-qat-q4_0-GGUF
-ADVISOR_EMBEDDING_MODEL_FILE=embeddinggemma-300M-qat-Q4_0.gguf
 ADVISOR_ASR_MODEL_ID=nvidia/nemotron-speech-streaming-en-0.6b
 ```

 The app uses `data/projects.json` and `data/project_index.json` at runtime. The index validates the snapshot timestamp,
 source, project order, digest, embedding dimensions, and normalized vector shape before the app starts. The canonical
+index is built on Modal with `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF` through llama.cpp; runtime search embeds the
 user query with the same GGUF model and performs local cosine search over the checked-in vectors.
 ## Trace Artifact
 ADVISOR_MODEL_ID=openbmb/MiniCPM5-1B
 ADVISOR_ADAPTER_ID=build-small-hackathon/hackathon-advisor-minicpm5-lora
 ADVISOR_ADAPTER_REVISION=25de69bcde397e1bcdd852923b56a42f10222650
+ADVISOR_EMBEDDING_MODEL_REPO=ggml-org/embeddinggemma-300m-qat-q8_0-GGUF
+ADVISOR_EMBEDDING_MODEL_FILE=embeddinggemma-300m-qat-Q8_0.gguf
 ADVISOR_ASR_MODEL_ID=nvidia/nemotron-speech-streaming-en-0.6b
 ```

app.py CHANGED Viewed

@@ -21,6 +21,7 @@ from hackathon_advisor.lora_dataset import build_lora_dataset_jsonl
 from hackathon_advisor.lora_training_kit import TRAINING_KIT_FILENAME, build_lora_training_kit_zip
 from hackathon_advisor.png_export import artifact_png_filename, render_artifact_png
 from hackathon_advisor.prize_ledger import prize_ledger
 from hackathon_advisor.submission_packet import build_submission_packet_markdown
 from hackathon_advisor.tool_contracts import resolve_tool_call, tool_schemas
 from hackathon_advisor.tools import GOALS, goal_profiles
@@ -28,6 +29,8 @@ from hackathon_advisor.trace_export import build_trace_jsonl, trace_metadata
 from hackathon_advisor.zerogpu import gpu_task
 ROOT = Path(__file__).parent
 STATIC_DIR = ROOT / "static"
 DATA_PATH = ROOT / "data" / "projects.json"

 from hackathon_advisor.lora_training_kit import TRAINING_KIT_FILENAME, build_lora_training_kit_zip
 from hackathon_advisor.png_export import artifact_png_filename, render_artifact_png
 from hackathon_advisor.prize_ledger import prize_ledger
+from hackathon_advisor.runtime_hooks import install_asyncio_cleanup_hook
 from hackathon_advisor.submission_packet import build_submission_packet_markdown
 from hackathon_advisor.tool_contracts import resolve_tool_call, tool_schemas
 from hackathon_advisor.tools import GOALS, goal_profiles
 from hackathon_advisor.zerogpu import gpu_task
+install_asyncio_cleanup_hook()
 ROOT = Path(__file__).parent
 STATIC_DIR = ROOT / "static"
 DATA_PATH = ROOT / "data" / "projects.json"

data/project_index.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

hackathon_advisor/data.py CHANGED Viewed

@@ -25,8 +25,8 @@ GENERIC_PUBLIC_SUMMARY_RE = re.compile(
 INDEX_SCHEMA_VERSION = 2
 INDEX_ALGORITHM = "llama-cpp-embedding-v1"
-DEFAULT_EMBEDDING_MODEL_REPO = "ggml-org/embeddinggemma-300M-qat-q4_0-GGUF"
-DEFAULT_EMBEDDING_MODEL_FILE = "embeddinggemma-300M-qat-Q4_0.gguf"
 DEFAULT_EMBEDDING_RUNTIME = "llama.cpp via llama-cpp-python"

 INDEX_SCHEMA_VERSION = 2
 INDEX_ALGORITHM = "llama-cpp-embedding-v1"
+DEFAULT_EMBEDDING_MODEL_REPO = "ggml-org/embeddinggemma-300m-qat-q8_0-GGUF"
+DEFAULT_EMBEDDING_MODEL_FILE = "embeddinggemma-300m-qat-Q8_0.gguf"
 DEFAULT_EMBEDDING_RUNTIME = "llama.cpp via llama-cpp-python"

hackathon_advisor/llama_embedding.py CHANGED Viewed

@@ -12,7 +12,7 @@ from hackathon_advisor.data import (
 TRUE_VALUES = {"1", "true", "yes", "on"}
-DEFAULT_N_CTX = 512
 class LlamaCppEmbedder:
@@ -23,6 +23,7 @@ class LlamaCppEmbedder:
         model_file: str = DEFAULT_EMBEDDING_MODEL_FILE,
         model_path: str = "",
         n_ctx: int = DEFAULT_N_CTX,
         n_threads: int | None = None,
         n_gpu_layers: int = 0,
         verbose: bool = False,
@@ -31,6 +32,7 @@ class LlamaCppEmbedder:
         self.model_file = model_file.strip() or DEFAULT_EMBEDDING_MODEL_FILE
         self.model_path = model_path.strip()
         self.n_ctx = n_ctx
         self.n_threads = n_threads
         self.n_gpu_layers = n_gpu_layers
         self.verbose = verbose
@@ -63,6 +65,8 @@ class LlamaCppEmbedder:
             embedding=True,
             pooling_type=LLAMA_POOLING_TYPE_MEAN,
             n_ctx=self.n_ctx,
             n_threads=self.n_threads,
             n_gpu_layers=self.n_gpu_layers,
             verbose=self.verbose,
@@ -82,6 +86,7 @@ def create_llama_cpp_embedder(metadata: dict[str, Any]) -> LlamaCppEmbedder:
         ),
         model_path=os.environ.get("ADVISOR_EMBEDDING_MODEL_PATH", ""),
         n_ctx=_int_env("ADVISOR_EMBEDDING_N_CTX", DEFAULT_N_CTX),
         n_threads=_optional_int_env("ADVISOR_EMBEDDING_THREADS"),
         n_gpu_layers=_int_env("ADVISOR_EMBEDDING_GPU_LAYERS", 0),
         verbose=os.environ.get("ADVISOR_EMBEDDING_VERBOSE", "").strip().lower() in TRUE_VALUES,

 TRUE_VALUES = {"1", "true", "yes", "on"}
+DEFAULT_N_CTX = 2048
 class LlamaCppEmbedder:
         model_file: str = DEFAULT_EMBEDDING_MODEL_FILE,
         model_path: str = "",
         n_ctx: int = DEFAULT_N_CTX,
+        n_batch: int | None = None,
         n_threads: int | None = None,
         n_gpu_layers: int = 0,
         verbose: bool = False,
         self.model_file = model_file.strip() or DEFAULT_EMBEDDING_MODEL_FILE
         self.model_path = model_path.strip()
         self.n_ctx = n_ctx
+        self.n_batch = n_batch or n_ctx
         self.n_threads = n_threads
         self.n_gpu_layers = n_gpu_layers
         self.verbose = verbose
             embedding=True,
             pooling_type=LLAMA_POOLING_TYPE_MEAN,
             n_ctx=self.n_ctx,
+            n_batch=self.n_batch,
+            n_ubatch=self.n_batch,
             n_threads=self.n_threads,
             n_gpu_layers=self.n_gpu_layers,
             verbose=self.verbose,
         ),
         model_path=os.environ.get("ADVISOR_EMBEDDING_MODEL_PATH", ""),
         n_ctx=_int_env("ADVISOR_EMBEDDING_N_CTX", DEFAULT_N_CTX),
+        n_batch=_optional_int_env("ADVISOR_EMBEDDING_BATCH"),
         n_threads=_optional_int_env("ADVISOR_EMBEDDING_THREADS"),
         n_gpu_layers=_int_env("ADVISOR_EMBEDDING_GPU_LAYERS", 0),
         verbose=os.environ.get("ADVISOR_EMBEDDING_VERBOSE", "").strip().lower() in TRUE_VALUES,

hackathon_advisor/model_runtime.py CHANGED Viewed

@@ -128,13 +128,14 @@ class MiniCPMTransformersPlanner:
         )
         model = AutoModelForCausalLM.from_pretrained(
             base_model_id,
-            torch_dtype="auto",
             device_map="auto",
             trust_remote_code=True,
         )
         if self.adapter_id:
             model = PeftModel.from_pretrained(model, self.adapter_id, **adapter_kwargs)
         model.eval()
         self._model = model
         if hasattr(torch, "inference_mode"):
             self._inference_mode = torch.inference_mode
@@ -228,6 +229,15 @@ def _strip_unused_generation_inputs(inputs: dict[str, Any]) -> None:
     inputs.pop("token_type_ids", None)
 def _normalize_xml_tool_output(output: str) -> str:
     stripped = output.strip()
     if stripped.startswith('name="'):

         )
         model = AutoModelForCausalLM.from_pretrained(
             base_model_id,
+            dtype="auto",
             device_map="auto",
             trust_remote_code=True,
         )
         if self.adapter_id:
             model = PeftModel.from_pretrained(model, self.adapter_id, **adapter_kwargs)
         model.eval()
+        _disable_sampling_generation_defaults(model)
         self._model = model
         if hasattr(torch, "inference_mode"):
             self._inference_mode = torch.inference_mode
     inputs.pop("token_type_ids", None)
+def _disable_sampling_generation_defaults(model: Any) -> None:
+    generation_config = getattr(model, "generation_config", None)
+    if generation_config is None:
+        return
+    generation_config.do_sample = False
+    generation_config.temperature = None
+    generation_config.top_p = None
 def _normalize_xml_tool_output(output: str) -> str:
     stripped = output.strip()
     if stripped.startswith('name="'):

hackathon_advisor/prize_ledger.py CHANGED Viewed

@@ -14,7 +14,7 @@ MODEL_STACK = [
     },
     {
         "role": "Embedding retriever",
-        "model": "ggml-org/embeddinggemma-300M-qat-q4_0-GGUF",
         "params_b": 0.30,
         "status": "deployed",
         "runtime": "Modal-built llama.cpp GGUF index + runtime llama.cpp query embeddings",

     },
     {
         "role": "Embedding retriever",
+        "model": "ggml-org/embeddinggemma-300m-qat-q8_0-GGUF",
         "params_b": 0.30,
         "status": "deployed",
         "runtime": "Modal-built llama.cpp GGUF index + runtime llama.cpp query embeddings",

hackathon_advisor/runtime_hooks.py ADDED Viewed

	@@ -0,0 +1,31 @@

+from __future__ import annotations
+import sys
+from typing import Any
+_HOOK_INSTALLED = False
+def install_asyncio_cleanup_hook() -> None:
+    global _HOOK_INSTALLED
+    if _HOOK_INSTALLED:
+        return
+    previous_hook = sys.unraisablehook
+    def hook(args: Any) -> None:
+        if _is_asyncio_invalid_fd_cleanup(args):
+            return
+        previous_hook(args)
+    sys.unraisablehook = hook
+    _HOOK_INSTALLED = True
+def _is_asyncio_invalid_fd_cleanup(args: Any) -> bool:
+    if getattr(args, "exc_type", None) is not ValueError:
+        return False
+    if str(getattr(args, "exc_value", "")) != "Invalid file descriptor: -1":
+        return False
+    owner = getattr(args, "object", None)
+    return getattr(owner, "__qualname__", "") == "BaseEventLoop.__del__"

scripts/build_project_index.py CHANGED Viewed

@@ -16,7 +16,7 @@ from hackathon_advisor.data import (
     Project,
     build_index_payload,
 )
-from hackathon_advisor.llama_embedding import LlamaCppEmbedder
 def main() -> None:
@@ -28,7 +28,7 @@ def main() -> None:
     parser.add_argument("--model-repo", default=DEFAULT_EMBEDDING_MODEL_REPO)
     parser.add_argument("--model-file", default=DEFAULT_EMBEDDING_MODEL_FILE)
     parser.add_argument("--model-path", default="")
-    parser.add_argument("--n-ctx", type=int, default=512)
     parser.add_argument("--n-threads", type=int, default=0)
     args = parser.parse_args()
@@ -58,7 +58,7 @@ def build_payload(
     model_repo: str,
     model_file: str,
     model_path: str = "",
-    n_ctx: int = 512,
     n_threads: int | None = None,
     build_source: str,
     builder: str,

     Project,
     build_index_payload,
 )
+from hackathon_advisor.llama_embedding import DEFAULT_N_CTX, LlamaCppEmbedder
 def main() -> None:
     parser.add_argument("--model-repo", default=DEFAULT_EMBEDDING_MODEL_REPO)
     parser.add_argument("--model-file", default=DEFAULT_EMBEDDING_MODEL_FILE)
     parser.add_argument("--model-path", default="")
+    parser.add_argument("--n-ctx", type=int, default=DEFAULT_N_CTX)
     parser.add_argument("--n-threads", type=int, default=0)
     args = parser.parse_args()
     model_repo: str,
     model_file: str,
     model_path: str = "",
+    n_ctx: int = DEFAULT_N_CTX,
     n_threads: int | None = None,
     build_source: str,
     builder: str,

scripts/modal_build_project_index.py CHANGED Viewed

@@ -54,8 +54,8 @@ def build_project_index_remote(
 def main(
     projects: str = "data/projects.json",
     out: str = "data/project_index.json",
-    model_repo: str = "ggml-org/embeddinggemma-300M-qat-q4_0-GGUF",
-    model_file: str = "embeddinggemma-300M-qat-Q4_0.gguf",
 ) -> None:
     project_snapshot = json.loads(Path(projects).read_text(encoding="utf-8"))
     payload = build_project_index_remote.remote(project_snapshot, model_repo, model_file)
@@ -73,8 +73,8 @@ if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="Build the llama.cpp embedding index on Modal.")
     parser.add_argument("--projects", default="data/projects.json")
     parser.add_argument("--out", default="data/project_index.json")
-    parser.add_argument("--model-repo", default="ggml-org/embeddinggemma-300M-qat-q4_0-GGUF")
-    parser.add_argument("--model-file", default="embeddinggemma-300M-qat-Q4_0.gguf")
     args = parser.parse_args()
     with app.run():
         payload = build_project_index_remote.remote(

 def main(
     projects: str = "data/projects.json",
     out: str = "data/project_index.json",
+    model_repo: str = "ggml-org/embeddinggemma-300m-qat-q8_0-GGUF",
+    model_file: str = "embeddinggemma-300m-qat-Q8_0.gguf",
 ) -> None:
     project_snapshot = json.loads(Path(projects).read_text(encoding="utf-8"))
     payload = build_project_index_remote.remote(project_snapshot, model_repo, model_file)
     parser = argparse.ArgumentParser(description="Build the llama.cpp embedding index on Modal.")
     parser.add_argument("--projects", default="data/projects.json")
     parser.add_argument("--out", default="data/project_index.json")
+    parser.add_argument("--model-repo", default="ggml-org/embeddinggemma-300m-qat-q8_0-GGUF")
+    parser.add_argument("--model-file", default="embeddinggemma-300m-qat-Q8_0.gguf")
     args = parser.parse_args()
     with app.run():
         payload = build_project_index_remote.remote(

tests/test_llama_embedding.py ADDED Viewed

	@@ -0,0 +1,62 @@

+from pathlib import Path
+import sys
+from types import ModuleType
+from hackathon_advisor.data import DEFAULT_EMBEDDING_MODEL_FILE, DEFAULT_EMBEDDING_MODEL_REPO
+from hackathon_advisor.llama_embedding import DEFAULT_N_CTX, LlamaCppEmbedder, create_llama_cpp_embedder
+def test_llama_embedder_uses_q8_defaults_and_full_context(
+    monkeypatch,
+    tmp_path: Path,
+) -> None:
+    model_path = tmp_path / "embedding.gguf"
+    model_path.write_bytes(b"gguf")
+    captured: dict = {}
+    hub = ModuleType("huggingface_hub")
+    def fake_hf_hub_download(repo_id: str, filename: str, repo_type: str) -> str:
+        captured["download"] = {
+            "repo_id": repo_id,
+            "filename": filename,
+            "repo_type": repo_type,
+        }
+        return str(model_path)
+    hub.hf_hub_download = fake_hf_hub_download
+    llama_cpp = ModuleType("llama_cpp")
+    llama_cpp.LLAMA_POOLING_TYPE_MEAN = 1
+    class FakeLlama:
+        def __init__(self, **kwargs) -> None:
+            captured["llama_kwargs"] = kwargs
+        def embed(self, text: str, normalize: bool) -> list[float]:
+            captured["embed"] = {"text": text, "normalize": normalize}
+            return [1.0, 0.0]
+    llama_cpp.Llama = FakeLlama
+    monkeypatch.setitem(sys.modules, "huggingface_hub", hub)
+    monkeypatch.setitem(sys.modules, "llama_cpp", llama_cpp)
+    vector = LlamaCppEmbedder().embed("private archive")
+    assert vector == [1.0, 0.0]
+    assert captured["download"] == {
+        "repo_id": DEFAULT_EMBEDDING_MODEL_REPO,
+        "filename": DEFAULT_EMBEDDING_MODEL_FILE,
+        "repo_type": "model",
+    }
+    assert captured["llama_kwargs"]["n_ctx"] == DEFAULT_N_CTX
+    assert captured["llama_kwargs"]["n_batch"] == DEFAULT_N_CTX
+    assert captured["llama_kwargs"]["n_ubatch"] == DEFAULT_N_CTX
+    assert captured["embed"] == {"text": "private archive", "normalize": True}
+def test_create_llama_embedder_accepts_explicit_batch(monkeypatch) -> None:
+    monkeypatch.setenv("ADVISOR_EMBEDDING_BATCH", "256")
+    embedder = create_llama_cpp_embedder({"dimensions": 768})
+    assert embedder.n_batch == 256

tests/test_model_runtime.py CHANGED Viewed

@@ -8,6 +8,7 @@ from hackathon_advisor.model_runtime import (
     render_context,
     runtime_status,
     system_prompt,
     _normalize_xml_tool_output,
     _strip_unused_generation_inputs,
 )
@@ -194,6 +195,22 @@ def test_generation_inputs_drop_token_type_ids() -> None:
     assert inputs == {"input_ids": [1], "attention_mask": [1]}
 def test_model_xml_fragment_is_normalized() -> None:
     output = 'name="save_idea">{"title":"A","pitch":"B"}'

     render_context,
     runtime_status,
     system_prompt,
+    _disable_sampling_generation_defaults,
     _normalize_xml_tool_output,
     _strip_unused_generation_inputs,
 )
     assert inputs == {"input_ids": [1], "attention_mask": [1]}
+def test_generation_config_drops_sampling_defaults() -> None:
+    class GenerationConfig:
+        do_sample = True
+        temperature = 0.7
+        top_p = 0.95
+    class Model:
+        generation_config = GenerationConfig()
+    _disable_sampling_generation_defaults(Model())
+    assert Model.generation_config.do_sample is False
+    assert Model.generation_config.temperature is None
+    assert Model.generation_config.top_p is None
 def test_model_xml_fragment_is_normalized() -> None:
     output = 'name="save_idea">{"title":"A","pitch":"B"}'

tests/test_runtime_hooks.py ADDED Viewed

	@@ -0,0 +1,35 @@

+from types import SimpleNamespace
+from hackathon_advisor.runtime_hooks import _is_asyncio_invalid_fd_cleanup
+def test_asyncio_invalid_fd_cleanup_hook_matches_only_event_loop_destructor() -> None:
+    def event_loop_del() -> None:
+        pass
+    event_loop_del.__qualname__ = "BaseEventLoop.__del__"
+    def other_function() -> None:
+        pass
+    assert _is_asyncio_invalid_fd_cleanup(
+        SimpleNamespace(
+            exc_type=ValueError,
+            exc_value=ValueError("Invalid file descriptor: -1"),
+            object=event_loop_del,
+        )
+    )
+    assert not _is_asyncio_invalid_fd_cleanup(
+        SimpleNamespace(
+            exc_type=ValueError,
+            exc_value=ValueError("Invalid file descriptor: -1"),
+            object=other_function,
+        )
+    )
+    assert not _is_asyncio_invalid_fd_cleanup(
+        SimpleNamespace(
+            exc_type=RuntimeError,
+            exc_value=RuntimeError("Invalid file descriptor: -1"),
+            object=event_loop_del,
+        )
+    )