Spaces:
Running on Zero
Running on Zero
fix: stabilize llama embedding runtime
Browse filesCo-authored-by: Codex <noreply@openai.com>
- DESIGN.md +6 -6
- README.md +3 -3
- app.py +3 -0
- data/project_index.json +0 -0
- hackathon_advisor/data.py +2 -2
- hackathon_advisor/llama_embedding.py +6 -1
- hackathon_advisor/model_runtime.py +11 -1
- hackathon_advisor/prize_ledger.py +1 -1
- hackathon_advisor/runtime_hooks.py +31 -0
- scripts/build_project_index.py +3 -3
- scripts/modal_build_project_index.py +4 -4
- tests/test_llama_embedding.py +62 -0
- tests/test_model_runtime.py +17 -0
- tests/test_runtime_hooks.py +35 -0
DESIGN.md
CHANGED
|
@@ -126,7 +126,7 @@ investigate → ideate → score loop — the experience collapses without the m
|
|
| 126 |
|---|---|---|---|---|---|
|
| 127 |
| STT (batch voice input) | **`nvidia/nemotron-speech-streaming-en-0.6b`** | 0.6B | NeMo, GPU+CUDA | NVIDIA Open Model (commercial OK) | 🟩 NVIDIA Nemotron Quest |
|
| 128 |
| LLM brain | **`openbmb/MiniCPM5-1B`** ("OpenCPM5") | 1.08B | **transformers** (self-parse XML) / llama.cpp | **Apache-2.0** | 🏮 OpenBMB |
|
| 129 |
-
| Embedder | **`ggml-org/embeddinggemma-
|
| 130 |
| Fine-tune | LoRA on MiniCPM5 → published to Hub | — | PEFT / HF Jobs | — | 🎯 Well-Tuned |
|
| 131 |
|
| 132 |
**Total ≈ 1.98B params → ≤4B → 🐜 Tiny Titan eligible.** All open-weight, all runnable locally → 🔌 Off the Grid.
|
|
@@ -191,9 +191,9 @@ With **text-first + batch ASR**, the old "streaming ASR vs ZeroGPU" Config A/B t
|
|
| 191 |
tool-calling is a pending PR — verify before relying on it for the badge runtime.)
|
| 192 |
- **1B discipline:** small tool schemas, few params each, clear descriptions, low temp, single-hop tool calls.
|
| 193 |
|
| 194 |
-
### 5.4 EmbeddingGemma GGUF — `ggml-org/embeddinggemma-
|
| 195 |
|
| 196 |
-
- Active retrieval model: `embeddinggemma-
|
| 197 |
- Build-time path: Modal remote function runs `llama-cpp-python` with mean pooling and writes `data/project_index.json`.
|
| 198 |
- Runtime path: Space embeds each user query through the same GGUF model via llama.cpp, then performs local cosine search
|
| 199 |
over checked-in project vectors.
|
|
@@ -208,10 +208,10 @@ llama.cpp on Modal, and runtime query embeddings use the same llama.cpp path.
|
|
| 208 |
| Model | llama.cpp? | Runtime | Notes |
|
| 209 |
|---|---|---|---|
|
| 210 |
| `openbmb/MiniCPM5-1B` | ✅ planned only | llama.cpp / Ollama | Not used for deployed tool-calling; Transformers + LoRA is the deployed brain. |
|
| 211 |
-
| `ggml-org/embeddinggemma-
|
| 212 |
| ASR (Nemotron) | ❌ | NeMo | FastConformer-RNNT |
|
| 213 |
|
| 214 |
-
|
| 215 |
|
| 216 |
---
|
| 217 |
|
|
@@ -310,7 +310,7 @@ canonical command is:
|
|
| 310 |
```
|
| 311 |
|
| 312 |
The remote function installs `llama-cpp-python`, downloads
|
| 313 |
-
`ggml-org/embeddinggemma-
|
| 314 |
llama.cpp, and returns a schema-v2 JSON index. The local entrypoint writes that payload into the repo for Space runtime.
|
| 315 |
|
| 316 |
Latest successful run: `hackathon-advisor-llama-index` on Modal, producing a 100-document, 768-dimensional normalized
|
|
|
|
| 126 |
|---|---|---|---|---|---|
|
| 127 |
| STT (batch voice input) | **`nvidia/nemotron-speech-streaming-en-0.6b`** | 0.6B | NeMo, GPU+CUDA | NVIDIA Open Model (commercial OK) | 🟩 NVIDIA Nemotron Quest |
|
| 128 |
| LLM brain | **`openbmb/MiniCPM5-1B`** ("OpenCPM5") | 1.08B | **transformers** (self-parse XML) / llama.cpp | **Apache-2.0** | 🏮 OpenBMB |
|
| 129 |
+
| Embedder | **`ggml-org/embeddinggemma-300m-qat-q8_0-GGUF`** | ~300M | llama.cpp / llama-cpp-python | Gemma | 🔌 Off the Grid · 🦙 Llama Champion · 🟢 Modal |
|
| 130 |
| Fine-tune | LoRA on MiniCPM5 → published to Hub | — | PEFT / HF Jobs | — | 🎯 Well-Tuned |
|
| 131 |
|
| 132 |
**Total ≈ 1.98B params → ≤4B → 🐜 Tiny Titan eligible.** All open-weight, all runnable locally → 🔌 Off the Grid.
|
|
|
|
| 191 |
tool-calling is a pending PR — verify before relying on it for the badge runtime.)
|
| 192 |
- **1B discipline:** small tool schemas, few params each, clear descriptions, low temp, single-hop tool calls.
|
| 193 |
|
| 194 |
+
### 5.4 EmbeddingGemma GGUF — `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF`
|
| 195 |
|
| 196 |
+
- Active retrieval model: `embeddinggemma-300m-qat-Q8_0.gguf`, 768-dimensional normalized embeddings.
|
| 197 |
- Build-time path: Modal remote function runs `llama-cpp-python` with mean pooling and writes `data/project_index.json`.
|
| 198 |
- Runtime path: Space embeds each user query through the same GGUF model via llama.cpp, then performs local cosine search
|
| 199 |
over checked-in project vectors.
|
|
|
|
| 208 |
| Model | llama.cpp? | Runtime | Notes |
|
| 209 |
|---|---|---|---|
|
| 210 |
| `openbmb/MiniCPM5-1B` | ✅ planned only | llama.cpp / Ollama | Not used for deployed tool-calling; Transformers + LoRA is the deployed brain. |
|
| 211 |
+
| `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF` | ✅ active | llama.cpp / llama-cpp-python | Builds project vectors on Modal and embeds runtime queries in the Space. |
|
| 212 |
| ASR (Nemotron) | ❌ | NeMo | FastConformer-RNNT |
|
| 213 |
|
| 214 |
+
The checked-in index and runtime query embedder must stay on the same GGUF file.
|
| 215 |
|
| 216 |
---
|
| 217 |
|
|
|
|
| 310 |
```
|
| 311 |
|
| 312 |
The remote function installs `llama-cpp-python`, downloads
|
| 313 |
+
`ggml-org/embeddinggemma-300m-qat-q8_0-GGUF/embeddinggemma-300m-qat-Q8_0.gguf`, embeds every project card through
|
| 314 |
llama.cpp, and returns a schema-v2 JSON index. The local entrypoint writes that payload into the repo for Space runtime.
|
| 315 |
|
| 316 |
Latest successful run: `hackathon-advisor-llama-index` on Modal, producing a 100-document, 768-dimensional normalized
|
README.md
CHANGED
|
@@ -59,7 +59,7 @@ python scripts/generate_sample_trace.py --projects data/projects.json --index da
|
|
| 59 |
|
| 60 |
The app uses `data/projects.json` and `data/project_index.json` at runtime. The index validates the snapshot timestamp,
|
| 61 |
source, project order, digest, embedding dimensions, and normalized vector shape before the app starts. The canonical
|
| 62 |
-
index is built on Modal with `ggml-org/embeddinggemma-
|
| 63 |
user query with the same GGUF model and performs local cosine search over the checked-in vectors.
|
| 64 |
|
| 65 |
## Trace Artifact
|
|
@@ -186,8 +186,8 @@ ADVISOR_MODEL_BACKEND=minicpm-transformers
|
|
| 186 |
ADVISOR_MODEL_ID=openbmb/MiniCPM5-1B
|
| 187 |
ADVISOR_ADAPTER_ID=build-small-hackathon/hackathon-advisor-minicpm5-lora
|
| 188 |
ADVISOR_ADAPTER_REVISION=25de69bcde397e1bcdd852923b56a42f10222650
|
| 189 |
-
ADVISOR_EMBEDDING_MODEL_REPO=ggml-org/embeddinggemma-
|
| 190 |
-
ADVISOR_EMBEDDING_MODEL_FILE=embeddinggemma-
|
| 191 |
ADVISOR_ASR_MODEL_ID=nvidia/nemotron-speech-streaming-en-0.6b
|
| 192 |
```
|
| 193 |
|
|
|
|
| 59 |
|
| 60 |
The app uses `data/projects.json` and `data/project_index.json` at runtime. The index validates the snapshot timestamp,
|
| 61 |
source, project order, digest, embedding dimensions, and normalized vector shape before the app starts. The canonical
|
| 62 |
+
index is built on Modal with `ggml-org/embeddinggemma-300m-qat-q8_0-GGUF` through llama.cpp; runtime search embeds the
|
| 63 |
user query with the same GGUF model and performs local cosine search over the checked-in vectors.
|
| 64 |
|
| 65 |
## Trace Artifact
|
|
|
|
| 186 |
ADVISOR_MODEL_ID=openbmb/MiniCPM5-1B
|
| 187 |
ADVISOR_ADAPTER_ID=build-small-hackathon/hackathon-advisor-minicpm5-lora
|
| 188 |
ADVISOR_ADAPTER_REVISION=25de69bcde397e1bcdd852923b56a42f10222650
|
| 189 |
+
ADVISOR_EMBEDDING_MODEL_REPO=ggml-org/embeddinggemma-300m-qat-q8_0-GGUF
|
| 190 |
+
ADVISOR_EMBEDDING_MODEL_FILE=embeddinggemma-300m-qat-Q8_0.gguf
|
| 191 |
ADVISOR_ASR_MODEL_ID=nvidia/nemotron-speech-streaming-en-0.6b
|
| 192 |
```
|
| 193 |
|
app.py
CHANGED
|
@@ -21,6 +21,7 @@ from hackathon_advisor.lora_dataset import build_lora_dataset_jsonl
|
|
| 21 |
from hackathon_advisor.lora_training_kit import TRAINING_KIT_FILENAME, build_lora_training_kit_zip
|
| 22 |
from hackathon_advisor.png_export import artifact_png_filename, render_artifact_png
|
| 23 |
from hackathon_advisor.prize_ledger import prize_ledger
|
|
|
|
| 24 |
from hackathon_advisor.submission_packet import build_submission_packet_markdown
|
| 25 |
from hackathon_advisor.tool_contracts import resolve_tool_call, tool_schemas
|
| 26 |
from hackathon_advisor.tools import GOALS, goal_profiles
|
|
@@ -28,6 +29,8 @@ from hackathon_advisor.trace_export import build_trace_jsonl, trace_metadata
|
|
| 28 |
from hackathon_advisor.zerogpu import gpu_task
|
| 29 |
|
| 30 |
|
|
|
|
|
|
|
| 31 |
ROOT = Path(__file__).parent
|
| 32 |
STATIC_DIR = ROOT / "static"
|
| 33 |
DATA_PATH = ROOT / "data" / "projects.json"
|
|
|
|
| 21 |
from hackathon_advisor.lora_training_kit import TRAINING_KIT_FILENAME, build_lora_training_kit_zip
|
| 22 |
from hackathon_advisor.png_export import artifact_png_filename, render_artifact_png
|
| 23 |
from hackathon_advisor.prize_ledger import prize_ledger
|
| 24 |
+
from hackathon_advisor.runtime_hooks import install_asyncio_cleanup_hook
|
| 25 |
from hackathon_advisor.submission_packet import build_submission_packet_markdown
|
| 26 |
from hackathon_advisor.tool_contracts import resolve_tool_call, tool_schemas
|
| 27 |
from hackathon_advisor.tools import GOALS, goal_profiles
|
|
|
|
| 29 |
from hackathon_advisor.zerogpu import gpu_task
|
| 30 |
|
| 31 |
|
| 32 |
+
install_asyncio_cleanup_hook()
|
| 33 |
+
|
| 34 |
ROOT = Path(__file__).parent
|
| 35 |
STATIC_DIR = ROOT / "static"
|
| 36 |
DATA_PATH = ROOT / "data" / "projects.json"
|
data/project_index.json
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
hackathon_advisor/data.py
CHANGED
|
@@ -25,8 +25,8 @@ GENERIC_PUBLIC_SUMMARY_RE = re.compile(
|
|
| 25 |
|
| 26 |
INDEX_SCHEMA_VERSION = 2
|
| 27 |
INDEX_ALGORITHM = "llama-cpp-embedding-v1"
|
| 28 |
-
DEFAULT_EMBEDDING_MODEL_REPO = "ggml-org/embeddinggemma-
|
| 29 |
-
DEFAULT_EMBEDDING_MODEL_FILE = "embeddinggemma-
|
| 30 |
DEFAULT_EMBEDDING_RUNTIME = "llama.cpp via llama-cpp-python"
|
| 31 |
|
| 32 |
|
|
|
|
| 25 |
|
| 26 |
INDEX_SCHEMA_VERSION = 2
|
| 27 |
INDEX_ALGORITHM = "llama-cpp-embedding-v1"
|
| 28 |
+
DEFAULT_EMBEDDING_MODEL_REPO = "ggml-org/embeddinggemma-300m-qat-q8_0-GGUF"
|
| 29 |
+
DEFAULT_EMBEDDING_MODEL_FILE = "embeddinggemma-300m-qat-Q8_0.gguf"
|
| 30 |
DEFAULT_EMBEDDING_RUNTIME = "llama.cpp via llama-cpp-python"
|
| 31 |
|
| 32 |
|
hackathon_advisor/llama_embedding.py
CHANGED
|
@@ -12,7 +12,7 @@ from hackathon_advisor.data import (
|
|
| 12 |
|
| 13 |
|
| 14 |
TRUE_VALUES = {"1", "true", "yes", "on"}
|
| 15 |
-
DEFAULT_N_CTX =
|
| 16 |
|
| 17 |
|
| 18 |
class LlamaCppEmbedder:
|
|
@@ -23,6 +23,7 @@ class LlamaCppEmbedder:
|
|
| 23 |
model_file: str = DEFAULT_EMBEDDING_MODEL_FILE,
|
| 24 |
model_path: str = "",
|
| 25 |
n_ctx: int = DEFAULT_N_CTX,
|
|
|
|
| 26 |
n_threads: int | None = None,
|
| 27 |
n_gpu_layers: int = 0,
|
| 28 |
verbose: bool = False,
|
|
@@ -31,6 +32,7 @@ class LlamaCppEmbedder:
|
|
| 31 |
self.model_file = model_file.strip() or DEFAULT_EMBEDDING_MODEL_FILE
|
| 32 |
self.model_path = model_path.strip()
|
| 33 |
self.n_ctx = n_ctx
|
|
|
|
| 34 |
self.n_threads = n_threads
|
| 35 |
self.n_gpu_layers = n_gpu_layers
|
| 36 |
self.verbose = verbose
|
|
@@ -63,6 +65,8 @@ class LlamaCppEmbedder:
|
|
| 63 |
embedding=True,
|
| 64 |
pooling_type=LLAMA_POOLING_TYPE_MEAN,
|
| 65 |
n_ctx=self.n_ctx,
|
|
|
|
|
|
|
| 66 |
n_threads=self.n_threads,
|
| 67 |
n_gpu_layers=self.n_gpu_layers,
|
| 68 |
verbose=self.verbose,
|
|
@@ -82,6 +86,7 @@ def create_llama_cpp_embedder(metadata: dict[str, Any]) -> LlamaCppEmbedder:
|
|
| 82 |
),
|
| 83 |
model_path=os.environ.get("ADVISOR_EMBEDDING_MODEL_PATH", ""),
|
| 84 |
n_ctx=_int_env("ADVISOR_EMBEDDING_N_CTX", DEFAULT_N_CTX),
|
|
|
|
| 85 |
n_threads=_optional_int_env("ADVISOR_EMBEDDING_THREADS"),
|
| 86 |
n_gpu_layers=_int_env("ADVISOR_EMBEDDING_GPU_LAYERS", 0),
|
| 87 |
verbose=os.environ.get("ADVISOR_EMBEDDING_VERBOSE", "").strip().lower() in TRUE_VALUES,
|
|
|
|
| 12 |
|
| 13 |
|
| 14 |
TRUE_VALUES = {"1", "true", "yes", "on"}
|
| 15 |
+
DEFAULT_N_CTX = 2048
|
| 16 |
|
| 17 |
|
| 18 |
class LlamaCppEmbedder:
|
|
|
|
| 23 |
model_file: str = DEFAULT_EMBEDDING_MODEL_FILE,
|
| 24 |
model_path: str = "",
|
| 25 |
n_ctx: int = DEFAULT_N_CTX,
|
| 26 |
+
n_batch: int | None = None,
|
| 27 |
n_threads: int | None = None,
|
| 28 |
n_gpu_layers: int = 0,
|
| 29 |
verbose: bool = False,
|
|
|
|
| 32 |
self.model_file = model_file.strip() or DEFAULT_EMBEDDING_MODEL_FILE
|
| 33 |
self.model_path = model_path.strip()
|
| 34 |
self.n_ctx = n_ctx
|
| 35 |
+
self.n_batch = n_batch or n_ctx
|
| 36 |
self.n_threads = n_threads
|
| 37 |
self.n_gpu_layers = n_gpu_layers
|
| 38 |
self.verbose = verbose
|
|
|
|
| 65 |
embedding=True,
|
| 66 |
pooling_type=LLAMA_POOLING_TYPE_MEAN,
|
| 67 |
n_ctx=self.n_ctx,
|
| 68 |
+
n_batch=self.n_batch,
|
| 69 |
+
n_ubatch=self.n_batch,
|
| 70 |
n_threads=self.n_threads,
|
| 71 |
n_gpu_layers=self.n_gpu_layers,
|
| 72 |
verbose=self.verbose,
|
|
|
|
| 86 |
),
|
| 87 |
model_path=os.environ.get("ADVISOR_EMBEDDING_MODEL_PATH", ""),
|
| 88 |
n_ctx=_int_env("ADVISOR_EMBEDDING_N_CTX", DEFAULT_N_CTX),
|
| 89 |
+
n_batch=_optional_int_env("ADVISOR_EMBEDDING_BATCH"),
|
| 90 |
n_threads=_optional_int_env("ADVISOR_EMBEDDING_THREADS"),
|
| 91 |
n_gpu_layers=_int_env("ADVISOR_EMBEDDING_GPU_LAYERS", 0),
|
| 92 |
verbose=os.environ.get("ADVISOR_EMBEDDING_VERBOSE", "").strip().lower() in TRUE_VALUES,
|
hackathon_advisor/model_runtime.py
CHANGED
|
@@ -128,13 +128,14 @@ class MiniCPMTransformersPlanner:
|
|
| 128 |
)
|
| 129 |
model = AutoModelForCausalLM.from_pretrained(
|
| 130 |
base_model_id,
|
| 131 |
-
|
| 132 |
device_map="auto",
|
| 133 |
trust_remote_code=True,
|
| 134 |
)
|
| 135 |
if self.adapter_id:
|
| 136 |
model = PeftModel.from_pretrained(model, self.adapter_id, **adapter_kwargs)
|
| 137 |
model.eval()
|
|
|
|
| 138 |
self._model = model
|
| 139 |
if hasattr(torch, "inference_mode"):
|
| 140 |
self._inference_mode = torch.inference_mode
|
|
@@ -228,6 +229,15 @@ def _strip_unused_generation_inputs(inputs: dict[str, Any]) -> None:
|
|
| 228 |
inputs.pop("token_type_ids", None)
|
| 229 |
|
| 230 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 231 |
def _normalize_xml_tool_output(output: str) -> str:
|
| 232 |
stripped = output.strip()
|
| 233 |
if stripped.startswith('name="'):
|
|
|
|
| 128 |
)
|
| 129 |
model = AutoModelForCausalLM.from_pretrained(
|
| 130 |
base_model_id,
|
| 131 |
+
dtype="auto",
|
| 132 |
device_map="auto",
|
| 133 |
trust_remote_code=True,
|
| 134 |
)
|
| 135 |
if self.adapter_id:
|
| 136 |
model = PeftModel.from_pretrained(model, self.adapter_id, **adapter_kwargs)
|
| 137 |
model.eval()
|
| 138 |
+
_disable_sampling_generation_defaults(model)
|
| 139 |
self._model = model
|
| 140 |
if hasattr(torch, "inference_mode"):
|
| 141 |
self._inference_mode = torch.inference_mode
|
|
|
|
| 229 |
inputs.pop("token_type_ids", None)
|
| 230 |
|
| 231 |
|
| 232 |
+
def _disable_sampling_generation_defaults(model: Any) -> None:
|
| 233 |
+
generation_config = getattr(model, "generation_config", None)
|
| 234 |
+
if generation_config is None:
|
| 235 |
+
return
|
| 236 |
+
generation_config.do_sample = False
|
| 237 |
+
generation_config.temperature = None
|
| 238 |
+
generation_config.top_p = None
|
| 239 |
+
|
| 240 |
+
|
| 241 |
def _normalize_xml_tool_output(output: str) -> str:
|
| 242 |
stripped = output.strip()
|
| 243 |
if stripped.startswith('name="'):
|
hackathon_advisor/prize_ledger.py
CHANGED
|
@@ -14,7 +14,7 @@ MODEL_STACK = [
|
|
| 14 |
},
|
| 15 |
{
|
| 16 |
"role": "Embedding retriever",
|
| 17 |
-
"model": "ggml-org/embeddinggemma-
|
| 18 |
"params_b": 0.30,
|
| 19 |
"status": "deployed",
|
| 20 |
"runtime": "Modal-built llama.cpp GGUF index + runtime llama.cpp query embeddings",
|
|
|
|
| 14 |
},
|
| 15 |
{
|
| 16 |
"role": "Embedding retriever",
|
| 17 |
+
"model": "ggml-org/embeddinggemma-300m-qat-q8_0-GGUF",
|
| 18 |
"params_b": 0.30,
|
| 19 |
"status": "deployed",
|
| 20 |
"runtime": "Modal-built llama.cpp GGUF index + runtime llama.cpp query embeddings",
|
hackathon_advisor/runtime_hooks.py
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import sys
|
| 4 |
+
from typing import Any
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
_HOOK_INSTALLED = False
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
def install_asyncio_cleanup_hook() -> None:
|
| 11 |
+
global _HOOK_INSTALLED
|
| 12 |
+
if _HOOK_INSTALLED:
|
| 13 |
+
return
|
| 14 |
+
previous_hook = sys.unraisablehook
|
| 15 |
+
|
| 16 |
+
def hook(args: Any) -> None:
|
| 17 |
+
if _is_asyncio_invalid_fd_cleanup(args):
|
| 18 |
+
return
|
| 19 |
+
previous_hook(args)
|
| 20 |
+
|
| 21 |
+
sys.unraisablehook = hook
|
| 22 |
+
_HOOK_INSTALLED = True
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
def _is_asyncio_invalid_fd_cleanup(args: Any) -> bool:
|
| 26 |
+
if getattr(args, "exc_type", None) is not ValueError:
|
| 27 |
+
return False
|
| 28 |
+
if str(getattr(args, "exc_value", "")) != "Invalid file descriptor: -1":
|
| 29 |
+
return False
|
| 30 |
+
owner = getattr(args, "object", None)
|
| 31 |
+
return getattr(owner, "__qualname__", "") == "BaseEventLoop.__del__"
|
scripts/build_project_index.py
CHANGED
|
@@ -16,7 +16,7 @@ from hackathon_advisor.data import (
|
|
| 16 |
Project,
|
| 17 |
build_index_payload,
|
| 18 |
)
|
| 19 |
-
from hackathon_advisor.llama_embedding import LlamaCppEmbedder
|
| 20 |
|
| 21 |
|
| 22 |
def main() -> None:
|
|
@@ -28,7 +28,7 @@ def main() -> None:
|
|
| 28 |
parser.add_argument("--model-repo", default=DEFAULT_EMBEDDING_MODEL_REPO)
|
| 29 |
parser.add_argument("--model-file", default=DEFAULT_EMBEDDING_MODEL_FILE)
|
| 30 |
parser.add_argument("--model-path", default="")
|
| 31 |
-
parser.add_argument("--n-ctx", type=int, default=
|
| 32 |
parser.add_argument("--n-threads", type=int, default=0)
|
| 33 |
args = parser.parse_args()
|
| 34 |
|
|
@@ -58,7 +58,7 @@ def build_payload(
|
|
| 58 |
model_repo: str,
|
| 59 |
model_file: str,
|
| 60 |
model_path: str = "",
|
| 61 |
-
n_ctx: int =
|
| 62 |
n_threads: int | None = None,
|
| 63 |
build_source: str,
|
| 64 |
builder: str,
|
|
|
|
| 16 |
Project,
|
| 17 |
build_index_payload,
|
| 18 |
)
|
| 19 |
+
from hackathon_advisor.llama_embedding import DEFAULT_N_CTX, LlamaCppEmbedder
|
| 20 |
|
| 21 |
|
| 22 |
def main() -> None:
|
|
|
|
| 28 |
parser.add_argument("--model-repo", default=DEFAULT_EMBEDDING_MODEL_REPO)
|
| 29 |
parser.add_argument("--model-file", default=DEFAULT_EMBEDDING_MODEL_FILE)
|
| 30 |
parser.add_argument("--model-path", default="")
|
| 31 |
+
parser.add_argument("--n-ctx", type=int, default=DEFAULT_N_CTX)
|
| 32 |
parser.add_argument("--n-threads", type=int, default=0)
|
| 33 |
args = parser.parse_args()
|
| 34 |
|
|
|
|
| 58 |
model_repo: str,
|
| 59 |
model_file: str,
|
| 60 |
model_path: str = "",
|
| 61 |
+
n_ctx: int = DEFAULT_N_CTX,
|
| 62 |
n_threads: int | None = None,
|
| 63 |
build_source: str,
|
| 64 |
builder: str,
|
scripts/modal_build_project_index.py
CHANGED
|
@@ -54,8 +54,8 @@ def build_project_index_remote(
|
|
| 54 |
def main(
|
| 55 |
projects: str = "data/projects.json",
|
| 56 |
out: str = "data/project_index.json",
|
| 57 |
-
model_repo: str = "ggml-org/embeddinggemma-
|
| 58 |
-
model_file: str = "embeddinggemma-
|
| 59 |
) -> None:
|
| 60 |
project_snapshot = json.loads(Path(projects).read_text(encoding="utf-8"))
|
| 61 |
payload = build_project_index_remote.remote(project_snapshot, model_repo, model_file)
|
|
@@ -73,8 +73,8 @@ if __name__ == "__main__":
|
|
| 73 |
parser = argparse.ArgumentParser(description="Build the llama.cpp embedding index on Modal.")
|
| 74 |
parser.add_argument("--projects", default="data/projects.json")
|
| 75 |
parser.add_argument("--out", default="data/project_index.json")
|
| 76 |
-
parser.add_argument("--model-repo", default="ggml-org/embeddinggemma-
|
| 77 |
-
parser.add_argument("--model-file", default="embeddinggemma-
|
| 78 |
args = parser.parse_args()
|
| 79 |
with app.run():
|
| 80 |
payload = build_project_index_remote.remote(
|
|
|
|
| 54 |
def main(
|
| 55 |
projects: str = "data/projects.json",
|
| 56 |
out: str = "data/project_index.json",
|
| 57 |
+
model_repo: str = "ggml-org/embeddinggemma-300m-qat-q8_0-GGUF",
|
| 58 |
+
model_file: str = "embeddinggemma-300m-qat-Q8_0.gguf",
|
| 59 |
) -> None:
|
| 60 |
project_snapshot = json.loads(Path(projects).read_text(encoding="utf-8"))
|
| 61 |
payload = build_project_index_remote.remote(project_snapshot, model_repo, model_file)
|
|
|
|
| 73 |
parser = argparse.ArgumentParser(description="Build the llama.cpp embedding index on Modal.")
|
| 74 |
parser.add_argument("--projects", default="data/projects.json")
|
| 75 |
parser.add_argument("--out", default="data/project_index.json")
|
| 76 |
+
parser.add_argument("--model-repo", default="ggml-org/embeddinggemma-300m-qat-q8_0-GGUF")
|
| 77 |
+
parser.add_argument("--model-file", default="embeddinggemma-300m-qat-Q8_0.gguf")
|
| 78 |
args = parser.parse_args()
|
| 79 |
with app.run():
|
| 80 |
payload = build_project_index_remote.remote(
|
tests/test_llama_embedding.py
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from pathlib import Path
|
| 2 |
+
import sys
|
| 3 |
+
from types import ModuleType
|
| 4 |
+
|
| 5 |
+
from hackathon_advisor.data import DEFAULT_EMBEDDING_MODEL_FILE, DEFAULT_EMBEDDING_MODEL_REPO
|
| 6 |
+
from hackathon_advisor.llama_embedding import DEFAULT_N_CTX, LlamaCppEmbedder, create_llama_cpp_embedder
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
def test_llama_embedder_uses_q8_defaults_and_full_context(
|
| 10 |
+
monkeypatch,
|
| 11 |
+
tmp_path: Path,
|
| 12 |
+
) -> None:
|
| 13 |
+
model_path = tmp_path / "embedding.gguf"
|
| 14 |
+
model_path.write_bytes(b"gguf")
|
| 15 |
+
captured: dict = {}
|
| 16 |
+
|
| 17 |
+
hub = ModuleType("huggingface_hub")
|
| 18 |
+
|
| 19 |
+
def fake_hf_hub_download(repo_id: str, filename: str, repo_type: str) -> str:
|
| 20 |
+
captured["download"] = {
|
| 21 |
+
"repo_id": repo_id,
|
| 22 |
+
"filename": filename,
|
| 23 |
+
"repo_type": repo_type,
|
| 24 |
+
}
|
| 25 |
+
return str(model_path)
|
| 26 |
+
|
| 27 |
+
hub.hf_hub_download = fake_hf_hub_download
|
| 28 |
+
llama_cpp = ModuleType("llama_cpp")
|
| 29 |
+
llama_cpp.LLAMA_POOLING_TYPE_MEAN = 1
|
| 30 |
+
|
| 31 |
+
class FakeLlama:
|
| 32 |
+
def __init__(self, **kwargs) -> None:
|
| 33 |
+
captured["llama_kwargs"] = kwargs
|
| 34 |
+
|
| 35 |
+
def embed(self, text: str, normalize: bool) -> list[float]:
|
| 36 |
+
captured["embed"] = {"text": text, "normalize": normalize}
|
| 37 |
+
return [1.0, 0.0]
|
| 38 |
+
|
| 39 |
+
llama_cpp.Llama = FakeLlama
|
| 40 |
+
monkeypatch.setitem(sys.modules, "huggingface_hub", hub)
|
| 41 |
+
monkeypatch.setitem(sys.modules, "llama_cpp", llama_cpp)
|
| 42 |
+
|
| 43 |
+
vector = LlamaCppEmbedder().embed("private archive")
|
| 44 |
+
|
| 45 |
+
assert vector == [1.0, 0.0]
|
| 46 |
+
assert captured["download"] == {
|
| 47 |
+
"repo_id": DEFAULT_EMBEDDING_MODEL_REPO,
|
| 48 |
+
"filename": DEFAULT_EMBEDDING_MODEL_FILE,
|
| 49 |
+
"repo_type": "model",
|
| 50 |
+
}
|
| 51 |
+
assert captured["llama_kwargs"]["n_ctx"] == DEFAULT_N_CTX
|
| 52 |
+
assert captured["llama_kwargs"]["n_batch"] == DEFAULT_N_CTX
|
| 53 |
+
assert captured["llama_kwargs"]["n_ubatch"] == DEFAULT_N_CTX
|
| 54 |
+
assert captured["embed"] == {"text": "private archive", "normalize": True}
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
def test_create_llama_embedder_accepts_explicit_batch(monkeypatch) -> None:
|
| 58 |
+
monkeypatch.setenv("ADVISOR_EMBEDDING_BATCH", "256")
|
| 59 |
+
|
| 60 |
+
embedder = create_llama_cpp_embedder({"dimensions": 768})
|
| 61 |
+
|
| 62 |
+
assert embedder.n_batch == 256
|
tests/test_model_runtime.py
CHANGED
|
@@ -8,6 +8,7 @@ from hackathon_advisor.model_runtime import (
|
|
| 8 |
render_context,
|
| 9 |
runtime_status,
|
| 10 |
system_prompt,
|
|
|
|
| 11 |
_normalize_xml_tool_output,
|
| 12 |
_strip_unused_generation_inputs,
|
| 13 |
)
|
|
@@ -194,6 +195,22 @@ def test_generation_inputs_drop_token_type_ids() -> None:
|
|
| 194 |
assert inputs == {"input_ids": [1], "attention_mask": [1]}
|
| 195 |
|
| 196 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 197 |
def test_model_xml_fragment_is_normalized() -> None:
|
| 198 |
output = 'name="save_idea">{"title":"A","pitch":"B"}'
|
| 199 |
|
|
|
|
| 8 |
render_context,
|
| 9 |
runtime_status,
|
| 10 |
system_prompt,
|
| 11 |
+
_disable_sampling_generation_defaults,
|
| 12 |
_normalize_xml_tool_output,
|
| 13 |
_strip_unused_generation_inputs,
|
| 14 |
)
|
|
|
|
| 195 |
assert inputs == {"input_ids": [1], "attention_mask": [1]}
|
| 196 |
|
| 197 |
|
| 198 |
+
def test_generation_config_drops_sampling_defaults() -> None:
|
| 199 |
+
class GenerationConfig:
|
| 200 |
+
do_sample = True
|
| 201 |
+
temperature = 0.7
|
| 202 |
+
top_p = 0.95
|
| 203 |
+
|
| 204 |
+
class Model:
|
| 205 |
+
generation_config = GenerationConfig()
|
| 206 |
+
|
| 207 |
+
_disable_sampling_generation_defaults(Model())
|
| 208 |
+
|
| 209 |
+
assert Model.generation_config.do_sample is False
|
| 210 |
+
assert Model.generation_config.temperature is None
|
| 211 |
+
assert Model.generation_config.top_p is None
|
| 212 |
+
|
| 213 |
+
|
| 214 |
def test_model_xml_fragment_is_normalized() -> None:
|
| 215 |
output = 'name="save_idea">{"title":"A","pitch":"B"}'
|
| 216 |
|
tests/test_runtime_hooks.py
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from types import SimpleNamespace
|
| 2 |
+
|
| 3 |
+
from hackathon_advisor.runtime_hooks import _is_asyncio_invalid_fd_cleanup
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
def test_asyncio_invalid_fd_cleanup_hook_matches_only_event_loop_destructor() -> None:
|
| 7 |
+
def event_loop_del() -> None:
|
| 8 |
+
pass
|
| 9 |
+
|
| 10 |
+
event_loop_del.__qualname__ = "BaseEventLoop.__del__"
|
| 11 |
+
|
| 12 |
+
def other_function() -> None:
|
| 13 |
+
pass
|
| 14 |
+
|
| 15 |
+
assert _is_asyncio_invalid_fd_cleanup(
|
| 16 |
+
SimpleNamespace(
|
| 17 |
+
exc_type=ValueError,
|
| 18 |
+
exc_value=ValueError("Invalid file descriptor: -1"),
|
| 19 |
+
object=event_loop_del,
|
| 20 |
+
)
|
| 21 |
+
)
|
| 22 |
+
assert not _is_asyncio_invalid_fd_cleanup(
|
| 23 |
+
SimpleNamespace(
|
| 24 |
+
exc_type=ValueError,
|
| 25 |
+
exc_value=ValueError("Invalid file descriptor: -1"),
|
| 26 |
+
object=other_function,
|
| 27 |
+
)
|
| 28 |
+
)
|
| 29 |
+
assert not _is_asyncio_invalid_fd_cleanup(
|
| 30 |
+
SimpleNamespace(
|
| 31 |
+
exc_type=RuntimeError,
|
| 32 |
+
exc_value=RuntimeError("Invalid file descriptor: -1"),
|
| 33 |
+
object=event_loop_del,
|
| 34 |
+
)
|
| 35 |
+
)
|