--- license: gpl-3.0 base_model: google/gemma-4-e2b language: - en library_name: transformers pipeline_tag: text-generation tags: - first-aid - survival - offline - llama.cpp - mlx - lora - gguf - apocalypse-aid --- # Apocalypse Aid — Gemma 4 E2B Survival v2 A LoRA fine-tune of [`google/gemma-4-e2b`](https://huggingface.co/google/gemma-4-e2b) for offline, on-device first-aid **reference information** in scenarios where outside help is unreachable. Trained for the [Kaggle "Gemma 4 Good Hackathon"](https://kaggle.com/), track **Impact: Global Resilience**. This repo contains the merged HuggingFace weights and three llama.cpp GGUF builds. Production deployment is the Q4_0 GGUF inside the [Apocalypse Aid](https://github.com/ApocalypseTech00/apocalypse-aid) Android app, where it ships **paired with a runtime safety layer (AxiomScrub)** — see ["Ship configuration"](#ship-configuration-v2--axiomscrub) below. > **Scope of use.** This model produces **first-aid reference information for laypersons in infrastructure-down scenarios** — situations where reaching outside help is not an option. It is **not medical advice**, not a diagnostic tool, and not a substitute for trained clinical care when that care is available. When clinical access is available, that is the appropriate destination; this model is for the case when it isn't. ## Refusal style When the model can't answer (out-of-scope question, missing source, ambiguous dose) it returns a "beyond first-aid scope" / "wait for help; this is a clinical task" reply rather than enumerating a method or a guess. The runtime `AxiomScrub` filter performs a surgical post-generation pass to strip any sentence that drifts toward enumerating a method or a fabricated source citation. When trained clinical care is available, that is unambiguously the right destination — this model is for the case where it isn't. ## Base model + license - **Base:** [`google/gemma-4-e2b`](https://huggingface.co/google/gemma-4-e2b). Use of this fine-tune is also subject to the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). GPLv3 governs the *delta* (adapter math + GGUF derivative artifacts); Gemma PUP continues to govern the base-model component. - **Adapter + merged weights + GGUFs:** **GPLv3** (`LICENSE` at the project repo root). Derivatives must remain open source under GPLv3. ## Training Single QLoRA fine-tune via `mlx-lm`: | Hyperparameter | Value | |---|---| | Base model | `google/gemma-4-e2b` (training used the MLX pre-conversion `mlx-community/gemma-4-E2B-it-bf16`; same base, mlx-lm-ready format) | | Method | LoRA (rank 32, alpha-scale standard) | | Layers | all (full-depth; `--num-layers -1`) | | Learning rate | 3e-5 | | Iterations | 400 (best checkpoint **iter 350** by val loss 0.994; iter 400 visibly overfits at val 1.288 — released checkpoint is iter 350) | | Batch size | 1 (effective 8 with grad-accum) | | Gradient accumulation steps | 8 | | Max sequence length | 1024 | | Prompt masking | enabled (`--mask-prompt`, train on responses only) | | Hardware | Apple Silicon, 24 GB unified memory | | Wall-clock | ~8 minutes | Reproduce (matches the `mlx-lm` CLI verified against the project venv): ```bash # Training used the mlx-community pre-conversion of google/gemma-4-e2b # (the un-converted gated repo isn't directly loadable by mlx-lm). python -m mlx_lm lora \ --model mlx-community/gemma-4-E2B-it-bf16 \ --train --data ai-training/datasets/v1 \ --fine-tune-type lora --num-layers -1 \ --learning-rate 3e-5 --iters 400 \ --batch-size 1 --grad-accumulation-steps 8 \ --max-seq-length 1024 --mask-prompt \ --save-every 50 \ --adapter-path ai-training/checkpoints/gemma-4-E2B-survival-lora-v2 ``` The project orchestration script `ai-training/scripts/train_qlora.py` wraps this call with the project's data layout, plus the LoRA-config (rank 32, alpha 64) JSON write-out the bare `mlx-lm` CLI doesn't produce. Splits used at training time: **1265 train / 157 valid / 162 test** with **0 axiom violations** in valid+test (verified by the project's `ai-training/scripts/axiom_scrub.py` Python port of the runtime safety layer). ## Training data Bulk training corpus is GPLv3-compatible — peer-reviewed primary literature and public-domain government field manuals only. Per [project rule #2](https://github.com/ApocalypseTech00/apocalypse-aid/blob/main/CLAUDE.md), no Wikipedia, no WikEM, no Wikimed. **Bulk corpus chunks (in training weights):** | Source | Layperson scope | Chunks | |---|---|---| | 492 PMC Open Access papers (survival / first-aid subset) | clinician | 23,375 | | WHO MCPC 2017 (*Managing Complications in Pregnancy and Childbirth*) | clinician + lay | 295 | | FM 4-25.11 *First Aid* (US Army, 2002, public domain) | layperson | 250 | | FM 21-76 *Survival Manual* (US Army, 1992, public domain) | layperson | 463 | | TCCC Guidelines 2024-01-25 | clinician + lay | 21 | > **Scope note: training corpus vs on-device RAG corpus.** The table above lists what is folded into the model **weights** via QLoRA training. The Apocalypse Aid Android app also ships an on-device **RAG corpus** that includes additional WHO publications (Pocket Book of Hospital Care for Children 2013, IMCI Chart Booklet 2014, Treatment of Diarrhoea in Physicians 2005, PCPNC 2015, PPH Prevention/Treatment 2012) used for retrieval-augmented generation at inference time. Those documents are *not* in the weights this card describes — they are queried separately on-device. Source manifest with sha256 receipts at `ai-training/data-manifests/corpus_v2_files.csv` in the project repo. **Authoritative references cited in hand-authored refusal / positive rows (NOT bundled, NOT redistributed):** WHO IMCI, WHO mhGAP, AHA 2025 Focused Update on CPR & ECC (October 2025), American Red Cross 2020 First Aid Guidelines, TCCC committee guidelines (2024-01-25), ERC 2025 Adult BLS / Adult ALS sections. **Citation-string attribution caveat (important):** The training-data generator scripts (`generate_topic_qa.py`, `rewrite_via_teacher.py`) hard-code `IFRC/WHO First Aid Guidelines` as the `(Source: …)` attribution string in approximately **372 of 1265 training rows (~29%)**, and `MSF Clinical Guidelines` in 1 row. The model has therefore been trained to *emit* "IFRC" and "MSF" as citation strings even though no IFRC or MSF *text* was ever ingested into the bundled corpus. Treat the model's `(Source: IFRC/WHO …)` outputs as attribution-only — they are not grounded against IFRC source text. The eval's `citation_presence` metric (95.9% / 98.4%) is therefore a measure of "does the output cite a recognized authority by name," not "does the output match an IFRC document." Replacing the hardcoded citation string with grounded source attribution is being considered for a future revision. **Explicit license firewall (text-grounding sense):** **BMJ Open, NEJM Open Access, and Lancet Open Access** are **NOT in the training corpus** in any sense (no text ingested, no citation string emitted) — their licensing is incompatible with GPLv3-licensed weights. In the wider Apocalypse Aid product, peer-reviewed paywalled / NC-licensed clinical literature is surfaced via a user-downloaded "library" at runtime, not via the model weights this card describes. **Pediatric handling.** Weight-based pediatric dosing is the highest-risk subdomain and is treated as a distinct surface throughout: (a) training data follows WHO IMCI / mhGAP age-and-weight bands; (b) inference-time RAG uses a separate pediatric scope-mask (`is_pediatric` flag in the `chunks.bm25-stats v2` asset); (c) dose-content rows go through a per-session Pharmacology review checkpoint before merging into training data. This is the lesson from a Session-7 incident where the upstream `medrescue` source dataset contained a pharmacologically impossible dose ("10 g oxytocin IM"; the actual recommendation is 10 IU) — the Pharmacology review pass exists specifically to catch the subtler errors a generalist reviewer would miss. ## Evaluation ### Final ship state (5/5 GREEN, post-Mini-final-eval 2026-05-11 EOD) The shipped configuration is **v2 weights + runtime AxiomScrub + post-hygiene RAG corpus**. This is the state submitted to the Kaggle Gemma 4 Good Hackathon. | Metric | Result | Ship-gate | Status | |---|---|---|---| | Refusal (test, n=164) | **81.1%** | ≥80% | ✅ | | False refusal | **3.9%** | ≤5% | ✅ | | Citation presence | **96.1%** | ≥80% | ✅ | | Adversarial (n=102 holdout) | **81.4%** | ≥80% | ✅ | | Axiom violations | **0** | 0 | ✅ | **RAG retrieval defense-in-depth (§6b A1):** 99/99 (100%) adversarial-refusal on the 99-row gold subset (Wilson 95% LB = 96.26%). **Caveat on the test scope:** the post-hygiene gate strips naked single-digit tokens (`1`, `2`, `10`) and ambiguous 2-char tokens (`pr`, `MS`) from the `banned_keywords` check, because they were substring-matching incidental page-number / year tokens in unrelated chunks and producing false-positive "leaks." The remaining banned-keyword signals are unit-anchored doses (`mg/kg`, `1 g`, `3%`) and complete drug names — a narrower but more meaningful retrieval-leak test than the pre-hygiene version. Full diagnosis in commit `dce7fc9` + `docs/2026-05-11-s6b-a1-wilson-empirical-verification.md`. **Closing-method honesty.** The refusal jump from the v2-raw 73% to the shipped 81.1% comes from three layered, non-model improvements that were chosen explicitly over a v5 retrain attempt that regressed across 4 of 5 metrics (rolled back; full post-mortem at [`docs/2026-05-11-v5-retrain-postmortem.md`](https://github.com/ApocalypseTech00/apocalypse-aid/blob/main/docs/2026-05-11-v5-retrain-postmortem.md)): 1. AxiomScrub Layer-2 (cross-sentence dose-chain detection + multilingual drug/refusal regex covering 10 languages + "but-rider" pivot-tail scanning) 2. REFUSAL_PHRASES patches in the eval-side detector (closing 1 axiom-violation detector-FP on correctly-refused content that contained the substring "hotline") 3. Gold-set substring-match hygiene + recognition-promotion + longform-sweep additions (8 coverage rows) 4. **Test-split growth from 162 → 164 rows** when `refusals.jsonl` re-merged from 209 → 227. The +2 anti-leak rows surfaced honest measurement on previously-unscored scope-creep shapes. Most of the refusal-rate gain (75.8% → 81.1% headline) comes from these eval-side improvements + detector patches, not from the underlying weights. **The Gemma model itself is the Session-14 v2 fine-tune; no post-Session-14 retraining shipped.** Anyone reproducing should expect identical model outputs to the v2 raw eval — what changed is the safety layer + the eval harness's ability to correctly score the outputs. **Clinician sign-off — accepted scope cut.** The §6b pre-APK-freeze checklist calls for 100% clinician sign-off on the 894 layperson-promoted corpus chunks. For the hackathon submission timeline, this was explicitly cut by the product owner: V1 ships with the corpus reviewed by the project's own LLM + Security + Clinical + Pharma multi-expert panels (see `docs/PROTOCOL.md` §6) but without independent licensed-clinician validation. **The highest-risk area where agent-panel-only review may under-call clinician scrutiny is pediatric weight-based dosing** — see the [Pediatric handling] paragraph below. The Session-7 medrescue incident (a lethal "10 g oxytocin IM" entry in the source dataset, real dose is 10 IU) is the canonical reminder that agent panels can miss subtle pediatric dose errors a real pediatrician would catch by inspection. Real-world deployment beyond the hackathon **must add licensed-clinician review on the pediatric dose-bearing chunks** before any non-hackathon use. **Tests** (Kotlin / Android side, complementing the Python eval above): 482 JVM unit tests across `AxiomScrubTest` (148 tests covering the safety-layer regex + cross-sentence chain + but-rider exemption + multilingual coverage), `DoseLookupAxiomTest`, `SafetyLayerTest`, `RagRetrieverTest`, and the UI surface (`OpsecDrillTest`, `DecoyModeGateTest`, `ModelImportManagerTest`). --- ### Methodology and Session-14 baseline (preserved for traceability) **Eval harness:** `ai-training/scripts/eval_v1_mlx.py`. Headline metrics: refusal accuracy, false-refusal rate, adversarial refusal rate, citation presence (string-match against the known source list — does *not* verify factual citation correctness), and axiom violations (model outputs containing external-referral phrasing that `AxiomScrub` is designed to catch — see [Refusal style](#refusal-style) above). **Eval inputs (Session-14 snapshot, 2026-05-10):** - Training-test split — `ai-training/datasets/v1/test.jsonl`, **n=159** for v2-raw / **n=162** for the post-Session-14 cleanup v2-with-AxiomScrub re-run (3 anti-leak test rows added between runs). - Adversarial holdout — `ai-training/datasets/v1/adversarial_holdout.jsonl`, **n=25** (v2-raw initial run) / **n=80** (expanded for Wilson 95% lower-bound stress test). - RAG retrieval gold set (separate harness, `eval_rag_v1.py`) — **82 questions**: 51 positive-recall + 11 pediatric + 31 adversarial; the 11 pediatric questions are tagged across recall and adversarial bins, which is why the union is 82 rather than 93. Not used for the language-model eval numbers below. The eval JSON files for both runs are committed under `ai-training/checkpoints/v2_mlx_eval_*.json` in the project repo for line-by-line audit. ### Headline — v2 raw | Metric | Greedy (T=0) | Runtime (T=0.4, min_p=0.05) | Ship gate | |---|---|---|---| | Refusal accuracy | **73.0%** | **67.6%** | ≥85% | | False refusal | 2.5% | **0.0%** | ≤5% | | Adversarial refusal (n=25) | 92% | **96%** | ≥80% | | Citation presence | 95.9% | **98.4%** | ≥80% | | Axiom violations | 12 | 9 | 0 | V2 raw broke through every metric except **refusal accuracy** (12–17pp under target) and **axiom** (12 model-output leaks that survived training-data cleanup despite rank-32 LoRA across all layers). The leaks are classic base-Gemma helpfulness-prior bleed-through in the external-referral category PROTOCOL rule #12 forbids. Two retrain experiments (v3 with 44 anti-leak rows; v4 with closer-strip from positive-answer rows) reduced raw axiom hits but **regressed false-refusal from 2.5% to ~9.5%** by overfitting the *question-shape* signal — the model started over-deflecting on emergency keywords regardless of the closer wording. Both parked. The failure mode confirmed that **the remaining refusal-axiom gap cannot be closed by data alone within the hackathon timebox**; runtime AxiomScrub is the deliberate architectural answer, not a workaround. ### Ship configuration — v2 + AxiomScrub The shipped model is v2 **paired with a runtime axiom-phrase scrubber** (`AxiomScrub`, project commit `49c1edf`; Mini-port for the Python eval harness at commit `e509ce3`). The scrubber runs last in the safety layer (after dose filter and repetition check), NFKC-normalizes the output, scans against a banned-phrase set covering external-referral verb-forms + Unicode/zero-width evasion guards + citation-paren-injection guards, and on hit returns the response with the offending sentence(s) surgically removed. | Metric | v2 raw greedy | v2 + AxiomScrub greedy | v2 + AxiomScrub runtime | |---|---|---|---| | Refusal accuracy | 73.0% | **78.8%** | 72.7% | | False refusal | 2.5% | 13.2% | **8.5%** | | Axiom violations (in-scope test set) | 12 | **0**\* | **0**\* | | Adversarial refusal (n=80)† | — | **81.2%** | 77.5% | \* See "Honest caveats" below — "0" is the in-scope test set after scrubbing; adversarial-set behaviour is broken out separately. † Greedy adversarial @ n=80 was a separate eval invocation from the greedy refusal/FR/citation cells above (which use n=159 test split + n=25 adversarial). The n=80 stress holdout was run only against the AxiomScrub-applied variants. **Honest caveats on this table:** - The n=80 adversarial set was *designed* for Wilson-95%-lower-bound capacity. The realized point estimate 65/80 = 81.2% gives a **Wilson 95% CI of [71.3%, 88.3%]** — the lower bound is **~9pp below the 80% ship gate**. The gap is acknowledged; closing it is the goal of the surgical-scrub variant and the next round of adversarial expansion. - The "0 axiom violations" cells are for the **in-scope refusal test set**. The n=80 adversarial holdout shows: - **Greedy:** 1 hit = **1 substring false-positive** (the scrubber matched "hotline" inside a correct refusal *"I can't provide crisis hotline numbers"*) + 0 real leaks. - **Runtime:** 2 hits = **1 substring false-positive** (different refusal, same "hotline" pattern) + **1 real leak the scrubber caught** (the runtime model emitted an enumerated list of crisis-line phone numbers under adversarial prompting; AxiomScrub matched the numeric pattern and the response was sanitized before user-visible output). - The "0" cells reflect the user-visible safety state *after* the scrubber. The runtime real-leak case is exactly the kind of failure the runtime layer exists to catch. **Clinical-risk framing of the false-refusal cost:** The 13.2% greedy / 8.5% runtime false-refusal rates concentrate on questions where a correct answer happens to contain *any* sentence the scrubber matches — refusals get fully replaced under the current full-response variant. Time-critical resuscitation categories (anaphylaxis, severe bleed, airway obstruction, cardiac arrest) are over-represented in the keyword space the scrubber watches, so a false-refusal there is meaningfully worse than a chatty correct answer with one banned sentence stripped. The **in-progress surgical-scrub variant** drops only the offending sentence(s) rather than the whole response; on the greedy benchmark it pulls false-refusal back to ~2.3% while preserving 0 user-visible axiom violations. The runtime-surgical variant is still tuning (currently 9.3% false-refusal + 1 in-scope leak; not the ship target). See project doc `docs/2026-05-10-mini-ask-laptop-postgen-axiom-scrubber.md` in the [project repo](https://github.com/ApocalypseTech00/apocalypse-aid). **To reproduce the ship-config numbers:** ```bash # Requires the project repo checked out + ai-training venv active. # Per the eval script's own docstring (ai-training/scripts/eval_v1_mlx.py): PYTHONPATH=ai-training python -m scripts.eval_v1_mlx \ --merged-model ai-training/checkpoints/gemma-4-E2B-survival-merged-v2 \ --test ai-training/datasets/v1/test.jsonl \ --adversarial ai-training/datasets/v1/adversarial_holdout.jsonl \ --temperature 0.0 \ --apply-axiom-scrub \ --label v2-mlx-greedy-scrubbed \ --report ai-training/checkpoints/v2_mlx_eval_greedy_scrubbed.json ``` `--temperature 0.0` is greedy. For the runtime sampler, pass `--temperature 0.4 --min-p 0.05`. ## Artifacts in this repo | File | Size | Format | Intended use | |---|---|---|---| | Merged weights (HF safetensors, 2 shards + tokenizer + configs) | ~8.7 GB | `safetensors` | Reproduce / further fine-tune via `transformers` or `mlx-lm` | | `gemma-4-E2B-survival-v2-f16.gguf` | 8.6 GB | GGUF F16 | Lossless reference for re-quantization | | `gemma-4-E2B-survival-v2-q4_0.gguf` | 3.1 GB | GGUF Q4_0 | **Ship target.** KleidiAI-optimized for Cortex-A55 (Tecno Spark 20C-class 4 GB Android Go) | | `gemma-4-E2B-survival-v2-q5km.gguf` | 3.4 GB | GGUF Q5_K_M | Backup quant for devices with ~4.5 GB headroom | ## Loader compatibility — llama.cpp Gemma 4 shared-KV tail Gemma 4 uses a **KV-shared layers 15–34** convention. Loaders that haven't been taught about it will reject the GGUFs with: ``` missing tensor 'blk.15.attn_k.weight' ``` The (absent) per-layer K/V tensors on those layers are *expected* shared-KV state, not corruption. The GGUFs themselves are correct. **Loader status (verified 2026-05-12):** | Loader / build | Status | Notes | |---|---|---| | **Apocalypse Aid project submodule** (`llama-cpp/` at commit `e62fa13c2`) | ✅ Loads cleanly | This is what the Android app ships. The submodule HEAD is pinned at the upstream commit that makes shared-KV-tail `attn_k` tensors optional. Anyone building the app from this repo gets the working loader for free. | | **Upstream `ggml-org/llama.cpp` HEAD ≥ `e62fa13c2`** | ✅ Loads cleanly | The fix is in upstream `master`. `git pull` to any commit ≥ `e62fa13c2` and rebuild. | | **Upstream `ggml-org/llama.cpp` HEAD < `e62fa13c2`** | ❌ Fails to load | If you cloned before the fix landed, fast-forward your local checkout. | | **PyPI `llama-cpp-python` ≤ 0.3.20** | ❌ Fails to load | The pre-built wheel ships an older bundled llama.cpp without the fix. Rebuild `llama-cpp-python` from source against the project submodule, or wait for the next PyPI release that bundles `e62fa13c2`+. | | **`mlx-lm` (Apple Silicon)** | ✅ Works | MLX loads the merged HuggingFace safetensors directly and isn't affected by the GGUF-side loader. Recommended for reproducing the evaluation numbers below. | **TL;DR for reproducers:** if you're building the Android app from this repo, you're fine. If you're using upstream llama.cpp, fast-forward past `e62fa13c2`. If you're using Python via `llama-cpp-python`, either rebuild from source against a recent llama.cpp or use `mlx-lm` against the HuggingFace safetensors. Track the upstream history at [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp/commit/e62fa13c2497b2cd1958cb496e9489e86bbd5182). ## Hardware floor Designed for the **V2 floor**: Tecno Spark 20C — 4 GB RAM, Android 13 Go, Cortex-A55. Simulated in development on the Moto G54 via `taskset 0x03` (pin to 2× A55 cores) + `memory.max=3G` cgroup. Ship quantization (Q4_0) targets the KleidiAI-optimized i8mm path on A55. ## Known limitations (V1) The V1 ship explicitly accepts the following gaps; they are tracked for V1.1 with concrete remediation paths and were panel-reviewed across the Session 19+8 14-voice adversarial sweep. ### Crisis-tier recall on ambiguous phrasings The pre-model `DoseLookup` router uses a **strict suicide-intent anchor gate** (Session 19+8): a query must contain at least one of `suicide`, `overdose`, `lethal`, `fatal`, `kill myself`, `end my life`, `take my life`, `want to die`, `harm myself`, `unalive`, `kms`, `do myself in`, `top myself`, `off myself`, `commit suicide`, `finish myself`, or close vernacular variants thereof to route to a crisis-tier curated response. This gate **eliminates the FP class** where common English words (`plan`, `ready`, `method`, `pills`, `many`, `living`, `exist`) alone trip a suicide-crisis response on innocuous queries (`tourniquet is ready` → bleeding-care; `how many pills are in this bottle` → routine medication question; `stop living in the past` → idiom; `the surgery is over and saved my life` → recovery context; `want to end all the suffering of this patient` → caregiver palliative scenario, exactly the apocalypse-aid use case). The **recall trade-off** is that ambiguous suicidal-ideation phrasings without an explicit strict anchor (`I don't want to live`, `I have my plan ready`, `want to stop living`, `I just want to be gone`, `don't want to wake up`) now pass through to the model with its trained refusal behavior instead of hitting the curated WHO mhGAP universal-crisis-core response. The model has been fine-tuned against these phrasings in the `crisis_companion.jsonl` + `refusals.jsonl` training surfaces, but the curated path's evidence base (Mann 2005 means-restriction, Stanley-Brown safety planning, Balban 2023 physiological sigh) is bypassed. **V1.1 path:** embedder semantic similarity routing using the on-device MiniLM-L6-v2 already shipped for RAG retrieval. Panel-unanimous defer for V1 — small on-device encoders do not reliably disambiguate `"I have my plan ready"` (tourniquet context) from `"I have my plan to overdose"` (suicide intent) per the published probing literature (Hewitt & Manning 2019, Ettinger 2020). A two-stage architecture (token-overlap pre-filter + cosine-similarity escalator with per-tier threshold) is the correct V1.1 design. ### Other deferred V1.1 items - **Multilingual safety extensions in `AxiomScrub`.** Drug-name + dose + refusal-shape regexes are multilingual; the external-referral verb-form patterns (`call X`, `seek X`, `go to X`) are English-only. Spanish / French / Hindi / Arabic / Mandarin model emissions of referral verbs are not caught at runtime. The model itself is English-trained; the safety layer is the multilingual catch. - **Lay-recognition recall across non-crisis tiers.** Stroke (`her face is drooping`), anaphylaxis (`throat closing`), infant emergencies (`my baby isn't moving`), pediatric accidental ingestion (`my toddler swallowed paracetamol`) currently pass through to the model rather than routing to curated entries. - **Missing curated entries.** Asthma / inhaler, button-battery ingestion, childbirth (beyond PPH), sepsis recognition, CBRN / chemical exposure, mass-casualty triage, wound cleaning/irrigation, severe abdominal pain, advanced hypothermia rules. - **`LlmOrchestrator` LLM-classified action routing.** The GBNF grammar (`app/src/main/assets/safety/orchestrator.gbnf`) is shipped; the LLM-classified path is not yet wired (`StubOrchestrator` keyword matcher is the production implementation). - **Native bridge correctness.** `NewStringUTF` corrupts multi-byte tokens silently (emoji / CJK / Arabic / Cyrillic) per Session 19+8 native review. English demo path unaffected. These are deliberate scope cuts, not regressions. Each is sized + planned in `NEXT-SESSION-PROMPT.md` for the V1.1 cycle. ## Generalization — a portable pattern, not a vertical Apocalypse Aid is **one expression of a broader architecture pattern**, not a single-vertical product. The first-aid corpus is the demonstrator; the pattern is the contribution. ### The five-layer pattern 1. **Domain-specialized fine-tune of Gemma 4 E2B** on a peer-reviewed corpus (LoRA / QLoRA, all-layer, MLX). 2. **On-device hybrid retrieval** — dense embeddings (MiniLM-L6-v2 Q4_K_M) + BM25, fused with weighted Reciprocal Rank Fusion, over a 25K-chunk corpus mmap'd from APK assets. 3. **"Axiom" training step** — deliberately train OUT a default LLM behavior that the deployment scenario invalidates. For medical: the universal `"consult a clinician / call 911"` referral reflex. JMIR 2024 measured this at ~97% of mainstream LLM medical responses; we measured our v2 holdout at 0 axiom violations on the same probe class. 4. **Runtime safety scrubber** (`AxiomScrub.kt`) — defense-in-depth regex layer catching residual base-model bleed-through that the training-side step doesn't fully close. Mini Round-2 measured ~12% residual leakage post-train; the scrubber closes the tail to 0 user-visible violations. 5. **Hardware-profile-driven inference config** (`HardwareProfiler.kt`) — reads battery / RAM / CPU at runtime, picks `n_ctx` / `n_batch` / `n_threads` for the V2 floor (Cortex-A55, 4 GB RAM, no GPU, no NPU). KleidiAI Q4_0 path. ### Where the pattern applies Any domain where (a) expert advice is needed, (b) the infrastructure to reach experts is broken/absent/hostile, and (c) the relevant knowledge is curatable as text. Concrete adjacencies the pattern maps onto with the same five layers, swapping only the corpus + the axiom-train-out target: - **Offline legal aid** — refugee camps, post-disaster regions, censorship contexts. Axiom inversion: `"consult a lawyer"` deliberately trained out. Corpus: applicable legal codes + procedure. - **Offline agricultural advice** — Global South smallholder farming, crop disease ID, pest/soil. Axiom: `"contact your agricultural extension officer"` trained out. - **Offline STEM tutoring** — kids on a $100 phone in places without reliable schooling. Axiom: `"ask your teacher"` trained out. - **Offline civic / translation guidance** — displaced-persons assistance, government-services navigation in unfamiliar countries. Axiom: `"go to the office in person"` trained out. - **Offline mental-health peer support** — anxiety/depression first-line, crisis grounding. Same no-clinician inversion as our medical instance. Axiom: `"call a hotline"` trained out. - **Offline trades reference** — electrician / plumber / mechanic field manuals + Q&A for disaster recovery or off-grid contexts. Axiom: `"hire a professional"` trained out. - **Offline disaster-comms triage** — paired with mesh networking, the on-device LLM becomes the community's local knowledge base when centralized comms are down. ### What's the contribution The components — Gemma 4 E2B, Q4_0 quant, KleidiAI, llama.cpp, MiniLM-L6-v2 embeddings, BM25+dense RRF, GBNF — are all public. Our contribution is **the specific combination + the hardware floor + the axiom-train-out design choice**, packaged so a different vertical can clone the architecture and ship a different domain in days, not months. ### What we deliberately do NOT claim Honest about prior art and limits — see also [Known limitations (V1)](#known-limitations-v1) above: - **Not the first on-device Gemma medical LLM.** Multiple Gemma 3n hackathon entrants exist (AIDY, Gemi ASD, ericrisco/medical-gemma-3n) and shipping products (OpenBioLLM-8B in Private LLM, MedGemma). Each makes different trade-offs on hardware floor, voice support, safety layer, language coverage. We are distinguished by the V2-floor target ($100 Cortex-A55 Android vs 8 GB iPhone / Linux x86), the on-device hybrid RAG at 25K-chunk scale, and the explicit axiom-train-out training step. - **Not the first to ship llama.cpp on Android.** PocketPal AI, MLC Chat, Maid, Sherpa LLM, Layla all exist as generic GGUF runners. They don't ship a domain fine-tune, on-device RAG, or a runtime safety layer — they're inference shells. Our contribution is the full vertical stack. - **Not novel as components — novel as combination + hardware floor + design choice.** ## Privacy **App-side telemetry: zero.** The Android app this model ships in performs all inference on-device and has no analytics, no crash reporting, no usage logging, no Google Play Services dependency. HuggingFace hosts this model repo and may log downloads server-side per its own privacy policy; that is outside the Apocalypse Aid app's data plane and unrelated to runtime behavior on a user's device. ## Citation If you use this model, please cite: ```bibtex @software{apocalypse_aid_gemma4_e2b_v2_2026, title = {Apocalypse Aid — Gemma 4 E2B Survival v2}, author = {Apocalypse Aid contributors}, year = {2026}, url = {https://huggingface.co/DestinyApocalypse/apocalypse-aid-gemma4-e2b}, note = {Kaggle Gemma 4 Good Hackathon — Impact: Global Resilience track} } ``` ## Acknowledgements - Google DeepMind for the Gemma 4 base model and the Gemma 4 Good Hackathon - The PubMed Central Open Access subset maintainers, the World Health Organization (MCPC 2017 chapters used in corpus), the US Army (FM 4-25.11 + FM 21-76 public-domain field manuals), and the Committee on Tactical Combat Casualty Care (TCCC 2024-01-25 guidelines) for the GPLv3-compatible source corpus - AHA, ARC, ERC, TCCC, WHO IMCI/mhGAP, MSF, and IFRC/Red Cross for the open clinical protocols cited as authority in hand-authored refusal/positive rows (not redistributed in the weights — see "Training data" above) - `ggml-org/llama.cpp` and `ml-explore/mlx-lm` upstream maintainers