Chimera.APE — v0.1.0-alpha (single-file build)

Three queries, one binary, zero regrets. Runs on anything, answers to no one.

One Actually Portable Executable. One download. Everything inside.

chimera-full.ape (~7.5 GB) bundles, in a single self-contained file that runs unmodified on Linux / macOS / Windows / BSD:

the orchestrator (C++/Cosmopolitan),
a llamafile + Gemma 4 12B QAT q4_0 (embeddings and chat from one server),
the multimodal projector (image + audio understanding),
QLever (SPARQL knowledge graph + BM25 text index), and
TurboVec (quantized approximate-nearest-neighbor vector search).

Point it at a directory of files — text, code, images, audio — and it digests everything into a hybrid graph-vector database. Ask a question and it answers with synthesized, cited, checksum-verified provenance. No network, no sidecar downloads, no runtime dependencies.

GitHub (source, smaller organ-only build, full docs): https://github.com/SEBK4C/Chimera.APE

Quick start

# Download this one file (no weights to fetch separately — they're inside):
hf download SEBK4C/Chimera.APE chimera-full.ape --local-dir .
chmod +x chimera-full.ape

# Ingest a directory. First run unpacks the embedded organs + weights into
# <dir>/.chimera/runtime/ (one-time, a few GB):
./chimera-full.ape ingest ~/notes

# Ask:
./chimera-full.ape --search "what did we decide about the billing rewrite?" \
    --db ~/notes/.chimera

Maria Chen leads Project Phoenix [1]. It is a rewrite of the billing system [1].

Sources:
  [1] phoenix.md#1  ✓ verified

✓ verified means the cited file is byte-identical to what was ingested; ⚠ drifted / ⚠ missing tell you when it isn't. Citations are promises the checksum keeps.

GPU (NVIDIA / Metal) — interactive ingest & search

CPU works everywhere but is slow (~7 tok/s — minutes per document). On a GPU, ingest and search become interactive. The orchestrator passes offload flags straight through to the embedded llamafile:

./chimera-full.ape ingest ~/notes --gpu auto       # offload all layers (default-on GPU box)
./chimera-full.ape ingest ~/notes --gpu nvidia     # pin the CUDA backend
./chimera-full.ape ingest ~/notes --gpu 24         # partial offload, N layers (small VRAM)
./chimera-full.ape ingest ~/notes --gpu off        # force CPU
./chimera-full.ape --search "..." --db ... --gpu auto

`--gpu`	llamafile flags	meaning
`auto` (default)	`-ngl 999`	offload all layers; falls back to CPU if no GPU
`off` / `disable`	`--gpu disable`	force CPU
integer `N`	`-ngl N`	offload N layers (VRAM-limited cards)
`nvidia`/`amd`/`apple`	`--gpu <vendor> -ngl 999`	pin the backend vendor

CUDA prereqs: a working NVIDIA driver is enough (llamafile ships a prebuilt tinyBLAS path); with the CUDA toolkit (nvcc on PATH) it JITs an optimized ggml-cuda module once and caches it under ~/.llamafile/. The first GPU run logs the device(s) and throughput to <db>/.chimera/logs/llamafile.log.

Verified on this build: 2× NVIDIA RTX 4090 (driver 580 / CUDA 12.8) — --gpu auto offloads Gemma 4 12B across both cards and runs ingest + search end-to-end with ✓ verified citations at ~90 tok/s generation (vs ~7 tok/s on CPU). Multimodal embeddings run on GPU too: image and audio embed natively as the model's end hidden state over the projector+interleave forward pass (LAST pooling), in the same 3840-d space as text — so --search-file (image→image, audio→audio) works on GPU. See docs/GEMMA4-EMBEDDINGS.md and docs/GPU.md.

Images and audio

PNG/JPEG/WAV/MP3 are first-class documents. At ingest the model transcribes legible text or describes the scene/sound, indexes that derived text, and stores the raw media embedding for query-by-example:

./chimera-full.ape --search "the budget figure on the banner" --db ~/notes/.chimera
./chimera-full.ape --search-file query.png --db ~/notes/.chimera

Other commands

./chimera-full.ape status  --db DIR/.chimera     # counts, dims, index staleness
./chimera-full.ape verify  --db ... [--paranoid] # re-checksum the corpus
./chimera-full.ape vacuum  --db ...              # purge superseded data, rebuild text index
./chimera-full.ape sparql  "SELECT ..." --db ... # raw SPARQL into the live graph

Hardware

Runs CPU-only (slow — minutes per document at ingest, ~7 tok/s on a fast CPU) or on a GPU (--gpu auto, interactive — see above). Needs ≥16 GB RAM (the model maps ~8 GB) and ~8 GB free disk for the one-time runtime extraction.

Two flavors

File	Size	Use
`chimera-full.ape` (here)	~7.5 GB	true single file; weights embedded
`chimera.ape` (on GitHub releases)	~315 MB	organs embedded, weights sidecar via `--model`

Known alpha limitations

Sequential ingest (CPU-bound on CPU hosts); §5 bounded-queue concurrency is designed, not yet wired.
Incremental ingests don't extend the BM25 text index (vector + graph search unaffected); vacuum rebuilds it.
Linux x86_64 is the tested platform; turbovec-server carries Linux ABI assumptions inside its APE shell, so other OSes are expected-but-unverified.
Dense rendered-text OCR has a known upstream vision-pipeline bug; photos/scenes describe well.
Embeddings use LAST pooling — the end hidden state of Gemma 4 12B's projector+interleave forward pass — for text, image, and audio alike (one shared 3840-d space; this is what makes native multimodal embedding work on GPU). The embedded llamafile carries the patch that makes this GPU-safe. If you indexed with an earlier (mean-pooled) build, re-ingest; dimensionality (3840) is unchanged.

Downloads last month: -; Downloads are not tracked for this model. How to track