Chimera.APE — v0.1.0-alpha (single-file build)

Three queries, one binary, zero regrets. Runs on anything, answers to no one.

One Actually Portable Executable. One download. Everything inside.

chimera-full.ape (~7.5 GB) bundles, in a single self-contained file that runs unmodified on Linux / macOS / Windows / BSD:

  • the orchestrator (C++/Cosmopolitan),
  • a llamafile + Gemma 4 12B QAT q4_0 (embeddings and chat from one server),
  • the multimodal projector (image + audio understanding),
  • QLever (SPARQL knowledge graph + BM25 text index), and
  • TurboVec (quantized approximate-nearest-neighbor vector search).

Point it at a directory of files — text, code, images, audio — and it digests everything into a hybrid graph-vector database. Ask a question and it answers with synthesized, cited, checksum-verified provenance. No network, no sidecar downloads, no runtime dependencies.

GitHub (source, smaller organ-only build, full docs): https://github.com/SEBK4C/Chimera.APE

Quick start

# Download this one file (no weights to fetch separately — they're inside):
hf download SEBK4C/Chimera.APE chimera-full.ape --local-dir .
chmod +x chimera-full.ape

# Ingest a directory. First run unpacks the embedded organs + weights into
# <dir>/.chimera/runtime/ (one-time, a few GB):
./chimera-full.ape ingest ~/notes

# Ask:
./chimera-full.ape --search "what did we decide about the billing rewrite?" \
    --db ~/notes/.chimera
Maria Chen leads Project Phoenix [1]. It is a rewrite of the billing system [1].

Sources:
  [1] phoenix.md#1  ✓ verified

✓ verified means the cited file is byte-identical to what was ingested; ⚠ drifted / ⚠ missing tell you when it isn't. Citations are promises the checksum keeps.

GPU (NVIDIA / Metal) — interactive ingest & search

CPU works everywhere but is slow (~7 tok/s — minutes per document). On a GPU, ingest and search become interactive. The orchestrator passes offload flags straight through to the embedded llamafile:

./chimera-full.ape ingest ~/notes --gpu auto       # offload all layers (default-on GPU box)
./chimera-full.ape ingest ~/notes --gpu nvidia     # pin the CUDA backend
./chimera-full.ape ingest ~/notes --gpu 24         # partial offload, N layers (small VRAM)
./chimera-full.ape ingest ~/notes --gpu off        # force CPU
./chimera-full.ape --search "..." --db ... --gpu auto
--gpu llamafile flags meaning
auto (default) -ngl 999 offload all layers; falls back to CPU if no GPU
off / disable --gpu disable force CPU
integer N -ngl N offload N layers (VRAM-limited cards)
nvidia/amd/apple --gpu <vendor> -ngl 999 pin the backend vendor

CUDA prereqs: a working NVIDIA driver is enough (llamafile ships a prebuilt tinyBLAS path); with the CUDA toolkit (nvcc on PATH) it JITs an optimized ggml-cuda module once and caches it under ~/.llamafile/. The first GPU run logs the device(s) and throughput to <db>/.chimera/logs/llamafile.log.

Verified on this build: 2× NVIDIA RTX 4090 (driver 580 / CUDA 12.8) — --gpu auto offloads Gemma 4 12B across both cards and runs ingest + search end-to-end with ✓ verified citations at ~90 tok/s generation (vs ~7 tok/s on CPU). Multimodal embeddings run on GPU too: image and audio embed natively as the model's end hidden state over the projector+interleave forward pass (LAST pooling), in the same 3840-d space as text — so --search-file (image→image, audio→audio) works on GPU. See docs/GEMMA4-EMBEDDINGS.md and docs/GPU.md.

Images and audio

PNG/JPEG/WAV/MP3 are first-class documents. At ingest the model transcribes legible text or describes the scene/sound, indexes that derived text, and stores the raw media embedding for query-by-example:

./chimera-full.ape --search "the budget figure on the banner" --db ~/notes/.chimera
./chimera-full.ape --search-file query.png --db ~/notes/.chimera

Other commands

./chimera-full.ape status  --db DIR/.chimera     # counts, dims, index staleness
./chimera-full.ape verify  --db ... [--paranoid] # re-checksum the corpus
./chimera-full.ape vacuum  --db ...              # purge superseded data, rebuild text index
./chimera-full.ape sparql  "SELECT ..." --db ... # raw SPARQL into the live graph

Hardware

Runs CPU-only (slow — minutes per document at ingest, ~7 tok/s on a fast CPU) or on a GPU (--gpu auto, interactive — see above). Needs ≥16 GB RAM (the model maps ~8 GB) and ~8 GB free disk for the one-time runtime extraction.

Two flavors

File Size Use
chimera-full.ape (here) ~7.5 GB true single file; weights embedded
chimera.ape (on GitHub releases) ~315 MB organs embedded, weights sidecar via --model

Known alpha limitations

  • Sequential ingest (CPU-bound on CPU hosts); §5 bounded-queue concurrency is designed, not yet wired.
  • Incremental ingests don't extend the BM25 text index (vector + graph search unaffected); vacuum rebuilds it.
  • Linux x86_64 is the tested platform; turbovec-server carries Linux ABI assumptions inside its APE shell, so other OSes are expected-but-unverified.
  • Dense rendered-text OCR has a known upstream vision-pipeline bug; photos/scenes describe well.
  • Embeddings use LAST pooling — the end hidden state of Gemma 4 12B's projector+interleave forward pass — for text, image, and audio alike (one shared 3840-d space; this is what makes native multimodal embedding work on GPU). The embedded llamafile carries the patch that makes this GPU-safe. If you indexed with an earlier (mean-pooled) build, re-ingest; dimensionality (3840) is unchanged.

Built with Cosmopolitan Libc. Gemma 4 weights © Google, Apache 2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support