vindex-infer β Run LLMs without CUDA, without PyTorch, from flat binary files
cronos3k
β’ β’ 1Decomposed weights for google/gemma-3-4b-it in LarQL vindex format.
Use with vindex-infer for vendor-free LLM inference β no CUDA, no PyTorch, just a Rust binary.
# Download
huggingface-cli download cronos3k/gemma-3-4b-it-vindex --local-dir gemma3-4b.vindex
# Run inference (CPU β works on any machine)
vindex-infer --vindex gemma3-4b.vindex --token-ids "818,5279,529,7001,563"
# 1. Paris (+21.24)
# 2. a (+17.69)
# 3. the (+17.51)
| File | Size | Contents |
|---|---|---|
| gate_vectors.bin | 1.66 GB | FFN gate projections [34 layers Γ 10240 Γ 2560] f16 |
| up_weights.bin | 1.66 GB | FFN up projections [34 Γ 10240 Γ 2560] f16 |
| down_weights.bin | 1.66 GB | FFN down projections [34 Γ 2560 Γ 10240] f16 |
| attn_weights.bin | 1.02 GB | Q/K/V/O + QK norms per layer, f16 |
| embeddings.bin | 1.25 GB | Token embeddings [262208 Γ 2560] f16 |
| norms.bin | 0.7 MB | RMSNorm gammas (4 per layer + final) f16 |
| tokenizer.json | 32 MB | HuggingFace tokenizer |
| index.json | 5 KB | Model config, layer info |
| Total | 7.29 GB |
Extracted using LarQL with --level all --f16:
larql extract-index google/gemma-3-4b-it -o gemma3-4b.vindex --level all --f16
Output matches HuggingFace Transformers to 5 significant figures across all 34 layers. Residual norms track exactly: Layer 34 = 67806.5 (HF: 67806).
Verified prompts: Paris, Jupiter, blue, Ulm, Pound β all correct.