QORA-LLM-2B

Pure Rust ternary inference engine based on BitNet b1.58-2B-4T. No Python, no CUDA, no external ML frameworks. Single executable + model weights = portable AI that runs on any machine.

Zero-multiplication inference β€” ternary weights {-1, 0, +1} mean the inner GEMV loop uses only addition and subtraction, no floating-point multiply. Smart system awareness β€” detects RAM and CPU at startup and adjusts generation limits automatically.

License

This project is licensed under Apache 2.0. The base model BitNet b1.58-2B-4T is released by Microsoft under the MIT license.

What It Does

QORA-LLM-2B is a 2-billion parameter language model. It can:

  • Text generation β€” answer questions, write code, explain concepts
  • Chat mode β€” multi-turn conversation with LLaMA 3 chat template
  • Raw mode β€” direct text completion without chat formatting

Architecture

BitNet b1.58 uses a modified transformer with ternary quantized projections and SubLN normalization:

Component Details
Parameters 2B total
Hidden dim 2560
Layers 30
Attention GQA (20 query / 5 KV heads), head_dim=128
FFN 6912 intermediate, Squared ReLU activation
Vocabulary 128,256 tokens (LLaMA 3)
Context 4096 tokens
RoPE rotate_half, theta=500,000

SubLN Pattern (4 norms per layer)

Unlike standard LLaMA (2 norms per layer), BitNet uses SubLN with extra normalization before output projections:

residual = x
x = input_layernorm(x)           # RMSNorm [2560]
q, k, v = q/k/v_proj(x)          # Ternary linear (add/sub only)
q, k = apply_rope(q, k)
attn = attention(q, k, v)        # GQA: 20Q/5KV
attn = attn_sub_norm(attn)       # SubLN RMSNorm [2560]
attn = o_proj(attn)              # Ternary linear
x = residual + attn

residual = x
x = post_attention_layernorm(x)  # RMSNorm [2560]
gate = relu2(gate_proj(x))       # Squared ReLU: max(0,x)^2
up = up_proj(x)
x = ffn_sub_norm(gate * up)      # SubLN RMSNorm [6912]
x = down_proj(x)
x = residual + x

Ternary GEMV (No Multiplication)

Each weight is one of {-1, 0, +1}, packed 4 per byte (2 bits each). The inner loop:

+1 -> output += input
-1 -> output -= input
 0 -> skip

A single scalar multiply by the layer scale factor happens only at the end. This makes BitNet inference fundamentally different from traditional float/quantized models.

Smart System Awareness

QORA-LLM-2B detects your system at startup and automatically adjusts generation limits:

QORA - Native Rust LLM Inference Engine
System: 16101 MB RAM (8271 MB free), 12 threads
Available RAM Max Tokens Behavior
< 4 GB 256 (cap 512) Minimal generation, warning displayed
4-8 GB 512 (cap 1024) Constrained, warning displayed
8-12 GB 1024 (cap 2048) Normal operation
>= 12 GB 2048 (cap 8192) Full capability

Hard caps apply even to explicit user values. Supports Windows, Linux, and macOS.

Platform Support

Platform Binary Status
Windows x86_64 qor2b.exe Tested
Linux x86_64 qor2b Supported
macOS aarch64 qor2b Supported

Quick Start

  1. Download from the Releases page:

    • model.qor2b (~1.13 GB)
    • tokenizer.json
    • qor2b.exe (Windows) or build from source
  2. Place all files in the same folder and run:

# Chat mode (default)
qor2b --prompt "Explain how ternary neural networks work"

# With token limit
qor2b --prompt "Write a haiku about Rust" --max-tokens 100

# Raw text completion (no chat template)
qor2b --prompt "Once upon a time" --raw

# Greedy decoding (deterministic)
qor2b --prompt "What is 2+2?" --greedy

CLI Flags

Flag Description
--prompt TEXT Input prompt (default: "Hello, how are you?")
--max-tokens N Max tokens to generate (default: auto based on RAM)
--raw Raw text completion (skip chat template)
--greedy Greedy decoding (temperature=0)
--load PATH Custom model path (default: model.qor2b next to exe)
--convert DIR Convert safetensors from DIR to .qor2b format
--save PATH Output path for conversion (default: model.qor2b)

Sampling Defaults

Parameter Value
temperature 0.7
top_k 40
top_p 0.95
repetition_penalty 1.1
presence_penalty 0.6

Converting from Safetensors

To convert the original bf16 weights yourself:

# Download the model from HuggingFace
# (requires: pip install huggingface_hub)
python -c "from huggingface_hub import snapshot_download; snapshot_download('microsoft/bitnet-b1.58-2B-4T-bf16', local_dir='bitnet-bf16')"

# Convert to .qor2b format
qor2b --convert bitnet-bf16 --save model.qor2b

Conversion takes ~2 minutes and compresses 4.8 GB bf16 safetensors to a 1.13 GB ternary binary.

Building from Source

cargo build --release

Dependencies

  • Language: Pure Rust (2024 edition)
  • rayon β€” Thread pool for parallel GEMV and attention
  • half β€” F16 support for embeddings
  • tokenizers β€” HuggingFace tokenizer (LLaMA 3)
  • safetensors β€” Model conversion from HuggingFace format
  • serde_json β€” Config parsing
  • No ML framework β€” all matrix ops are hand-written Rust

File Structure

src/
  main.rs       β€” CLI entry point, argument parsing, smart system
  config.rs     β€” BitNet model configuration
  gemv.rs       β€” Ternary GEMV kernel, forward pass, attention, RoPE
  generate.rs   β€” Text generation loop with sampling
  tokenizer.rs  β€” LLaMA 3 tokenizer and chat template
  save.rs       β€” Binary model format (.qor2b) save/load
  convert.rs    β€” Safetensors bf16 -> ternary .qor2b converter
  system.rs     β€” System resource detection and smart limits
  lib.rs        β€” Module exports

Model Binary Format (.qor2b)

Custom binary format for fast loading:

Header:   "QR2B" magic + version(u32)
Metadata: layers, vocab, hidden, intermediate, heads, kv_heads,
          head_dim, kv_groups, half_dim, rms_eps
Layers:   30x (7 ternary weights + 4 norm vectors)
Global:   f16 embedding + f32 final norm + f32 RoPE tables

Size Breakdown

Component Size
Embedding (f16) ~656 MB
30 layers ternary (2-bit) ~470 MB
Norms + RoPE tables ~3 MB
Total ~1.13 GB

Performance

Tested on i5-11500 (6C/12T, AVX-512), 16GB RAM:

Metric Value
Decode speed ~2.5 tok/s
Model load time ~2s
Model size 1.13 GB
RAM usage ~1.5 GB

Comparison with Other QORA Models

QORA-LLM-2B QORA-3B QORA-4B
Parameters 2B 3B 4B
Quantization Ternary (1.58-bit) Q4 (4-bit) Q4 (4-bit)
Model size 1.13 GB 1.7 GB 3.5 GB
Decode (CPU) ~2.5 tok/s ~0.86 tok/s ~1.3 tok/s
RAM usage ~1.5 GB ~2.5 GB ~3.5 GB
GPU support No Yes Yes
Vision No No Yes
Best for Fast CPU inference, low RAM General text Multimodal, reasoning
Downloads last month
39
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for qoranet/QORA-LLM-2B

Quantized
(6)
this model