TachBit-285

A ~285M parameter Transformer decoder-only language model trained on English Wikipedia, implemented in PyTorch and optimized for deployment on Raspberry Pi 5 with ONNX Runtime.

Model Architecture

Parameter Value
Layers 24
Hidden Size 768
Attention Heads 12
KV Heads 4
Head Dimension 64
Intermediate Size 3072
Context Length 2048
Vocabulary Size 50,272
Parameters ~285M

Architecture Features

  • Grouped Query Attention (GQA): 12 query heads, 4 KV heads (3:1 ratio)
  • SwiGLU Activation: Swish-Gated Linear Unit for FFN layers
  • RoPE Position Embeddings: Rotary position embeddings (base=10,000)
  • RMSNorm Pre-Norm: Root mean square layer normalization before sublayers
  • No Bias Terms: All linear layers have bias=False (LLaMA style)

Training

  • Dataset: Wikimedia English Wikipedia (2023-11-01 snapshot, ~6B tokens)
  • Tokenizer: GPT-2 BPE (50,272 tokens) re-used
  • Optimizer: AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
  • Learning Rate: 3e-4 peak, cosine decay with 5% warmup
  • Batch Size: 512 sequences × 2048 tokens
  • Training Steps: 6000 steps
  • Precision: BF16 mixed-precision training on NVIDIA GB10
  • Gradient Checkpointing: Enabled for memory efficiency

Inference Performance

Raspberry Pi 5 (ONNX Runtime CPU)

Model Quantization Speed File Size
FP32 ONNX No quantization ~6.6 tok/s ~1GB
INT8 ONNX Dynamic quantization ~11 tok/s ~200MB

The INT8 quantized model provides ~1.7× speedup using ARM NEON sdot/udot instructions.

KV Cache

Both GPU and edge CPU inference use KV cache for efficient autoregressive generation:

  • Static-shape KV cache: All input/output shapes are fixed constants for ONNX Runtime
  • Decode graph: Handles single-token steps with fully static shapes (no re-planning)

The KV cache implementation eliminates ONNX Runtime re-planning overhead and achieves ~6.6 tokens/sec on Pi 5 CPU with --max-seq-len 512.

Files in this Repository

File Description
model.safetensors PyTorch model weights (FP32)
config.json Model configuration
tokenizer.json GPT-2 BPE tokenizer
tokenizer_config.json Tokenizer configuration
onnx/tachbit_kvcache.onnx ONNX model (FP32, with KV cache, static shapes)
onnx/tachbit_kvcache_int8_dynamic.onnx ONNX model (INT8 quantized, Pi 5 optimized)

Usage

PyTorch (General Purpose)

from transformers import AutoTokenizer
from tachbit import TachBitModel

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("tachyonx-ai/TachBit-285")

# Load model
model = TachBitModel.from_pretrained("tachyonx-ai/TachBit-285")
model.eval()

# Generate text
inputs = tokenizer("The Eiffel Tower is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

ONNX Runtime (Edge Deployment)

For Raspberry Pi 5 deployment with INT8 quantization:

import onnxruntime as ort
import numpy as np

# Load INT8 quantized ONNX model
session = ort.InferenceSession("onnx/tachbit_kvcache_int8_dynamic.onnx")

# Prepare inputs (see GitHub repo for complete KV cache management)
inputs = {
    "input_ids": np.array([[token_id]]),
    "positions": np.array([position]),
    "cache_len": np.array(cache_len),
    "full_key_0": k_cache_0,
    "full_value_0": v_cache_0,
    # ... all 24 transformer layers
}

# Run inference
outputs = session.run(None, inputs)
logits = outputs[0]
next_token = np.argmax(logits, axis=-1)

Limitations

  • Base model only: This is a base language model trained on Wikipedia text. It is NOT instruction-tuned and will not reliably follow chat instructions.
  • Wikipedia style: Outputs are Wikipedia-style text continuations, not helpful assistant responses.
  • Limited coherence: At 285M parameters, the model may lose coherence over long generations (>100 tokens).

License

This project is released under the MIT license.

Citation

@misc{tachbit,
    title={TachBit-285: A Transformer LLM for Edge Deployment},
    author={TachyonX Ltd},
    year={2026},
}
Downloads last month
2,002
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support