TachBit-285
A ~285M parameter Transformer decoder-only language model trained on English Wikipedia, implemented in PyTorch and optimized for deployment on Raspberry Pi 5 with ONNX Runtime.
Model Architecture
| Parameter | Value |
|---|---|
| Layers | 24 |
| Hidden Size | 768 |
| Attention Heads | 12 |
| KV Heads | 4 |
| Head Dimension | 64 |
| Intermediate Size | 3072 |
| Context Length | 2048 |
| Vocabulary Size | 50,272 |
| Parameters | ~285M |
Architecture Features
- Grouped Query Attention (GQA): 12 query heads, 4 KV heads (3:1 ratio)
- SwiGLU Activation: Swish-Gated Linear Unit for FFN layers
- RoPE Position Embeddings: Rotary position embeddings (base=10,000)
- RMSNorm Pre-Norm: Root mean square layer normalization before sublayers
- No Bias Terms: All linear layers have bias=False (LLaMA style)
Training
- Dataset: Wikimedia English Wikipedia (2023-11-01 snapshot, ~6B tokens)
- Tokenizer: GPT-2 BPE (50,272 tokens) re-used
- Optimizer: AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
- Learning Rate: 3e-4 peak, cosine decay with 5% warmup
- Batch Size: 512 sequences × 2048 tokens
- Training Steps: 6000 steps
- Precision: BF16 mixed-precision training on NVIDIA GB10
- Gradient Checkpointing: Enabled for memory efficiency
Inference Performance
Raspberry Pi 5 (ONNX Runtime CPU)
| Model | Quantization | Speed | File Size |
|---|---|---|---|
| FP32 ONNX | No quantization | ~6.6 tok/s | ~1GB |
| INT8 ONNX | Dynamic quantization | ~11 tok/s | ~200MB |
The INT8 quantized model provides ~1.7× speedup using ARM NEON sdot/udot instructions.
KV Cache
Both GPU and edge CPU inference use KV cache for efficient autoregressive generation:
- Static-shape KV cache: All input/output shapes are fixed constants for ONNX Runtime
- Decode graph: Handles single-token steps with fully static shapes (no re-planning)
The KV cache implementation eliminates ONNX Runtime re-planning overhead and achieves
~6.6 tokens/sec on Pi 5 CPU with --max-seq-len 512.
Files in this Repository
| File | Description |
|---|---|
model.safetensors |
PyTorch model weights (FP32) |
config.json |
Model configuration |
tokenizer.json |
GPT-2 BPE tokenizer |
tokenizer_config.json |
Tokenizer configuration |
onnx/tachbit_kvcache.onnx |
ONNX model (FP32, with KV cache, static shapes) |
onnx/tachbit_kvcache_int8_dynamic.onnx |
ONNX model (INT8 quantized, Pi 5 optimized) |
Usage
PyTorch (General Purpose)
from transformers import AutoTokenizer
from tachbit import TachBitModel
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("tachyonx-ai/TachBit-285")
# Load model
model = TachBitModel.from_pretrained("tachyonx-ai/TachBit-285")
model.eval()
# Generate text
inputs = tokenizer("The Eiffel Tower is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
ONNX Runtime (Edge Deployment)
For Raspberry Pi 5 deployment with INT8 quantization:
import onnxruntime as ort
import numpy as np
# Load INT8 quantized ONNX model
session = ort.InferenceSession("onnx/tachbit_kvcache_int8_dynamic.onnx")
# Prepare inputs (see GitHub repo for complete KV cache management)
inputs = {
"input_ids": np.array([[token_id]]),
"positions": np.array([position]),
"cache_len": np.array(cache_len),
"full_key_0": k_cache_0,
"full_value_0": v_cache_0,
# ... all 24 transformer layers
}
# Run inference
outputs = session.run(None, inputs)
logits = outputs[0]
next_token = np.argmax(logits, axis=-1)
Limitations
- Base model only: This is a base language model trained on Wikipedia text. It is NOT instruction-tuned and will not reliably follow chat instructions.
- Wikipedia style: Outputs are Wikipedia-style text continuations, not helpful assistant responses.
- Limited coherence: At 285M parameters, the model may lose coherence over long generations (>100 tokens).
License
This project is released under the MIT license.
Citation
@misc{tachbit,
title={TachBit-285: A Transformer LLM for Edge Deployment},
author={TachyonX Ltd},
year={2026},
}
- Downloads last month
- 2,002