| --- |
| model-index: |
| - name: TachBit-285 |
| results: [] |
| tasks: |
| - type: text-generation |
| name: Text Generation |
| tags: |
| - transformer |
| - llm |
| - language-modeling |
| - tachbit |
| - text-generation |
| - pytorch |
| - onnx |
| --- |
| |
| # TachBit-285 |
|
|
| A ~285M parameter Transformer decoder-only language model trained on English Wikipedia, |
| implemented in PyTorch and optimized for deployment on Raspberry Pi 5 with ONNX Runtime. |
|
|
| ## Model Architecture |
|
|
| | Parameter | Value | |
| |-------------------|--------| |
| | Layers | 24 | |
| | Hidden Size | 768 | |
| | Attention Heads | 12 | |
| | KV Heads | 4 | |
| | Head Dimension | 64 | |
| | Intermediate Size | 3072 | |
| | Context Length | 2048 | |
| | Vocabulary Size | 50,272 | |
| | Parameters | ~285M | |
|
|
| ## Architecture Features |
|
|
| - **Grouped Query Attention (GQA)**: 12 query heads, 4 KV heads (3:1 ratio) |
| - **SwiGLU Activation**: Swish-Gated Linear Unit for FFN layers |
| - **RoPE Position Embeddings**: Rotary position embeddings (base=10,000) |
| - **RMSNorm Pre-Norm**: Root mean square layer normalization before sublayers |
| - **No Bias Terms**: All linear layers have bias=False (LLaMA style) |
|
|
| ## Training |
|
|
| - **Dataset**: Wikimedia English Wikipedia (2023-11-01 snapshot, ~6B tokens) |
| - **Tokenizer**: GPT-2 BPE (50,272 tokens) re-used |
| - **Optimizer**: AdamW (β1=0.9, β2=0.95, weight_decay=0.1) |
| - **Learning Rate**: 3e-4 peak, cosine decay with 5% warmup |
| - **Batch Size**: 512 sequences × 2048 tokens |
| - **Training Steps**: 6000 steps |
| - **Precision**: BF16 mixed-precision training on NVIDIA GB10 |
| - **Gradient Checkpointing**: Enabled for memory efficiency |
| |
| ## Inference Performance |
| |
| ### Raspberry Pi 5 (ONNX Runtime CPU) |
| |
| | Model | Quantization | Speed | File Size | |
| |-----------|----------------------|------------|------------| |
| | FP32 ONNX | No quantization | ~6.6 tok/s | ~1GB | |
| | INT8 ONNX | Dynamic quantization | ~11 tok/s | ~200MB | |
| |
| The INT8 quantized model provides ~1.7× speedup using ARM NEON `sdot`/`udot` instructions. |
| |
| ## KV Cache |
| |
| Both GPU and edge CPU inference use **KV cache** for efficient autoregressive generation: |
| |
| - **Static-shape KV cache**: All input/output shapes are fixed constants for ONNX Runtime |
| - **Decode graph**: Handles single-token steps with fully static shapes (no re-planning) |
| |
| The KV cache implementation eliminates ONNX Runtime re-planning overhead and achieves |
| ~6.6 tokens/sec on Pi 5 CPU with `--max-seq-len 512`. |
| |
| ## Files in this Repository |
| |
| | File | Description | |
| |------------------------------------------|-------------------------------------------------| |
| | `model.safetensors` | PyTorch model weights (FP32) | |
| | `config.json` | Model configuration | |
| | `tokenizer.json` | GPT-2 BPE tokenizer | |
| | `tokenizer_config.json` | Tokenizer configuration | |
| | `onnx/tachbit_kvcache.onnx` | ONNX model (FP32, with KV cache, static shapes) | |
| | `onnx/tachbit_kvcache_int8_dynamic.onnx` | ONNX model (INT8 quantized, Pi 5 optimized) | |
|
|
| ## Usage |
|
|
| ### PyTorch (General Purpose) |
|
|
| ```python |
| from transformers import AutoTokenizer |
| from tachbit import TachBitModel |
| |
| # Load tokenizer |
| tokenizer = AutoTokenizer.from_pretrained("tachyonx-ai/TachBit-285") |
| |
| # Load model |
| model = TachBitModel.from_pretrained("tachyonx-ai/TachBit-285") |
| model.eval() |
| |
| # Generate text |
| inputs = tokenizer("The Eiffel Tower is", return_tensors="pt") |
| outputs = model.generate(**inputs, max_new_tokens=100) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
| ### ONNX Runtime (Edge Deployment) |
|
|
| For Raspberry Pi 5 deployment with INT8 quantization: |
|
|
| ```python |
| import onnxruntime as ort |
| import numpy as np |
| |
| # Load INT8 quantized ONNX model |
| session = ort.InferenceSession("onnx/tachbit_kvcache_int8_dynamic.onnx") |
| |
| # Prepare inputs (see GitHub repo for complete KV cache management) |
| inputs = { |
| "input_ids": np.array([[token_id]]), |
| "positions": np.array([position]), |
| "cache_len": np.array(cache_len), |
| "full_key_0": k_cache_0, |
| "full_value_0": v_cache_0, |
| # ... all 24 transformer layers |
| } |
| |
| # Run inference |
| outputs = session.run(None, inputs) |
| logits = outputs[0] |
| next_token = np.argmax(logits, axis=-1) |
| ``` |
|
|
| ## Limitations |
|
|
| - **Base model only**: This is a base language model trained on Wikipedia text. It is NOT instruction-tuned and will not reliably follow chat instructions. |
| - **Wikipedia style**: Outputs are Wikipedia-style text continuations, not helpful assistant responses. |
| - **Limited coherence**: At 285M parameters, the model may lose coherence over long generations (>100 tokens). |
|
|
| ## License |
|
|
| This project is released under the MIT license. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{tachbit, |
| title={TachBit-285: A Transformer LLM for Edge Deployment}, |
| author={TachyonX Ltd}, |
| year={2026}, |
| } |
| ``` |
|
|