--- model-index: - name: TachBit-285 results: [] tasks: - type: text-generation name: Text Generation tags: - transformer - llm - language-modeling - tachbit - text-generation - pytorch - onnx --- # TachBit-285 A ~285M parameter Transformer decoder-only language model trained on English Wikipedia, implemented in PyTorch and optimized for deployment on Raspberry Pi 5 with ONNX Runtime. ## Model Architecture | Parameter | Value | |-------------------|--------| | Layers | 24 | | Hidden Size | 768 | | Attention Heads | 12 | | KV Heads | 4 | | Head Dimension | 64 | | Intermediate Size | 3072 | | Context Length | 2048 | | Vocabulary Size | 50,272 | | Parameters | ~285M | ## Architecture Features - **Grouped Query Attention (GQA)**: 12 query heads, 4 KV heads (3:1 ratio) - **SwiGLU Activation**: Swish-Gated Linear Unit for FFN layers - **RoPE Position Embeddings**: Rotary position embeddings (base=10,000) - **RMSNorm Pre-Norm**: Root mean square layer normalization before sublayers - **No Bias Terms**: All linear layers have bias=False (LLaMA style) ## Training - **Dataset**: Wikimedia English Wikipedia (2023-11-01 snapshot, ~6B tokens) - **Tokenizer**: GPT-2 BPE (50,272 tokens) re-used - **Optimizer**: AdamW (β1=0.9, β2=0.95, weight_decay=0.1) - **Learning Rate**: 3e-4 peak, cosine decay with 5% warmup - **Batch Size**: 512 sequences × 2048 tokens - **Training Steps**: 6000 steps - **Precision**: BF16 mixed-precision training on NVIDIA GB10 - **Gradient Checkpointing**: Enabled for memory efficiency ## Inference Performance ### Raspberry Pi 5 (ONNX Runtime CPU) | Model | Quantization | Speed | File Size | |-----------|----------------------|------------|------------| | FP32 ONNX | No quantization | ~6.6 tok/s | ~1GB | | INT8 ONNX | Dynamic quantization | ~11 tok/s | ~200MB | The INT8 quantized model provides ~1.7× speedup using ARM NEON `sdot`/`udot` instructions. ## KV Cache Both GPU and edge CPU inference use **KV cache** for efficient autoregressive generation: - **Static-shape KV cache**: All input/output shapes are fixed constants for ONNX Runtime - **Decode graph**: Handles single-token steps with fully static shapes (no re-planning) The KV cache implementation eliminates ONNX Runtime re-planning overhead and achieves ~6.6 tokens/sec on Pi 5 CPU with `--max-seq-len 512`. ## Files in this Repository | File | Description | |------------------------------------------|-------------------------------------------------| | `model.safetensors` | PyTorch model weights (FP32) | | `config.json` | Model configuration | | `tokenizer.json` | GPT-2 BPE tokenizer | | `tokenizer_config.json` | Tokenizer configuration | | `onnx/tachbit_kvcache.onnx` | ONNX model (FP32, with KV cache, static shapes) | | `onnx/tachbit_kvcache_int8_dynamic.onnx` | ONNX model (INT8 quantized, Pi 5 optimized) | ## Usage ### PyTorch (General Purpose) ```python from transformers import AutoTokenizer from tachbit import TachBitModel # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("tachyonx-ai/TachBit-285") # Load model model = TachBitModel.from_pretrained("tachyonx-ai/TachBit-285") model.eval() # Generate text inputs = tokenizer("The Eiffel Tower is", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### ONNX Runtime (Edge Deployment) For Raspberry Pi 5 deployment with INT8 quantization: ```python import onnxruntime as ort import numpy as np # Load INT8 quantized ONNX model session = ort.InferenceSession("onnx/tachbit_kvcache_int8_dynamic.onnx") # Prepare inputs (see GitHub repo for complete KV cache management) inputs = { "input_ids": np.array([[token_id]]), "positions": np.array([position]), "cache_len": np.array(cache_len), "full_key_0": k_cache_0, "full_value_0": v_cache_0, # ... all 24 transformer layers } # Run inference outputs = session.run(None, inputs) logits = outputs[0] next_token = np.argmax(logits, axis=-1) ``` ## Limitations - **Base model only**: This is a base language model trained on Wikipedia text. It is NOT instruction-tuned and will not reliably follow chat instructions. - **Wikipedia style**: Outputs are Wikipedia-style text continuations, not helpful assistant responses. - **Limited coherence**: At 285M parameters, the model may lose coherence over long generations (>100 tokens). ## License This project is released under the MIT license. ## Citation ```bibtex @misc{tachbit, title={TachBit-285: A Transformer LLM for Edge Deployment}, author={TachyonX Ltd}, year={2026}, } ```