---
model-index:
  - name: TachBit-285
    results: []
    tasks:
      - type: text-generation
        name: Text Generation
tags:
  - transformer
  - llm
  - language-modeling
  - tachbit
  - text-generation
  - pytorch
  - onnx
---

# TachBit-285

A ~285M parameter Transformer decoder-only language model trained on English Wikipedia,
implemented in PyTorch and optimized for deployment on Raspberry Pi 5 with ONNX Runtime.

## Model Architecture

| Parameter         | Value  |
|-------------------|--------|
| Layers            | 24     |
| Hidden Size       | 768    |
| Attention Heads   | 12     |
| KV Heads          | 4      |
| Head Dimension    | 64     |
| Intermediate Size | 3072   |
| Context Length    | 2048   |
| Vocabulary Size   | 50,272 |
| Parameters        | ~285M  |

## Architecture Features

- **Grouped Query Attention (GQA)**: 12 query heads, 4 KV heads (3:1 ratio)
- **SwiGLU Activation**: Swish-Gated Linear Unit for FFN layers
- **RoPE Position Embeddings**: Rotary position embeddings (base=10,000)
- **RMSNorm Pre-Norm**: Root mean square layer normalization before sublayers
- **No Bias Terms**: All linear layers have bias=False (LLaMA style)

## Training

- **Dataset**: Wikimedia English Wikipedia (2023-11-01 snapshot, ~6B tokens)
- **Tokenizer**: GPT-2 BPE (50,272 tokens) re-used
- **Optimizer**: AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
- **Learning Rate**: 3e-4 peak, cosine decay with 5% warmup
- **Batch Size**: 512 sequences × 2048 tokens
- **Training Steps**: 6000 steps
- **Precision**: BF16 mixed-precision training on NVIDIA GB10
- **Gradient Checkpointing**: Enabled for memory efficiency

## Inference Performance

### Raspberry Pi 5 (ONNX Runtime CPU)

| Model     | Quantization         | Speed      | File Size  |
|-----------|----------------------|------------|------------|
| FP32 ONNX | No quantization      | ~6.6 tok/s | ~1GB       |
| INT8 ONNX | Dynamic quantization | ~11 tok/s  | ~200MB     |

The INT8 quantized model provides ~1.7× speedup using ARM NEON `sdot`/`udot` instructions.

## KV Cache

Both GPU and edge CPU inference use **KV cache** for efficient autoregressive generation:

- **Static-shape KV cache**: All input/output shapes are fixed constants for ONNX Runtime
- **Decode graph**: Handles single-token steps with fully static shapes (no re-planning)

The KV cache implementation eliminates ONNX Runtime re-planning overhead and achieves
~6.6 tokens/sec on Pi 5 CPU with `--max-seq-len 512`.

## Files in this Repository

| File                                     | Description                                     |
|------------------------------------------|-------------------------------------------------|
| `model.safetensors`                      | PyTorch model weights (FP32)                    |
| `config.json`                            | Model configuration                             |
| `tokenizer.json`                         | GPT-2 BPE tokenizer                             |
| `tokenizer_config.json`                  | Tokenizer configuration                         |
| `onnx/tachbit_kvcache.onnx`              | ONNX model (FP32, with KV cache, static shapes) |
| `onnx/tachbit_kvcache_int8_dynamic.onnx` | ONNX model (INT8 quantized, Pi 5 optimized)     |

## Usage

### PyTorch (General Purpose)

```python
from transformers import AutoTokenizer
from tachbit import TachBitModel

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("tachyonx-ai/TachBit-285")

# Load model
model = TachBitModel.from_pretrained("tachyonx-ai/TachBit-285")
model.eval()

# Generate text
inputs = tokenizer("The Eiffel Tower is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### ONNX Runtime (Edge Deployment)

For Raspberry Pi 5 deployment with INT8 quantization:

```python
import onnxruntime as ort
import numpy as np

# Load INT8 quantized ONNX model
session = ort.InferenceSession("onnx/tachbit_kvcache_int8_dynamic.onnx")

# Prepare inputs (see GitHub repo for complete KV cache management)
inputs = {
    "input_ids": np.array([[token_id]]),
    "positions": np.array([position]),
    "cache_len": np.array(cache_len),
    "full_key_0": k_cache_0,
    "full_value_0": v_cache_0,
    # ... all 24 transformer layers
}

# Run inference
outputs = session.run(None, inputs)
logits = outputs[0]
next_token = np.argmax(logits, axis=-1)
```

## Limitations

- **Base model only**: This is a base language model trained on Wikipedia text. It is NOT instruction-tuned and will not reliably follow chat instructions.
- **Wikipedia style**: Outputs are Wikipedia-style text continuations, not helpful assistant responses.
- **Limited coherence**: At 285M parameters, the model may lose coherence over long generations (>100 tokens).

## License

This project is released under the MIT license.

## Citation

```bibtex
@misc{tachbit,
    title={TachBit-285: A Transformer LLM for Edge Deployment},
    author={TachyonX Ltd},
    year={2026},
}
```