Upload TachBit model weights and config

55320a9 verified about 1 month ago

5.04 kB

	---
	model-index:
	- name: TachBit-285
	results: []
	tasks:
	- type: text-generation
	name: Text Generation
	tags:
	- transformer
	- llm
	- language-modeling
	- tachbit
	- text-generation
	- pytorch
	- onnx
	---

	# TachBit-285

	A ~285M parameter Transformer decoder-only language model trained on English Wikipedia,
	implemented in PyTorch and optimized for deployment on Raspberry Pi 5 with ONNX Runtime.

	## Model Architecture

	\| Parameter \| Value \|
	\|-------------------\|--------\|
	\| Layers \| 24 \|
	\| Hidden Size \| 768 \|
	\| Attention Heads \| 12 \|
	\| KV Heads \| 4 \|
	\| Head Dimension \| 64 \|
	\| Intermediate Size \| 3072 \|
	\| Context Length \| 2048 \|
	\| Vocabulary Size \| 50,272 \|
	\| Parameters \| ~285M \|

	## Architecture Features

	- Grouped Query Attention (GQA): 12 query heads, 4 KV heads (3:1 ratio)
	- SwiGLU Activation: Swish-Gated Linear Unit for FFN layers
	- RoPE Position Embeddings: Rotary position embeddings (base=10,000)
	- RMSNorm Pre-Norm: Root mean square layer normalization before sublayers
	- No Bias Terms: All linear layers have bias=False (LLaMA style)

	## Training

	- Dataset: Wikimedia English Wikipedia (2023-11-01 snapshot, ~6B tokens)
	- Tokenizer: GPT-2 BPE (50,272 tokens) re-used
	- Optimizer: AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
	- Learning Rate: 3e-4 peak, cosine decay with 5% warmup
	- Batch Size: 512 sequences × 2048 tokens
	- Training Steps: 6000 steps
	- Precision: BF16 mixed-precision training on NVIDIA GB10
	- Gradient Checkpointing: Enabled for memory efficiency

	## Inference Performance

	### Raspberry Pi 5 (ONNX Runtime CPU)

	\| Model \| Quantization \| Speed \| File Size \|
	\|-----------\|----------------------\|------------\|------------\|
	\| FP32 ONNX \| No quantization \| ~6.6 tok/s \| ~1GB \|
	\| INT8 ONNX \| Dynamic quantization \| ~11 tok/s \| ~200MB \|

	The INT8 quantized model provides ~1.7× speedup using ARM NEON `sdot`/`udot` instructions.

	## KV Cache

	Both GPU and edge CPU inference use KV cache for efficient autoregressive generation:

	- Static-shape KV cache: All input/output shapes are fixed constants for ONNX Runtime
	- Decode graph: Handles single-token steps with fully static shapes (no re-planning)

	The KV cache implementation eliminates ONNX Runtime re-planning overhead and achieves
	~6.6 tokens/sec on Pi 5 CPU with `--max-seq-len 512`.

	## Files in this Repository

	\| File \| Description \|
	\|------------------------------------------\|-------------------------------------------------\|
	\| `model.safetensors` \| PyTorch model weights (FP32) \|
	\| `config.json` \| Model configuration \|
	\| `tokenizer.json` \| GPT-2 BPE tokenizer \|
	\| `tokenizer_config.json` \| Tokenizer configuration \|
	\| `onnx/tachbit_kvcache.onnx` \| ONNX model (FP32, with KV cache, static shapes) \|
	\| `onnx/tachbit_kvcache_int8_dynamic.onnx` \| ONNX model (INT8 quantized, Pi 5 optimized) \|

	## Usage

	### PyTorch (General Purpose)

	```python
	from transformers import AutoTokenizer
	from tachbit import TachBitModel

	# Load tokenizer
	tokenizer = AutoTokenizer.from_pretrained("tachyonx-ai/TachBit-285")

	# Load model
	model = TachBitModel.from_pretrained("tachyonx-ai/TachBit-285")
	model.eval()

	# Generate text
	inputs = tokenizer("The Eiffel Tower is", return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### ONNX Runtime (Edge Deployment)

	For Raspberry Pi 5 deployment with INT8 quantization:

	```python
	import onnxruntime as ort
	import numpy as np

	# Load INT8 quantized ONNX model
	session = ort.InferenceSession("onnx/tachbit_kvcache_int8_dynamic.onnx")

	# Prepare inputs (see GitHub repo for complete KV cache management)
	inputs = {
	"input_ids": np.array([[token_id]]),
	"positions": np.array([position]),
	"cache_len": np.array(cache_len),
	"full_key_0": k_cache_0,
	"full_value_0": v_cache_0,
	# ... all 24 transformer layers
	}

	# Run inference
	outputs = session.run(None, inputs)
	logits = outputs[0]
	next_token = np.argmax(logits, axis=-1)
	```

	## Limitations

	- Base model only: This is a base language model trained on Wikipedia text. It is NOT instruction-tuned and will not reliably follow chat instructions.
	- Wikipedia style: Outputs are Wikipedia-style text continuations, not helpful assistant responses.
	- Limited coherence: At 285M parameters, the model may lose coherence over long generations (>100 tokens).

	## License

	This project is released under the MIT license.

	## Citation

	```bibtex
	@misc{tachbit,
	title={TachBit-285: A Transformer LLM for Edge Deployment},
	author={TachyonX Ltd},
	year={2026},
	}
	```