T4NT-0.5B
Tanh 4-bit Neural Transformer — a 535M parameter language model trained entirely from scratch with 4-bit quantization-aware training (QAT) using symmetric quantization and tanh soft weight clipping.
What is T4NT?
T4NT is a decoder-only transformer with a modern architecture (RoPE + SwiGLU + RMSNorm) where every linear layer is 4-bit quantization-constrained throughout training via straight-through estimators (STE). Weights pass through symmetric 4-bit quantization (15 levels, -7 to +7) during every forward pass, while latent weights are maintained in floating point for gradient updates. A novel tanh soft clipping method bounds weight magnitudes between optimizer steps.
No pretrained weights were used. No fine-tuning of existing models. Trained from random initialization on a single NVIDIA T4 GPU.
Architecture
| Component | T4NT-0.5B |
|---|---|
| Parameters | 535,033,600 (0.54B) |
| Embedding dim | 1280 |
| Attention heads | 16 |
| Layers | 20 |
| FFN dim | 3584 |
| Max sequence length | 256 |
| Position encoding | Rotary (RoPE) |
| FFN activation | SwiGLU |
| Normalization | RMSNorm |
| Weight quantization | 4-bit symmetric (15 levels) |
| Weight clipping | tanh(w/3) * 3 |
| Vocab size | 50,257 (GPT-2 tokenizer) |
Training Details
| Detail | Value |
|---|---|
| Dataset | WikiText-103 |
| Tokens seen | ~15M |
| Epochs | 50 |
| Batch size | 2 (effective 32 with gradient accumulation) |
| Learning rate | 3e-4 (cosine annealing) |
| Optimizer | AdamW (betas 0.9, 0.95) |
| Mixed precision | FP16 autocast + GradScaler |
| Hardware | NVIDIA T4 15GB |
| Training time | 2.92 hours |
| FP32 size | 2.04 GB |
| INT4 deployment size | 255 MB |
Results
Perplexity (WikiText-103)
| Epoch | Train Loss | Val Loss | PPL |
|---|---|---|---|
| 1 | 7.985 | 7.254 | 1413.75 |
| 5 | 6.116 | 5.931 | 376.41 |
| 10 | 5.630 | 5.508 | 246.76 |
| 15 | 5.365 | 5.280 | 196.34 |
| 20 | 5.182 | 4.995 | 147.72 |
| 25 | 5.035 | 4.848 | 127.47 |
| 30 | 4.879 | 4.760 | 116.80 |
| 35 | 4.781 | 4.672 | 106.86 |
| 40 | 4.760 | 4.612 | 100.73 |
| 45 | 4.716 | 4.574 | 96.91 |
| 50 | 4.726 | 4.610 | 100.46 |
| Best | - | 4.533 | 93.04 |
Weight Stability (50 epochs)
| Metric | Epoch 1 | Epoch 25 | Epoch 50 |
|---|---|---|---|
| Weight std | 0.0201 | 0.0205 | 0.0199 |
| Weight abs_max | 0.1183 | 0.1607 | 0.1130 |
| Quantization levels used | 14.8 / 15 | 14.9 / 15 | 15.0 / 15 |
| Weights in [-3, 3] | 100% | 100% | 100% |
| Gradient norm | 2.22 | 0.61 | 0.60 |
Key observations:
- Weight standard deviation remained constant throughout training (0.020 +/- 0.001)
- Weight abs_max peaked at epoch 20 then decreased as training stabilized, demonstrating the self-correcting property of tanh clipping
- All 15 quantization levels remained in use throughout training (no level collapse)
- Gradient norms decreased naturally from 2.22 to 0.60, indicating healthy convergence
Generation Samples (Epoch 50)
The model demonstrates syntactic fluency and structural coherence, though semantic depth remains limited due to dataset scale (15M tokens). Generated text follows grammatical English patterns with proper sentence structure.
Prompt: "The meaning of life is"
The meaning of life is unclear, but it is still a significant, mature, obvious, a man of other males who would find the way of the female. The man died on April 8, 1820, and the father of the family, John, died on March 2
Prompt: "In the history of"
In the history of the first two years of the period. In his second career in 1848, Mary was named for his junior year by the University of North America, and later, in 1875, was named after his mother, William, a member of the University
Key Contributions
- 4-bit QAT at scale: We demonstrate that a modern transformer (535M params) can be trained from scratch under 4-bit quantization constraints with stable convergence and no divergence over 50 epochs
- Tanh soft weight clipping: A smooth, differentiable weight constraint method that bounds weight magnitudes during quantized training, showing self-correcting behavior (weights expand during learning, contract during convergence)
- Quantization level preservation: All 15 quantization levels remained fully utilized throughout 50 epochs of training with no level collapse, a non-trivial result for low-bit training
- Consumer hardware training: Entire training completed on a single NVIDIA T4 (15GB) in under 3 hours, demonstrating accessibility of low-bit training research
Related Work
This model builds on the research presented in:
"True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity" Shivnath Tathe, arXiv:2603.13931 Paper | Code
How to Use
This is a custom architecture (not a standard HuggingFace model). Load with PyTorch:
import torch
# Load checkpoint
ckpt = torch.load('T4NT_0.5B_tanh_50ep.pt', map_location='cpu')
print(f"Val Loss: {ckpt['val_loss']:.4f}")
print(f"PPL: {ckpt['ppl']:.2f}")
print(f"Config: {ckpt['config']}")
# Rebuild model architecture and load weights
# See config.json for architecture parameters
# Full training code available in the training logs
Files
| File | Description |
|---|---|
config.json |
Model architecture and training configuration |
generation_config.json |
Default generation parameters |
T4NT_0.5B_tanh_50ep.pt |
Model checkpoint (FP32 training weights, 2.14 GB) |
t4nt_tanh_50ep_log.json |
Detailed training logs with per-epoch metrics and weight statistics |
Limitations
- Limited training data: Trained on ~15M tokens from WikiText-103. Production language models use 1T+ tokens. Perplexity reflects this data constraint, not an architectural ceiling.
- Semantic depth: The model generates syntactically correct text but lacks deep semantic coherence due to limited data exposure.
- Wikipedia distribution only: Training data is exclusively Wikipedia. The model has no exposure to conversational, code, or instructional text.
- No instruction tuning: This is a base language model (next-token prediction only). It does not follow instructions or engage in dialogue.
- No FP32 baseline comparison: This work focuses on demonstrating the feasibility and stability of 4-bit constrained training, not on matching FP32 accuracy. A matched FP32 baseline at the same scale and data budget is planned for the full paper.
- Latent weights in floating point: While forward-pass computation uses 4-bit quantized weights, the optimizer maintains floating-point latent weights for gradient updates. This is standard for QAT but means training memory is not reduced to 4-bit levels.
Ablation Studies
Ablation experiments comparing tanh clipping vs hard clipping vs no clipping are in progress. Results will be added upon completion.
Citation
@misc{tathe2025t4nt,
author = {Tathe, Shivnath},
title = {T4NT-0.5B: Tanh 4-bit Neural Transformer},
year = {2025},
url = {https://huggingface.co/shivnathtathe/T4NT-0.5B}
}
@misc{tathe2026true4bit,
author = {Tathe, Shivnath},
title = {True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity},
year = {2026},
eprint = {2603.13931},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}
Author
Shivnath Tathe Software Engineer at ISG eSolutions | Independent AI Researcher
Google Scholar | arXiv | GitHub | LinkedIn
- Downloads last month
- 214
Dataset used to train shivnathtathe/T4NT-0.5B
Paper for shivnathtathe/T4NT-0.5B
Evaluation results
- Perplexity on WikiText-103self-reported93.040