T4NT-0.5B

Tanh 4-bit Neural Transformer — a 535M parameter language model trained entirely from scratch with 4-bit quantization-aware training (QAT) using symmetric quantization and tanh soft weight clipping.

What is T4NT?

T4NT is a decoder-only transformer with a modern architecture (RoPE + SwiGLU + RMSNorm) where every linear layer is 4-bit quantization-constrained throughout training via straight-through estimators (STE). Weights pass through symmetric 4-bit quantization (15 levels, -7 to +7) during every forward pass, while latent weights are maintained in floating point for gradient updates. A novel tanh soft clipping method bounds weight magnitudes between optimizer steps.

No pretrained weights were used. No fine-tuning of existing models. Trained from random initialization on a single NVIDIA T4 GPU.

Architecture

Component T4NT-0.5B
Parameters 535,033,600 (0.54B)
Embedding dim 1280
Attention heads 16
Layers 20
FFN dim 3584
Max sequence length 256
Position encoding Rotary (RoPE)
FFN activation SwiGLU
Normalization RMSNorm
Weight quantization 4-bit symmetric (15 levels)
Weight clipping tanh(w/3) * 3
Vocab size 50,257 (GPT-2 tokenizer)

Training Details

Detail Value
Dataset WikiText-103
Tokens seen ~15M
Epochs 50
Batch size 2 (effective 32 with gradient accumulation)
Learning rate 3e-4 (cosine annealing)
Optimizer AdamW (betas 0.9, 0.95)
Mixed precision FP16 autocast + GradScaler
Hardware NVIDIA T4 15GB
Training time 2.92 hours
FP32 size 2.04 GB
INT4 deployment size 255 MB

Results

Perplexity (WikiText-103)

Epoch Train Loss Val Loss PPL
1 7.985 7.254 1413.75
5 6.116 5.931 376.41
10 5.630 5.508 246.76
15 5.365 5.280 196.34
20 5.182 4.995 147.72
25 5.035 4.848 127.47
30 4.879 4.760 116.80
35 4.781 4.672 106.86
40 4.760 4.612 100.73
45 4.716 4.574 96.91
50 4.726 4.610 100.46
Best - 4.533 93.04

Weight Stability (50 epochs)

Metric Epoch 1 Epoch 25 Epoch 50
Weight std 0.0201 0.0205 0.0199
Weight abs_max 0.1183 0.1607 0.1130
Quantization levels used 14.8 / 15 14.9 / 15 15.0 / 15
Weights in [-3, 3] 100% 100% 100%
Gradient norm 2.22 0.61 0.60

Key observations:

  • Weight standard deviation remained constant throughout training (0.020 +/- 0.001)
  • Weight abs_max peaked at epoch 20 then decreased as training stabilized, demonstrating the self-correcting property of tanh clipping
  • All 15 quantization levels remained in use throughout training (no level collapse)
  • Gradient norms decreased naturally from 2.22 to 0.60, indicating healthy convergence

Generation Samples (Epoch 50)

The model demonstrates syntactic fluency and structural coherence, though semantic depth remains limited due to dataset scale (15M tokens). Generated text follows grammatical English patterns with proper sentence structure.

Prompt: "The meaning of life is"

The meaning of life is unclear, but it is still a significant, mature, obvious, a man of other males who would find the way of the female. The man died on April 8, 1820, and the father of the family, John, died on March 2

Prompt: "In the history of"

In the history of the first two years of the period. In his second career in 1848, Mary was named for his junior year by the University of North America, and later, in 1875, was named after his mother, William, a member of the University

Key Contributions

  1. 4-bit QAT at scale: We demonstrate that a modern transformer (535M params) can be trained from scratch under 4-bit quantization constraints with stable convergence and no divergence over 50 epochs
  2. Tanh soft weight clipping: A smooth, differentiable weight constraint method that bounds weight magnitudes during quantized training, showing self-correcting behavior (weights expand during learning, contract during convergence)
  3. Quantization level preservation: All 15 quantization levels remained fully utilized throughout 50 epochs of training with no level collapse, a non-trivial result for low-bit training
  4. Consumer hardware training: Entire training completed on a single NVIDIA T4 (15GB) in under 3 hours, demonstrating accessibility of low-bit training research

Related Work

This model builds on the research presented in:

"True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity" Shivnath Tathe, arXiv:2603.13931 Paper | Code

How to Use

This is a custom architecture (not a standard HuggingFace model). Load with PyTorch:

import torch

# Load checkpoint
ckpt = torch.load('T4NT_0.5B_tanh_50ep.pt', map_location='cpu')
print(f"Val Loss: {ckpt['val_loss']:.4f}")
print(f"PPL: {ckpt['ppl']:.2f}")
print(f"Config: {ckpt['config']}")

# Rebuild model architecture and load weights
# See config.json for architecture parameters
# Full training code available in the training logs

Files

File Description
config.json Model architecture and training configuration
generation_config.json Default generation parameters
T4NT_0.5B_tanh_50ep.pt Model checkpoint (FP32 training weights, 2.14 GB)
t4nt_tanh_50ep_log.json Detailed training logs with per-epoch metrics and weight statistics

Limitations

  • Limited training data: Trained on ~15M tokens from WikiText-103. Production language models use 1T+ tokens. Perplexity reflects this data constraint, not an architectural ceiling.
  • Semantic depth: The model generates syntactically correct text but lacks deep semantic coherence due to limited data exposure.
  • Wikipedia distribution only: Training data is exclusively Wikipedia. The model has no exposure to conversational, code, or instructional text.
  • No instruction tuning: This is a base language model (next-token prediction only). It does not follow instructions or engage in dialogue.
  • No FP32 baseline comparison: This work focuses on demonstrating the feasibility and stability of 4-bit constrained training, not on matching FP32 accuracy. A matched FP32 baseline at the same scale and data budget is planned for the full paper.
  • Latent weights in floating point: While forward-pass computation uses 4-bit quantized weights, the optimizer maintains floating-point latent weights for gradient updates. This is standard for QAT but means training memory is not reduced to 4-bit levels.

Ablation Studies

Ablation experiments comparing tanh clipping vs hard clipping vs no clipping are in progress. Results will be added upon completion.

Citation

@misc{tathe2025t4nt,
  author = {Tathe, Shivnath},
  title = {T4NT-0.5B: Tanh 4-bit Neural Transformer},
  year = {2025},
  url = {https://huggingface.co/shivnathtathe/T4NT-0.5B}
}

@misc{tathe2026true4bit,
  author = {Tathe, Shivnath},
  title = {True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity},
  year = {2026},
  eprint = {2603.13931},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG}
}

Author

Shivnath Tathe Software Engineer at ISG eSolutions | Independent AI Researcher

Google Scholar | arXiv | GitHub | LinkedIn

Downloads last month
214
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train shivnathtathe/T4NT-0.5B

Paper for shivnathtathe/T4NT-0.5B

Evaluation results