T4NT-0.5B

Tanh 4-bit Neural Transformer — a 535M parameter language model trained entirely from scratch with 4-bit quantization-aware training (QAT) using symmetric quantization and tanh soft weight clipping.

What is T4NT?

T4NT is a decoder-only transformer with a modern architecture (RoPE + SwiGLU + RMSNorm) where every linear layer is 4-bit quantization-constrained throughout training via straight-through estimators (STE). Weights pass through symmetric 4-bit quantization (15 levels, -7 to +7) during every forward pass, while latent weights are maintained in floating point for gradient updates. A novel tanh soft clipping method bounds weight magnitudes between optimizer steps.

No pretrained weights were used. No fine-tuning of existing models. Trained from random initialization on a single NVIDIA T4 GPU.

Architecture

Component	T4NT-0.5B
Parameters	535,033,600 (0.54B)
Embedding dim	1280
Attention heads	16
Layers	20
FFN dim	3584
Max sequence length	256
Position encoding	Rotary (RoPE)
FFN activation	SwiGLU
Normalization	RMSNorm
Weight quantization	4-bit symmetric (15 levels)
Weight clipping	tanh(w/3) * 3
Vocab size	50,257 (GPT-2 tokenizer)

Training Details

Detail	Value
Dataset	WikiText-103
Tokens seen	~15M
Epochs	50
Batch size	2 (effective 32 with gradient accumulation)
Learning rate	3e-4 (cosine annealing)
Optimizer	AdamW (betas 0.9, 0.95)
Mixed precision	FP16 autocast + GradScaler
Hardware	NVIDIA T4 15GB
Training time	2.92 hours
FP32 size	2.04 GB
INT4 deployment size	255 MB

Results

Perplexity (WikiText-103)

Epoch	Train Loss	Val Loss	PPL
1	7.985	7.254	1413.75
5	6.116	5.931	376.41
10	5.630	5.508	246.76
15	5.365	5.280	196.34
20	5.182	4.995	147.72
25	5.035	4.848	127.47
30	4.879	4.760	116.80
35	4.781	4.672	106.86
40	4.760	4.612	100.73
45	4.716	4.574	96.91
50	4.726	4.610	100.46
Best	-	4.533	93.04

Weight Stability (50 epochs)

Metric	Epoch 1	Epoch 25	Epoch 50
Weight std	0.0201	0.0205	0.0199
Weight abs_max	0.1183	0.1607	0.1130
Quantization levels used	14.8 / 15	14.9 / 15	15.0 / 15
Weights in [-3, 3]	100%	100%	100%
Gradient norm	2.22	0.61	0.60

Key observations:

Weight standard deviation remained constant throughout training (0.020 +/- 0.001)
Weight abs_max peaked at epoch 20 then decreased as training stabilized, demonstrating the self-correcting property of tanh clipping
All 15 quantization levels remained in use throughout training (no level collapse)
Gradient norms decreased naturally from 2.22 to 0.60, indicating healthy convergence

Generation Samples (Epoch 50)

The model demonstrates syntactic fluency and structural coherence, though semantic depth remains limited due to dataset scale (15M tokens). Generated text follows grammatical English patterns with proper sentence structure.

Prompt: "The meaning of life is"

The meaning of life is unclear, but it is still a significant, mature, obvious, a man of other males who would find the way of the female. The man died on April 8, 1820, and the father of the family, John, died on March 2

Prompt: "In the history of"

In the history of the first two years of the period. In his second career in 1848, Mary was named for his junior year by the University of North America, and later, in 1875, was named after his mother, William, a member of the University

Key Contributions

4-bit QAT at scale: We demonstrate that a modern transformer (535M params) can be trained from scratch under 4-bit quantization constraints with stable convergence and no divergence over 50 epochs
Tanh soft weight clipping: A smooth, differentiable weight constraint method that bounds weight magnitudes during quantized training, showing self-correcting behavior (weights expand during learning, contract during convergence)
Quantization level preservation: All 15 quantization levels remained fully utilized throughout 50 epochs of training with no level collapse, a non-trivial result for low-bit training
Consumer hardware training: Entire training completed on a single NVIDIA T4 (15GB) in under 3 hours, demonstrating accessibility of low-bit training research

Related Work

This model builds on the research presented in:

"True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity" Shivnath Tathe, arXiv:2603.13931 Paper | Code

How to Use

This is a custom architecture (not a standard HuggingFace model). Load with PyTorch:

import torch

# Load checkpoint
ckpt = torch.load('T4NT_0.5B_tanh_50ep.pt', map_location='cpu')
print(f"Val Loss: {ckpt['val_loss']:.4f}")
print(f"PPL: {ckpt['ppl']:.2f}")
print(f"Config: {ckpt['config']}")

# Rebuild model architecture and load weights
# See config.json for architecture parameters
# Full training code available in the training logs

Files

File	Description
`config.json`	Model architecture and training configuration
`generation_config.json`	Default generation parameters
`T4NT_0.5B_tanh_50ep.pt`	Model checkpoint (FP32 training weights, 2.14 GB)
`t4nt_tanh_50ep_log.json`	Detailed training logs with per-epoch metrics and weight statistics

Limitations

Limited training data: Trained on ~15M tokens from WikiText-103. Production language models use 1T+ tokens. Perplexity reflects this data constraint, not an architectural ceiling.
Semantic depth: The model generates syntactically correct text but lacks deep semantic coherence due to limited data exposure.
Wikipedia distribution only: Training data is exclusively Wikipedia. The model has no exposure to conversational, code, or instructional text.
No instruction tuning: This is a base language model (next-token prediction only). It does not follow instructions or engage in dialogue.
No FP32 baseline comparison: This work focuses on demonstrating the feasibility and stability of 4-bit constrained training, not on matching FP32 accuracy. A matched FP32 baseline at the same scale and data budget is planned for the full paper.
Latent weights in floating point: While forward-pass computation uses 4-bit quantized weights, the optimizer maintains floating-point latent weights for gradient updates. This is standard for QAT but means training memory is not reduced to 4-bit levels.

Ablation Studies

Ablation experiments comparing tanh clipping vs hard clipping vs no clipping are in progress. Results will be added upon completion.

Citation

@misc{tathe2025t4nt,
  author = {Tathe, Shivnath},
  title = {T4NT-0.5B: Tanh 4-bit Neural Transformer},
  year = {2025},
  url = {https://huggingface.co/shivnathtathe/T4NT-0.5B}
}

@misc{tathe2026true4bit,
  author = {Tathe, Shivnath},
  title = {True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity},
  year = {2026},
  eprint = {2603.13931},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG}
}

Author

Shivnath Tathe Software Engineer at ISG eSolutions | Independent AI Researcher

Google Scholar | arXiv | GitHub | LinkedIn

Downloads last month: 214

Dataset used to train shivnathtathe/T4NT-0.5B

Paper for shivnathtathe/T4NT-0.5B

True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity

Paper • 2603.13931 • Published Mar 14 • 1

Evaluation results

Perplexity on WikiText-103
self-reported

93.040