CesarFavero's picture
Upload folder using huggingface_hub
35ec26b verified
---
license: mit
language:
- en
base_model: microsoft/bitnet-b1.58-2B-4T-bf16
tags:
- bitnet
- ternary
- pruning
- quantization
- efficient-inference
- rpt
datasets:
- wikitext
pipeline_tag: text-generation
model-index:
- name: rpt-bitnet-2b-pruned
results:
- task:
type: text-generation
dataset:
name: WikiText-2
type: wikitext
metrics:
- name: Perplexity
type: perplexity
value: 16.39
---
# RPT BitNet 2B Pruned
**Ternary (1.58-bit) language model with 42.6% sparsity, improved via progressive pruning + QAT/STE.**
Based on [microsoft/bitnet-b1.58-2B-4T-bf16](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16), this model was fine-tuned using:
1. **Progressive magnitude pruning** (5% then 10%)
2. **Quantization-Aware Training with Straight-Through Estimator (QAT/STE)** - 300 steps per level
3. **Ternary snap** to {-1, 0, +1}
The result is a model that is **better than the baseline** after pruning and ternary quantization.
## Results
| Metric | Baseline | This Model | Change |
|--------|----------|------------|--------|
| PPL (WikiText-2) | 25.13 | **16.39** | **-34.8%** |
| Ternary weights | 100% | **100%** | - |
| Sparsity (zeros) | ~33% (natural) | **42.6%** | +9.6pp |
| GGUF size (I2_S) | ~1.3 GB | **~1.1 GB** | -15% |
| CPU inference | Coherent | **Coherent** | - |
## Key Finding
Removing 10% of weights by magnitude from a ternary model **improves** perplexity by 34.8% after QAT/STE fine-tuning. This is counter-intuitive: pruning typically degrades models. In the ternary regime, low-magnitude weights appear to be noise that harms performance.
## Sample Outputs (GGUF I2_S, bitnet.cpp CPU)
```
Prompt: "The capital of France is"
Output: "Paris. There are also some cities that can be considered as their main cities,
such as the city that has been capital of France since the 17th century."
Prompt: "Water boils at"
Output: "100 degrees Celsius (212 degrees Fahrenheit) at standard atmospheric pressure."
Prompt: "The largest planet in the solar system is"
Output: "Jupiter. It is a gas giant planet that is about 318 Earths in size."
```
## Usage
### With PyTorch (HuggingFace Transformers)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"CesarFavero/rpt-bitnet-2b-pruned",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("CesarFavero/rpt-bitnet-2b-pruned")
inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### With bitnet.cpp (CPU inference)
For CPU inference, use the GGUF I2_S file with [bitnet.cpp](https://github.com/microsoft/BitNet):
```bash
# Clone BitNet fork
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
# Build (requires clang, not gcc)
python setup_env.py --hf-repo CesarFavero/rpt-bitnet-2b-pruned-GGUF -q i2_s
# Run
python run_inference.py -m models/rpt-bitnet-2b-pruned/ggml-model-i2_s.gguf \
-p "The capital of France is" -n 50
```
**Important**: The I2_S format requires the BitNet fork of llama.cpp. Standard llama.cpp and llama-cpp-python do NOT support this format.
## Training Details
| Parameter | Value |
|-----------|-------|
| Base model | microsoft/bitnet-b1.58-2B-4T-bf16 |
| Parameters | 2.4B (100% ternary) |
| Optimizer | AdamW (lr=5e-4, wd=0.01) |
| Technique | QAT with STE |
| Pruning | Progressive magnitude (5% -> 10%) |
| Steps/level | 300 |
| Batch size | 8 |
| Seq length | 128 |
| Hardware | NVIDIA A100 (~40GB) |
| Training time | ~7 minutes GPU |
| Dataset | WikiText-2 |
### Pipeline
```
microsoft/bitnet-b1.58-2B-4T-bf16
-> Progressive pruning (5% then 10% by magnitude)
-> QAT/STE fine-tune (300 steps per level)
-> Ternary snap to {-1, 0, +1}
-> Save HuggingFace format (this model)
-> Convert to GGUF I2_S (see GGUF variant)
-> Inference via bitnet.cpp (CPU)
```
## Limitations
- **Evaluation limited to WikiText-2**: The PPL improvement needs validation on broader benchmarks (MMLU, HellaSwag, ARC)
- **Short context tested**: Only tested with sequences up to 128 tokens during training
- **I2_S format support**: GGUF variant requires BitNet fork of llama.cpp (not standard llama.cpp)
- **Language**: Primarily tested on English text
- **PPL improvement caveat**: The dramatic PPL improvement after ternary snap (33.07 -> 16.39) may reflect implicit regularization rather than genuine capability improvement. Broader benchmarks needed to confirm.
## Part of RPT (Redes Preditivas Termodinamicas)
This model was produced as part of the RPT project, which validates physics-inspired principles for neural network efficiency:
- **Landauer's principle**: Sparsity (removing information) improves model quality
- **Self-Organized Criticality**: The model naturally operates at the edge of chaos (Lyapunov exponent ~ 0)
- **Predictive Coding**: Correction ratios decrease with depth (39.87 -> 0.21)
Full documentation: [RPT Project](https://github.com/CesarFavero/rpt-bitnet-2b-pruned)
## Citation
```bibtex
@misc{rpt2026,
title={Sparsity Improves Ternary Language Models: Evidence from BitNet b1.58},
author={Cesar and Claude},
year={2026},
note={RPT - Redes Preditivas Termodinamicas}
}
```
## License
MIT (same as base model)