File size: 5,496 Bytes
35ec26b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
license: mit
language:
- en
base_model: microsoft/bitnet-b1.58-2B-4T-bf16
tags:
- bitnet
- ternary
- pruning
- quantization
- efficient-inference
- rpt
datasets:
- wikitext
pipeline_tag: text-generation
model-index:
- name: rpt-bitnet-2b-pruned
  results:
  - task:
      type: text-generation
    dataset:
      name: WikiText-2
      type: wikitext
    metrics:
    - name: Perplexity
      type: perplexity
      value: 16.39
---

# RPT BitNet 2B Pruned

**Ternary (1.58-bit) language model with 42.6% sparsity, improved via progressive pruning + QAT/STE.**

Based on [microsoft/bitnet-b1.58-2B-4T-bf16](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16), this model was fine-tuned using:

1. **Progressive magnitude pruning** (5% then 10%)
2. **Quantization-Aware Training with Straight-Through Estimator (QAT/STE)** - 300 steps per level
3. **Ternary snap** to {-1, 0, +1}

The result is a model that is **better than the baseline** after pruning and ternary quantization.

## Results

| Metric | Baseline | This Model | Change |
|--------|----------|------------|--------|
| PPL (WikiText-2) | 25.13 | **16.39** | **-34.8%** |
| Ternary weights | 100% | **100%** | - |
| Sparsity (zeros) | ~33% (natural) | **42.6%** | +9.6pp |
| GGUF size (I2_S) | ~1.3 GB | **~1.1 GB** | -15% |
| CPU inference | Coherent | **Coherent** | - |

## Key Finding

Removing 10% of weights by magnitude from a ternary model **improves** perplexity by 34.8% after QAT/STE fine-tuning. This is counter-intuitive: pruning typically degrades models. In the ternary regime, low-magnitude weights appear to be noise that harms performance.

## Sample Outputs (GGUF I2_S, bitnet.cpp CPU)

```
Prompt: "The capital of France is"
Output: "Paris. There are also some cities that can be considered as their main cities,
such as the city that has been capital of France since the 17th century."

Prompt: "Water boils at"
Output: "100 degrees Celsius (212 degrees Fahrenheit) at standard atmospheric pressure."

Prompt: "The largest planet in the solar system is"
Output: "Jupiter. It is a gas giant planet that is about 318 Earths in size."
```

## Usage

### With PyTorch (HuggingFace Transformers)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "CesarFavero/rpt-bitnet-2b-pruned",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("CesarFavero/rpt-bitnet-2b-pruned")

inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### With bitnet.cpp (CPU inference)

For CPU inference, use the GGUF I2_S file with [bitnet.cpp](https://github.com/microsoft/BitNet):

```bash
# Clone BitNet fork
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet

# Build (requires clang, not gcc)
python setup_env.py --hf-repo CesarFavero/rpt-bitnet-2b-pruned-GGUF -q i2_s

# Run
python run_inference.py -m models/rpt-bitnet-2b-pruned/ggml-model-i2_s.gguf \
    -p "The capital of France is" -n 50
```

**Important**: The I2_S format requires the BitNet fork of llama.cpp. Standard llama.cpp and llama-cpp-python do NOT support this format.

## Training Details

| Parameter | Value |
|-----------|-------|
| Base model | microsoft/bitnet-b1.58-2B-4T-bf16 |
| Parameters | 2.4B (100% ternary) |
| Optimizer | AdamW (lr=5e-4, wd=0.01) |
| Technique | QAT with STE |
| Pruning | Progressive magnitude (5% -> 10%) |
| Steps/level | 300 |
| Batch size | 8 |
| Seq length | 128 |
| Hardware | NVIDIA A100 (~40GB) |
| Training time | ~7 minutes GPU |
| Dataset | WikiText-2 |

### Pipeline

```
microsoft/bitnet-b1.58-2B-4T-bf16
    -> Progressive pruning (5% then 10% by magnitude)
    -> QAT/STE fine-tune (300 steps per level)
    -> Ternary snap to {-1, 0, +1}
    -> Save HuggingFace format (this model)
    -> Convert to GGUF I2_S (see GGUF variant)
    -> Inference via bitnet.cpp (CPU)
```

## Limitations

- **Evaluation limited to WikiText-2**: The PPL improvement needs validation on broader benchmarks (MMLU, HellaSwag, ARC)
- **Short context tested**: Only tested with sequences up to 128 tokens during training
- **I2_S format support**: GGUF variant requires BitNet fork of llama.cpp (not standard llama.cpp)
- **Language**: Primarily tested on English text
- **PPL improvement caveat**: The dramatic PPL improvement after ternary snap (33.07 -> 16.39) may reflect implicit regularization rather than genuine capability improvement. Broader benchmarks needed to confirm.

## Part of RPT (Redes Preditivas Termodinamicas)

This model was produced as part of the RPT project, which validates physics-inspired principles for neural network efficiency:

- **Landauer's principle**: Sparsity (removing information) improves model quality
- **Self-Organized Criticality**: The model naturally operates at the edge of chaos (Lyapunov exponent ~ 0)
- **Predictive Coding**: Correction ratios decrease with depth (39.87 -> 0.21)

Full documentation: [RPT Project](https://github.com/CesarFavero/rpt-bitnet-2b-pruned)

## Citation

```bibtex
@misc{rpt2026,
    title={Sparsity Improves Ternary Language Models: Evidence from BitNet b1.58},
    author={Cesar and Claude},
    year={2026},
    note={RPT - Redes Preditivas Termodinamicas}
}
```

## License

MIT (same as base model)