---
license: mit
language:
- en
base_model: microsoft/bitnet-b1.58-2B-4T-bf16
tags:
- bitnet
- ternary
- pruning
- quantization
- efficient-inference
- rpt
datasets:
- wikitext
pipeline_tag: text-generation
model-index:
- name: rpt-bitnet-2b-pruned
  results:
  - task:
      type: text-generation
    dataset:
      name: WikiText-2
      type: wikitext
    metrics:
    - name: Perplexity
      type: perplexity
      value: 16.39
---

# RPT BitNet 2B Pruned

**Ternary (1.58-bit) language model with 42.6% sparsity, improved via progressive pruning + QAT/STE.**

Based on [microsoft/bitnet-b1.58-2B-4T-bf16](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16), this model was fine-tuned using:

1. **Progressive magnitude pruning** (5% then 10%)
2. **Quantization-Aware Training with Straight-Through Estimator (QAT/STE)** - 300 steps per level
3. **Ternary snap** to {-1, 0, +1}

The result is a model that is **better than the baseline** after pruning and ternary quantization.

## Results

| Metric | Baseline | This Model | Change |
|--------|----------|------------|--------|
| PPL (WikiText-2) | 25.13 | **16.39** | **-34.8%** |
| Ternary weights | 100% | **100%** | - |
| Sparsity (zeros) | ~33% (natural) | **42.6%** | +9.6pp |
| GGUF size (I2_S) | ~1.3 GB | **~1.1 GB** | -15% |
| CPU inference | Coherent | **Coherent** | - |

## Key Finding

Removing 10% of weights by magnitude from a ternary model **improves** perplexity by 34.8% after QAT/STE fine-tuning. This is counter-intuitive: pruning typically degrades models. In the ternary regime, low-magnitude weights appear to be noise that harms performance.

## Sample Outputs (GGUF I2_S, bitnet.cpp CPU)

```
Prompt: "The capital of France is"
Output: "Paris. There are also some cities that can be considered as their main cities,
such as the city that has been capital of France since the 17th century."

Prompt: "Water boils at"
Output: "100 degrees Celsius (212 degrees Fahrenheit) at standard atmospheric pressure."

Prompt: "The largest planet in the solar system is"
Output: "Jupiter. It is a gas giant planet that is about 318 Earths in size."
```

## Usage

### With PyTorch (HuggingFace Transformers)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "CesarFavero/rpt-bitnet-2b-pruned",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("CesarFavero/rpt-bitnet-2b-pruned")

inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### With bitnet.cpp (CPU inference)

For CPU inference, use the GGUF I2_S file with [bitnet.cpp](https://github.com/microsoft/BitNet):

```bash
# Clone BitNet fork
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet

# Build (requires clang, not gcc)
python setup_env.py --hf-repo CesarFavero/rpt-bitnet-2b-pruned-GGUF -q i2_s

# Run
python run_inference.py -m models/rpt-bitnet-2b-pruned/ggml-model-i2_s.gguf \
    -p "The capital of France is" -n 50
```

**Important**: The I2_S format requires the BitNet fork of llama.cpp. Standard llama.cpp and llama-cpp-python do NOT support this format.

## Training Details

| Parameter | Value |
|-----------|-------|
| Base model | microsoft/bitnet-b1.58-2B-4T-bf16 |
| Parameters | 2.4B (100% ternary) |
| Optimizer | AdamW (lr=5e-4, wd=0.01) |
| Technique | QAT with STE |
| Pruning | Progressive magnitude (5% -> 10%) |
| Steps/level | 300 |
| Batch size | 8 |
| Seq length | 128 |
| Hardware | NVIDIA A100 (~40GB) |
| Training time | ~7 minutes GPU |
| Dataset | WikiText-2 |

### Pipeline

```
microsoft/bitnet-b1.58-2B-4T-bf16
    -> Progressive pruning (5% then 10% by magnitude)
    -> QAT/STE fine-tune (300 steps per level)
    -> Ternary snap to {-1, 0, +1}
    -> Save HuggingFace format (this model)
    -> Convert to GGUF I2_S (see GGUF variant)
    -> Inference via bitnet.cpp (CPU)
```

## Limitations

- **Evaluation limited to WikiText-2**: The PPL improvement needs validation on broader benchmarks (MMLU, HellaSwag, ARC)
- **Short context tested**: Only tested with sequences up to 128 tokens during training
- **I2_S format support**: GGUF variant requires BitNet fork of llama.cpp (not standard llama.cpp)
- **Language**: Primarily tested on English text
- **PPL improvement caveat**: The dramatic PPL improvement after ternary snap (33.07 -> 16.39) may reflect implicit regularization rather than genuine capability improvement. Broader benchmarks needed to confirm.

## Part of RPT (Redes Preditivas Termodinamicas)

This model was produced as part of the RPT project, which validates physics-inspired principles for neural network efficiency:

- **Landauer's principle**: Sparsity (removing information) improves model quality
- **Self-Organized Criticality**: The model naturally operates at the edge of chaos (Lyapunov exponent ~ 0)
- **Predictive Coding**: Correction ratios decrease with depth (39.87 -> 0.21)

Full documentation: [RPT Project](https://github.com/CesarFavero/rpt-bitnet-2b-pruned)

## Citation

```bibtex
@misc{rpt2026,
    title={Sparsity Improves Ternary Language Models: Evidence from BitNet b1.58},
    author={Cesar and Claude},
    year={2026},
    note={RPT - Redes Preditivas Termodinamicas}
}
```

## License

MIT (same as base model)