|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
base_model: microsoft/bitnet-b1.58-2B-4T-bf16 |
|
|
tags: |
|
|
- bitnet |
|
|
- ternary |
|
|
- pruning |
|
|
- quantization |
|
|
- efficient-inference |
|
|
- rpt |
|
|
datasets: |
|
|
- wikitext |
|
|
pipeline_tag: text-generation |
|
|
model-index: |
|
|
- name: rpt-bitnet-2b-pruned |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
name: WikiText-2 |
|
|
type: wikitext |
|
|
metrics: |
|
|
- name: Perplexity |
|
|
type: perplexity |
|
|
value: 16.39 |
|
|
--- |
|
|
|
|
|
# RPT BitNet 2B Pruned |
|
|
|
|
|
**Ternary (1.58-bit) language model with 42.6% sparsity, improved via progressive pruning + QAT/STE.** |
|
|
|
|
|
Based on [microsoft/bitnet-b1.58-2B-4T-bf16](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16), this model was fine-tuned using: |
|
|
|
|
|
1. **Progressive magnitude pruning** (5% then 10%) |
|
|
2. **Quantization-Aware Training with Straight-Through Estimator (QAT/STE)** - 300 steps per level |
|
|
3. **Ternary snap** to {-1, 0, +1} |
|
|
|
|
|
The result is a model that is **better than the baseline** after pruning and ternary quantization. |
|
|
|
|
|
## Results |
|
|
|
|
|
| Metric | Baseline | This Model | Change | |
|
|
|--------|----------|------------|--------| |
|
|
| PPL (WikiText-2) | 25.13 | **16.39** | **-34.8%** | |
|
|
| Ternary weights | 100% | **100%** | - | |
|
|
| Sparsity (zeros) | ~33% (natural) | **42.6%** | +9.6pp | |
|
|
| GGUF size (I2_S) | ~1.3 GB | **~1.1 GB** | -15% | |
|
|
| CPU inference | Coherent | **Coherent** | - | |
|
|
|
|
|
## Key Finding |
|
|
|
|
|
Removing 10% of weights by magnitude from a ternary model **improves** perplexity by 34.8% after QAT/STE fine-tuning. This is counter-intuitive: pruning typically degrades models. In the ternary regime, low-magnitude weights appear to be noise that harms performance. |
|
|
|
|
|
## Sample Outputs (GGUF I2_S, bitnet.cpp CPU) |
|
|
|
|
|
``` |
|
|
Prompt: "The capital of France is" |
|
|
Output: "Paris. There are also some cities that can be considered as their main cities, |
|
|
such as the city that has been capital of France since the 17th century." |
|
|
|
|
|
Prompt: "Water boils at" |
|
|
Output: "100 degrees Celsius (212 degrees Fahrenheit) at standard atmospheric pressure." |
|
|
|
|
|
Prompt: "The largest planet in the solar system is" |
|
|
Output: "Jupiter. It is a gas giant planet that is about 318 Earths in size." |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### With PyTorch (HuggingFace Transformers) |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"CesarFavero/rpt-bitnet-2b-pruned", |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained("CesarFavero/rpt-bitnet-2b-pruned") |
|
|
|
|
|
inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device) |
|
|
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
### With bitnet.cpp (CPU inference) |
|
|
|
|
|
For CPU inference, use the GGUF I2_S file with [bitnet.cpp](https://github.com/microsoft/BitNet): |
|
|
|
|
|
```bash |
|
|
# Clone BitNet fork |
|
|
git clone --recursive https://github.com/microsoft/BitNet.git |
|
|
cd BitNet |
|
|
|
|
|
# Build (requires clang, not gcc) |
|
|
python setup_env.py --hf-repo CesarFavero/rpt-bitnet-2b-pruned-GGUF -q i2_s |
|
|
|
|
|
# Run |
|
|
python run_inference.py -m models/rpt-bitnet-2b-pruned/ggml-model-i2_s.gguf \ |
|
|
-p "The capital of France is" -n 50 |
|
|
``` |
|
|
|
|
|
**Important**: The I2_S format requires the BitNet fork of llama.cpp. Standard llama.cpp and llama-cpp-python do NOT support this format. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Base model | microsoft/bitnet-b1.58-2B-4T-bf16 | |
|
|
| Parameters | 2.4B (100% ternary) | |
|
|
| Optimizer | AdamW (lr=5e-4, wd=0.01) | |
|
|
| Technique | QAT with STE | |
|
|
| Pruning | Progressive magnitude (5% -> 10%) | |
|
|
| Steps/level | 300 | |
|
|
| Batch size | 8 | |
|
|
| Seq length | 128 | |
|
|
| Hardware | NVIDIA A100 (~40GB) | |
|
|
| Training time | ~7 minutes GPU | |
|
|
| Dataset | WikiText-2 | |
|
|
|
|
|
### Pipeline |
|
|
|
|
|
``` |
|
|
microsoft/bitnet-b1.58-2B-4T-bf16 |
|
|
-> Progressive pruning (5% then 10% by magnitude) |
|
|
-> QAT/STE fine-tune (300 steps per level) |
|
|
-> Ternary snap to {-1, 0, +1} |
|
|
-> Save HuggingFace format (this model) |
|
|
-> Convert to GGUF I2_S (see GGUF variant) |
|
|
-> Inference via bitnet.cpp (CPU) |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Evaluation limited to WikiText-2**: The PPL improvement needs validation on broader benchmarks (MMLU, HellaSwag, ARC) |
|
|
- **Short context tested**: Only tested with sequences up to 128 tokens during training |
|
|
- **I2_S format support**: GGUF variant requires BitNet fork of llama.cpp (not standard llama.cpp) |
|
|
- **Language**: Primarily tested on English text |
|
|
- **PPL improvement caveat**: The dramatic PPL improvement after ternary snap (33.07 -> 16.39) may reflect implicit regularization rather than genuine capability improvement. Broader benchmarks needed to confirm. |
|
|
|
|
|
## Part of RPT (Redes Preditivas Termodinamicas) |
|
|
|
|
|
This model was produced as part of the RPT project, which validates physics-inspired principles for neural network efficiency: |
|
|
|
|
|
- **Landauer's principle**: Sparsity (removing information) improves model quality |
|
|
- **Self-Organized Criticality**: The model naturally operates at the edge of chaos (Lyapunov exponent ~ 0) |
|
|
- **Predictive Coding**: Correction ratios decrease with depth (39.87 -> 0.21) |
|
|
|
|
|
Full documentation: [RPT Project](https://github.com/CesarFavero/rpt-bitnet-2b-pruned) |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{rpt2026, |
|
|
title={Sparsity Improves Ternary Language Models: Evidence from BitNet b1.58}, |
|
|
author={Cesar and Claude}, |
|
|
year={2026}, |
|
|
note={RPT - Redes Preditivas Termodinamicas} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT (same as base model) |
|
|
|