--- license: mit language: - en base_model: microsoft/bitnet-b1.58-2B-4T-bf16 tags: - bitnet - ternary - pruning - quantization - efficient-inference - rpt datasets: - wikitext pipeline_tag: text-generation model-index: - name: rpt-bitnet-2b-pruned results: - task: type: text-generation dataset: name: WikiText-2 type: wikitext metrics: - name: Perplexity type: perplexity value: 16.39 --- # RPT BitNet 2B Pruned **Ternary (1.58-bit) language model with 42.6% sparsity, improved via progressive pruning + QAT/STE.** Based on [microsoft/bitnet-b1.58-2B-4T-bf16](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16), this model was fine-tuned using: 1. **Progressive magnitude pruning** (5% then 10%) 2. **Quantization-Aware Training with Straight-Through Estimator (QAT/STE)** - 300 steps per level 3. **Ternary snap** to {-1, 0, +1} The result is a model that is **better than the baseline** after pruning and ternary quantization. ## Results | Metric | Baseline | This Model | Change | |--------|----------|------------|--------| | PPL (WikiText-2) | 25.13 | **16.39** | **-34.8%** | | Ternary weights | 100% | **100%** | - | | Sparsity (zeros) | ~33% (natural) | **42.6%** | +9.6pp | | GGUF size (I2_S) | ~1.3 GB | **~1.1 GB** | -15% | | CPU inference | Coherent | **Coherent** | - | ## Key Finding Removing 10% of weights by magnitude from a ternary model **improves** perplexity by 34.8% after QAT/STE fine-tuning. This is counter-intuitive: pruning typically degrades models. In the ternary regime, low-magnitude weights appear to be noise that harms performance. ## Sample Outputs (GGUF I2_S, bitnet.cpp CPU) ``` Prompt: "The capital of France is" Output: "Paris. There are also some cities that can be considered as their main cities, such as the city that has been capital of France since the 17th century." Prompt: "Water boils at" Output: "100 degrees Celsius (212 degrees Fahrenheit) at standard atmospheric pressure." Prompt: "The largest planet in the solar system is" Output: "Jupiter. It is a gas giant planet that is about 318 Earths in size." ``` ## Usage ### With PyTorch (HuggingFace Transformers) ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "CesarFavero/rpt-bitnet-2b-pruned", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("CesarFavero/rpt-bitnet-2b-pruned") inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### With bitnet.cpp (CPU inference) For CPU inference, use the GGUF I2_S file with [bitnet.cpp](https://github.com/microsoft/BitNet): ```bash # Clone BitNet fork git clone --recursive https://github.com/microsoft/BitNet.git cd BitNet # Build (requires clang, not gcc) python setup_env.py --hf-repo CesarFavero/rpt-bitnet-2b-pruned-GGUF -q i2_s # Run python run_inference.py -m models/rpt-bitnet-2b-pruned/ggml-model-i2_s.gguf \ -p "The capital of France is" -n 50 ``` **Important**: The I2_S format requires the BitNet fork of llama.cpp. Standard llama.cpp and llama-cpp-python do NOT support this format. ## Training Details | Parameter | Value | |-----------|-------| | Base model | microsoft/bitnet-b1.58-2B-4T-bf16 | | Parameters | 2.4B (100% ternary) | | Optimizer | AdamW (lr=5e-4, wd=0.01) | | Technique | QAT with STE | | Pruning | Progressive magnitude (5% -> 10%) | | Steps/level | 300 | | Batch size | 8 | | Seq length | 128 | | Hardware | NVIDIA A100 (~40GB) | | Training time | ~7 minutes GPU | | Dataset | WikiText-2 | ### Pipeline ``` microsoft/bitnet-b1.58-2B-4T-bf16 -> Progressive pruning (5% then 10% by magnitude) -> QAT/STE fine-tune (300 steps per level) -> Ternary snap to {-1, 0, +1} -> Save HuggingFace format (this model) -> Convert to GGUF I2_S (see GGUF variant) -> Inference via bitnet.cpp (CPU) ``` ## Limitations - **Evaluation limited to WikiText-2**: The PPL improvement needs validation on broader benchmarks (MMLU, HellaSwag, ARC) - **Short context tested**: Only tested with sequences up to 128 tokens during training - **I2_S format support**: GGUF variant requires BitNet fork of llama.cpp (not standard llama.cpp) - **Language**: Primarily tested on English text - **PPL improvement caveat**: The dramatic PPL improvement after ternary snap (33.07 -> 16.39) may reflect implicit regularization rather than genuine capability improvement. Broader benchmarks needed to confirm. ## Part of RPT (Redes Preditivas Termodinamicas) This model was produced as part of the RPT project, which validates physics-inspired principles for neural network efficiency: - **Landauer's principle**: Sparsity (removing information) improves model quality - **Self-Organized Criticality**: The model naturally operates at the edge of chaos (Lyapunov exponent ~ 0) - **Predictive Coding**: Correction ratios decrease with depth (39.87 -> 0.21) Full documentation: [RPT Project](https://github.com/CesarFavero/rpt-bitnet-2b-pruned) ## Citation ```bibtex @misc{rpt2026, title={Sparsity Improves Ternary Language Models: Evidence from BitNet b1.58}, author={Cesar and Claude}, year={2026}, note={RPT - Redes Preditivas Termodinamicas} } ``` ## License MIT (same as base model)