Upload folder using huggingface_hub

35ec26b verified 10 days ago

5.5 kB

	---
	license: mit
	language:
	- en
	base_model: microsoft/bitnet-b1.58-2B-4T-bf16
	tags:
	- bitnet
	- ternary
	- pruning
	- quantization
	- efficient-inference
	- rpt
	datasets:
	- wikitext
	pipeline_tag: text-generation
	model-index:
	- name: rpt-bitnet-2b-pruned
	results:
	- task:
	type: text-generation
	dataset:
	name: WikiText-2
	type: wikitext
	metrics:
	- name: Perplexity
	type: perplexity
	value: 16.39
	---

	# RPT BitNet 2B Pruned

	Ternary (1.58-bit) language model with 42.6% sparsity, improved via progressive pruning + QAT/STE.

	Based on [microsoft/bitnet-b1.58-2B-4T-bf16](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16), this model was fine-tuned using:

	1. Progressive magnitude pruning (5% then 10%)
	2. Quantization-Aware Training with Straight-Through Estimator (QAT/STE) - 300 steps per level
	3. Ternary snap to {-1, 0, +1}

	The result is a model that is better than the baseline after pruning and ternary quantization.

	## Results

	\| Metric \| Baseline \| This Model \| Change \|
	\|--------\|----------\|------------\|--------\|
	\| PPL (WikiText-2) \| 25.13 \| 16.39 \| -34.8% \|
	\| Ternary weights \| 100% \| 100% \| - \|
	\| Sparsity (zeros) \| ~33% (natural) \| 42.6% \| +9.6pp \|
	\| GGUF size (I2_S) \| ~1.3 GB \| ~1.1 GB \| -15% \|
	\| CPU inference \| Coherent \| Coherent \| - \|

	## Key Finding

	Removing 10% of weights by magnitude from a ternary model improves perplexity by 34.8% after QAT/STE fine-tuning. This is counter-intuitive: pruning typically degrades models. In the ternary regime, low-magnitude weights appear to be noise that harms performance.

	## Sample Outputs (GGUF I2_S, bitnet.cpp CPU)

	```
	Prompt: "The capital of France is"
	Output: "Paris. There are also some cities that can be considered as their main cities,
	such as the city that has been capital of France since the 17th century."

	Prompt: "Water boils at"
	Output: "100 degrees Celsius (212 degrees Fahrenheit) at standard atmospheric pressure."

	Prompt: "The largest planet in the solar system is"
	Output: "Jupiter. It is a gas giant planet that is about 318 Earths in size."
	```

	## Usage

	### With PyTorch (HuggingFace Transformers)

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	"CesarFavero/rpt-bitnet-2b-pruned",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained("CesarFavero/rpt-bitnet-2b-pruned")

	inputs = tokenizer("The capital of France is", return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### With bitnet.cpp (CPU inference)

	For CPU inference, use the GGUF I2_S file with [bitnet.cpp](https://github.com/microsoft/BitNet):

	```bash
	# Clone BitNet fork
	git clone --recursive https://github.com/microsoft/BitNet.git
	cd BitNet

	# Build (requires clang, not gcc)
	python setup_env.py --hf-repo CesarFavero/rpt-bitnet-2b-pruned-GGUF -q i2_s

	# Run
	python run_inference.py -m models/rpt-bitnet-2b-pruned/ggml-model-i2_s.gguf \
	-p "The capital of France is" -n 50
	```

	Important: The I2_S format requires the BitNet fork of llama.cpp. Standard llama.cpp and llama-cpp-python do NOT support this format.

	## Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base model \| microsoft/bitnet-b1.58-2B-4T-bf16 \|
	\| Parameters \| 2.4B (100% ternary) \|
	\| Optimizer \| AdamW (lr=5e-4, wd=0.01) \|
	\| Technique \| QAT with STE \|
	\| Pruning \| Progressive magnitude (5% -> 10%) \|
	\| Steps/level \| 300 \|
	\| Batch size \| 8 \|
	\| Seq length \| 128 \|
	\| Hardware \| NVIDIA A100 (~40GB) \|
	\| Training time \| ~7 minutes GPU \|
	\| Dataset \| WikiText-2 \|

	### Pipeline

	```
	microsoft/bitnet-b1.58-2B-4T-bf16
	-> Progressive pruning (5% then 10% by magnitude)
	-> QAT/STE fine-tune (300 steps per level)
	-> Ternary snap to {-1, 0, +1}
	-> Save HuggingFace format (this model)
	-> Convert to GGUF I2_S (see GGUF variant)
	-> Inference via bitnet.cpp (CPU)
	```

	## Limitations

	- Evaluation limited to WikiText-2: The PPL improvement needs validation on broader benchmarks (MMLU, HellaSwag, ARC)
	- Short context tested: Only tested with sequences up to 128 tokens during training
	- I2_S format support: GGUF variant requires BitNet fork of llama.cpp (not standard llama.cpp)
	- Language: Primarily tested on English text
	- PPL improvement caveat: The dramatic PPL improvement after ternary snap (33.07 -> 16.39) may reflect implicit regularization rather than genuine capability improvement. Broader benchmarks needed to confirm.

	## Part of RPT (Redes Preditivas Termodinamicas)

	This model was produced as part of the RPT project, which validates physics-inspired principles for neural network efficiency:

	- Landauer's principle: Sparsity (removing information) improves model quality
	- Self-Organized Criticality: The model naturally operates at the edge of chaos (Lyapunov exponent ~ 0)
	- Predictive Coding: Correction ratios decrease with depth (39.87 -> 0.21)

	Full documentation: [RPT Project](https://github.com/CesarFavero/rpt-bitnet-2b-pruned)

	## Citation

	```bibtex
	@misc{rpt2026,
	title={Sparsity Improves Ternary Language Models: Evidence from BitNet b1.58},
	author={Cesar and Claude},
	year={2026},
	note={RPT - Redes Preditivas Termodinamicas}
	}
	```

	## License

	MIT (same as base model)