BitMamba-2-1B
Mirror repository of Zhayr1/BitMamba-2-1B, maintained by Aquantic Research for the GPU-to-CPU/ARM neural network transposition programme.
BitMamba-2-1B is a scalable, hybrid architecture that integrates 1.58-bit ternary quantization (BitNet) into the Mamba-2 state space model framework. Trained from scratch on 150B tokens of high-quality data, it demonstrates that ternary SSMs follow predictable scaling laws, achieving competitive reasoning capabilities with a drastically reduced memory footprint.
ARM NEON Port β Cross-Platform CPU Inference
An ARM NEON port of the BitMamba-2 inference engine has been developed by Aquantic Research, enabling native inference on Apple Silicon (M1/M2/M3/M4) and ARM-based processors.
| Model | Hardware | Speed | Latency/token | RAM |
|---|---|---|---|---|
| BitMamba-2 1B | Intel Core i3-12100F (AVX2) | ~53 tok/s | β | 621 MB |
| BitMamba-2 1B | Apple M1 (ARM NEON) | 27.9 tok/s | 35.9 ms | 614 MB |
Key finding: Speed is perfectly constant regardless of sequence length (50, 200, or more tokens). This experimentally validates the O(1) memory property of SSM architectures β unlike Transformers whose memory grows with sequence length.
Comparison with Transformer baselines (same hardware)
| Type | Model | Weights | Quant | tok/s | Hardware |
|---|---|---|---|---|---|
| SSM | BitMamba-2 1B | 614 MB | 1.58-bit | 27.9 | Apple M1 |
| Transformer | TinyLlama 1.1B | 638 MB | Q4_K_M | ~30-40 | Apple M1 |
| Transformer | Llama-7B | 3.8 GB | Q4 | ~15 | Apple M1 |
| Cloud GPU | Claude 3.5 Haiku | β | β | 61 | GPU cloud |
At comparable weight sizes (~600 MB), the SSM achieves throughput competitive with quantized Transformers, but with constant memory (no KV cache growth) and 1.58-bit compression (vs. 4-bit for Transformers).
ARM NEON Port Resources
- Code: rasata/bitmamba.cpp β ARM NEON fork with cross-platform dispatch (x86 AVX2 + ARM NEON)
- Preprint: "State Space Models as CPU-Native Neural Network Architectures: Experimental Evidence from ARM NEON Inference with 1.58-bit Quantized Mamba" β Gabriel Zo-Hasina Rasatavohary, Aquantic Research, March 2026. To be published on engrXiv (DOI pending).
- Research programme: GPU-to-CPU/ARM Neural Network Transposition
Quick Start (ARM)
# Clone the ARM NEON fork
git clone https://github.com/rasata/bitmamba.cpp
cd bitmamba.cpp
# Build (macOS Apple Silicon)
brew install libomp
cmake -B build && cmake --build build
# Download weights from this repo
wget https://huggingface.co/rasatavohary/BitMamba-2-1B/resolve/main/bitmamba_cpp/bitmamba_1b.bin
# Run inference
cd build && cp ../tokenizer.bin .
./bitmamba ../bitmamba_1b.bin "The future of AI is" tokenizer 0.7 1.1 0.05 0.9 40 200
β‘ Key Features
- Architecture: Mamba-2 SSM + BitNet b1.58 (Ternary Weights).
- Parameters: 1B.
- Precision: 1.58-bit (weights {-1, 0, 1}).
- Training Tokens: 150 Billion (FineWeb-Edu, Cosmopedia, Stack-Dedup).
- Hardware: Trained on Google Cloud TPU v6e.
π Benchmark Results
| Benchmark | Metric | BitMamba-2-1B | vs. 255M Baseline |
|---|---|---|---|
| ARC-Easy | Accuracy | 63.30% | +7.8% |
| PIQA | Accuracy | 68.77% | +4.4% |
| BoolQ | Accuracy | 62.35% | +3.1% |
| HellaSwag | Acc Norm | 45.59% | +10.4% |
| WikiText-2 | Perplexity | 29.62 | -22.1 |
Scaling from 255M to 1B parameters yields consistent improvements...
π Usage (Inference)
This model is optimized for edge deployment using our custom C++ inference engine.
1. Download the Quantized Model
Download the bitmamba_1b.bin file located in the files tab (or bitmamba_cpp folder).
2. Run with C++
Go to the original GitHub Repository for x86 AVX2 inference, or rasata/bitmamba.cpp for cross-platform (x86 + ARM NEON) inference.
# Example usage after compiling bitmamba.cpp
./bitmamba bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200
3. JAX/Flax Usage
The bitmamba_1b.msgpack contains the raw JAX weights for research purposes. You can load them using the source code provided in src/ on GitHub.
π οΈ Efficient Deployment
| Platform | Hardware | RAM Usage | Speed |
|---|---|---|---|
| x86 (original) | Intel Core i3-12100F (AVX2) | 621 MB | ~53 tok/s |
| ARM (NEON port) | Apple M1 | 614 MB | 27.9 tok/s |
π Citations
Original model
@misc{salazar2026bitmamba2,
author = {Salazar, Jesus},
title = {{BitMamba}-2: Efficient Scaling of 1.58-bit State Space Models},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.18394665},
url = {https://doi.org/10.5281/zenodo.18394665}
}
ARM NEON port and CPU-native research
@misc{rasatavohary2026ssm,
author = {Rasatavohary, Gabriel Zo-Hasina},
title = {State Space Models as {CPU}-Native Neural Network Architectures:
Experimental Evidence from {ARM NEON} Inference with 1.58-bit
Quantized {Mamba}},
year = {2026},
howpublished = {engrXiv preprint (DOI pending)},
note = {Aquantic Research. First ARM NEON port of BitMamba-2.
Code: \url{https://github.com/rasata/bitmamba.cpp}},
}
Training Datasets
Links
- Original paper (Zenodo) β Salazar, 2026
- Original GitHub β Zhayr1
- ARM NEON fork β Aquantic Research
- Research programme β GPU-to-CPU/ARM transposition
- Interactive Demo
- Downloads last month
- 124
