BitMamba-2-1B

Open in Spaces Paper GitHub ARM NEON Port Preprint

Mirror repository of Zhayr1/BitMamba-2-1B, maintained by Aquantic Research for the GPU-to-CPU/ARM neural network transposition programme.

BitMamba-2-1B is a scalable, hybrid architecture that integrates 1.58-bit ternary quantization (BitNet) into the Mamba-2 state space model framework. Trained from scratch on 150B tokens of high-quality data, it demonstrates that ternary SSMs follow predictable scaling laws, achieving competitive reasoning capabilities with a drastically reduced memory footprint.


ARM NEON Port β€” Cross-Platform CPU Inference

An ARM NEON port of the BitMamba-2 inference engine has been developed by Aquantic Research, enabling native inference on Apple Silicon (M1/M2/M3/M4) and ARM-based processors.

Model Hardware Speed Latency/token RAM
BitMamba-2 1B Intel Core i3-12100F (AVX2) ~53 tok/s β€” 621 MB
BitMamba-2 1B Apple M1 (ARM NEON) 27.9 tok/s 35.9 ms 614 MB

Key finding: Speed is perfectly constant regardless of sequence length (50, 200, or more tokens). This experimentally validates the O(1) memory property of SSM architectures β€” unlike Transformers whose memory grows with sequence length.

Comparison with Transformer baselines (same hardware)

Type Model Weights Quant tok/s Hardware
SSM BitMamba-2 1B 614 MB 1.58-bit 27.9 Apple M1
Transformer TinyLlama 1.1B 638 MB Q4_K_M ~30-40 Apple M1
Transformer Llama-7B 3.8 GB Q4 ~15 Apple M1
Cloud GPU Claude 3.5 Haiku β€” β€” 61 GPU cloud

At comparable weight sizes (~600 MB), the SSM achieves throughput competitive with quantized Transformers, but with constant memory (no KV cache growth) and 1.58-bit compression (vs. 4-bit for Transformers).

ARM NEON Port Resources

  • Code: rasata/bitmamba.cpp β€” ARM NEON fork with cross-platform dispatch (x86 AVX2 + ARM NEON)
  • Preprint: "State Space Models as CPU-Native Neural Network Architectures: Experimental Evidence from ARM NEON Inference with 1.58-bit Quantized Mamba" β€” Gabriel Zo-Hasina Rasatavohary, Aquantic Research, March 2026. To be published on engrXiv (DOI pending).
  • Research programme: GPU-to-CPU/ARM Neural Network Transposition

Quick Start (ARM)

# Clone the ARM NEON fork
git clone https://github.com/rasata/bitmamba.cpp
cd bitmamba.cpp

# Build (macOS Apple Silicon)
brew install libomp
cmake -B build && cmake --build build

# Download weights from this repo
wget https://huggingface.co/rasatavohary/BitMamba-2-1B/resolve/main/bitmamba_cpp/bitmamba_1b.bin

# Run inference
cd build && cp ../tokenizer.bin .
./bitmamba ../bitmamba_1b.bin "The future of AI is" tokenizer 0.7 1.1 0.05 0.9 40 200

⚑ Key Features

  • Architecture: Mamba-2 SSM + BitNet b1.58 (Ternary Weights).
  • Parameters: 1B.
  • Precision: 1.58-bit (weights {-1, 0, 1}).
  • Training Tokens: 150 Billion (FineWeb-Edu, Cosmopedia, Stack-Dedup).
  • Hardware: Trained on Google Cloud TPU v6e.

πŸ“Š Benchmark Results

Benchmark Metric BitMamba-2-1B vs. 255M Baseline
ARC-Easy Accuracy 63.30% +7.8%
PIQA Accuracy 68.77% +4.4%
BoolQ Accuracy 62.35% +3.1%
HellaSwag Acc Norm 45.59% +10.4%
WikiText-2 Perplexity 29.62 -22.1

Scaling from 255M to 1B parameters yields consistent improvements...

Scaling Laws

πŸš€ Usage (Inference)

This model is optimized for edge deployment using our custom C++ inference engine.

1. Download the Quantized Model

Download the bitmamba_1b.bin file located in the files tab (or bitmamba_cpp folder).

2. Run with C++

Go to the original GitHub Repository for x86 AVX2 inference, or rasata/bitmamba.cpp for cross-platform (x86 + ARM NEON) inference.

# Example usage after compiling bitmamba.cpp
./bitmamba bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200

3. JAX/Flax Usage

The bitmamba_1b.msgpack contains the raw JAX weights for research purposes. You can load them using the source code provided in src/ on GitHub.

πŸ› οΈ Efficient Deployment

Platform Hardware RAM Usage Speed
x86 (original) Intel Core i3-12100F (AVX2) 621 MB ~53 tok/s
ARM (NEON port) Apple M1 614 MB 27.9 tok/s

πŸ“œ Citations

Original model

@misc{salazar2026bitmamba2,
  author       = {Salazar, Jesus},
  title        = {{BitMamba}-2: Efficient Scaling of 1.58-bit State Space Models},
  year         = {2026},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.18394665},
  url          = {https://doi.org/10.5281/zenodo.18394665}
}

ARM NEON port and CPU-native research

@misc{rasatavohary2026ssm,
  author       = {Rasatavohary, Gabriel Zo-Hasina},
  title        = {State Space Models as {CPU}-Native Neural Network Architectures:
                   Experimental Evidence from {ARM NEON} Inference with 1.58-bit
                   Quantized {Mamba}},
  year         = {2026},
  howpublished = {engrXiv preprint (DOI pending)},
  note         = {Aquantic Research. First ARM NEON port of BitMamba-2.
                   Code: \url{https://github.com/rasata/bitmamba.cpp}},
}

Training Datasets

Links

Downloads last month
124
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train rasatavohary/BitMamba-2-1B