rasatavohary
/

BitMamba-2-1B

@@ -9,6 +9,9 @@ tags:
 - 1.58-bit
 - ternary
 - efficient-inference
 datasets:
 - HuggingFaceFW/fineweb-edu
 - bigcode/the-stack-dedup
@@ -28,11 +31,66 @@ inference: false
 [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm-dark.svg)](https://huggingface.co/spaces/Zhayr1/Bitmamba-2-1B)
 [![Paper](https://img.shields.io/badge/Paper-Zenodo-00649C.svg)](https://doi.org/10.5281/zenodo.18394665)
 [![GitHub](https://img.shields.io/badge/GitHub-Source%20Code-black)](https://github.com/Zhayr1/BitMamba-2)
 </div>
 **BitMamba-2-1B** is a scalable, hybrid architecture that integrates **1.58-bit ternary quantization** (BitNet) into the **Mamba-2** state space model framework. Trained from scratch on 150B tokens of high-quality data, it demonstrates that ternary SSMs follow predictable scaling laws, achieving competitive reasoning capabilities with a drastically reduced memory footprint.
 ## ⚡ Key Features
 - **Architecture:** Mamba-2 SSM + BitNet b1.58 (Ternary Weights).
@@ -65,7 +123,7 @@ Download the `bitmamba_1b.bin` file located in the files tab (or `bitmamba_cpp`
 ### 2. Run with C++
-Go to our [GitHub Repository](https://github.com/Zhayr1/bitmamba.cpp) to get the inference code.
 ```bash
 # Example usage after compiling bitmamba.cpp
@@ -78,13 +136,14 @@ The `bitmamba_1b.msgpack` contains the raw JAX weights for research purposes. Yo
 ## 🛠️ Efficient Deployment
-Running on a consumer **Intel Core i3-12100F CPU**:
-| Model             | RAM Usage  | Speed         |
-| ----------------- | ---------- | ------------- |
-| **BitMamba-2-1B** | **621 MB** | **~53 tok/s** |
-## 📜 Citation
 ```bibtex
 @misc{salazar2026bitmamba2,
@@ -95,4 +154,33 @@ Running on a consumer **Intel Core i3-12100F CPU**:
   doi          = {10.5281/zenodo.18394665},
   url          = {https://doi.org/10.5281/zenodo.18394665}
 }
-```

 - 1.58-bit
 - ternary
 - efficient-inference
+- arm-neon
+- apple-silicon
+- cpu-inference
 datasets:
 - HuggingFaceFW/fineweb-edu
 - bigcode/the-stack-dedup
 [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm-dark.svg)](https://huggingface.co/spaces/Zhayr1/Bitmamba-2-1B)
 [![Paper](https://img.shields.io/badge/Paper-Zenodo-00649C.svg)](https://doi.org/10.5281/zenodo.18394665)
 [![GitHub](https://img.shields.io/badge/GitHub-Source%20Code-black)](https://github.com/Zhayr1/BitMamba-2)
+[![ARM NEON Port](https://img.shields.io/badge/ARM%20NEON-Port-green)](https://github.com/rasata/bitmamba.cpp)
+[![Preprint](https://img.shields.io/badge/Preprint-engrXiv-blue)](https://engrxiv.org/)
 </div>
+> **Mirror repository** of [Zhayr1/BitMamba-2-1B](https://huggingface.co/Zhayr1/BitMamba-2-1B), maintained by [Aquantic Research](https://github.com/rasata/zonova-research-gpu-to-cpu-transposition) for the GPU-to-CPU/ARM neural network transposition programme.
 **BitMamba-2-1B** is a scalable, hybrid architecture that integrates **1.58-bit ternary quantization** (BitNet) into the **Mamba-2** state space model framework. Trained from scratch on 150B tokens of high-quality data, it demonstrates that ternary SSMs follow predictable scaling laws, achieving competitive reasoning capabilities with a drastically reduced memory footprint.
+---
+## ARM NEON Port — Cross-Platform CPU Inference
+An **ARM NEON port** of the BitMamba-2 inference engine has been developed by Aquantic Research, enabling native inference on **Apple Silicon** (M1/M2/M3/M4) and ARM-based processors.
+| Model | Hardware | Speed | Latency/token | RAM |
+|-------|----------|-------|---------------|-----|
+| BitMamba-2 1B | Intel Core i3-12100F (AVX2) | ~53 tok/s | — | 621 MB |
+| **BitMamba-2 1B** | **Apple M1 (ARM NEON)** | **27.9 tok/s** | 35.9 ms | 614 MB |
+**Key finding**: Speed is **perfectly constant** regardless of sequence length (50, 200, or more tokens). This experimentally validates the **O(1) memory** property of SSM architectures — unlike Transformers whose memory grows with sequence length.
+### Comparison with Transformer baselines (same hardware)
+| Type | Model | Weights | Quant | tok/s | Hardware |
+|------|-------|---------|-------|-------|----------|
+| **SSM** | **BitMamba-2 1B** | **614 MB** | **1.58-bit** | **27.9** | **Apple M1** |
+| Transformer | TinyLlama 1.1B | 638 MB | Q4_K_M | ~30-40 | Apple M1 |
+| Transformer | Llama-7B | 3.8 GB | Q4 | ~15 | Apple M1 |
+| Cloud GPU | Claude 3.5 Haiku | — | — | 61 | GPU cloud |
+At comparable weight sizes (~600 MB), the SSM achieves throughput competitive with quantized Transformers, but with **constant memory** (no KV cache growth) and **1.58-bit** compression (vs. 4-bit for Transformers).
+### ARM NEON Port Resources
+- **Code**: [rasata/bitmamba.cpp](https://github.com/rasata/bitmamba.cpp) — ARM NEON fork with cross-platform dispatch (x86 AVX2 + ARM NEON)
+- **Preprint**: *"State Space Models as CPU-Native Neural Network Architectures: Experimental Evidence from ARM NEON Inference with 1.58-bit Quantized Mamba"* — Gabriel Zo-Hasina Rasatavohary, Aquantic Research, March 2026. To be published on [engrXiv](https://engrxiv.org/) (DOI pending).
+- **Research programme**: [GPU-to-CPU/ARM Neural Network Transposition](https://github.com/rasata/zonova-research-gpu-to-cpu-transposition)
+### Quick Start (ARM)
+```bash
+# Clone the ARM NEON fork
+git clone https://github.com/rasata/bitmamba.cpp
+cd bitmamba.cpp
+# Build (macOS Apple Silicon)
+brew install libomp
+cmake -B build && cmake --build build
+# Download weights from this repo
+wget https://huggingface.co/rasatavohary/BitMamba-2-1B/resolve/main/bitmamba_cpp/bitmamba_1b.bin
+# Run inference
+cd build && cp ../tokenizer.bin .
+./bitmamba ../bitmamba_1b.bin "The future of AI is" tokenizer 0.7 1.1 0.05 0.9 40 200
+```
+---
 ## ⚡ Key Features
 - **Architecture:** Mamba-2 SSM + BitNet b1.58 (Ternary Weights).
 ### 2. Run with C++
+Go to the original [GitHub Repository](https://github.com/Zhayr1/bitmamba.cpp) for x86 AVX2 inference, or [rasata/bitmamba.cpp](https://github.com/rasata/bitmamba.cpp) for cross-platform (x86 + ARM NEON) inference.
 ```bash
 # Example usage after compiling bitmamba.cpp
 ## 🛠️ Efficient Deployment
+| Platform | Hardware | RAM Usage | Speed |
+|----------|----------|-----------|-------|
+| x86 (original) | Intel Core i3-12100F (AVX2) | 621 MB | ~53 tok/s |
+| **ARM (NEON port)** | **Apple M1** | **614 MB** | **27.9 tok/s** |
+## 📜 Citations
+### Original model
 ```bibtex
 @misc{salazar2026bitmamba2,
   doi          = {10.5281/zenodo.18394665},
   url          = {https://doi.org/10.5281/zenodo.18394665}
 }
+```
+### ARM NEON port and CPU-native research
+```bibtex
+@misc{rasatavohary2026ssm,
+  author       = {Rasatavohary, Gabriel Zo-Hasina},
+  title        = {State Space Models as {CPU}-Native Neural Network Architectures:
+                   Experimental Evidence from {ARM NEON} Inference with 1.58-bit
+                   Quantized {Mamba}},
+  year         = {2026},
+  howpublished = {engrXiv preprint (DOI pending)},
+  note         = {Aquantic Research. First ARM NEON port of BitMamba-2.
+                   Code: \url{https://github.com/rasata/bitmamba.cpp}},
+}
+```
+## Training Datasets
+- [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
+- [HuggingFaceTB/cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)
+- [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup)
+## Links
+- [Original paper (Zenodo)](https://doi.org/10.5281/zenodo.18394665) — Salazar, 2026
+- [Original GitHub](https://github.com/Zhayr1/BitMamba-2) — Zhayr1
+- [ARM NEON fork](https://github.com/rasata/bitmamba.cpp) — Aquantic Research
+- [Research programme](https://github.com/rasata/zonova-research-gpu-to-cpu-transposition) — GPU-to-CPU/ARM transposition
+- [Interactive Demo](https://huggingface.co/spaces/Zhayr1/Bitmamba-2-1B)