rasatavohary commited on
Commit
8b26eae
Β·
verified Β·
1 Parent(s): 0a50554

Add ARM NEON port reference, preprint citation, and mirror notice

Browse files
Files changed (1) hide show
  1. README.md +95 -7
README.md CHANGED
@@ -9,6 +9,9 @@ tags:
9
  - 1.58-bit
10
  - ternary
11
  - efficient-inference
 
 
 
12
  datasets:
13
  - HuggingFaceFW/fineweb-edu
14
  - bigcode/the-stack-dedup
@@ -28,11 +31,66 @@ inference: false
28
  [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm-dark.svg)](https://huggingface.co/spaces/Zhayr1/Bitmamba-2-1B)
29
  [![Paper](https://img.shields.io/badge/Paper-Zenodo-00649C.svg)](https://doi.org/10.5281/zenodo.18394665)
30
  [![GitHub](https://img.shields.io/badge/GitHub-Source%20Code-black)](https://github.com/Zhayr1/BitMamba-2)
 
 
31
 
32
  </div>
33
 
 
 
34
  **BitMamba-2-1B** is a scalable, hybrid architecture that integrates **1.58-bit ternary quantization** (BitNet) into the **Mamba-2** state space model framework. Trained from scratch on 150B tokens of high-quality data, it demonstrates that ternary SSMs follow predictable scaling laws, achieving competitive reasoning capabilities with a drastically reduced memory footprint.
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ## ⚑ Key Features
37
 
38
  - **Architecture:** Mamba-2 SSM + BitNet b1.58 (Ternary Weights).
@@ -65,7 +123,7 @@ Download the `bitmamba_1b.bin` file located in the files tab (or `bitmamba_cpp`
65
 
66
  ### 2. Run with C++
67
 
68
- Go to our [GitHub Repository](https://github.com/Zhayr1/bitmamba.cpp) to get the inference code.
69
 
70
  ```bash
71
  # Example usage after compiling bitmamba.cpp
@@ -78,13 +136,14 @@ The `bitmamba_1b.msgpack` contains the raw JAX weights for research purposes. Yo
78
 
79
  ## πŸ› οΈ Efficient Deployment
80
 
81
- Running on a consumer **Intel Core i3-12100F CPU**:
 
 
 
82
 
83
- | Model | RAM Usage | Speed |
84
- | ----------------- | ---------- | ------------- |
85
- | **BitMamba-2-1B** | **621 MB** | **~53 tok/s** |
86
 
87
- ## πŸ“œ Citation
88
 
89
  ```bibtex
90
  @misc{salazar2026bitmamba2,
@@ -95,4 +154,33 @@ Running on a consumer **Intel Core i3-12100F CPU**:
95
  doi = {10.5281/zenodo.18394665},
96
  url = {https://doi.org/10.5281/zenodo.18394665}
97
  }
98
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  - 1.58-bit
10
  - ternary
11
  - efficient-inference
12
+ - arm-neon
13
+ - apple-silicon
14
+ - cpu-inference
15
  datasets:
16
  - HuggingFaceFW/fineweb-edu
17
  - bigcode/the-stack-dedup
 
31
  [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm-dark.svg)](https://huggingface.co/spaces/Zhayr1/Bitmamba-2-1B)
32
  [![Paper](https://img.shields.io/badge/Paper-Zenodo-00649C.svg)](https://doi.org/10.5281/zenodo.18394665)
33
  [![GitHub](https://img.shields.io/badge/GitHub-Source%20Code-black)](https://github.com/Zhayr1/BitMamba-2)
34
+ [![ARM NEON Port](https://img.shields.io/badge/ARM%20NEON-Port-green)](https://github.com/rasata/bitmamba.cpp)
35
+ [![Preprint](https://img.shields.io/badge/Preprint-engrXiv-blue)](https://engrxiv.org/)
36
 
37
  </div>
38
 
39
+ > **Mirror repository** of [Zhayr1/BitMamba-2-1B](https://huggingface.co/Zhayr1/BitMamba-2-1B), maintained by [Aquantic Research](https://github.com/rasata/zonova-research-gpu-to-cpu-transposition) for the GPU-to-CPU/ARM neural network transposition programme.
40
+
41
  **BitMamba-2-1B** is a scalable, hybrid architecture that integrates **1.58-bit ternary quantization** (BitNet) into the **Mamba-2** state space model framework. Trained from scratch on 150B tokens of high-quality data, it demonstrates that ternary SSMs follow predictable scaling laws, achieving competitive reasoning capabilities with a drastically reduced memory footprint.
42
 
43
+ ---
44
+
45
+ ## ARM NEON Port β€” Cross-Platform CPU Inference
46
+
47
+ An **ARM NEON port** of the BitMamba-2 inference engine has been developed by Aquantic Research, enabling native inference on **Apple Silicon** (M1/M2/M3/M4) and ARM-based processors.
48
+
49
+ | Model | Hardware | Speed | Latency/token | RAM |
50
+ |-------|----------|-------|---------------|-----|
51
+ | BitMamba-2 1B | Intel Core i3-12100F (AVX2) | ~53 tok/s | β€” | 621 MB |
52
+ | **BitMamba-2 1B** | **Apple M1 (ARM NEON)** | **27.9 tok/s** | 35.9 ms | 614 MB |
53
+
54
+ **Key finding**: Speed is **perfectly constant** regardless of sequence length (50, 200, or more tokens). This experimentally validates the **O(1) memory** property of SSM architectures β€” unlike Transformers whose memory grows with sequence length.
55
+
56
+ ### Comparison with Transformer baselines (same hardware)
57
+
58
+ | Type | Model | Weights | Quant | tok/s | Hardware |
59
+ |------|-------|---------|-------|-------|----------|
60
+ | **SSM** | **BitMamba-2 1B** | **614 MB** | **1.58-bit** | **27.9** | **Apple M1** |
61
+ | Transformer | TinyLlama 1.1B | 638 MB | Q4_K_M | ~30-40 | Apple M1 |
62
+ | Transformer | Llama-7B | 3.8 GB | Q4 | ~15 | Apple M1 |
63
+ | Cloud GPU | Claude 3.5 Haiku | β€” | β€” | 61 | GPU cloud |
64
+
65
+ At comparable weight sizes (~600 MB), the SSM achieves throughput competitive with quantized Transformers, but with **constant memory** (no KV cache growth) and **1.58-bit** compression (vs. 4-bit for Transformers).
66
+
67
+ ### ARM NEON Port Resources
68
+
69
+ - **Code**: [rasata/bitmamba.cpp](https://github.com/rasata/bitmamba.cpp) β€” ARM NEON fork with cross-platform dispatch (x86 AVX2 + ARM NEON)
70
+ - **Preprint**: *"State Space Models as CPU-Native Neural Network Architectures: Experimental Evidence from ARM NEON Inference with 1.58-bit Quantized Mamba"* β€” Gabriel Zo-Hasina Rasatavohary, Aquantic Research, March 2026. To be published on [engrXiv](https://engrxiv.org/) (DOI pending).
71
+ - **Research programme**: [GPU-to-CPU/ARM Neural Network Transposition](https://github.com/rasata/zonova-research-gpu-to-cpu-transposition)
72
+
73
+ ### Quick Start (ARM)
74
+
75
+ ```bash
76
+ # Clone the ARM NEON fork
77
+ git clone https://github.com/rasata/bitmamba.cpp
78
+ cd bitmamba.cpp
79
+
80
+ # Build (macOS Apple Silicon)
81
+ brew install libomp
82
+ cmake -B build && cmake --build build
83
+
84
+ # Download weights from this repo
85
+ wget https://huggingface.co/rasatavohary/BitMamba-2-1B/resolve/main/bitmamba_cpp/bitmamba_1b.bin
86
+
87
+ # Run inference
88
+ cd build && cp ../tokenizer.bin .
89
+ ./bitmamba ../bitmamba_1b.bin "The future of AI is" tokenizer 0.7 1.1 0.05 0.9 40 200
90
+ ```
91
+
92
+ ---
93
+
94
  ## ⚑ Key Features
95
 
96
  - **Architecture:** Mamba-2 SSM + BitNet b1.58 (Ternary Weights).
 
123
 
124
  ### 2. Run with C++
125
 
126
+ Go to the original [GitHub Repository](https://github.com/Zhayr1/bitmamba.cpp) for x86 AVX2 inference, or [rasata/bitmamba.cpp](https://github.com/rasata/bitmamba.cpp) for cross-platform (x86 + ARM NEON) inference.
127
 
128
  ```bash
129
  # Example usage after compiling bitmamba.cpp
 
136
 
137
  ## πŸ› οΈ Efficient Deployment
138
 
139
+ | Platform | Hardware | RAM Usage | Speed |
140
+ |----------|----------|-----------|-------|
141
+ | x86 (original) | Intel Core i3-12100F (AVX2) | 621 MB | ~53 tok/s |
142
+ | **ARM (NEON port)** | **Apple M1** | **614 MB** | **27.9 tok/s** |
143
 
144
+ ## πŸ“œ Citations
 
 
145
 
146
+ ### Original model
147
 
148
  ```bibtex
149
  @misc{salazar2026bitmamba2,
 
154
  doi = {10.5281/zenodo.18394665},
155
  url = {https://doi.org/10.5281/zenodo.18394665}
156
  }
157
+ ```
158
+
159
+ ### ARM NEON port and CPU-native research
160
+
161
+ ```bibtex
162
+ @misc{rasatavohary2026ssm,
163
+ author = {Rasatavohary, Gabriel Zo-Hasina},
164
+ title = {State Space Models as {CPU}-Native Neural Network Architectures:
165
+ Experimental Evidence from {ARM NEON} Inference with 1.58-bit
166
+ Quantized {Mamba}},
167
+ year = {2026},
168
+ howpublished = {engrXiv preprint (DOI pending)},
169
+ note = {Aquantic Research. First ARM NEON port of BitMamba-2.
170
+ Code: \url{https://github.com/rasata/bitmamba.cpp}},
171
+ }
172
+ ```
173
+
174
+ ## Training Datasets
175
+
176
+ - [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
177
+ - [HuggingFaceTB/cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)
178
+ - [bigcode/the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup)
179
+
180
+ ## Links
181
+
182
+ - [Original paper (Zenodo)](https://doi.org/10.5281/zenodo.18394665) β€” Salazar, 2026
183
+ - [Original GitHub](https://github.com/Zhayr1/BitMamba-2) β€” Zhayr1
184
+ - [ARM NEON fork](https://github.com/rasata/bitmamba.cpp) β€” Aquantic Research
185
+ - [Research programme](https://github.com/rasata/zonova-research-gpu-to-cpu-transposition) β€” GPU-to-CPU/ARM transposition
186
+ - [Interactive Demo](https://huggingface.co/spaces/Zhayr1/Bitmamba-2-1B)