Upload folder using huggingface_hub
Browse files- README.md +32 -35
- checkpoints/checkpoint_12000.pth +3 -0
README.md
CHANGED
|
@@ -16,70 +16,67 @@ tags:
|
|
| 16 |
|
| 17 |
# Aetheris: Hybrid Mamba-MoE (294M)
|
| 18 |
|
| 19 |
-
> **
|
| 20 |
-
> **Status:** π‘ Experimental / Research Preview
|
| 21 |
> **Source Code:** [GitHub - Pomilon/Aetheris](https://github.com/Pomilon/Aetheris)
|
| 22 |
|
| 23 |
-
**Aetheris** is
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
## π§ͺ
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
The architecture
|
| 32 |
-
1. **SSM Blocks (Odd Layers):** Dense Mamba blocks
|
| 33 |
-
2. **MoE Blocks (Even Layers):** Sparse
|
| 34 |
|
| 35 |
-
### π
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
-
| Metric | Count (Millions) |
|
| 40 |
| :--- | :--- | :--- |
|
| 41 |
-
| **Total
|
| 42 |
-
| **Active
|
| 43 |
|
| 44 |
-
## π Training
|
| 45 |
|
| 46 |
-
|
| 47 |
|
| 48 |
-
* **
|
| 49 |
-
* **
|
| 50 |
-
* **Dataset:**
|
| 51 |
|
| 52 |
-
> **β οΈ
|
| 53 |
|
| 54 |
-
## π
|
| 55 |
|
| 56 |
-
Since
|
|
|
|
|
|
|
| 57 |
|
| 58 |
```bash
|
| 59 |
-
# 1. Clone the
|
| 60 |
git clone https://github.com/Pomilon/Aetheris.git
|
| 61 |
cd Aetheris
|
| 62 |
|
| 63 |
-
# 2. Install
|
| 64 |
pip install -r requirements.txt
|
| 65 |
|
| 66 |
-
# 3. Run generation (
|
| 67 |
-
python -m aetheris.cli.main generate --prompt "The
|
| 68 |
````
|
| 69 |
|
|
|
|
|
|
|
| 70 |
## π Acknowledgements
|
| 71 |
|
| 72 |
-
This
|
| 73 |
|
| 74 |
* **Mamba:** Gu & Dao (2023)
|
| 75 |
* **Mixture of Experts:** Shazeer et al. (2017)
|
| 76 |
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
### π Author's Note (The "Scuffed" Reality)
|
| 80 |
-
|
| 81 |
-
*Hi, Pomilon here\! While the text above sounds official, here is the reality:*
|
| 82 |
-
|
| 83 |
-
*This is a "learning by doing" experiment. I wanted to see if I could smash these two architectures together on my laptop without it exploding. I built this from scratch to learn, so don't expect GPT-4 performance\! It's currently in the "babbling coherently" phase.*
|
| 84 |
|
| 85 |
-
|
|
|
|
| 16 |
|
| 17 |
# Aetheris: Hybrid Mamba-MoE (294M)
|
| 18 |
|
| 19 |
+
> **Status:** π‘ Experimental / Proof-of-Concept
|
|
|
|
| 20 |
> **Source Code:** [GitHub - Pomilon/Aetheris](https://github.com/Pomilon/Aetheris)
|
| 21 |
|
| 22 |
+
**Aetheris** is a "learning by doing" experiment where I attempted to smash together a **Mamba State Space Model** backbone with **Mixture-of-Experts (MoE)** layers.
|
| 23 |
|
| 24 |
+
I built this from scratch to see if I could combine Mamba's long-context efficiency with MoE's sparse capacity on consumer hardware. It is **not** a state-of-the-art foundation modelβit's a fun architectural playground.
|
| 25 |
|
| 26 |
+
## π§ͺ The "What If" Experiment
|
| 27 |
|
| 28 |
+
The idea was simple: *Can I interleave dense Mamba blocks with sparse MoE layers to make a model that is big on disk but fast in inference?*
|
| 29 |
|
| 30 |
+
The architecture alternates between:
|
| 31 |
+
1. **SSM Blocks (Odd Layers):** Dense Mamba blocks for handling memory and context.
|
| 32 |
+
2. **MoE Blocks (Even Layers):** Sparse layers that route tokens to only 1 of 4 experts.
|
| 33 |
|
| 34 |
+
### π The Specs
|
| 35 |
|
| 36 |
+
Because of the hybrid design, ~43% of the model is "dormant" during inference.
|
| 37 |
|
| 38 |
+
| Metric | Count (Millions) | What it means |
|
| 39 |
| :--- | :--- | :--- |
|
| 40 |
+
| **Total Capacity** | **294.44M** | The size on disk. |
|
| 41 |
+
| **Active Params** | **167.03M** | The actual compute used per token. |
|
| 42 |
|
| 43 |
+
## π Training Log (Live)
|
| 44 |
|
| 45 |
+
I am currently training this on a single NVIDIA RTX 5000. It's still cooking!
|
| 46 |
|
| 47 |
+
* **Latest Checkpoint:** Step 11,000
|
| 48 |
+
* **Loss:** ~1.4167
|
| 49 |
+
* **Dataset:** Subset of SlimPajama-627B
|
| 50 |
|
| 51 |
+
> **β οΈ Disclaimer:** This model is currently babbling coherent English but isn't very smart yet. Don't expect GPT-4 (or even GPT-2) level reasoning. It's a proof-of-concept for the code, not the weights! :D
|
| 52 |
|
| 53 |
+
## π How to Run (The "Scuffed" Way)
|
| 54 |
|
| 55 |
+
Since this uses a custom architecture, `AutoModel.from_pretrained` won't work out of the box. You need the code from my repo.
|
| 56 |
+
|
| 57 |
+
Right now, the easiest way to run it is using the CLI tool in the repo:
|
| 58 |
|
| 59 |
```bash
|
| 60 |
+
# 1. Clone the repo
|
| 61 |
git clone https://github.com/Pomilon/Aetheris.git
|
| 62 |
cd Aetheris
|
| 63 |
|
| 64 |
+
# 2. Install requirements
|
| 65 |
pip install -r requirements.txt
|
| 66 |
|
| 67 |
+
# 3. Run generation (Make sure you download the model to a folder first!)
|
| 68 |
+
python -m aetheris.cli.main generate --prompt "The quick brown fox" --checkpoint_dir path/to/checkpoints_folder # rename the checkpoint inside to checkpoint_current.pth
|
| 69 |
````
|
| 70 |
|
| 71 |
+
*(I'll add a cleaner inference script later, but this works for now\!)*
|
| 72 |
+
|
| 73 |
## π Acknowledgements
|
| 74 |
|
| 75 |
+
This project stands on the shoulders of giants. It is an implementation study based on:
|
| 76 |
|
| 77 |
* **Mamba:** Gu & Dao (2023)
|
| 78 |
* **Mixture of Experts:** Shazeer et al. (2017)
|
| 79 |
|
| 80 |
+
## License
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
+
MIT
|
checkpoints/checkpoint_12000.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:86793c523a0c2be249ec16e6ae9c0f932bdb7783e6b184d7431e94c2ccbc8de7
|
| 3 |
+
size 3533562641
|