pomilon-lab
/

Aetheris-MoE-300M-A125M-base

@@ -16,70 +16,67 @@ tags:
 # Aetheris: Hybrid Mamba-MoE (294M)
-> **Developed by:** [Pomilon Intelligence Lab](https://huggingface.co/Pomilon-Lab)
-> **Status:** 🟡 Experimental / Research Preview
 > **Source Code:** [GitHub - Pomilon/Aetheris](https://github.com/Pomilon/Aetheris)
-**Aetheris** is an experimental language model exploring the intersection of **State Space Models (SSM)** and **Mixture-of-Experts (MoE)** architectures.
-We designed this model to investigate efficient scaling on consumer hardware. By combining Mamba's linear-time sequence modeling with the sparse capacity of MoE, Aetheris aims to maximize parameter count while minimizing inference latency.
-## 🧪 Architecture & Design
-This project tests the hypothesis: *Can we interleave dense Mamba blocks with sparse MoE layers to create a model that is parameter-rich but computationally light?*
-The architecture follows a strict alternating pattern:
-1.  **SSM Blocks (Odd Layers):** Dense Mamba blocks responsible for sequence mixing and memory.
-2.  **MoE Blocks (Even Layers):** Sparse router layers that direct tokens to 1 of 4 experts.
-### 📊 Technical Specifications
-Due to the sparse nature of the MoE layers, approximately **43% of the parameters remain inactive** during any given inference step.
-| Metric | Count (Millions) | Description |
 | :--- | :--- | :--- |
-| **Total Parameters** | **294.44M** | Storage footprint on disk. |
-| **Active Parameters** | **167.03M** | Computational cost per token (Inference). |
-## 📉 Training Status
-Training is currently in progress on a single NVIDIA RTX 5000.
-* **Current Step:** 11,000
-* **Current Loss:** ~1.4167
-* **Dataset:** A subset of SlimPajama-627B
-> **⚠️ Performance Notice:** Aetheris is currently in a "proof-of-concept" state. While it generates coherent English syntax, it does not yet possess strong reasoning capabilities. It is intended for architectural analysis rather than downstream tasks.
-## 🚀 Usage & Inference
-Since Aetheris utilizes a custom architecture not yet supported by standard Transformers, you must use the custom inference code provided in our repository.
 ```bash
-# 1. Clone the repository
 git clone https://github.com/Pomilon/Aetheris.git
 cd Aetheris
-# 2. Install dependencies
 pip install -r requirements.txt
-# 3. Run generation (Ensure you have downloaded the model weights first)
-python -m aetheris.cli.main generate --prompt "The future of AI is" --checkpoint_dir path/to/checkpoints_folder
 ````
 ## 📚 Acknowledgements
-This research builds upon foundational work in the field:
   * **Mamba:** Gu & Dao (2023)
   * **Mixture of Experts:** Shazeer et al. (2017)
------
-### 📝 Author's Note (The "Scuffed" Reality)
-*Hi, Pomilon here\! While the text above sounds official, here is the reality:*
-*This is a "learning by doing" experiment. I wanted to see if I could smash these two architectures together on my laptop without it exploding. I built this from scratch to learn, so don't expect GPT-4 performance\! It's currently in the "babbling coherently" phase.*
-*If you manage to break it (or fix it), let me know\!*

 # Aetheris: Hybrid Mamba-MoE (294M)
+> **Status:** 🟡 Experimental / Proof-of-Concept
 > **Source Code:** [GitHub - Pomilon/Aetheris](https://github.com/Pomilon/Aetheris)
+**Aetheris** is a "learning by doing" experiment where I attempted to smash together a **Mamba State Space Model** backbone with **Mixture-of-Experts (MoE)** layers.
+I built this from scratch to see if I could combine Mamba's long-context efficiency with MoE's sparse capacity on consumer hardware. It is **not** a state-of-the-art foundation model—it's a fun architectural playground.
+## 🧪 The "What If" Experiment
+The idea was simple: *Can I interleave dense Mamba blocks with sparse MoE layers to make a model that is big on disk but fast in inference?*
+The architecture alternates between:
+1.  **SSM Blocks (Odd Layers):** Dense Mamba blocks for handling memory and context.
+2.  **MoE Blocks (Even Layers):** Sparse layers that route tokens to only 1 of 4 experts.
+### 📊 The Specs
+Because of the hybrid design, ~43% of the model is "dormant" during inference.
+| Metric | Count (Millions) | What it means |
 | :--- | :--- | :--- |
+| **Total Capacity** | **294.44M** | The size on disk. |
+| **Active Params** | **167.03M** | The actual compute used per token. |
+## 📉 Training Log (Live)
+I am currently training this on a single NVIDIA RTX 5000. It's still cooking!
+* **Latest Checkpoint:** Step 11,000
+* **Loss:** ~1.4167
+* **Dataset:** Subset of SlimPajama-627B
+> **⚠️ Disclaimer:** This model is currently babbling coherent English but isn't very smart yet. Don't expect GPT-4 (or even GPT-2) level reasoning. It's a proof-of-concept for the code, not the weights! :D
+## 🚀 How to Run (The "Scuffed" Way)
+Since this uses a custom architecture, `AutoModel.from_pretrained` won't work out of the box. You need the code from my repo.
+Right now, the easiest way to run it is using the CLI tool in the repo:
 ```bash
+# 1. Clone the repo
 git clone https://github.com/Pomilon/Aetheris.git
 cd Aetheris
+# 2. Install requirements
 pip install -r requirements.txt
+# 3. Run generation (Make sure you download the model to a folder first!)
+python -m aetheris.cli.main generate --prompt "The quick brown fox" --checkpoint_dir path/to/checkpoints_folder # rename the checkpoint inside to checkpoint_current.pth
 ````
+*(I'll add a cleaner inference script later, but this works for now\!)*
 ## 📚 Acknowledgements
+This project stands on the shoulders of giants. It is an implementation study based on:
   * **Mamba:** Gu & Dao (2023)
   * **Mixture of Experts:** Shazeer et al. (2017)
+## License
+MIT

checkpoints/checkpoint_12000.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:86793c523a0c2be249ec16e6ae9c0f932bdb7783e6b184d7431e94c2ccbc8de7
+size 3533562641