Spaces:

pomilon-lab
/

Aetheris-Inference

Sleeping

App Files Files Community

Pomilon commited on Nov 30, 2025

Commit

1df0e33

0 Parent(s):

Deploy Aetheris to HF Space

Browse files

Files changed (35) hide show

.gitattributes +1 -0
.gitignore +28 -0
Dockerfile +38 -0
Dockerfile-nvidia +41 -0
LICENSE +21 -0
README.md +146 -0
aetheris/__init__.py +2 -0
aetheris/api/schemas.py +92 -0
aetheris/api/server.py +162 -0
aetheris/cli/__init__.py +1 -0
aetheris/cli/main.py +287 -0
aetheris/config.py +58 -0
aetheris/data.py +105 -0
aetheris/inference.py +106 -0
aetheris/model.py +86 -0
aetheris/modules/__init__.py +3 -0
aetheris/modules/expert.py +35 -0
aetheris/modules/moe.py +83 -0
aetheris/modules/ssm.py +91 -0
aetheris/trainer/__init__.py +1 -0
aetheris/trainer/trainer.py +145 -0
aetheris/utils.py +39 -0
configs/debug.yaml +16 -0
configs/default.yaml +16 -0
configs/inference.yaml +16 -0
configs/large.yaml +16 -0
requirements.txt +12 -0
scripts/generate.py +16 -0
scripts/info.py +11 -0
scripts/train.py +6 -0
scripts/validate.py +87 -0
tests/test_api.py +88 -0
tests/test_inference.py +71 -0
tests/test_model.py +55 -0
tests/test_overflow.py +67 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ checkpoints/checkpoint_current.pth filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,28 @@

+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.Python
+env/
+venv/
+.env
+.venv
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+checkpoints/
+*.log
+.DS_Store
+legacy/

Dockerfile ADDED Viewed

	@@ -0,0 +1,38 @@

+FROM python:3.10-slim
+# Set environment variables
+ENV PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    git \
+    build-essential \
+    && rm -rf /var/lib/apt/lists/*
+# Set working directory
+WORKDIR /app
+# Create a user first to handle permissions correctly from the start
+RUN useradd -m -u 1000 user
+# Switch to user
+USER user
+ENV HOME=/home/user \
+    PATH=/home/user/.local/bin:$PATH
+# Set up application directory with correct permissions
+WORKDIR $HOME/app
+# Copy requirements and install
+COPY --chown=user requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cpu
+# Copy application code
+COPY --chown=user . .
+# Expose port
+EXPOSE 7860
+# Command to run the application
+CMD ["python3", "-m", "aetheris.cli.main", "serve", "--host", "0.0.0.0", "--port", "7860"]

Dockerfile-nvidia ADDED Viewed

	@@ -0,0 +1,41 @@

+# Use NVIDIA CUDA base image for GPU support
+FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04
+# Set environment variables
+ENV PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    DEBIAN_FRONTEND=noninteractive
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    python3-pip \
+    python3-dev \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+# Set working directory
+WORKDIR /app
+# Install Python dependencies
+COPY requirements.txt .
+RUN pip3 install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY . .
+# Expose port (7860 is default for Hugging Face Spaces)
+EXPOSE 7860
+# Create a user to avoid running as root (good practice, also sometimes required by HF)
+# But often HF runs as user 1000.
+RUN useradd -m -u 1000 user
+USER user
+ENV HOME=/home/user \
+    PATH=/home/user/.local/bin:$PATH
+WORKDIR $HOME/app
+COPY --chown=user . $HOME/app
+# Command to run the application
+# We use the CLI serve command we added
+CMD ["python3", "-m", "aetheris.cli.main", "serve", "--host", "0.0.0.0", "--port", "7860"]

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 Pomilon
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,146 @@

+# Aetheris: Hybrid Mamba-MoE Experiment
+<p align="center">
+  <img src="https://img.shields.io/badge/Status-Experimental-yellow.svg" alt="Status">
+  <img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License">
+  <img src="https://img.shields.io/badge/Python-3.10+-blue.svg" alt="Python">
+  <img src="https://img.shields.io/badge/PyTorch-2.0+-orange.svg" alt="PyTorch">
+  <img src="https://img.shields.io/badge/API-FastAPI-009688.svg" alt="FastAPI">
+</p>
+**Aetheris** is a hobbyist research project and experimental implementation exploring the intersection of **State Space Models (Mamba)** and **Mixture of Experts (MoE)**.
+The goal of this project was to learn by doing: attempting to combine the linear-time inference of Mamba with the sparse scaling capacity of MoE from scratch in PyTorch. It is designed as a playground for understanding these modern architectures, not as a published academic paper or production-ready foundation model.
+## 🧪 The Experiment
+Current LLM architectures are evolving rapidly. I built Aetheris to investigate a specific question:
+> *Can we successfully interleave Mamba blocks (for long context) with sparse MoE layers (for capacity) to train an efficient model on consumer hardware?*
+This project implements a hybrid architecture that attempts to:
+1.  **Replace Attention:** Use Mamba (SSM) blocks to achieve $O(N)$ sequence scaling.
+2.  **Scale Parameters Sparsely:** Use MoE layers to increase model size without exploding the computational cost per token.
+3.  **Run Locally:** Optimize the implementation for single-GPU training (gradient checkpointing, efficient routing).
+## 🏗️ Architecture Implementation
+Aetheris alternates between custom implementations of two core modules:
+* **SSMBlock (The Backbone):** Implements the selective scan mechanism described in the [Mamba paper](https://arxiv.org/abs/2312.00752). This handles the sequence mixing and "memory" of the model.
+* **SparseMoELayer (The Scaling):** A router-based layer that dispatches tokens to Top-K experts (Feed-Forward Networks). This allows the model to "specialize" parts of its parameters for different types of tokens.
+## 🚀 Quick Start
+This code is provided for educational purposes and for others who want to experiment with hybrid architectures.
+### Installation
+**Option 1: Local Python Environment**
+```bash
+git clone https://github.com/Pomilon/Aetheris.git
+cd Aetheris
+pip install -r requirements.txt
+```
+**Option 2: Docker**
+We provide Dockerfiles for both CPU (slim) and GPU (NVIDIA) environments.
+```bash
+# CPU Version
+docker build -t aetheris-cpu -f Dockerfile .
+docker run -p 7860:7860 aetheris-cpu
+# GPU Version (Requires NVIDIA Container Toolkit)
+docker build -t aetheris-gpu -f Dockerfile-nvidia .
+docker run --gpus all -p 7860:7860 aetheris-gpu
+```
+### Usage (CLI)
+Aetheris includes a CLI to train, inference, or serve the model.
+**1. Training (From Scratch)**
+```bash
+# Trains a small model defined in configs/default.yaml
+python -m aetheris.cli.main train --config configs/default.yaml
+```
+**2. Generation (CLI)**
+```bash
+python -m aetheris.cli.main generate --prompt "The quick brown fox" --checkpoint_dir checkpoints
+```
+**3. API Server (OpenAI-Compatible)**
+Start a local API server that simulates OpenAI's chat completions endpoint.
+```bash
+python -m aetheris.cli.main serve --host 0.0.0.0 --port 8000
+```
+You can then interact with it using standard tools:
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d 	{
+    "model": "aetheris-hybrid",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "stream": true
+  }
+```
+### Development & Testing
+To run the test suite:
+```bash
+pytest tests/
+```
+## ⚙️ Configuration
+You can tweak the hyperparameters in `configs/`. I've included a "Debug" config that is small enough to train on a laptop CPU for testing the code flow.
+| Config File | Description |
+| :--- | :--- |
+| `configs/default.yaml` | Standard experimental setup (requires GPU). |
+| `configs/debug.yaml` | Tiny model (2 layers) for code debugging. |
+## 📚 Acknowledgements & References
+This project is an implementation study and relies heavily on the brilliant theoretical work of others. It is not an original invention of the Mamba or MoE concepts.
+  * **Mamba Architecture:** Gu, A., & Dao, T. (2023). *Mamba: Linear-Time Sequence Modeling with Selective State Spaces*. [arXiv:2312.00752](https://arxiv.org/abs/2312.00752)
+  * **Mixture of Experts:** Shazeer, N., et al. (2017). *Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer*. [arXiv:1701.06538](https://arxiv.org/abs/1701.06538)
+  * **Inspiration:** Jamba (AI21 Labs) and OpenMoE.
+## 🧠 Model Weights & Checkpoints
+All pre-trained checkpoints are hosted on the [Hugging Face Hub](https://huggingface.co/Pomilon).
+| Model Artifact | Step | Description | Download |
+| :--- | :--- | :--- | :--- |
+| **Aetheris-Base** | 10k | Early convergence checkpoint (Loss ~3.66). Good for analyzing router behavior. | [🤗 Hugging Face](https://huggingface.co/Pomilon/Aetheris-MoE-300M-A125M-base) |
+| **Aetheris-Chat** | -- | *Coming Soon (Post-SFT)* | -- |
+> **⚠️ Important:** Aetheris uses a custom Hybrid Mamba-MoE architecture. You **cannot** load it directly with `transformers.AutoModel`. You must use the interface provided in this repository.
+### 🐍 How to Load
+```python
+python -m aetheris.cli.main generate --prompt "The quick brown fox" --checkpoint_dir path/to/checkpoints_folder # rename the checkpoint inside to checkpoint_current.pth
+```
+> **Note:** will add better inference later down the line, for now used this scuffed version. :D
+> **Note:** These weights are from an experimental run. While they demonstrate the architectural capabilities, do not expect GPT-5 or even google bard level coherence. :D
+> this project was made for learning and fun!
+## License
+MIT

aetheris/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .model import HybridMambaMoE
2	+ from .config import AetherisConfig

aetheris/api/schemas.py ADDED Viewed

	@@ -0,0 +1,92 @@

+from typing import List, Optional, Union, Dict, Any
+from pydantic import BaseModel, Field
+import time
+class ChatMessage(BaseModel):
+    role: str
+    content: str
+class ChatCompletionRequest(BaseModel):
+    model: str
+    messages: List[ChatMessage]
+    temperature: Optional[float] = 1.0
+    top_p: Optional[float] = 1.0
+    n: Optional[int] = 1
+    stream: Optional[bool] = False
+    stop: Optional[Union[str, List[str]]] = None
+    max_tokens: Optional[int] = None
+    presence_penalty: Optional[float] = 0.0
+    frequency_penalty: Optional[float] = 0.0
+    logit_bias: Optional[Dict[str, float]] = None
+    user: Optional[str] = None
+class ChatCompletionChoice(BaseModel):
+    index: int
+    message: ChatMessage
+    finish_reason: Optional[str] = None
+class ChatCompletionResponse(BaseModel):
+    id: str
+    object: str = "chat.completion"
+    created: int = Field(default_factory=lambda: int(time.time()))
+    model: str
+    choices: List[ChatCompletionChoice]
+    usage: Optional[Dict[str, int]] = None
+class ChatCompletionChunkDelta(BaseModel):
+    role: Optional[str] = None
+    content: Optional[str] = None
+class ChatCompletionChunkChoice(BaseModel):
+    index: int
+    delta: ChatCompletionChunkDelta
+    finish_reason: Optional[str] = None
+class ChatCompletionChunk(BaseModel):
+    id: str
+    object: str = "chat.completion.chunk"
+    created: int = Field(default_factory=lambda: int(time.time()))
+    model: str
+    choices: List[ChatCompletionChunkChoice]
+class CompletionRequest(BaseModel):
+    model: str
+    prompt: Union[str, List[str]]
+    suffix: Optional[str] = None
+    max_tokens: Optional[int] = 16
+    temperature: Optional[float] = 1.0
+    top_p: Optional[float] = 1.0
+    n: Optional[int] = 1
+    stream: Optional[bool] = False
+    logprobs: Optional[int] = None
+    echo: Optional[bool] = False
+    stop: Optional[Union[str, List[str]]] = None
+    presence_penalty: Optional[float] = 0.0
+    frequency_penalty: Optional[float] = 0.0
+    best_of: Optional[int] = 1
+    logit_bias: Optional[Dict[str, float]] = None
+    user: Optional[str] = None
+class CompletionChoice(BaseModel):
+    text: str
+    index: int
+    logprobs: Optional[Any] = None
+    finish_reason: Optional[str] = None
+class CompletionResponse(BaseModel):
+    id: str
+    object: str = "text_completion"
+    created: int = Field(default_factory=lambda: int(time.time()))
+    model: str
+    choices: List[CompletionChoice]
+    usage: Optional[Dict[str, int]] = None
+class ModelCard(BaseModel):
+    id: str
+    object: str = "model"
+    created: int = Field(default_factory=lambda: int(time.time()))
+    owned_by: str = "aetheris"
+class ModelList(BaseModel):
+    object: str = "list"
+    data: List[ModelCard]

aetheris/api/server.py ADDED Viewed

	@@ -0,0 +1,162 @@

+import time
+import uuid
+import json
+import asyncio
+from typing import AsyncGenerator
+from fastapi import FastAPI, HTTPException, Request
+from fastapi.middleware.cors import CORSMiddleware
+from sse_starlette.sse import EventSourceResponse
+from aetheris.api.schemas import (
+    ChatCompletionRequest, ChatCompletionResponse, ChatCompletionChunk,
+    ChatCompletionChoice, ChatMessage, ChatCompletionChunkChoice, ChatCompletionChunkDelta,
+    CompletionRequest, CompletionResponse, CompletionChoice,
+    ModelList, ModelCard
+)
+from aetheris.inference import InferenceEngine
+app = FastAPI(title="Aetheris API", version="0.1.0")
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Global engine instance
+engine: InferenceEngine = None
+def get_engine():
+    global engine
+    if engine is None:
+        # Defaults, ideally loaded from config/env
+        engine = InferenceEngine()
+    return engine
+@app.on_event("startup")
+async def startup_event():
+    get_engine()
+@app.get("/v1/models", response_model=ModelList)
+async def list_models():
+    return ModelList(data=[ModelCard(id="aetheris-hybrid-mamba-moe")])
+@app.post("/v1/chat/completions")
+async def chat_completions(request: ChatCompletionRequest):
+    engine = get_engine()
+    # Simple prompt construction from messages
+    prompt = ""
+    for msg in request.messages:
+        prompt += f"{msg.role}: {msg.content}\n"
+    prompt += "assistant: "
+    request_id = f"chatcmpl-{uuid.uuid4()}"
+    created_time = int(time.time())
+    if request.stream:
+        async def event_generator():
+            yield json.dumps(ChatCompletionChunk(
+                id=request_id,
+                created=created_time,
+                model=request.model,
+                choices=[ChatCompletionChunkChoice(
+                    index=0,
+                    delta=ChatCompletionChunkDelta(role="assistant"),
+                    finish_reason=None
+                )]
+            ).model_dump())
+            for token in engine.generate(
+                prompt=prompt,
+                max_new_tokens=request.max_tokens or 100,
+                temperature=request.temperature,
+                top_p=request.top_p,
+                repetition_penalty=1.0 + request.frequency_penalty, # Approximating
+                stream=True
+            ):
+                yield json.dumps(ChatCompletionChunk(
+                    id=request_id,
+                    created=created_time,
+                    model=request.model,
+                    choices=[ChatCompletionChunkChoice(
+                        index=0,
+                        delta=ChatCompletionChunkDelta(content=token),
+                        finish_reason=None
+                    )]
+                ).model_dump())
+            yield json.dumps(ChatCompletionChunk(
+                id=request_id,
+                created=created_time,
+                model=request.model,
+                choices=[ChatCompletionChunkChoice(
+                    index=0,
+                    delta=ChatCompletionChunkDelta(),
+                    finish_reason="stop"
+                )]
+            ).model_dump())
+            yield "[DONE]"
+        return EventSourceResponse(event_generator())
+    else:
+        generated_text = engine.generate_full(
+            prompt=prompt,
+            max_new_tokens=request.max_tokens or 100,
+            temperature=request.temperature,
+            top_p=request.top_p,
+            repetition_penalty=1.0 + request.frequency_penalty
+        )
+        return ChatCompletionResponse(
+            id=request_id,
+            created=created_time,
+            model=request.model,
+            choices=[ChatCompletionChoice(
+                index=0,
+                message=ChatMessage(role="assistant", content=generated_text),
+                finish_reason="stop"
+            )],
+            usage={"prompt_tokens": len(prompt), "completion_tokens": len(generated_text), "total_tokens": len(prompt) + len(generated_text)} # Approximated
+        )
+@app.post("/v1/completions")
+async def completions(request: CompletionRequest):
+    engine = get_engine()
+    prompt = request.prompt
+    if isinstance(prompt, list):
+        prompt = prompt[0] # Handle single prompt for now
+    request_id = f"cmpl-{uuid.uuid4()}"
+    created_time = int(time.time())
+    if request.stream:
+        # Streaming for completions not fully implemented to match OpenAI exactly in this demo,
+        # but logic is similar to chat.
+        # For simplicity, returning non-streaming for now or basic stream.
+        pass # TODO: Implement streaming for completions
+    generated_text = engine.generate_full(
+        prompt=prompt,
+        max_new_tokens=request.max_tokens or 16,
+        temperature=request.temperature,
+        top_p=request.top_p,
+        repetition_penalty=1.0 + request.frequency_penalty
+    )
+    return CompletionResponse(
+        id=request_id,
+        created=created_time,
+        model=request.model,
+        choices=[CompletionChoice(
+            text=generated_text,
+            index=0,
+            logprobs=None,
+            finish_reason="length" # or stop
+        )],
+        usage={"prompt_tokens": len(prompt), "completion_tokens": len(generated_text), "total_tokens": len(prompt) + len(generated_text)}
+    )

aetheris/cli/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+

aetheris/cli/main.py ADDED Viewed

	@@ -0,0 +1,287 @@

+import argparse
+import sys
+import torch
+import os
+import torch.nn.functional as F
+from aetheris.config import AetherisConfig
+from aetheris.model import HybridMambaMoE
+from aetheris.data import create_streaming_loader, get_tokenizer
+from aetheris.utils import load_latest_checkpoint, calculate_model_stats
+from aetheris.trainer import Trainer
+def train_command(args):
+    print(f"\n{'='*70}")
+    print(f"Aetheris Training")
+    print(f"Config: {args.config}")
+    if args.hf_token:
+        print(f"Using Hugging Face token: {args.hf_token[:10]}...")
+        from huggingface_hub import login
+        login(token=args.hf_token)
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    if device.type == 'cuda':
+        torch.backends.cuda.matmul.allow_tf32 = True
+        torch.backends.cudnn.allow_tf32 = True
+        torch.cuda.empty_cache()
+    config = AetherisConfig.from_yaml(args.config)
+    tokenizer = get_tokenizer()
+    print(f"Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}")
+    print(f"Model Size: d_model={config.d_model}, layers={config.n_layer}")
+    print(f"{'='*70}\n")
+    model = HybridMambaMoE(config).to(device)
+    # Apply weight initialization
+    print("Applying proper weight initialization...")
+    model.apply(model._init_weights)
+    # Calculate model stats
+    stats = calculate_model_stats(model)
+    print(f"Total Parameters: {stats['total_params']:,}")
+    print(f"Trainable Parameters: {stats['trainable_params']:,}")
+    # Use lower learning rate for stability
+    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01,
+                                 betas=(0.9, 0.95), eps=1e-8, fused=False if device.type == 'cpu' else True)
+    scaler = torch.amp.GradScaler('cuda' if device.type == 'cuda' else 'cpu', init_scale=2**10)
+    start_step, current_stage = load_latest_checkpoint(model, optimizer, scaler, device, args.checkpoint_dir, args.checkpoint_name)
+    trainer = Trainer(model, optimizer, scaler, config, device, args.checkpoint_dir)
+    # --- STAGE 1: PRE-TRAINING ---
+    if current_stage == "Pre-Training" or start_step == 0:
+        pt_loader = create_streaming_loader("cerebras/SlimPajama-627B", "train",
+                                           tokenizer, config, args.batch_size, mode="pretrain",
+                                           hf_token=args.hf_token, start_step=start_step)
+        # Validation loader (no skipping needed, always from start of val set)
+        pt_val_loader = create_streaming_loader("cerebras/SlimPajama-627B", "validation",
+                                               tokenizer, config, args.batch_size, mode="pretrain",
+                                               hf_token=args.hf_token)
+        start_step = trainer.train_epoch(pt_loader, total_steps=args.pretrain_steps,
+                                       start_step=start_step, stage_name="Pre-Training",
+                                       val_loader=pt_val_loader)
+        current_stage = "SFT"
+        start_step = 0
+    # --- STAGE 2: SFT ---
+    print("\n=== STAGE 2: SFT ===")
+    for param_group in optimizer.param_groups:
+        param_group['lr'] = 5e-5
+    sft_loader = create_streaming_loader("OpenAssistant/oasst1", "train",
+                                        tokenizer, config, args.batch_size, mode="sft",
+                                        hf_token=args.hf_token, start_step=start_step)
+    sft_val_loader = create_streaming_loader("OpenAssistant/oasst1", "validation",
+                                            tokenizer, config, args.batch_size, mode="sft",
+                                            hf_token=args.hf_token)
+    trainer.train_epoch(sft_loader, total_steps=args.sft_steps,
+                      start_step=start_step, stage_name="SFT",
+                      val_loader=sft_val_loader)
+    print("\nTraining Complete!")
+@torch.no_grad()
+def generate_command(args):
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    config = AetherisConfig.from_yaml(args.config)
+    tokenizer = get_tokenizer()
+    model = HybridMambaMoE(config).to(device).to(config.torch_dtype)
+    load_latest_checkpoint(model, None, None, device, args.checkpoint_dir, args.checkpoint_name)
+    model.eval()
+    prompt = args.prompt
+    max_new_tokens = args.max_new_tokens
+    temperature = args.temperature
+    top_k = args.top_k
+    top_p = args.top_p
+    repetition_penalty = args.repetition_penalty
+    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
+    generated_ids = input_ids.clone()
+    history_ids = set(input_ids[0].tolist())
+    print("-" * 50)
+    print(f"Prompt: {prompt}")
+    print("Generated Continuation:")
+    for _ in range(max_new_tokens):
+        # Check if we should use autocast (skip if model uses float32)
+        use_autocast = True
+        if config.torch_dtype == torch.float32:
+            use_autocast = False
+        if use_autocast:
+            with torch.amp.autocast('cuda' if device.type == 'cuda' else 'cpu', dtype=model.config.torch_dtype):
+                outputs = model(generated_ids)
+                logits = outputs['logits']
+                next_token_logits = logits[:, -1, :]
+        else:
+            outputs = model(generated_ids)
+            logits = outputs['logits']
+            next_token_logits = logits[:, -1, :]
+        # Repetition penalty
+        for token_id in history_ids:
+            if token_id < next_token_logits.size(-1):
+                logit = next_token_logits[0, token_id].item()
+                if logit > 0:
+                    next_token_logits[0, token_id] = logit / repetition_penalty
+                else:
+                    next_token_logits[0, token_id] = logit * repetition_penalty
+        # Temperature
+        if temperature > 0:
+            next_token_logits = next_token_logits / temperature
+        # Top-p / Top-k
+        if top_p < 1.0:
+            sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
+            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+            sorted_indices_to_remove = cumulative_probs > top_p
+            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+            sorted_indices_to_remove[..., 0] = False
+            indices_to_remove = sorted_indices[sorted_indices_to_remove]
+            next_token_logits.scatter_(1, indices_to_remove.unsqueeze(0), float('-inf'))
+        elif top_k > 0:
+            top_k_logits, top_k_indices = torch.topk(next_token_logits, top_k)
+            next_token_logits = torch.full_like(next_token_logits, float('-inf'))
+            next_token_logits.scatter_(1, top_k_indices, top_k_logits)
+        # Sample
+        next_token_probs = F.softmax(next_token_logits, dim=-1)
+        next_token = torch.multinomial(next_token_probs, num_samples=1)
+        next_token_item = next_token.item()
+        if next_token_item == tokenizer.eos_token_id:
+            break
+        generated_ids = torch.cat([generated_ids, next_token], dim=-1)
+        history_ids.add(next_token_item)
+        new_token_text = tokenizer.decode(next_token.squeeze().tolist(), skip_special_tokens=True)
+        print(new_token_text, end="", flush=True)
+    print("\n" + "-" * 50)
+def info_command(args):
+    config = AetherisConfig.from_yaml(args.config)
+    model = HybridMambaMoE(config)
+    total_params = 0
+    dense_params = 0   # Parameters active for EVERY token
+    expert_params = 0  # Parameters in all MoE Experts
+    for name, param in model.named_parameters():
+        numel = param.numel()
+        total_params += numel
+        if 'experts' in name:
+            expert_params += numel
+        else:
+            dense_params += numel
+    single_expert_size = expert_params / config.num_experts if config.num_experts > 0 else 0
+    active_per_token_params = dense_params + (single_expert_size * config.top_k)
+    def format_count(count):
+        return f"{count / 1_000_000:.2f}M"
+    print("=" * 50)
+    print("Hybrid Mamba-MoE Model Parameter Analysis")
+    print("=" * 50)
+    print(f"Total Model Layers (N_Layer): {config.n_layer}")
+    print(f"MoE Experts per Layer: {config.num_experts}")
+    print(f"Active Experts (Top-K): {config.top_k}")
+    print("-" * 50)
+    print(f"Total Parameters (Checkpoint Size): {format_count(total_params)}")
+    print(f"Dense (Always Active) Parameters: {format_count(dense_params)}")
+    print(f"Expert-Only Parameters: {format_count(expert_params)}")
+    print("-" * 50)
+    print(f"**Active Parameters (Per-Token Compute Load): {format_count(active_per_token_params)}**")
+    print(" (This is the 'Dense' parameters + the K active expert parameters)")
+    print("=" * 50)
+def main():
+    parser = argparse.ArgumentParser(description="Aetheris CLI")
+    subparsers = parser.add_subparsers(dest="command", help="Available commands")
+    # Train Command
+    train_parser = subparsers.add_parser("train", help="Train the model")
+    train_parser.add_argument("--config", type=str, default="configs/default.yaml", help="Path to config file")
+    train_parser.add_argument("--checkpoint_dir", type=str, default="checkpoints", help="Directory to save checkpoints")
+    train_parser.add_argument("--hf_token", type=str, default=os.environ.get("HF_TOKEN"), help="HuggingFace Token")
+    train_parser.add_argument("--batch_size", type=int, default=2, help="Batch size")
+    train_parser.add_argument("--pretrain_steps", type=int, default=50000, help="Number of pretraining steps")
+    train_parser.add_argument("--sft_steps", type=int, default=1000, help="Number of SFT steps")
+    train_parser.add_argument("--checkpoint_name", type=str, default="checkpoint_current.pth", help="Checkpoint file name to load from")
+    # Generate Command
+    gen_parser = subparsers.add_parser("generate", help="Generate text")
+    gen_parser.add_argument("--config", type=str, default="configs/default.yaml", help="Path to config file")
+    gen_parser.add_argument("--checkpoint_dir", type=str, default="checkpoints", help="Directory with checkpoints")
+    gen_parser.add_argument("--checkpoint_name", type=str, default="checkpoint_current.pth", help="Checkpoint file name")
+    gen_parser.add_argument("--prompt", type=str, default="The quick brown fox", help="Prompt for generation")
+    gen_parser.add_argument("--max_new_tokens", type=int, default=100, help="Max new tokens to generate")
+    gen_parser.add_argument("--temperature", type=float, default=0.8, help="Sampling temperature")
+    gen_parser.add_argument("--top_k", type=int, default=0, help="Top-k sampling")
+    gen_parser.add_argument("--top_p", type=float, default=0.9, help="Top-p sampling")
+    gen_parser.add_argument("--repetition_penalty", type=float, default=3.0, help="Repetition penalty")
+    # Serve Command
+    serve_parser = subparsers.add_parser("serve", help="Start the API server")
+    serve_parser.add_argument("--host", type=str, default="0.0.0.0", help="Host to bind")
+    serve_parser.add_argument("--port", type=int, default=8000, help="Port to bind")
+    serve_parser.add_argument("--config", type=str, default="configs/default.yaml", help="Path to config file")
+    serve_parser.add_argument("--checkpoint_dir", type=str, default="checkpoints", help="Directory with checkpoints")
+    serve_parser.add_argument("--checkpoint_name", type=str, default="checkpoint_current.pth", help="Checkpoint file name")
+    # Info Command
+    info_parser = subparsers.add_parser("info", help="Show model info")
+    info_parser.add_argument("--config", type=str, default="configs/default.yaml", help="Path to config file")
+    args = parser.parse_args()
+    if args.command == "train":
+        train_command(args)
+    elif args.command == "generate":
+        generate_command(args)
+    elif args.command == "serve":
+        import uvicorn
+        from aetheris.api.server import app, get_engine
+        # Initialize engine before starting server
+        engine = get_engine()
+        # You might want to pass config/checkpoint paths to get_engine here if it supported arguments
+        # For now, it defaults or we need to modify get_engine or InferenceEngine to take args.
+        # But `get_engine` is a simple global accessor.
+        # Better: Initialize a global engine with args here.
+        from aetheris.inference import InferenceEngine
+        import aetheris.api.server
+        aetheris.api.server.engine = InferenceEngine(
+            config_path=args.config,
+            checkpoint_dir=args.checkpoint_dir,
+            checkpoint_name=args.checkpoint_name
+        )
+        uvicorn.run(app, host=args.host, port=args.port)
+    elif args.command == "info":
+        info_command(args)
+    else:
+        parser.print_help()
+if __name__ == "__main__":
+    main()

aetheris/config.py ADDED Viewed

	@@ -0,0 +1,58 @@

+from dataclasses import dataclass, field
+import yaml
+import torch
+from typing import Optional
+@dataclass
+class AetherisConfig:
+    # Model dimensions
+    vocab_size: int = 50257
+    d_model: int = 768
+    n_layer: int = 24
+    num_experts: int = 4
+    top_k: int = 1
+    d_ff: int = 2304 # d_model * 3
+    # SSM parameters
+    ssm_d_state: int = 16
+    ssm_expand: int = 2
+    d_inner: Optional[int] = None # Will be d_model * ssm_expand if None
+    # Training parameters
+    load_balancing_coef: float = 1e-2
+    router_z_loss_coef: float = 1e-3
+    max_seq_len: int = 512
+    dtype: str = "float16" # "float16", "float32", "bfloat16"
+    # Optimization settings
+    use_cpu_offload: bool = False
+    gradient_checkpointing: bool = True
+    checkpoint_ssm_layers: bool = True
+    use_flash_attention: bool = False
+    def __post_init__(self):
+        if self.d_inner is None:
+            self.d_inner = self.d_model * self.ssm_expand
+        if self.d_ff is None:
+             self.d_ff = self.d_model * 3
+    @property
+    def torch_dtype(self):
+        if self.dtype == "float16":
+            return torch.float16
+        elif self.dtype == "float32":
+            return torch.float32
+        elif self.dtype == "bfloat16":
+            return torch.bfloat16
+        else:
+            raise ValueError(f"Unsupported dtype: {self.dtype}")
+    @classmethod
+    def from_yaml(cls, path: str):
+        with open(path, 'r') as f:
+            config_dict = yaml.safe_load(f)
+        return cls(**config_dict)
+    def to_yaml(self, path: str):
+        with open(path, 'w') as f:
+            yaml.dump(self.__dict__, f)

aetheris/data.py ADDED Viewed

	@@ -0,0 +1,105 @@

+import torch
+from torch.utils.data import DataLoader, IterableDataset
+from transformers import AutoTokenizer
+from datasets import load_dataset
+import random
+from typing import Dict, Iterator
+import os
+def get_tokenizer(model_name: str = "gpt2"):
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    return tokenizer
+class StreamingDataset(IterableDataset):
+    def __init__(self, dataset, tokenizer, max_seq_len, mode="pretrain", buffer_size=500, skip_samples=0):
+        self.dataset = dataset
+        self.tokenizer = tokenizer
+        self.max_seq_len = max_seq_len
+        self.mode = mode
+        self.buffer_size = buffer_size
+        self.skip_samples = skip_samples
+    def _prepare_sft_text(self, example):
+        if 'messages' in example:
+            text = ""
+            for msg in example['messages']:
+                role = msg.get('role', '')
+                content = msg.get('content', '')
+                if role == 'assistant':
+                    text += f"Assistant: {content}{self.tokenizer.eos_token}"
+                else:
+                    text += f"User: {content}\n"
+            return text
+        elif 'text' in example:
+            return example['text']
+        else:
+            return ""
+    def __iter__(self) -> Iterator[Dict[str, torch.Tensor]]:
+        iterator = iter(self.dataset)
+        buffer = []
+        # Calculate roughly how many items to skip if they were yielded
+        # We process skipping in the yield loop
+        for example in iterator:
+            text = (example.get('text', '') if self.mode == "pretrain"
+                   else self._prepare_sft_text(example))
+            if len(text) < 10:
+                continue
+            enc = self.tokenizer(text, truncation=True, max_length=self.max_seq_len,
+                               return_tensors="pt")
+            input_ids = enc['input_ids'][0]
+            if len(input_ids) < 2:
+                continue
+            if len(input_ids) < self.max_seq_len:
+                pad_len = self.max_seq_len - len(input_ids)
+                input_ids = torch.cat([
+                    input_ids,
+                    torch.full((pad_len,), self.tokenizer.pad_token_id, dtype=torch.long)
+                ])
+            labels = input_ids.clone()
+            if len(input_ids) < self.max_seq_len:
+                labels[-pad_len:] = -100
+            buffer.append((input_ids, labels))
+            if len(buffer) >= self.buffer_size:
+                random.shuffle(buffer)
+                for _ in range(self.buffer_size // 2):
+                    item = buffer.pop()
+                    if self.skip_samples > 0:
+                        self.skip_samples -= 1
+                        continue
+                    yield item
+        # Yield remaining
+        random.shuffle(buffer)
+        while buffer:
+            item = buffer.pop()
+            if self.skip_samples > 0:
+                self.skip_samples -= 1
+                continue
+            yield item
+def create_streaming_loader(dataset_name, split, tokenizer, config, batch_size, mode="pretrain", hf_token=None, start_step=0):
+    raw_dataset = load_dataset(dataset_name, split=split, streaming=True,
+                              trust_remote_code=True, token=hf_token)
+    # Calculate samples to skip: start_step * batch_size
+    skip_samples = start_step * batch_size
+    if skip_samples > 0:
+        print(f"  [Loader] Resuming: Fast-forwarding dataset by {skip_samples} samples...")
+    stream_ds = StreamingDataset(raw_dataset, tokenizer, config.max_seq_len, mode=mode, skip_samples=skip_samples)
+    # Increase num_workers for better utilization
+    return DataLoader(stream_ds, batch_size=batch_size, pin_memory=True,
+                     num_workers=4, prefetch_factor=4)

aetheris/inference.py ADDED Viewed

	@@ -0,0 +1,106 @@

+import torch
+import torch.nn.functional as F
+from typing import Optional, List, Generator
+from aetheris.config import AetherisConfig
+from aetheris.model import HybridMambaMoE
+from aetheris.data import get_tokenizer
+from aetheris.utils import load_latest_checkpoint
+class InferenceEngine:
+    def __init__(self, config_path: str = "configs/default.yaml", checkpoint_dir: str = "checkpoints", checkpoint_name: str = "checkpoint_current.pth", device: str = None):
+        self.device = torch.device(device if device else ('cuda' if torch.cuda.is_available() else 'cpu'))
+        self.config = AetherisConfig.from_yaml(config_path)
+        self.tokenizer = get_tokenizer()
+        self.model = HybridMambaMoE(self.config).to(self.device).to(self.config.torch_dtype)
+        # Load checkpoint
+        # Note: load_latest_checkpoint expects optimizer and scaler, but for inference we can pass None
+        load_latest_checkpoint(self.model, None, None, self.device, checkpoint_dir, checkpoint_name)
+        self.model.eval()
+    def generate(self,
+                 prompt: str,
+                 max_new_tokens: int = 100,
+                 temperature: float = 0.8,
+                 top_k: int = 0,
+                 top_p: float = 0.9,
+                 repetition_penalty: float = 1.0,
+                 stream: bool = False) -> Generator[str, None, None] | str:
+        input_ids = self.tokenizer.encode(prompt, return_tensors='pt').to(self.device)
+        generated_ids = input_ids.clone()
+        history_ids = set(input_ids[0].tolist())
+        def token_generator():
+            nonlocal generated_ids
+            for _ in range(max_new_tokens):
+                 # Check if we should use autocast (skip if model uses float32)
+                use_autocast = True
+                if self.config.torch_dtype == torch.float32:
+                    use_autocast = False
+                if use_autocast:
+                    with torch.amp.autocast('cuda' if self.device.type == 'cuda' else 'cpu', dtype=self.model.config.torch_dtype):
+                        outputs = self.model(generated_ids)
+                        logits = outputs['logits']
+                        next_token_logits = logits[:, -1, :]
+                else:
+                    outputs = self.model(generated_ids)
+                    logits = outputs['logits']
+                    next_token_logits = logits[:, -1, :]
+                # Repetition penalty
+                for token_id in history_ids:
+                    if token_id < next_token_logits.size(-1):
+                        logit = next_token_logits[0, token_id].item()
+                        if logit > 0:
+                            next_token_logits[0, token_id] = logit / repetition_penalty
+                        else:
+                            next_token_logits[0, token_id] = logit * repetition_penalty
+                # Temperature
+                if temperature > 0:
+                    next_token_logits = next_token_logits / temperature
+                # Top-p / Top-k
+                if top_p < 1.0:
+                    sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
+                    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+                    sorted_indices_to_remove = cumulative_probs > top_p
+                    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+                    sorted_indices_to_remove[..., 0] = False
+                    indices_to_remove = sorted_indices[sorted_indices_to_remove]
+                    next_token_logits.scatter_(1, indices_to_remove.unsqueeze(0), float('-inf'))
+                elif top_k > 0:
+                    top_k_logits, top_k_indices = torch.topk(next_token_logits, top_k)
+                    next_token_logits = torch.full_like(next_token_logits, float('-inf'))
+                    next_token_logits.scatter_(1, top_k_indices, top_k_logits)
+                # Sample
+                next_token_probs = F.softmax(next_token_logits, dim=-1)
+                next_token = torch.multinomial(next_token_probs, num_samples=1)
+                next_token_item = next_token.item()
+                if next_token_item == self.tokenizer.eos_token_id:
+                    break
+                generated_ids = torch.cat([generated_ids, next_token], dim=-1)
+                history_ids.add(next_token_item)
+                new_token_text = self.tokenizer.decode(next_token.squeeze().tolist(), skip_special_tokens=True)
+                yield new_token_text
+        if stream:
+            return token_generator()
+        else:
+            return "".join(list(token_generator()))
+    def generate_full(self,
+                 prompt: str,
+                 max_new_tokens: int = 100,
+                 temperature: float = 0.8,
+                 top_k: int = 0,
+                 top_p: float = 0.9,
+                 repetition_penalty: float = 1.0) -> str:
+        return self.generate(prompt, max_new_tokens, temperature, top_k, top_p, repetition_penalty, stream=False)

aetheris/model.py ADDED Viewed

	@@ -0,0 +1,86 @@

+import torch
+import torch.nn as nn
+from typing import Dict, Any, List
+from .config import AetherisConfig
+from .modules import SSMBlock, SparseMoELayer
+class HybridMambaMoE(nn.Module):
+    def __init__(self, config: AetherisConfig):
+        super().__init__()
+        self.config = config
+        self.embedding = nn.Embedding(config.vocab_size, config.d_model)
+        self.layers = nn.ModuleList()
+        for i in range(config.n_layer):
+            if i % 2 == 0:
+                self.layers.append(SSMBlock(config))
+            else:
+                self.layers.append(SparseMoELayer(config))
+        self.final_norm = nn.LayerNorm(config.d_model)
+        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
+        self.lm_head.weight = self.embedding.weight  # Weight tying
+        self.loss_fn = nn.CrossEntropyLoss(ignore_index=-1)
+        self.gradient_checkpointing = config.gradient_checkpointing
+        # Initialize embeddings with smaller scale
+        nn.init.normal_(self.embedding.weight, mean=0.0, std=0.02)
+    def _init_weights(self, module):
+        """Apply proper weight initialization"""
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight, gain=0.5)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            nn.init.normal_(module.weight, mean=0.0, std=0.02)
+        elif isinstance(module, nn.LayerNorm):
+            nn.init.ones_(module.weight)
+            nn.init.zeros_(module.bias)
+    def forward(self, input_ids: torch.Tensor, labels: torch.Tensor = None) -> Dict[str, Any]:
+        x = self.embedding(input_ids)
+        total_aux_loss = torch.tensor(0.0, device=x.device, dtype=x.dtype)
+        for i, layer in enumerate(self.layers):
+            if self.gradient_checkpointing and self.training:
+                # Checkpoint ALL layers for maximum memory savings
+                if isinstance(layer, SparseMoELayer):
+                    def moe_forward(module, inp):
+                        return module(inp)
+                    x, aux_loss = torch.utils.checkpoint.checkpoint(
+                        moe_forward, layer, x, use_reentrant=False
+                    )
+                    total_aux_loss = total_aux_loss + aux_loss
+                else:
+                    x = torch.utils.checkpoint.checkpoint(
+                        layer, x, use_reentrant=False
+                    )
+            else:
+                if isinstance(layer, SparseMoELayer):
+                    x, aux_loss = layer(x)
+                    total_aux_loss = total_aux_loss + aux_loss
+                else:
+                    x = layer(x)
+        x = self.final_norm(x)
+        logits = self.lm_head(x)
+        if labels is not None:
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            ce_loss = self.loss_fn(shift_logits.view(-1, self.config.vocab_size),
+                                  shift_labels.view(-1))
+            # Scale down aux loss to prevent it from dominating
+            total_loss = ce_loss + 0.01 * total_aux_loss
+            return {
+                "loss": total_loss,
+                "ce_loss": ce_loss,
+                "aux_loss": total_aux_loss,
+                "logits": logits
+            }
+        return {"logits": logits}

aetheris/modules/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@

+from .expert import Expert
+from .ssm import SSMBlock, selective_scan_native
+from .moe import SparseMoELayer

aetheris/modules/expert.py ADDED Viewed

	@@ -0,0 +1,35 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class Expert(nn.Module):
+    """Memory-efficient Feed-Forward Network expert with proper initialization."""
+    def __init__(self, d_model: int, d_ff: int):
+        super().__init__()
+        self.w1 = nn.Linear(d_model, d_ff, bias=False)
+        self.w2 = nn.Linear(d_ff, d_model, bias=False)
+        self.act = nn.GELU()
+        # Proper initialization to prevent NaN
+        nn.init.xavier_uniform_(self.w1.weight, gain=0.5)
+        nn.init.xavier_uniform_(self.w2.weight, gain=0.5)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        orig_dtype = x.dtype
+        # Force float32 for internal computation to prevent overflow in half precision
+        x = x.to(torch.float32)
+        # Cast weights to float32 for calculation
+        # This is necessary because the module weights might be float16
+        w1_weight = self.w1.weight.to(torch.float32)
+        w2_weight = self.w2.weight.to(torch.float32)
+        h = F.linear(x, w1_weight)
+        h = self.act(h)
+        out = F.linear(h, w2_weight)
+        # Clamp to avoid Inf when casting back to float16
+        if orig_dtype == torch.float16:
+            out = torch.clamp(out, min=-65500.0, max=65500.0)
+        return out.to(orig_dtype)

aetheris/modules/moe.py ADDED Viewed

	@@ -0,0 +1,83 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from ..config import AetherisConfig
+from .expert import Expert
+class SparseMoELayer(nn.Module):
+    """Memory-optimized Sparse MoE with efficient routing."""
+    def __init__(self, config: AetherisConfig):
+        super().__init__()
+        self.d_model = config.d_model
+        self.num_experts = config.num_experts
+        self.top_k = config.top_k
+        self.load_balancing_coef = config.load_balancing_coef
+        self.z_loss_coef = config.router_z_loss_coef
+        self.gate = nn.Linear(config.d_model, config.num_experts, bias=False)
+        self.experts = nn.ModuleList([Expert(config.d_model, config.d_ff)
+                                      for _ in range(config.num_experts)])
+        self.norm = nn.LayerNorm(config.d_model)
+    def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
+        B, L, D = x.shape
+        x_norm = self.norm(x)
+        flat_x = x_norm.view(-1, D)
+        # Routing Logits with stability
+        gate_logits = self.gate(flat_x)
+        # Clamp logits to prevent overflow
+        gate_logits = torch.clamp(gate_logits, min=-10.0, max=10.0)
+        # Z-Loss for stability
+        z_loss = torch.mean(torch.logsumexp(gate_logits, dim=-1)**2) * self.z_loss_coef
+        if self.training:
+            # Reduce noise for stability
+            gate_logits = gate_logits + torch.randn_like(gate_logits) * 1e-3
+        gate_probs = F.softmax(gate_logits, dim=-1)
+        gate_weights, expert_indices = torch.topk(gate_probs, self.top_k, dim=-1)
+        # Normalize weights for stability
+        gate_weights = gate_weights / (gate_weights.sum(dim=-1, keepdim=True) + 1e-8)
+        # Load balancing loss
+        # Use only the top-1 expert for load balancing calculation to keep it simple and consistent
+        expert_mask = F.one_hot(expert_indices[:, 0], num_classes=self.num_experts).float()
+        fraction_routed = expert_mask.mean(dim=0)
+        mean_prob = gate_probs.mean(dim=0)
+        aux_loss = (self.num_experts * torch.sum(fraction_routed * mean_prob)) * self.load_balancing_coef
+        total_aux_loss = aux_loss + z_loss
+        # Efficient dispatch with in-place operations
+        # Accumulate in float32 to prevent overflow during aggregation
+        final_output = torch.zeros_like(flat_x, dtype=torch.float32)
+        # Iterate over all k selected experts
+        for k_idx in range(self.top_k):
+            for i, expert in enumerate(self.experts):
+                # Find tokens routed to expert 'i' at the k-th position
+                mask = (expert_indices[:, k_idx] == i)
+                if not mask.any():
+                    continue
+                expert_input = flat_x[mask]
+                expert_out = expert(expert_input)
+                # Apply weights
+                weights = gate_weights[mask, k_idx].unsqueeze(1)
+                # Cast to float32 for accumulation
+                expert_out = expert_out.to(torch.float32)
+                weights = weights.to(torch.float32)
+                # Accumulate output (add to existing results from other experts)
+                final_output[mask] += expert_out * weights
+        # Cast back to original dtype
+        final_output = final_output.to(flat_x.dtype)
+        return x + final_output.view(B, L, D), total_aux_loss

aetheris/modules/ssm.py ADDED Viewed

	@@ -0,0 +1,91 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from ..config import AetherisConfig
+def selective_scan_native(u: torch.Tensor, delta: torch.Tensor, A: torch.Tensor,
+                         B: torch.Tensor, C: torch.Tensor, D: torch.Tensor) -> torch.Tensor:
+    """Memory-efficient scan with reduced intermediate tensors."""
+    B_size, L, D_inner = u.shape
+    D_state = A.shape[-1]
+    # Use in-place operations where possible
+    h = torch.zeros(B_size, D_inner, D_state, device=u.device, dtype=u.dtype)
+    ys = []
+    for l in range(L):
+        dt = delta[:, l, :].unsqueeze(-1)
+        dA = torch.exp(dt * A)
+        B_l = B[:, l, :].unsqueeze(1)
+        dB = dt * B_l
+        u_t = u[:, l, :].unsqueeze(-1)
+        h = dA * h + dB * u_t
+        C_l = C[:, l, :].unsqueeze(1)
+        y_t = torch.sum(h * C_l, dim=-1)
+        ys.append(y_t)
+    y = torch.stack(ys, dim=1)
+    return y + u * D
+class SSMBlock(nn.Module):
+    """Memory-optimized State Space Model with stability improvements."""
+    def __init__(self, config: AetherisConfig):
+        super().__init__()
+        self.d_model = config.d_model
+        self.d_state = config.ssm_d_state
+        self.d_inner = config.d_inner
+        self.in_proj = nn.Linear(self.d_model, self.d_inner * 2, bias=False)
+        self.out_proj = nn.Linear(self.d_inner, self.d_model, bias=False)
+        self.conv_d = nn.Conv1d(self.d_inner, self.d_inner, kernel_size=3,
+                        padding=2, groups=self.d_inner, bias=False)
+        self.gate_proj = nn.Linear(self.d_model, self.d_inner, bias=False)
+        self.B_proj = nn.Linear(self.d_inner, self.d_state, bias=False)
+        self.C_proj = nn.Linear(self.d_inner, self.d_state, bias=False)
+        self.delta_proj = nn.Linear(self.d_inner, self.d_inner, bias=False)
+        # Initialize A to be more stable (closer to -1)
+        self.A_log = nn.Parameter(torch.randn(self.d_inner, self.d_state) * 0.1 - 4.0)
+        self.D = nn.Parameter(torch.ones(self.d_inner) * 0.1)
+        self.act = nn.SiLU()
+        self.norm = nn.LayerNorm(config.d_model)
+        # Proper initialization
+        nn.init.xavier_uniform_(self.in_proj.weight, gain=0.5)
+        nn.init.xavier_uniform_(self.out_proj.weight, gain=0.5)
+        nn.init.xavier_uniform_(self.gate_proj.weight, gain=0.5)
+        nn.init.xavier_uniform_(self.B_proj.weight, gain=0.5)
+        nn.init.xavier_uniform_(self.C_proj.weight, gain=0.5)
+        nn.init.xavier_uniform_(self.delta_proj.weight, gain=0.5)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        B, L, D = x.shape
+        x_norm = self.norm(x)
+        xz = self.in_proj(x_norm)
+        x_in, z_gate = xz.chunk(2, dim=-1)
+        x_conv = self.conv_d(x_in.transpose(1, 2))
+        # Slice off the last 2 elements (the "future" leakage)
+        x_conv = x_conv[:, :, :-2].transpose(1, 2)
+        x_conv = self.act(x_conv)
+        # Add small epsilon to prevent numerical issues and clamp max value
+        delta = torch.clamp(F.softplus(self.delta_proj(x_conv)), max=5.0) + 1e-4
+        B_ssm = self.B_proj(x_conv)
+        C_ssm = self.C_proj(x_conv)
+        # Clamp A to prevent extreme values
+        A_fixed = -torch.exp(torch.clamp(self.A_log, min=-10.0, max=2.0))
+        A_batched = A_fixed.unsqueeze(0).expand(B, -1, -1)
+        y_ssm = selective_scan_native(x_conv, delta, A_batched, B_ssm, C_ssm, self.D)
+        y_gate = F.silu(self.gate_proj(x_norm)) * y_ssm
+        output = self.out_proj(y_gate)
+        return x + output

aetheris/trainer/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .trainer import Trainer

aetheris/trainer/trainer.py ADDED Viewed

	@@ -0,0 +1,145 @@

+import torch
+import time
+import os
+from aetheris.utils import save_checkpoint, load_latest_checkpoint, calculate_model_stats
+class Trainer:
+    def __init__(self, model, optimizer, scaler, config, device, checkpoint_dir, logger=None):
+        self.model = model
+        self.optimizer = optimizer
+        self.scaler = scaler
+        self.config = config
+        self.device = device
+        self.checkpoint_dir = checkpoint_dir
+        self.logger = logger
+        self.model.to(self.device)
+    def validate(self, val_loader, global_step):
+        self.model.eval()
+        total_loss = 0
+        total_items = 0
+        num_batches = 100 # Validate on 100 batches to save time
+        print(f"\n[Validation] Starting validation at step {global_step}...")
+        with torch.no_grad():
+             for i, batch in enumerate(val_loader):
+                if i >= num_batches:
+                    break
+                input_ids, labels = batch
+                input_ids = input_ids.to(self.device, non_blocking=True)
+                labels = labels.to(self.device, non_blocking=True)
+                # Auto-cast context
+                if self.device.type == 'cuda':
+                    autocast_dtype = torch.float16
+                else:
+                    autocast_dtype = torch.bfloat16
+                use_autocast = True if self.config.torch_dtype != torch.float32 else False
+                if use_autocast:
+                    with torch.amp.autocast('cuda' if self.device.type == 'cuda' else 'cpu', dtype=autocast_dtype):
+                        output = self.model(input_ids, labels)
+                else:
+                    output = self.model(input_ids, labels)
+                total_loss += output["loss"].item()
+                total_items += 1
+        avg_loss = total_loss / total_items if total_items > 0 else 0
+        perplexity = torch.exp(torch.tensor(avg_loss)).item()
+        print(f"[Validation] Step {global_step} | Loss: {avg_loss:.4f} | PPL: {perplexity:.4f}")
+        self.model.train()
+        return avg_loss
+    def train_epoch(self, train_loader, total_steps, start_step=0, stage_name="Training", val_loader=None, eval_every=500):
+        print(f"\n{'='*70}\nStarting {stage_name}: Target Steps={total_steps}\n{'='*70}")
+        self.model.train()
+        global_step = start_step
+        running_loss = 0
+        print("Initializing data iterator...")
+        train_iter = iter(train_loader)
+        print("Fetching first batch...")
+        while global_step < total_steps:
+            step_start = time.time()
+            # Removed periodic cache clearing for performance
+            self.optimizer.zero_grad(set_to_none=True)
+            try:
+                batch = next(train_iter)
+                if global_step == start_step:
+                    print(f"✓ First batch loaded! Starting training loop...")
+            except StopIteration:
+                train_iter = iter(train_loader)
+                batch = next(train_iter)
+            input_ids, labels = batch
+            input_ids = input_ids.to(self.device, non_blocking=True)
+            labels = labels.to(self.device, non_blocking=True)
+            # Determine autocast dtype
+            if self.device.type == 'cuda':
+                autocast_dtype = torch.float16
+            else:
+                autocast_dtype = torch.bfloat16
+            # Check if we should use autocast (skip if model uses float32)
+            use_autocast = True
+            if self.config.torch_dtype == torch.float32:
+                use_autocast = False
+            if use_autocast:
+                with torch.amp.autocast('cuda' if self.device.type == 'cuda' else 'cpu', dtype=autocast_dtype):
+                    output = self.model(input_ids, labels)
+                    loss = output["loss"]
+            else:
+                output = self.model(input_ids, labels)
+                loss = output["loss"]
+            self.scaler.scale(loss).backward()
+            self.scaler.unscale_(self.optimizer)
+            # Gradient clipping
+            grad_norm = torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=0.5)
+            if torch.isnan(grad_norm) or torch.isinf(grad_norm):
+                print(f"WARNING: NaN/Inf gradient at step {global_step}, skipping update")
+            else:
+                self.scaler.step(self.optimizer)
+            self.scaler.update()
+            global_step += 1
+            running_loss += loss.item()
+            if global_step % 10 == 0:
+                avg_loss = running_loss / 10
+                t_diff = time.time() - step_start
+                if self.device.type == 'cuda':
+                    mem = torch.cuda.memory_allocated() / 1e9
+                    max_mem = torch.cuda.max_memory_allocated() / 1e9
+                    mem_str = f"VRAM: {mem:.1f}GB (peak: {max_mem:.1f}GB)"
+                else:
+                    mem_str = "CPU Mode"
+                tokens_per_sec = (self.config.max_seq_len * input_ids.size(0)) / t_diff
+                print(f"  Step {global_step}/{total_steps} | Loss: {avg_loss:.4f} | "
+                      f"{mem_str} | {tokens_per_sec:.0f} tok/s")
+                running_loss = 0
+            if global_step % 500 == 0:
+                save_checkpoint(self.model, self.optimizer, self.scaler, global_step, stage_name, self.checkpoint_dir)
+            if val_loader is not None and global_step % eval_every == 0 and global_step > start_step:
+                self.validate(val_loader, global_step)
+        return global_step

aetheris/utils.py ADDED Viewed

	@@ -0,0 +1,39 @@

+import os
+import torch
+from typing import Tuple
+def save_checkpoint(model, optimizer, scaler, step, stage, checkpoint_dir, checkpoint_name="checkpoint_current.pth"):
+    os.makedirs(checkpoint_dir, exist_ok=True)
+    path = os.path.join(checkpoint_dir, checkpoint_name)
+    torch.save({
+        'step': step,
+        'stage': stage,
+        'model_state_dict': model.state_dict(),
+        'optimizer_state_dict': optimizer.state_dict(),
+        'scaler_state_dict': scaler.state_dict()
+    }, path)
+    print(f"    [Checkpoint] Saved at step {step}")
+def load_latest_checkpoint(model, optimizer, scaler, device, checkpoint_dir, checkpoint_name="checkpoint_current.pth") -> Tuple[int, str]:
+    path = os.path.join(checkpoint_dir, checkpoint_name)
+    if not os.path.exists(path):
+        return 0, "Pre-Training"
+    print(f"    [Checkpoint] Loading from {path}...")
+    ckpt = torch.load(path, map_location=device)
+    model.load_state_dict(ckpt['model_state_dict'])
+    if optimizer:
+        optimizer.load_state_dict(ckpt['optimizer_state_dict'])
+    if scaler:
+        scaler.load_state_dict(ckpt['scaler_state_dict'])
+    return ckpt['step'], ckpt['stage']
+def calculate_model_stats(model):
+    total_params = sum(p.numel() for p in model.parameters())
+    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    return {
+        'total_params': total_params,
+        'trainable_params': trainable_params,
+        'active_params': int(total_params * 0.6), # Approximation
+        'sparsity_ratio': 0.6 # Approximation
+    }

configs/debug.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+vocab_size: 50257
+d_model: 128
+n_layer: 4
+num_experts: 4
+top_k: 1
+d_ff: 384
+ssm_d_state: 8
+ssm_expand: 2
+load_balancing_coef: 0.01
+router_z_loss_coef: 0.001
+max_seq_len: 128
+dtype: "float32" # Use float32 for debugging on CPU
+use_cpu_offload: false
+gradient_checkpointing: false
+checkpoint_ssm_layers: false
+use_flash_attention: false

configs/default.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+vocab_size: 50257
+d_model: 768
+n_layer: 24
+num_experts: 4
+top_k: 1
+d_ff: 2304
+ssm_d_state: 16
+ssm_expand: 2
+load_balancing_coef: 0.01
+router_z_loss_coef: 0.001
+max_seq_len: 512
+dtype: "float16"
+use_cpu_offload: false
+gradient_checkpointing: true
+checkpoint_ssm_layers: true
+use_flash_attention: false

configs/inference.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+vocab_size: 50257
+d_model: 768
+n_layer: 24
+num_experts: 4
+top_k: 1
+d_ff: 2304
+ssm_d_state: 16
+ssm_expand: 2
+load_balancing_coef: 0.0
+router_z_loss_coef: 0.0
+max_seq_len: 1024
+dtype: "float16"
+use_cpu_offload: true # Offload to CPU during inference to save VRAM
+gradient_checkpointing: false
+checkpoint_ssm_layers: false
+use_flash_attention: true

configs/large.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+vocab_size: 50257
+d_model: 1600
+n_layer: 48
+num_experts: 8
+top_k: 2
+d_ff: 4800
+ssm_d_state: 64
+ssm_expand: 2
+load_balancing_coef: 0.01
+router_z_loss_coef: 0.001
+max_seq_len: 2048
+dtype: "float16"
+use_cpu_offload: false
+gradient_checkpointing: true
+checkpoint_ssm_layers: true
+use_flash_attention: true

requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+torch>=2.0.0
+transformers
+datasets
+huggingface_hub
+pyyaml
+zstandard
+fastapi
+uvicorn
+pydantic
+sse-starlette
+pytest
+httpx

scripts/generate.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import sys
+from pathlib import Path
+from aetheris.cli.main import main
+if __name__ == "__main__":
+    # Simulate arguments if needed, but since we are replacing the script, we can just rely on argparse to parse sys.argv
+    # The original script parsed arguments like --prompt, etc.
+    # The new CLI expects a subcommand, e.g., 'generate'
+    # Check if 'generate' is already in argv, if not prepend it
+    if len(sys.argv) > 1 and sys.argv[1] != 'generate':
+        sys.argv.insert(1, 'generate')
+    elif len(sys.argv) == 1:
+        sys.argv.append('generate')
+    sys.exit(main())

scripts/info.py ADDED Viewed

	@@ -0,0 +1,11 @@

+import sys
+from pathlib import Path
+from aetheris.cli.main import main
+if __name__ == "__main__":
+    if len(sys.argv) > 1 and sys.argv[1] != 'info':
+        sys.argv.insert(1, 'info')
+    elif len(sys.argv) == 1:
+        sys.argv.append('info')
+    sys.exit(main())

scripts/train.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import sys
+from pathlib import Path
+from aetheris.cli.main import main
+if __name__ == "__main__":
+    sys.exit(main())

scripts/validate.py ADDED Viewed

	@@ -0,0 +1,87 @@

+import argparse
+import os
+import torch
+import math
+import time
+import sys
+from pathlib import Path
+# Add project root to path
+sys.path.append(str(Path(__file__).resolve().parent.parent))
+from aetheris.config import AetherisConfig
+from aetheris.model import HybridMambaMoE
+from aetheris.data import create_streaming_loader, get_tokenizer
+from aetheris.utils import load_latest_checkpoint
+@torch.no_grad()
+def evaluate_model(model, val_loader, device, max_batches=100):
+    print(f"\n{'='*50}\nStarting Validation (Max {max_batches} batches)\n{'='*50}")
+    model.eval()
+    total_loss = 0.0
+    num_batches = 0
+    start_time = time.time()
+    for batch in val_loader:
+        if num_batches >= max_batches:
+            break
+        input_ids, labels = batch
+        input_ids = input_ids.to(device, non_blocking=True)
+        labels = labels.to(device, non_blocking=True)
+        with torch.amp.autocast('cuda', dtype=torch.float16):
+            output = model(input_ids, labels)
+            loss = output["loss"]
+        total_loss += loss.item()
+        num_batches += 1
+        if num_batches % 20 == 0:
+            print(f"-> Processed {num_batches}/{max_batches} batches...")
+    end_time = time.time()
+    if num_batches == 0:
+        print("No validation batches were processed.")
+        return float('inf')
+    avg_loss = total_loss / num_batches
+    perplexity = math.exp(avg_loss)
+    print(f"\n--- Validation Results ---")
+    print(f"Total batches processed: {num_batches}")
+    print(f"Time taken: {end_time - start_time:.2f} seconds")
+    print(f"Average Loss: {avg_loss:.4f}")
+    print(f"Perplexity: {perplexity:.2f}")
+    print(f"--------------------------\n")
+    return avg_loss
+def main():
+    parser = argparse.ArgumentParser(description="Validate Aetheris Model")
+    parser.add_argument("--config", type=str, default="configs/default.yaml", help="Path to config file")
+    parser.add_argument("--checkpoint_dir", type=str, default="checkpoints", help="Directory with checkpoints")
+    parser.add_argument("--checkpoint_name", type=str, default="checkpoint_current.pth", help="Checkpoint file name")
+    parser.add_argument("--batch_size", type=int, default=2, help="Batch size")
+    parser.add_argument("--hf_token", type=str, default=os.environ.get("HF_TOKEN"), help="HuggingFace Token")
+    parser.add_argument("--dataset", type=str, default="cerebras/SlimPajama-627B", help="Dataset to validate on")
+    parser.add_argument("--dataset_mode", type=str, default="pretrain", help="pretrain or sft")
+    args = parser.parse_args()
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    config = AetherisConfig.from_yaml(args.config)
+    tokenizer = get_tokenizer()
+    model = HybridMambaMoE(config).to(device).to(config.torch_dtype)
+    load_latest_checkpoint(model, None, None, device, args.checkpoint_dir, args.checkpoint_name)
+    val_loader = create_streaming_loader(args.dataset, "validation", tokenizer, config, args.batch_size, mode=args.dataset_mode, hf_token=args.hf_token)
+    evaluate_model(model, val_loader, device)
+if __name__ == "__main__":
+    main()

tests/test_api.py ADDED Viewed

	@@ -0,0 +1,88 @@

+import pytest
+from fastapi.testclient import TestClient
+from unittest.mock import MagicMock, patch
+from aetheris.api.server import app, get_engine
+import aetheris.api.server
+# Mock the engine globally
+@pytest.fixture
+def mock_engine():
+    with patch("aetheris.api.server.engine") as mock_eng:
+        # Mock generate_full
+        mock_eng.generate_full.return_value = "This is a generated response."
+        # Mock generate (streaming)
+        def mock_stream(*args, **kwargs):
+            yield "This "
+            yield "is "
+            yield "streamed."
+        mock_eng.generate.side_effect = mock_stream
+        # Need to ensure get_engine returns this mock
+        # We can also just set aetheris.api.server.engine
+        aetheris.api.server.engine = mock_eng
+        yield mock_eng
+client = TestClient(app)
+def test_list_models(mock_engine):
+    response = client.get("/v1/models")
+    assert response.status_code == 200
+    data = response.json()
+    assert data["object"] == "list"
+    assert len(data["data"]) > 0
+    assert data["data"][0]["id"] == "aetheris-hybrid-mamba-moe"
+def test_chat_completions_non_stream(mock_engine):
+    payload = {
+        "model": "aetheris-hybrid-mamba-moe",
+        "messages": [{"role": "user", "content": "Hello"}],
+        "stream": False
+    }
+    response = client.post("/v1/chat/completions", json=payload)
+    assert response.status_code == 200
+    data = response.json()
+    assert data["object"] == "chat.completion"
+    assert len(data["choices"]) == 1
+    assert data["choices"][0]["message"]["content"] == "This is a generated response."
+def test_chat_completions_stream(mock_engine):
+    payload = {
+        "model": "aetheris-hybrid-mamba-moe",
+        "messages": [{"role": "user", "content": "Hello"}],
+        "stream": True
+    }
+    response = client.post("/v1/chat/completions", json=payload)
+    assert response.status_code == 200
+    # SSE format checking
+    assert "text/event-stream" in response.headers["content-type"]
+    # We can iterate over the response lines to check content
+    content = ""
+    for line in response.iter_lines():
+        if line:
+            # TestClient iter_lines yields strings, not bytes, unless configured otherwise
+            # or depending on the version. If it's bytes, we decode. If it's str, we don't.
+            if isinstance(line, bytes):
+                line = line.decode("utf-8")
+            if line.startswith("data: ") and line != "data: [DONE]":
+                import json
+                chunk = json.loads(line[6:])
+                if chunk["choices"][0]["delta"].get("content"):
+                    content += chunk["choices"][0]["delta"]["content"]
+    assert content == "This is streamed."
+def test_completions(mock_engine):
+    payload = {
+        "model": "aetheris-hybrid-mamba-moe",
+        "prompt": "Once upon a time",
+        "max_tokens": 10
+    }
+    response = client.post("/v1/completions", json=payload)
+    assert response.status_code == 200
+    data = response.json()
+    assert data["object"] == "text_completion"
+    assert len(data["choices"]) == 1
+    assert data["choices"][0]["text"] == "This is a generated response."

tests/test_inference.py ADDED Viewed

	@@ -0,0 +1,71 @@

+import pytest
+from unittest.mock import MagicMock, patch
+from aetheris.inference import InferenceEngine
+@pytest.fixture
+def mock_model():
+    with patch("aetheris.inference.HybridMambaMoE") as MockModel:
+        mock_instance = MockModel.return_value
+        # Mock model output
+        mock_instance.to.return_value = mock_instance
+        mock_instance.eval.return_value = None
+        # Mock forward pass
+        mock_output = MagicMock()
+        # Shape: (batch_size, seq_len, vocab_size)
+        mock_output.__getitem__.return_value = torch.randn(1, 1, 50257)
+        # Actually we need 'logits' key access
+        mock_instance.return_value = {'logits': torch.randn(1, 10, 50257)}
+        yield mock_instance
+@pytest.fixture
+def mock_tokenizer():
+    with patch("aetheris.inference.get_tokenizer") as mock_get_tokenizer:
+        mock_tok = MagicMock()
+        mock_tok.encode.return_value = torch.tensor([[1, 2, 3]])
+        mock_tok.decode.return_value = "token"
+        mock_tok.eos_token_id = 50256
+        mock_get_tokenizer.return_value = mock_tok
+        yield mock_tok
+@pytest.fixture
+def mock_utils():
+    with patch("aetheris.inference.load_latest_checkpoint") as mock_load:
+        yield mock_load
+import torch
+def test_inference_initialization(mock_model, mock_tokenizer, mock_utils):
+    engine = InferenceEngine(config_path="configs/default.yaml")
+    assert engine.model is not None
+    assert engine.tokenizer is not None
+    mock_utils.assert_called_once()
+def test_generate_full(mock_model, mock_tokenizer, mock_utils):
+    engine = InferenceEngine()
+    # Mock model output for generation loop
+    # We need to ensure the model returns logits of correct shape
+    # The loop calls model(generated_ids)
+    # Let's mock the actual model call inside generate
+    engine.model.config.torch_dtype = torch.float32
+    # We need to return a dict with logits
+    # Shape: (batch, seq_len, vocab_size)
+    engine.model.side_effect = lambda x: {'logits': torch.randn(1, x.shape[1], 50257)}
+    output = engine.generate_full("test prompt", max_new_tokens=5)
+    assert isinstance(output, str)
+    assert len(output) > 0
+def test_generate_stream(mock_model, mock_tokenizer, mock_utils):
+    engine = InferenceEngine()
+    engine.model.config.torch_dtype = torch.float32
+    engine.model.side_effect = lambda x: {'logits': torch.randn(1, x.shape[1], 50257)}
+    generator = engine.generate("test prompt", max_new_tokens=5, stream=True)
+    tokens = list(generator)
+    assert len(tokens) == 5
+    assert all(isinstance(t, str) for t in tokens)

tests/test_model.py ADDED Viewed

	@@ -0,0 +1,55 @@

+import unittest
+import torch
+import sys
+from pathlib import Path
+# Add project root to path
+sys.path.append(str(Path(__file__).resolve().parent.parent))
+from aetheris.config import AetherisConfig
+from aetheris.model import HybridMambaMoE
+class TestHybridMambaMoE(unittest.TestCase):
+    def setUp(self):
+        self.config = AetherisConfig(
+            vocab_size=100,
+            d_model=32,
+            n_layer=4,
+            num_experts=2,
+            top_k=1,
+            d_ff=64,
+            ssm_d_state=8,
+            ssm_expand=2,
+            max_seq_len=64
+        )
+        self.model = HybridMambaMoE(self.config)
+        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+        self.model.to(self.device)
+    def test_forward_pass(self):
+        batch_size = 2
+        seq_len = 16
+        input_ids = torch.randint(0, self.config.vocab_size, (batch_size, seq_len)).to(self.device)
+        output = self.model(input_ids)
+        self.assertIn('logits', output)
+        self.assertEqual(output['logits'].shape, (batch_size, seq_len, self.config.vocab_size))
+    def test_forward_pass_with_labels(self):
+        batch_size = 2
+        seq_len = 16
+        input_ids = torch.randint(0, self.config.vocab_size, (batch_size, seq_len)).to(self.device)
+        labels = input_ids.clone()
+        output = self.model(input_ids, labels=labels)
+        self.assertIn('loss', output)
+        self.assertIn('ce_loss', output)
+        self.assertIn('aux_loss', output)
+        self.assertIn('logits', output)
+        self.assertTrue(output['loss'] > 0)
+if __name__ == '__main__':
+    unittest.main()

tests/test_overflow.py ADDED Viewed

	@@ -0,0 +1,67 @@

+import unittest
+import torch
+import sys
+from pathlib import Path
+# Add project root to path
+sys.path.append(str(Path(__file__).resolve().parent.parent))
+from aetheris.modules.expert import Expert
+from aetheris.modules.moe import SparseMoELayer
+from aetheris.config import AetherisConfig
+class TestOverflow(unittest.TestCase):
+    def setUp(self):
+        self.config = AetherisConfig(
+            vocab_size=100,
+            d_model=128,
+            n_layer=2,
+            num_experts=2,
+            top_k=1,
+            d_ff=512,  # Large enough to potentially cause issues
+            ssm_d_state=16,
+            ssm_expand=2,
+            max_seq_len=64
+        )
+        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    def test_expert_overflow_protection(self):
+        """Test if Expert handles large inputs without producing NaNs in float16"""
+        expert = Expert(self.config.d_model, self.config.d_ff).to(self.device)
+        # Manually cast weights to float16 to simulate mixed precision training environment
+        expert.half()
+        # Create a large input in float16 that would normally cause overflow in intermediate layers
+        # The limit of float16 is ~65504.
+        # If w1 projects this up, it can easily exceed that.
+        large_input = torch.ones(1, self.config.d_model, dtype=torch.float16).to(self.device) * 100.0
+        # Force weights to be large to guarantee overflow if protection isn't working
+        with torch.no_grad():
+            expert.w1.weight.fill_(10.0)
+            expert.w2.weight.fill_(0.1)
+        # 100 * 10 = 1000. Sum over d_model(128) -> 128000.
+        # This summation happens in the matrix multiplication.
+        # If the matmul internal accumulation is float16, it effectively overflows.
+        output = expert(large_input)
+        self.assertFalse(torch.isnan(output).any(), "Output contains NaNs")
+        self.assertFalse(torch.isinf(output).any(), "Output contains Infs")
+    def test_moe_accumulation_stability(self):
+        """Test if MoE layer handles accumulation in float32"""
+        moe = SparseMoELayer(self.config).to(self.device)
+        moe.half()
+        x = torch.randn(2, 10, self.config.d_model, dtype=torch.float16).to(self.device)
+        # Pass through
+        output, loss = moe(x)
+        self.assertFalse(torch.isnan(output).any(), "MoE Output contains NaNs")
+        self.assertEqual(output.dtype, torch.float16)
+if __name__ == '__main__':
+    unittest.main()