rcgalbo commited on 26 days ago

Commit

8b21693

verified ·

1 Parent(s): 022db6b

Upload Aetheris model (Stage 2 best, 722M params, loss=2.73)

Browse files

Files changed (19) hide show

README.md +47 -62
aetheris/__init__.py +2 -0
aetheris/api/schemas.py +92 -0
aetheris/api/server.py +196 -0
aetheris/cli/__init__.py +1 -0
aetheris/cli/main.py +362 -0
aetheris/config.py +58 -0
aetheris/data.py +231 -0
aetheris/inference.py +106 -0
aetheris/model.py +104 -0
aetheris/modules/__init__.py +3 -0
aetheris/modules/expert.py +35 -0
aetheris/modules/moe.py +83 -0
aetheris/modules/ssm.py +119 -0
aetheris/trainer/__init__.py +1 -0
aetheris/trainer/trainer.py +176 -0
aetheris/utils.py +55 -0
config.yaml +17 -0
pytorch_model.pt +3 -0

README.md CHANGED Viewed

@@ -1,80 +1,65 @@
 ---
-language:
-  - multilingual
-  - en
-  - es
-  - hi
-  - zh
-  - ar
-  - sw
-  - tr
-  - ja
-  - id
-  - te
 license: apache-2.0
 tags:
-  - mamba
-  - moe
-  - ssm
-  - multilingual
-  - distillation
-  - aya
-library_name: aetheris
 pipeline_tag: text-generation
 ---
 # Aetheris — Hybrid Mamba-MoE Multilingual Model
-**Aetheris** is a ~800M parameter hybrid SSM/MoE language model distilled from
 [CohereLabs/tiny-aya-global](https://huggingface.co/CohereLabs/tiny-aya-global) (3.35B).
-Built by [Wayy Research](https://github.com/Wayy-Research).
 ## Architecture
-- **Type**: Hybrid Mamba (SSM) + Mixture of Experts (MoE)
-- **Layers**: 24 (interleaved: even=SSM, odd=MoE)
 - **Hidden dim**: 1024
-- **Experts**: 4 per MoE layer, top-1 routing
-- **SSM state dim**: 16
-- **Vocab size**: 256,000 (shared with tiny-aya-global)
-- **Parameters**: ~800M
 ## Training
-3-stage MambaInLlama distillation pipeline:
-| Stage | Method | Data | Steps |
-|-------|--------|------|-------|
-| 1 | CKA-guided Layer Alignment | ClimbMix | 10,000 |
-| 2 | KL Distillation (T=2.0, alpha=0.7) | ClimbMix | 20,000 |
-| 3 | Supervised Fine-Tuning | aya_collection | 5,000 |
-Key research findings applied:
-- SSM 10x LR boost (compensates 27x gradient imbalance)
-- SVD split for MoE expert initialization (CKA=0.097 diversity)
-- Per-language KL tracking for multilingual equity
-## Current Checkpoint
-- **Stage**: 2 (kl-distillation)
-- **Step**: 18000
-- **Loss**: 3.4199
-- **Updated**: 2026-03-13T01:45:14.154527+00:00
-## Languages
-Supports 70+ languages inherited from tiny-aya-global. Core evaluation
-languages: English, Spanish, Hindi, Chinese, Arabic, Swahili, Turkish,
-Japanese, Indonesian, Telugu.
-## Citation
-```bibtex
-@misc{aetheris2026,
-  title={Aetheris: Hybrid Mamba-MoE Multilingual Model via Knowledge Distillation},
-  author={Wayy Research},
-  year={2026},
-  url={https://huggingface.co/wayyresearch/aetheris}
-}
 ```

 ---
 license: apache-2.0
+language:
+- en
+- es
+- fr
+- de
+- zh
+- ja
+- ko
+- ar
+- hi
+- tr
+- sw
+- id
+- pt
+- ru
 tags:
+- multilingual
+- mamba
+- moe
+- distillation
+- aya
 pipeline_tag: text-generation
 ---
 # Aetheris — Hybrid Mamba-MoE Multilingual Model
+**Aetheris** is a ~720M parameter hybrid SSM-MoE language model distilled from
 [CohereLabs/tiny-aya-global](https://huggingface.co/CohereLabs/tiny-aya-global) (3.35B).
+It supports **67 languages** with 4.6x compression.
 ## Architecture
+- **Type**: Hybrid Mamba-MoE (interleaved SSM + Sparse MoE layers)
+- **Layers**: 24 (12 SSM + 12 MoE)
 - **Hidden dim**: 1024
+- **Experts**: 4 (top-1 routing)
+- **Vocab**: 261,019 tokens (Aya tokenizer)
+- **Parameters**: 722M
 ## Training
+- **Stage 1**: CKA-guided layer alignment (10K steps)
+- **Stage 2**: KL divergence distillation, T=2.0, alpha=0.7 (20K steps, best loss=2.73)
+- **Stage 3**: SFT fine-tuning (pending)
+- **Teacher**: CohereLabs/tiny-aya-global (3.35B)
+- **Data**: ClimbMix (NVIDIA)
+## Usage
+```python
+import torch, yaml, sys
+sys.path.insert(0, ".")
+from aetheris.config import AetherisConfig
+from aetheris.model import HybridMambaMoE
+config = AetherisConfig.from_yaml("config.yaml")
+model = HybridMambaMoE(config)
+sd = torch.load("pytorch_model.pt", map_location="cpu")
+model.load_state_dict(sd)
+model.eval()
 ```
+## Wayy Research
+*People for research, research for people.*
+Buffalo, NY — Est. 2024

aetheris/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .model import HybridMambaMoE
2	+ from .config import AetherisConfig

aetheris/api/schemas.py ADDED Viewed

	@@ -0,0 +1,92 @@

+from typing import List, Optional, Union, Dict, Any
+from pydantic import BaseModel, Field
+import time
+class ChatMessage(BaseModel):
+    role: str
+    content: str
+class ChatCompletionRequest(BaseModel):
+    model: str
+    messages: List[ChatMessage]
+    temperature: Optional[float] = 1.0
+    top_p: Optional[float] = 1.0
+    n: Optional[int] = 1
+    stream: Optional[bool] = False
+    stop: Optional[Union[str, List[str]]] = None
+    max_tokens: Optional[int] = None
+    presence_penalty: Optional[float] = 0.0
+    frequency_penalty: Optional[float] = 0.0
+    logit_bias: Optional[Dict[str, float]] = None
+    user: Optional[str] = None
+class ChatCompletionChoice(BaseModel):
+    index: int
+    message: ChatMessage
+    finish_reason: Optional[str] = None
+class ChatCompletionResponse(BaseModel):
+    id: str
+    object: str = "chat.completion"
+    created: int = Field(default_factory=lambda: int(time.time()))
+    model: str
+    choices: List[ChatCompletionChoice]
+    usage: Optional[Dict[str, int]] = None
+class ChatCompletionChunkDelta(BaseModel):
+    role: Optional[str] = None
+    content: Optional[str] = None
+class ChatCompletionChunkChoice(BaseModel):
+    index: int
+    delta: ChatCompletionChunkDelta
+    finish_reason: Optional[str] = None
+class ChatCompletionChunk(BaseModel):
+    id: str
+    object: str = "chat.completion.chunk"
+    created: int = Field(default_factory=lambda: int(time.time()))
+    model: str
+    choices: List[ChatCompletionChunkChoice]
+class CompletionRequest(BaseModel):
+    model: str
+    prompt: Union[str, List[str]]
+    suffix: Optional[str] = None
+    max_tokens: Optional[int] = 16
+    temperature: Optional[float] = 1.0
+    top_p: Optional[float] = 1.0
+    n: Optional[int] = 1
+    stream: Optional[bool] = False
+    logprobs: Optional[int] = None
+    echo: Optional[bool] = False
+    stop: Optional[Union[str, List[str]]] = None
+    presence_penalty: Optional[float] = 0.0
+    frequency_penalty: Optional[float] = 0.0
+    best_of: Optional[int] = 1
+    logit_bias: Optional[Dict[str, float]] = None
+    user: Optional[str] = None
+class CompletionChoice(BaseModel):
+    text: str
+    index: int
+    logprobs: Optional[Any] = None
+    finish_reason: Optional[str] = None
+class CompletionResponse(BaseModel):
+    id: str
+    object: str = "text_completion"
+    created: int = Field(default_factory=lambda: int(time.time()))
+    model: str
+    choices: List[CompletionChoice]
+    usage: Optional[Dict[str, int]] = None
+class ModelCard(BaseModel):
+    id: str
+    object: str = "model"
+    created: int = Field(default_factory=lambda: int(time.time()))
+    owned_by: str = "aetheris"
+class ModelList(BaseModel):
+    object: str = "list"
+    data: List[ModelCard]

aetheris/api/server.py ADDED Viewed

	@@ -0,0 +1,196 @@

+import time
+import uuid
+import json
+import asyncio
+from typing import AsyncGenerator
+from fastapi import FastAPI, HTTPException, Request
+from fastapi.middleware.cors import CORSMiddleware
+from sse_starlette.sse import EventSourceResponse
+from aetheris.api.schemas import (
+    ChatCompletionRequest, ChatCompletionResponse, ChatCompletionChunk,
+    ChatCompletionChoice, ChatMessage, ChatCompletionChunkChoice, ChatCompletionChunkDelta,
+    CompletionRequest, CompletionResponse, CompletionChoice,
+    ModelList, ModelCard
+)
+from aetheris.inference import InferenceEngine
+app = FastAPI(title="Aetheris API", version="0.1.0")
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Global engine instance
+engine: InferenceEngine = None
+def get_engine():
+    global engine
+    if engine is None:
+        # Defaults, ideally loaded from config/env
+        engine = InferenceEngine()
+    return engine
+@app.on_event("startup")
+async def startup_event():
+    get_engine()
+@app.get("/")
+async def root():
+    return {"status": "running", "message": "Aetheris API is active. Use /v1/chat/completions for inference."}
+@app.get("/v1/models", response_model=ModelList)
+async def list_models():
+    return ModelList(data=[ModelCard(id="aetheris-hybrid-mamba-moe")])
+@app.post("/v1/chat/completions")
+async def chat_completions(request: ChatCompletionRequest):
+    engine = get_engine()
+    # Simple prompt construction from messages
+    prompt = ""
+    for msg in request.messages:
+        prompt += f"{msg.role}: {msg.content}\n"
+    prompt += "assistant: "
+    request_id = f"chatcmpl-{uuid.uuid4()}"
+    created_time = int(time.time())
+    if request.stream:
+        async def event_generator():
+            yield json.dumps(ChatCompletionChunk(
+                id=request_id,
+                created=created_time,
+                model=request.model,
+                choices=[ChatCompletionChunkChoice(
+                    index=0,
+                    delta=ChatCompletionChunkDelta(role="assistant"),
+                    finish_reason=None
+                )]
+            ).model_dump())
+            # Offload synchronous generation to a thread to avoid blocking the event loop
+            queue = asyncio.Queue()
+            loop = asyncio.get_running_loop()
+            import threading
+            stop_event = threading.Event()
+            def producer():
+                try:
+                    # Run the synchronous generator
+                    for token in engine.generate(
+                        prompt=prompt,
+                        max_new_tokens=request.max_tokens or 100,
+                        temperature=request.temperature,
+                        top_p=request.top_p,
+                        repetition_penalty=1.0 + request.frequency_penalty,
+                        stream=True
+                    ):
+                        if stop_event.is_set():
+                            break
+                        # Schedule the put() coroutine on the main loop
+                        asyncio.run_coroutine_threadsafe(queue.put(token), loop)
+                except Exception as e:
+                    print(f"Generation error: {e}")
+                finally:
+                    # Signal done
+                    asyncio.run_coroutine_threadsafe(queue.put(None), loop)
+            thread = threading.Thread(target=producer, daemon=True)
+            thread.start()
+            try:
+                while True:
+                    token = await queue.get()
+                    if token is None:
+                        break
+                    yield json.dumps(ChatCompletionChunk(
+                        id=request_id,
+                        created=created_time,
+                        model=request.model,
+                        choices=[ChatCompletionChunkChoice(
+                            index=0,
+                            delta=ChatCompletionChunkDelta(content=token),
+                            finish_reason=None
+                        )]
+                    ).model_dump())
+                yield json.dumps(ChatCompletionChunk(
+                    id=request_id,
+                    created=created_time,
+                    model=request.model,
+                    choices=[ChatCompletionChunkChoice(
+                        index=0,
+                        delta=ChatCompletionChunkDelta(),
+                        finish_reason="stop"
+                    )]
+                ).model_dump())
+                yield "[DONE]"
+            finally:
+                stop_event.set()
+        return EventSourceResponse(event_generator())
+    else:
+        generated_text = engine.generate_full(
+            prompt=prompt,
+            max_new_tokens=request.max_tokens or 100,
+            temperature=request.temperature,
+            top_p=request.top_p,
+            repetition_penalty=1.0 + request.frequency_penalty
+        )
+        return ChatCompletionResponse(
+            id=request_id,
+            created=created_time,
+            model=request.model,
+            choices=[ChatCompletionChoice(
+                index=0,
+                message=ChatMessage(role="assistant", content=generated_text),
+                finish_reason="stop"
+            )],
+            usage={"prompt_tokens": len(prompt), "completion_tokens": len(generated_text), "total_tokens": len(prompt) + len(generated_text)} # Approximated
+        )
+@app.post("/v1/completions")
+async def completions(request: CompletionRequest):
+    engine = get_engine()
+    prompt = request.prompt
+    if isinstance(prompt, list):
+        prompt = prompt[0] # Handle single prompt for now
+    request_id = f"cmpl-{uuid.uuid4()}"
+    created_time = int(time.time())
+    if request.stream:
+        # Streaming for completions not fully implemented to match OpenAI exactly in this demo,
+        # but logic is similar to chat.
+        # For simplicity, returning non-streaming for now or basic stream.
+        pass # TODO: Implement streaming for completions
+    generated_text = engine.generate_full(
+        prompt=prompt,
+        max_new_tokens=request.max_tokens or 16,
+        temperature=request.temperature,
+        top_p=request.top_p,
+        repetition_penalty=1.0 + request.frequency_penalty
+    )
+    return CompletionResponse(
+        id=request_id,
+        created=created_time,
+        model=request.model,
+        choices=[CompletionChoice(
+            text=generated_text,
+            index=0,
+            logprobs=None,
+            finish_reason="length" # or stop
+        )],
+        usage={"prompt_tokens": len(prompt), "completion_tokens": len(generated_text), "total_tokens": len(prompt) + len(generated_text)}
+    )

aetheris/cli/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+

aetheris/cli/main.py ADDED Viewed

	@@ -0,0 +1,362 @@

+import argparse
+import sys
+import torch
+import os
+import math
+import torch.nn.functional as F
+from aetheris.config import AetherisConfig
+from aetheris.model import HybridMambaMoE
+from aetheris.data import create_streaming_loader, get_tokenizer
+from aetheris.utils import load_latest_checkpoint, calculate_model_stats
+from aetheris.trainer import Trainer
+def train_command(args):
+    print(f"\n{'='*70}")
+    print(f"Aetheris Training")
+    print(f"Config: {args.config}")
+    if args.hf_token:
+        print(f"Using Hugging Face token: {args.hf_token[:10]}...")
+        from huggingface_hub import login
+        login(token=args.hf_token)
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    if device.type == 'cuda':
+        torch.backends.cuda.matmul.allow_tf32 = True
+        torch.backends.cudnn.allow_tf32 = True
+        torch.backends.cudnn.benchmark = True
+        torch.cuda.empty_cache()
+    config = AetherisConfig.from_yaml(args.config)
+    # Add special tokens if using VoxLex config (vocab_size > 50257)
+    add_special = config.vocab_size > 50257
+    tokenizer = get_tokenizer(add_special_tokens=add_special)
+    print(f"Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}")
+    print(f"Model Size: d_model={config.d_model}, layers={config.n_layer}")
+    print(f"Vocab Size: {config.vocab_size} | Max Seq Len: {config.max_seq_len}")
+    print(f"{'='*70}\n")
+    model = HybridMambaMoE(config).to(device)
+    # Apply weight initialization BEFORE resize (resize copies old weights)
+    print("Applying proper weight initialization...")
+    model.apply(model._init_weights)
+    # Resize embeddings if tokenizer has special tokens (AFTER init)
+    if len(tokenizer) > model.config.vocab_size:
+        print(f"Resizing embeddings: {model.config.vocab_size} → {len(tokenizer)}")
+        model.resize_token_embeddings(len(tokenizer))
+    elif len(tokenizer) < model.config.vocab_size:
+        print(f"Resizing embeddings: {model.config.vocab_size} (config) with {len(tokenizer)} tokenizer tokens")
+        model.resize_token_embeddings(config.vocab_size)
+    # Calculate model stats
+    stats = calculate_model_stats(model)
+    print(f"Total Parameters: {stats['total_params']:,}")
+    print(f"Trainable Parameters: {stats['trainable_params']:,}")
+    # Use lower learning rate for stability
+    lr = args.lr if args.lr else 1e-4
+    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01,
+                                 betas=(0.9, 0.95), eps=1e-8)
+    # PyTorch 2.1 uses torch.cuda.amp.GradScaler; 2.3+ uses torch.amp.GradScaler
+    try:
+        scaler = torch.amp.GradScaler('cuda' if device.type == 'cuda' else 'cpu', init_scale=2**10)
+    except (TypeError, AttributeError):
+        scaler = torch.cuda.amp.GradScaler(init_scale=2**10)
+    if args.resume:
+        # Resume: load model + optimizer + scaler state
+        start_step, current_stage = load_latest_checkpoint(model, optimizer, scaler, device, args.checkpoint_dir, args.checkpoint_name)
+    else:
+        # Fine-tune: load model weights only, fresh optimizer
+        start_step, current_stage = load_latest_checkpoint(model, None, None, device, args.checkpoint_dir, args.checkpoint_name)
+        if start_step > 0:
+            print(f"  Loaded base weights (was at step {start_step}), resetting to step 0 for fine-tuning")
+            start_step = 0
+            current_stage = "Pre-Training"
+    if args.compile:
+        print("Compiling model with torch.compile()...")
+        model = torch.compile(model)
+    trainer = Trainer(model, optimizer, scaler, config, device, args.checkpoint_dir, grad_accum_steps=args.accumulate_grad_batches)
+    # Resolve dataset names
+    pretrain_dataset = args.pretrain_dataset or "cerebras/SlimPajama-627B"
+    sft_dataset = args.sft_dataset or "OpenAssistant/oasst1"
+    # --- STAGE 1: PRE-TRAINING ---
+    if current_stage == "Pre-Training" or start_step == 0:
+        print(f"\n=== STAGE 1: Pre-Training on {pretrain_dataset} ===")
+        # Build LR scheduler for pretraining (adjust for gradient accumulation)
+        warmup_steps = args.warmup_steps if args.warmup_steps else 1000
+        effective_steps = max(1, args.pretrain_steps // args.accumulate_grad_batches)
+        effective_warmup = max(1, warmup_steps // args.accumulate_grad_batches)
+        scheduler = _build_scheduler(optimizer, effective_steps, effective_warmup)
+        trainer.scheduler = scheduler
+        pt_loader = create_streaming_loader(pretrain_dataset, "train",
+                                           tokenizer, config, args.batch_size, mode="pretrain",
+                                           hf_token=args.hf_token, start_step=start_step)
+        pt_val_loader = create_streaming_loader(pretrain_dataset, "validation",
+                                               tokenizer, config, args.batch_size, mode="pretrain",
+                                               hf_token=args.hf_token)
+        start_step = trainer.train_epoch(pt_loader, total_steps=args.pretrain_steps,
+                                       start_step=start_step, stage_name="Pre-Training",
+                                       val_loader=pt_val_loader)
+        current_stage = "SFT"
+        start_step = 0
+    # --- STAGE 2: SFT ---
+    print(f"\n=== STAGE 2: SFT on {sft_dataset} ===")
+    sft_lr = args.sft_lr if args.sft_lr else 5e-5
+    for param_group in optimizer.param_groups:
+        param_group['lr'] = sft_lr
+    # Build LR scheduler for SFT (adjust for gradient accumulation)
+    sft_warmup = args.sft_warmup_steps if args.sft_warmup_steps else 200
+    effective_sft_steps = max(1, args.sft_steps // args.accumulate_grad_batches)
+    effective_sft_warmup = max(1, sft_warmup // args.accumulate_grad_batches)
+    scheduler = _build_scheduler(optimizer, effective_sft_steps, effective_sft_warmup)
+    trainer.scheduler = scheduler
+    sft_loader = create_streaming_loader(sft_dataset, "train",
+                                        tokenizer, config, args.batch_size, mode="sft",
+                                        hf_token=args.hf_token, start_step=start_step)
+    sft_val_loader = create_streaming_loader(sft_dataset, "validation",
+                                            tokenizer, config, args.batch_size, mode="sft",
+                                            hf_token=args.hf_token)
+    trainer.train_epoch(sft_loader, total_steps=args.sft_steps,
+                      start_step=start_step, stage_name="SFT",
+                      val_loader=sft_val_loader)
+    print("\nTraining Complete!")
+def _build_scheduler(optimizer, total_steps, warmup_steps):
+    """Cosine annealing with linear warmup. LR multiplier: 0→1 (warmup) → 0.1 (cosine)."""
+    def lr_lambda(current_step):
+        if current_step < warmup_steps:
+            return float(current_step) / float(max(1, warmup_steps))
+        progress = float(current_step - warmup_steps) / float(max(1, total_steps - warmup_steps))
+        return max(0.1, 0.5 * (1.0 + math.cos(math.pi * progress)))
+    return torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
+@torch.no_grad()
+def generate_command(args):
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    config = AetherisConfig.from_yaml(args.config)
+    add_special = config.vocab_size > 50257
+    tokenizer = get_tokenizer(add_special_tokens=add_special)
+    model = HybridMambaMoE(config).to(device).to(config.torch_dtype)
+    # Resize if needed
+    if len(tokenizer) != config.vocab_size:
+        model.resize_token_embeddings(config.vocab_size)
+    load_latest_checkpoint(model, None, None, device, args.checkpoint_dir, args.checkpoint_name)
+    model.eval()
+    prompt = args.prompt
+    max_new_tokens = args.max_new_tokens
+    temperature = args.temperature
+    top_k = args.top_k
+    top_p = args.top_p
+    repetition_penalty = args.repetition_penalty
+    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
+    generated_ids = input_ids.clone()
+    history_ids = set(input_ids[0].tolist())
+    print("-" * 50)
+    print(f"Prompt: {prompt}")
+    print("Generated Continuation:")
+    for step in range(max_new_tokens):
+        use_autocast = True
+        if config.torch_dtype == torch.float32:
+            use_autocast = False
+        if use_autocast:
+            with torch.amp.autocast('cuda' if device.type == 'cuda' else 'cpu', dtype=model.config.torch_dtype):
+                outputs = model(generated_ids)
+                logits = outputs['logits']
+                next_token_logits = logits[:, -1, :]
+        else:
+            outputs = model(generated_ids)
+            logits = outputs['logits']
+            next_token_logits = logits[:, -1, :]
+        # Repetition penalty
+        for token_id in history_ids:
+            if token_id < next_token_logits.size(-1):
+                logit = next_token_logits[0, token_id].item()
+                if logit > 0:
+                    next_token_logits[0, token_id] = logit / repetition_penalty
+                else:
+                    next_token_logits[0, token_id] = logit * repetition_penalty
+        # Temperature
+        if temperature > 0:
+            next_token_logits = next_token_logits / temperature
+        # Top-p / Top-k
+        if top_p < 1.0:
+            sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
+            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+            sorted_indices_to_remove = cumulative_probs > top_p
+            sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+            sorted_indices_to_remove[..., 0] = False
+            indices_to_remove = sorted_indices[sorted_indices_to_remove]
+            next_token_logits.scatter_(1, indices_to_remove.unsqueeze(0), float('-inf'))
+        elif top_k > 0:
+            top_k_logits, top_k_indices = torch.topk(next_token_logits, top_k)
+            next_token_logits = torch.full_like(next_token_logits, float('-inf'))
+            next_token_logits.scatter_(1, top_k_indices, top_k_logits)
+        # Sample
+        next_token_probs = F.softmax(next_token_logits, dim=-1)
+        next_token = torch.multinomial(next_token_probs, num_samples=1)
+        next_token_item = next_token.item()
+        if next_token_item == tokenizer.eos_token_id:
+            break
+        generated_ids = torch.cat([generated_ids, next_token], dim=-1)
+        history_ids.add(next_token_item)
+        new_token_text = tokenizer.decode(next_token.squeeze().tolist(), skip_special_tokens=True)
+        print(new_token_text, end="", flush=True)
+    print("\n" + "-" * 50)
+def info_command(args):
+    config = AetherisConfig.from_yaml(args.config)
+    model = HybridMambaMoE(config)
+    total_params = 0
+    dense_params = 0
+    expert_params = 0
+    for name, param in model.named_parameters():
+        numel = param.numel()
+        total_params += numel
+        if 'experts' in name:
+            expert_params += numel
+        else:
+            dense_params += numel
+    single_expert_size = expert_params / config.num_experts if config.num_experts > 0 else 0
+    active_per_token_params = dense_params + (single_expert_size * config.top_k)
+    def format_count(count):
+        return f"{count / 1_000_000:.2f}M"
+    print("=" * 50)
+    print("Hybrid Mamba-MoE Model Parameter Analysis")
+    print("=" * 50)
+    print(f"Total Model Layers (N_Layer): {config.n_layer}")
+    print(f"MoE Experts per Layer: {config.num_experts}")
+    print(f"Active Experts (Top-K): {config.top_k}")
+    print("-" * 50)
+    print(f"Total Parameters (Checkpoint Size): {format_count(total_params)}")
+    print(f"Dense (Always Active) Parameters: {format_count(dense_params)}")
+    print(f"Expert-Only Parameters: {format_count(expert_params)}")
+    print("-" * 50)
+    print(f"**Active Parameters (Per-Token Compute Load): {format_count(active_per_token_params)}**")
+    print(" (This is the 'Dense' parameters + the K active expert parameters)")
+    print("=" * 50)
+def main():
+    parser = argparse.ArgumentParser(description="Aetheris CLI")
+    subparsers = parser.add_subparsers(dest="command", help="Available commands")
+    # Train Command
+    train_parser = subparsers.add_parser("train", help="Train the model")
+    train_parser.add_argument("--config", type=str, default="configs/default.yaml", help="Path to config file")
+    train_parser.add_argument("--checkpoint_dir", type=str, default="checkpoints", help="Directory to save checkpoints")
+    train_parser.add_argument("--hf_token", type=str, default=os.environ.get("HF_TOKEN"), help="HuggingFace Token")
+    train_parser.add_argument("--batch_size", type=int, default=2, help="Batch size")
+    train_parser.add_argument("--pretrain_steps", type=int, default=50000, help="Number of pretraining steps")
+    train_parser.add_argument("--sft_steps", type=int, default=1000, help="Number of SFT steps")
+    train_parser.add_argument("--checkpoint_name", type=str, default="checkpoint_current.pth", help="Checkpoint file name to load from")
+    train_parser.add_argument("--compile", action="store_true", help="Compile model with torch.compile for speed")
+    train_parser.add_argument("--accumulate_grad_batches", type=int, default=1, help="Gradient accumulation steps")
+    # Custom dataset args
+    train_parser.add_argument("--pretrain-dataset", type=str, default=None,
+                              help="Pretraining dataset: local JSONL path or HuggingFace dataset name")
+    train_parser.add_argument("--sft-dataset", type=str, default=None,
+                              help="SFT dataset: local JSONL path or HuggingFace dataset name")
+    # Learning rate args
+    train_parser.add_argument("--lr", type=float, default=None, help="Peak learning rate for pretraining (default: 1e-4)")
+    train_parser.add_argument("--sft-lr", type=float, default=None, help="Peak learning rate for SFT (default: 5e-5)")
+    train_parser.add_argument("--warmup-steps", type=int, default=None, help="Warmup steps for pretraining (default: 1000)")
+    train_parser.add_argument("--sft-warmup-steps", type=int, default=None, help="Warmup steps for SFT (default: 200)")
+    train_parser.add_argument("--resume", action="store_true", help="Resume from checkpoint step (default: start from 0)")
+    # Generate Command
+    gen_parser = subparsers.add_parser("generate", help="Generate text")
+    gen_parser.add_argument("--config", type=str, default="configs/default.yaml", help="Path to config file")
+    gen_parser.add_argument("--checkpoint_dir", type=str, default="checkpoints", help="Directory with checkpoints")
+    gen_parser.add_argument("--checkpoint_name", type=str, default="checkpoint_current.pth", help="Checkpoint file name")
+    gen_parser.add_argument("--prompt", type=str, default="The quick brown fox", help="Prompt for generation")
+    gen_parser.add_argument("--max_new_tokens", type=int, default=100, help="Max new tokens to generate")
+    gen_parser.add_argument("--temperature", type=float, default=0.8, help="Sampling temperature")
+    gen_parser.add_argument("--top_k", type=int, default=0, help="Top-k sampling")
+    gen_parser.add_argument("--top_p", type=float, default=0.9, help="Top-p sampling")
+    gen_parser.add_argument("--repetition_penalty", type=float, default=3.0, help="Repetition penalty")
+    # Serve Command
+    serve_parser = subparsers.add_parser("serve", help="Start the API server")
+    serve_parser.add_argument("--host", type=str, default="0.0.0.0", help="Host to bind")
+    serve_parser.add_argument("--port", type=int, default=8000, help="Port to bind")
+    serve_parser.add_argument("--config", type=str, default="configs/default.yaml", help="Path to config file")
+    serve_parser.add_argument("--checkpoint_dir", type=str, default="checkpoints", help="Directory with checkpoints")
+    serve_parser.add_argument("--checkpoint_name", type=str, default="checkpoint_current.pth", help="Checkpoint file name")
+    # Info Command
+    info_parser = subparsers.add_parser("info", help="Show model info")
+    info_parser.add_argument("--config", type=str, default="configs/default.yaml", help="Path to config file")
+    args = parser.parse_args()
+    if args.command == "train":
+        train_command(args)
+    elif args.command == "generate":
+        generate_command(args)
+    elif args.command == "serve":
+        import uvicorn
+        from aetheris.api.server import app, get_engine
+        engine = get_engine()
+        from aetheris.inference import InferenceEngine
+        import aetheris.api.server
+        aetheris.api.server.engine = InferenceEngine(
+            config_path=args.config,
+            checkpoint_dir=args.checkpoint_dir,
+            checkpoint_name=args.checkpoint_name
+        )
+        uvicorn.run(app, host=args.host, port=args.port)
+    elif args.command == "info":
+        info_command(args)
+    else:
+        parser.print_help()
+if __name__ == "__main__":
+    main()

aetheris/config.py ADDED Viewed

	@@ -0,0 +1,58 @@

+from dataclasses import dataclass, field
+import yaml
+import torch
+from typing import Optional
+@dataclass
+class AetherisConfig:
+    # Model dimensions
+    vocab_size: int = 50257
+    d_model: int = 768
+    n_layer: int = 24
+    num_experts: int = 4
+    top_k: int = 1
+    d_ff: int = 2304 # d_model * 3
+    # SSM parameters
+    ssm_d_state: int = 16
+    ssm_expand: int = 2
+    d_inner: Optional[int] = None # Will be d_model * ssm_expand if None
+    # Training parameters
+    load_balancing_coef: float = 1e-2
+    router_z_loss_coef: float = 1e-3
+    max_seq_len: int = 512
+    dtype: str = "float16" # "float16", "float32", "bfloat16"
+    # Optimization settings
+    use_cpu_offload: bool = False
+    gradient_checkpointing: bool = True
+    checkpoint_ssm_layers: bool = True
+    use_flash_attention: bool = False
+    def __post_init__(self):
+        if self.d_inner is None:
+            self.d_inner = self.d_model * self.ssm_expand
+        if self.d_ff is None:
+             self.d_ff = self.d_model * 3
+    @property
+    def torch_dtype(self):
+        if self.dtype == "float16":
+            return torch.float16
+        elif self.dtype == "float32":
+            return torch.float32
+        elif self.dtype == "bfloat16":
+            return torch.bfloat16
+        else:
+            raise ValueError(f"Unsupported dtype: {self.dtype}")
+    @classmethod
+    def from_yaml(cls, path: str):
+        with open(path, 'r') as f:
+            config_dict = yaml.safe_load(f)
+        return cls(**config_dict)
+    def to_yaml(self, path: str):
+        with open(path, 'w') as f:
+            yaml.dump(self.__dict__, f)

aetheris/data.py ADDED Viewed

	@@ -0,0 +1,231 @@

+import torch
+from torch.utils.data import DataLoader, IterableDataset
+from transformers import AutoTokenizer
+from datasets import load_dataset
+import json
+import random
+from typing import Dict, Iterator, List, Optional
+import os
+VOXLEX_SPECIAL_TOKENS = [
+    "<tool_call>", "</tool_call>",
+    "<tool_result>", "</tool_result>",
+    "<legal_cite>", "</legal_cite>",
+]
+def get_tokenizer(model_name: str = "gpt2", add_special_tokens: bool = False):
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    if add_special_tokens:
+        num_added = tokenizer.add_special_tokens(
+            {"additional_special_tokens": VOXLEX_SPECIAL_TOKENS}
+        )
+        if num_added > 0:
+            print(f"  Added {num_added} special tokens → vocab_size={len(tokenizer)}")
+    return tokenizer
+class StreamingDataset(IterableDataset):
+    def __init__(self, dataset, tokenizer, max_seq_len, mode="pretrain", buffer_size=100, skip_samples=0):
+        self.dataset = dataset
+        self.tokenizer = tokenizer
+        self.max_seq_len = max_seq_len
+        self.mode = mode
+        self.buffer_size = buffer_size
+        self.skip_samples = skip_samples
+    def _find_assistant_spans(self, text: str) -> List[tuple]:
+        """Find character spans of assistant responses in SFT text."""
+        spans = []
+        search_from = 0
+        while True:
+            start = text.find("<|assistant|>", search_from)
+            if start == -1:
+                break
+            content_start = start + len("<|assistant|>")
+            # End at next role tag or end of text
+            end = len(text)
+            for tag in ["<|user|>", "<|system|>", "<|tool|>", "<|endoftext|>"]:
+                pos = text.find(tag, content_start)
+                if pos != -1:
+                    end = min(end, pos)
+            spans.append((content_start, end))
+            search_from = end
+        return spans
+    def _prepare_sft_example(self, example):
+        """Prepare SFT example with label masking — loss only on assistant tokens."""
+        if 'messages' in example:
+            # Build text with role tags
+            text = ""
+            for msg in example['messages']:
+                role = msg.get('role', '')
+                content = msg.get('content', '')
+                text += f"<|{role}|>{content}"
+            text += self.tokenizer.eos_token
+        elif 'text' in example:
+            text = example['text']
+        else:
+            return None
+        if len(text) < 10:
+            return None
+        # Pre-truncate to avoid slow tokenization of very long texts
+        max_chars = self.max_seq_len * 5
+        if len(text) > max_chars:
+            text = text[:max_chars]
+        enc = self.tokenizer(text, truncation=True, max_length=self.max_seq_len,
+                             return_tensors="pt")
+        input_ids = enc['input_ids'][0]
+        if len(input_ids) < 2:
+            return None
+        # Build labels: -100 for non-assistant tokens
+        labels = torch.full_like(input_ids, -100)
+        assistant_spans = self._find_assistant_spans(text)
+        for char_start, char_end in assistant_spans:
+            # Map character offsets to token positions
+            in_span = False
+            for tok_idx in range(len(input_ids)):
+                token_span = enc.token_to_chars(0, tok_idx)
+                if token_span is None:
+                    # Special token (e.g. <tool_call>) — include if neighbors are in span
+                    if in_span:
+                        labels[tok_idx] = input_ids[tok_idx]
+                    continue
+                tok_start, tok_end = token_span
+                # Token overlaps with assistant span
+                if tok_end > char_start and tok_start < char_end:
+                    labels[tok_idx] = input_ids[tok_idx]
+                    in_span = True
+                else:
+                    in_span = False
+        # Also train on eos token at the end
+        if input_ids[-1] == self.tokenizer.eos_token_id:
+            labels[-1] = input_ids[-1]
+        # Pad to max_seq_len
+        if len(input_ids) < self.max_seq_len:
+            pad_len = self.max_seq_len - len(input_ids)
+            input_ids = torch.cat([
+                input_ids,
+                torch.full((pad_len,), self.tokenizer.pad_token_id, dtype=torch.long)
+            ])
+            labels = torch.cat([
+                labels,
+                torch.full((pad_len,), -100, dtype=torch.long)
+            ])
+        return input_ids, labels
+    def _prepare_pretrain_example(self, example):
+        """Prepare pretraining example — loss on all non-pad tokens."""
+        text = example.get('text', '')
+        if len(text) < 10:
+            return None
+        # Pre-truncate text to avoid tokenizing 100K+ char documents
+        # GPT-2 averages ~4 chars per token; use 5x max_seq_len as safe limit
+        max_chars = self.max_seq_len * 5
+        if len(text) > max_chars:
+            text = text[:max_chars]
+        enc = self.tokenizer(text, truncation=True, max_length=self.max_seq_len,
+                             return_tensors="pt")
+        input_ids = enc['input_ids'][0]
+        if len(input_ids) < 2:
+            return None
+        labels = input_ids.clone()
+        if len(input_ids) < self.max_seq_len:
+            pad_len = self.max_seq_len - len(input_ids)
+            input_ids = torch.cat([
+                input_ids,
+                torch.full((pad_len,), self.tokenizer.pad_token_id, dtype=torch.long)
+            ])
+            labels = torch.cat([
+                labels,
+                torch.full((pad_len,), -100, dtype=torch.long)
+            ])
+        return input_ids, labels
+    def __iter__(self) -> Iterator[Dict[str, torch.Tensor]]:
+        iterator = iter(self.dataset)
+        buffer = []
+        for example in iterator:
+            if self.mode == "pretrain":
+                result = self._prepare_pretrain_example(example)
+            else:
+                result = self._prepare_sft_example(example)
+            if result is None:
+                continue
+            buffer.append(result)
+            if len(buffer) >= self.buffer_size:
+                random.shuffle(buffer)
+                for _ in range(self.buffer_size // 2):
+                    item = buffer.pop()
+                    if self.skip_samples > 0:
+                        self.skip_samples -= 1
+                        continue
+                    yield item
+        # Yield remaining
+        random.shuffle(buffer)
+        while buffer:
+            item = buffer.pop()
+            if self.skip_samples > 0:
+                self.skip_samples -= 1
+                continue
+            yield item
+def _load_jsonl_dataset(path: str):
+    """Load a local JSONL file as a streaming iterable (no memory materialization)."""
+    from datasets import IterableDataset
+    def gen():
+        with open(path, 'r') as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    yield json.loads(line)
+    return IterableDataset.from_generator(gen)
+def create_streaming_loader(dataset_name, split, tokenizer, config, batch_size,
+                            mode="pretrain", hf_token=None, start_step=0):
+    # Support local JSONL files
+    if os.path.isfile(dataset_name) and dataset_name.endswith('.jsonl'):
+        print(f"  Loading local dataset: {dataset_name}")
+        raw_dataset = _load_jsonl_dataset(dataset_name)
+    else:
+        raw_dataset = load_dataset(dataset_name, split=split, streaming=True,
+                                   trust_remote_code=True, token=hf_token)
+    # Calculate samples to skip: start_step * batch_size
+    skip_samples = start_step * batch_size
+    if skip_samples > 0:
+        print(f"  [Loader] Resuming: Fast-forwarding dataset by {skip_samples} samples...")
+    stream_ds = StreamingDataset(raw_dataset, tokenizer, config.max_seq_len,
+                                 mode=mode, skip_samples=skip_samples)
+    # num_workers=0 avoids 4x data duplication with IterableDataset
+    # (each worker iterates the full dataset without sharding logic)
+    return DataLoader(stream_ds, batch_size=batch_size, pin_memory=True,
+                      num_workers=0)

aetheris/inference.py ADDED Viewed

	@@ -0,0 +1,106 @@

+import torch
+import torch.nn.functional as F
+from typing import Optional, List, Generator
+from aetheris.config import AetherisConfig
+from aetheris.model import HybridMambaMoE
+from aetheris.data import get_tokenizer
+from aetheris.utils import load_latest_checkpoint
+class InferenceEngine:
+    def __init__(self, config_path: str = "configs/default.yaml", checkpoint_dir: str = "checkpoints", checkpoint_name: str = "checkpoint_current.pth", device: str = None):
+        self.device = torch.device(device if device else ('cuda' if torch.cuda.is_available() else 'cpu'))
+        self.config = AetherisConfig.from_yaml(config_path)
+        self.tokenizer = get_tokenizer()
+        self.model = HybridMambaMoE(self.config).to(self.device).to(self.config.torch_dtype)
+        # Load checkpoint
+        # Note: load_latest_checkpoint expects optimizer and scaler, but for inference we can pass None
+        load_latest_checkpoint(self.model, None, None, self.device, checkpoint_dir, checkpoint_name)
+        self.model.eval()
+    def generate(self,
+                 prompt: str,
+                 max_new_tokens: int = 100,
+                 temperature: float = 0.8,
+                 top_k: int = 0,
+                 top_p: float = 0.9,
+                 repetition_penalty: float = 1.0,
+                 stream: bool = False) -> Generator[str, None, None] | str:
+        input_ids = self.tokenizer.encode(prompt, return_tensors='pt').to(self.device)
+        generated_ids = input_ids.clone()
+        history_ids = set(input_ids[0].tolist())
+        def token_generator():
+            nonlocal generated_ids
+            for _ in range(max_new_tokens):
+                 # Check if we should use autocast (skip if model uses float32)
+                use_autocast = True
+                if self.config.torch_dtype == torch.float32:
+                    use_autocast = False
+                if use_autocast:
+                    with torch.amp.autocast('cuda' if self.device.type == 'cuda' else 'cpu', dtype=self.model.config.torch_dtype):
+                        outputs = self.model(generated_ids)
+                        logits = outputs['logits']
+                        next_token_logits = logits[:, -1, :]
+                else:
+                    outputs = self.model(generated_ids)
+                    logits = outputs['logits']
+                    next_token_logits = logits[:, -1, :]
+                # Repetition penalty
+                for token_id in history_ids:
+                    if token_id < next_token_logits.size(-1):
+                        logit = next_token_logits[0, token_id].item()
+                        if logit > 0:
+                            next_token_logits[0, token_id] = logit / repetition_penalty
+                        else:
+                            next_token_logits[0, token_id] = logit * repetition_penalty
+                # Temperature
+                if temperature > 0:
+                    next_token_logits = next_token_logits / temperature
+                # Top-p / Top-k
+                if top_p < 1.0:
+                    sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
+                    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+                    sorted_indices_to_remove = cumulative_probs > top_p
+                    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+                    sorted_indices_to_remove[..., 0] = False
+                    indices_to_remove = sorted_indices[sorted_indices_to_remove]
+                    next_token_logits.scatter_(1, indices_to_remove.unsqueeze(0), float('-inf'))
+                elif top_k > 0:
+                    top_k_logits, top_k_indices = torch.topk(next_token_logits, top_k)
+                    next_token_logits = torch.full_like(next_token_logits, float('-inf'))
+                    next_token_logits.scatter_(1, top_k_indices, top_k_logits)
+                # Sample
+                next_token_probs = F.softmax(next_token_logits, dim=-1)
+                next_token = torch.multinomial(next_token_probs, num_samples=1)
+                next_token_item = next_token.item()
+                if next_token_item == self.tokenizer.eos_token_id:
+                    break
+                generated_ids = torch.cat([generated_ids, next_token], dim=-1)
+                history_ids.add(next_token_item)
+                new_token_text = self.tokenizer.decode(next_token.squeeze().tolist(), skip_special_tokens=True)
+                yield new_token_text
+        if stream:
+            return token_generator()
+        else:
+            return "".join(list(token_generator()))
+    def generate_full(self,
+                 prompt: str,
+                 max_new_tokens: int = 100,
+                 temperature: float = 0.8,
+                 top_k: int = 0,
+                 top_p: float = 0.9,
+                 repetition_penalty: float = 1.0) -> str:
+        return self.generate(prompt, max_new_tokens, temperature, top_k, top_p, repetition_penalty, stream=False)

aetheris/model.py ADDED Viewed

	@@ -0,0 +1,104 @@

+import torch
+import torch.nn as nn
+import torch.utils.checkpoint
+from typing import Dict, Any, List
+from .config import AetherisConfig
+from .modules import SSMBlock, SparseMoELayer
+class HybridMambaMoE(nn.Module):
+    def __init__(self, config: AetherisConfig):
+        super().__init__()
+        self.config = config
+        self.embedding = nn.Embedding(config.vocab_size, config.d_model)
+        self.layers = nn.ModuleList()
+        for i in range(config.n_layer):
+            if i % 2 == 0:
+                self.layers.append(SSMBlock(config))
+            else:
+                self.layers.append(SparseMoELayer(config))
+        self.final_norm = nn.LayerNorm(config.d_model)
+        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
+        self.lm_head.weight = self.embedding.weight  # Weight tying
+        # Use -100 as ignore_index (PyTorch standard for label masking)
+        self.loss_fn = nn.CrossEntropyLoss(ignore_index=-100)
+        self.gradient_checkpointing = config.gradient_checkpointing
+        # Initialize embeddings with smaller scale
+        nn.init.normal_(self.embedding.weight, mean=0.0, std=0.02)
+    def resize_token_embeddings(self, new_vocab_size: int):
+        """Resize embedding and lm_head for new tokens. New embeddings initialized from mean of existing."""
+        old_vocab_size = self.embedding.num_embeddings
+        if new_vocab_size == old_vocab_size:
+            return
+        old_weight = self.embedding.weight.data
+        mean_embed = old_weight.mean(dim=0)
+        self.embedding = nn.Embedding(new_vocab_size, self.config.d_model)
+        self.embedding.weight.data[:old_vocab_size] = old_weight
+        self.embedding.weight.data[old_vocab_size:] = mean_embed.unsqueeze(0).expand(
+            new_vocab_size - old_vocab_size, -1
+        )
+        self.lm_head = nn.Linear(self.config.d_model, new_vocab_size, bias=False)
+        self.lm_head.weight = self.embedding.weight  # Re-tie weights
+        self.config.vocab_size = new_vocab_size
+    def _init_weights(self, module):
+        """Apply proper weight initialization"""
+        if isinstance(module, nn.Linear):
+            nn.init.xavier_uniform_(module.weight, gain=0.5)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            nn.init.normal_(module.weight, mean=0.0, std=0.02)
+        elif isinstance(module, nn.LayerNorm):
+            nn.init.ones_(module.weight)
+            nn.init.zeros_(module.bias)
+    def forward(self, input_ids: torch.Tensor, labels: torch.Tensor = None) -> Dict[str, Any]:
+        x = self.embedding(input_ids)
+        total_aux_loss = torch.tensor(0.0, device=x.device, dtype=x.dtype)
+        for i, layer in enumerate(self.layers):
+            if self.gradient_checkpointing and self.training:
+                # Checkpoint ALL layers for maximum memory savings
+                if isinstance(layer, SparseMoELayer):
+                    def moe_forward(module, inp):
+                        return module(inp)
+                    x, aux_loss = torch.utils.checkpoint.checkpoint(
+                        moe_forward, layer, x, use_reentrant=False
+                    )
+                    total_aux_loss = total_aux_loss + aux_loss
+                else:
+                    x = torch.utils.checkpoint.checkpoint(
+                        layer, x, use_reentrant=False
+                    )
+            else:
+                if isinstance(layer, SparseMoELayer):
+                    x, aux_loss = layer(x)
+                    total_aux_loss = total_aux_loss + aux_loss
+                else:
+                    x = layer(x)
+        x = self.final_norm(x)
+        logits = self.lm_head(x)
+        if labels is not None:
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            ce_loss = self.loss_fn(shift_logits.view(-1, self.config.vocab_size),
+                                  shift_labels.view(-1))
+            # Scale down aux loss to prevent it from dominating
+            total_loss = ce_loss + 0.01 * total_aux_loss
+            return {
+                "loss": total_loss,
+                "ce_loss": ce_loss,
+                "aux_loss": total_aux_loss,
+                "logits": logits
+            }
+        return {"logits": logits}

aetheris/modules/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@

+from .expert import Expert
+from .ssm import SSMBlock, selective_scan_native
+from .moe import SparseMoELayer

aetheris/modules/expert.py ADDED Viewed

	@@ -0,0 +1,35 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class Expert(nn.Module):
+    """Memory-efficient Feed-Forward Network expert with proper initialization."""
+    def __init__(self, d_model: int, d_ff: int):
+        super().__init__()
+        self.w1 = nn.Linear(d_model, d_ff, bias=False)
+        self.w2 = nn.Linear(d_ff, d_model, bias=False)
+        self.act = nn.GELU()
+        # Proper initialization to prevent NaN
+        nn.init.xavier_uniform_(self.w1.weight, gain=0.5)
+        nn.init.xavier_uniform_(self.w2.weight, gain=0.5)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        orig_dtype = x.dtype
+        # Force float32 for internal computation to prevent overflow in half precision
+        x = x.to(torch.float32)
+        # Cast weights to float32 for calculation
+        # This is necessary because the module weights might be float16
+        w1_weight = self.w1.weight.to(torch.float32)
+        w2_weight = self.w2.weight.to(torch.float32)
+        h = F.linear(x, w1_weight)
+        h = self.act(h)
+        out = F.linear(h, w2_weight)
+        # Clamp to avoid Inf when casting back to float16
+        if orig_dtype == torch.float16:
+            out = torch.clamp(out, min=-65500.0, max=65500.0)
+        return out.to(orig_dtype)

aetheris/modules/moe.py ADDED Viewed

	@@ -0,0 +1,83 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from ..config import AetherisConfig
+from .expert import Expert
+class SparseMoELayer(nn.Module):
+    """Memory-optimized Sparse MoE with efficient routing."""
+    def __init__(self, config: AetherisConfig):
+        super().__init__()
+        self.d_model = config.d_model
+        self.num_experts = config.num_experts
+        self.top_k = config.top_k
+        self.load_balancing_coef = config.load_balancing_coef
+        self.z_loss_coef = config.router_z_loss_coef
+        self.gate = nn.Linear(config.d_model, config.num_experts, bias=False)
+        self.experts = nn.ModuleList([Expert(config.d_model, config.d_ff)
+                                      for _ in range(config.num_experts)])
+        self.norm = nn.LayerNorm(config.d_model)
+    def forward(self, x: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
+        B, L, D = x.shape
+        x_norm = self.norm(x)
+        flat_x = x_norm.view(-1, D)
+        # Routing Logits with stability
+        gate_logits = self.gate(flat_x)
+        # Clamp logits to prevent overflow
+        gate_logits = torch.clamp(gate_logits, min=-10.0, max=10.0)
+        # Z-Loss for stability
+        z_loss = torch.mean(torch.logsumexp(gate_logits, dim=-1)**2) * self.z_loss_coef
+        if self.training:
+            # Reduce noise for stability
+            gate_logits = gate_logits + torch.randn_like(gate_logits) * 1e-3
+        gate_probs = F.softmax(gate_logits, dim=-1)
+        gate_weights, expert_indices = torch.topk(gate_probs, self.top_k, dim=-1)
+        # Normalize weights for stability
+        gate_weights = gate_weights / (gate_weights.sum(dim=-1, keepdim=True) + 1e-8)
+        # Load balancing loss
+        # Use only the top-1 expert for load balancing calculation to keep it simple and consistent
+        expert_mask = F.one_hot(expert_indices[:, 0], num_classes=self.num_experts).float()
+        fraction_routed = expert_mask.mean(dim=0)
+        mean_prob = gate_probs.mean(dim=0)
+        aux_loss = (self.num_experts * torch.sum(fraction_routed * mean_prob)) * self.load_balancing_coef
+        total_aux_loss = aux_loss + z_loss
+        # Efficient dispatch with in-place operations
+        # Accumulate in float32 to prevent overflow during aggregation
+        final_output = torch.zeros_like(flat_x, dtype=torch.float32)
+        # Iterate over all k selected experts
+        for k_idx in range(self.top_k):
+            for i, expert in enumerate(self.experts):
+                # Find tokens routed to expert 'i' at the k-th position
+                mask = (expert_indices[:, k_idx] == i)
+                if not mask.any():
+                    continue
+                expert_input = flat_x[mask]
+                expert_out = expert(expert_input)
+                # Apply weights
+                weights = gate_weights[mask, k_idx].unsqueeze(1)
+                # Cast to float32 for accumulation
+                expert_out = expert_out.to(torch.float32)
+                weights = weights.to(torch.float32)
+                # Accumulate output (add to existing results from other experts)
+                final_output[mask] += expert_out * weights
+        # Cast back to original dtype
+        final_output = final_output.to(flat_x.dtype)
+        return x + final_output.view(B, L, D), total_aux_loss

aetheris/modules/ssm.py ADDED Viewed

	@@ -0,0 +1,119 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from ..config import AetherisConfig
+# Try to import CUDA selective scan kernel
+try:
+    from mamba_ssm.ops.selective_scan_interface import selective_scan_fn
+    HAS_CUDA_SSM = True
+except ImportError:
+    HAS_CUDA_SSM = False
+@torch.jit.ignore
+def selective_scan_native(u: torch.Tensor, delta: torch.Tensor, A: torch.Tensor,
+                         B: torch.Tensor, C: torch.Tensor, D: torch.Tensor) -> torch.Tensor:
+    """Fallback pure-Python scan (slow, O(L) sequential)."""
+    B_size, L, D_inner = u.shape
+    D_state = A.shape[-1]
+    original_dtype = u.dtype
+    h = torch.zeros(B_size, D_inner, D_state, device=u.device, dtype=torch.float32)
+    ys = []
+    u = u.float()
+    delta = delta.float()
+    A = A.float()
+    B = B.float()
+    C = C.float()
+    D = D.float()
+    for l in range(L):
+        dt = delta[:, l, :].unsqueeze(-1)
+        dA = torch.exp(dt * A)
+        B_l = B[:, l, :].unsqueeze(1)
+        dB = dt * B_l
+        u_t = u[:, l, :].unsqueeze(-1)
+        h = dA * h + dB * u_t
+        C_l = C[:, l, :].unsqueeze(1)
+        y_t = torch.sum(h * C_l, dim=-1)
+        ys.append(y_t)
+    y = torch.stack(ys, dim=1)
+    y = y + u * D
+    return y.to(dtype=original_dtype)
+class SSMBlock(nn.Module):
+    """State Space Model block with optional CUDA-accelerated selective scan."""
+    def __init__(self, config: AetherisConfig):
+        super().__init__()
+        self.d_model = config.d_model
+        self.d_state = config.ssm_d_state
+        self.d_inner = config.d_inner
+        self.in_proj = nn.Linear(self.d_model, self.d_inner * 2, bias=False)
+        self.out_proj = nn.Linear(self.d_inner, self.d_model, bias=False)
+        self.conv_d = nn.Conv1d(self.d_inner, self.d_inner, kernel_size=3,
+                        padding=2, groups=self.d_inner, bias=False)
+        self.gate_proj = nn.Linear(self.d_model, self.d_inner, bias=False)
+        self.B_proj = nn.Linear(self.d_inner, self.d_state, bias=False)
+        self.C_proj = nn.Linear(self.d_inner, self.d_state, bias=False)
+        self.delta_proj = nn.Linear(self.d_inner, self.d_inner, bias=False)
+        self.A_log = nn.Parameter(torch.randn(self.d_inner, self.d_state) * 0.1 - 4.0)
+        self.D = nn.Parameter(torch.ones(self.d_inner) * 0.1)
+        self.act = nn.SiLU()
+        self.norm = nn.LayerNorm(config.d_model)
+        nn.init.xavier_uniform_(self.in_proj.weight, gain=0.5)
+        nn.init.xavier_uniform_(self.out_proj.weight, gain=0.5)
+        nn.init.xavier_uniform_(self.gate_proj.weight, gain=0.5)
+        nn.init.xavier_uniform_(self.B_proj.weight, gain=0.5)
+        nn.init.xavier_uniform_(self.C_proj.weight, gain=0.5)
+        nn.init.xavier_uniform_(self.delta_proj.weight, gain=0.5)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        B, L, D = x.shape
+        x_norm = self.norm(x)
+        xz = self.in_proj(x_norm)
+        x_in, z_gate = xz.chunk(2, dim=-1)
+        x_conv = self.conv_d(x_in.transpose(1, 2))
+        x_conv = x_conv[:, :, :-2].transpose(1, 2)
+        x_conv = self.act(x_conv)
+        B_ssm = self.B_proj(x_conv)
+        C_ssm = self.C_proj(x_conv)
+        # A is (D_inner, D_state) — clamped and negated
+        A = -torch.exp(torch.clamp(self.A_log, min=-10.0, max=2.0))
+        if HAS_CUDA_SSM and x.is_cuda:
+            # CUDA kernel expects float32 — cast inputs and cast output back
+            original_dtype = x_conv.dtype
+            delta_raw = self.delta_proj(x_conv)
+            y_ssm = selective_scan_fn(
+                x_conv.transpose(1, 2).contiguous().float(),     # (B, D_inner, L)
+                delta_raw.transpose(1, 2).contiguous().float(),   # (B, D_inner, L)
+                A.contiguous().float(),                            # (D_inner, D_state)
+                B_ssm.transpose(1, 2).contiguous().float(),       # (B, D_state, L)
+                C_ssm.transpose(1, 2).contiguous().float(),       # (B, D_state, L)
+                self.D.float(),                                    # (D_inner,)
+                z=None,
+                delta_bias=None,
+                delta_softplus=True,
+                return_last_state=False,
+            )
+            y_ssm = y_ssm.to(dtype=original_dtype).transpose(1, 2)  # Back to (B, L, D_inner)
+        else:
+            # Fallback: pure Python sequential scan
+            delta = torch.clamp(F.softplus(self.delta_proj(x_conv)), max=5.0) + 1e-4
+            A_batched = A.unsqueeze(0).expand(B, -1, -1)
+            y_ssm = selective_scan_native(x_conv, delta, A_batched, B_ssm, C_ssm, self.D)
+        y_gate = F.silu(self.gate_proj(x_norm)) * y_ssm
+        output = self.out_proj(y_gate)
+        return x + output

aetheris/trainer/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .trainer import Trainer

aetheris/trainer/trainer.py ADDED Viewed

	@@ -0,0 +1,176 @@

+import torch
+import time
+import os
+from aetheris.utils import save_checkpoint, load_latest_checkpoint, calculate_model_stats
+from aetheris.data import get_tokenizer
+class Trainer:
+    def __init__(self, model, optimizer, scaler, config, device, checkpoint_dir, logger=None, grad_accum_steps=1):
+        self.model = model
+        self.optimizer = optimizer
+        self.scaler = scaler
+        self.config = config
+        self.device = device
+        self.checkpoint_dir = checkpoint_dir
+        self.logger = logger
+        self.grad_accum_steps = grad_accum_steps
+        self.scheduler = None  # Set by CLI before train_epoch()
+        self.model.to(self.device)
+    def validate(self, val_loader, global_step):
+        self.model.eval()
+        total_loss = 0
+        total_items = 0
+        num_batches = 100 # Validate on 100 batches to save time
+        print(f"\n[Validation] Starting validation at step {global_step}...")
+        with torch.no_grad():
+             for i, batch in enumerate(val_loader):
+                if i >= num_batches:
+                    break
+                input_ids, labels = batch
+                input_ids = input_ids.to(self.device, non_blocking=True)
+                labels = labels.to(self.device, non_blocking=True)
+                # Auto-cast context — bf16 on Ampere+, fp16 fallback
+                autocast_dtype = torch.bfloat16
+                use_autocast = True if self.config.torch_dtype != torch.float32 else False
+                if use_autocast:
+                    with torch.cuda.amp.autocast(dtype=autocast_dtype):
+                        output = self.model(input_ids, labels)
+                else:
+                    output = self.model(input_ids, labels)
+                total_loss += output["loss"].item()
+                total_items += 1
+        avg_loss = total_loss / total_items if total_items > 0 else 0
+        perplexity = torch.exp(torch.tensor(avg_loss)).item()
+        print(f"[Validation] Step {global_step} | Loss: {avg_loss:.4f} | PPL: {perplexity:.4f}")
+        self.model.train()
+        return avg_loss
+    def train_epoch(self, train_loader, total_steps, start_step=0, stage_name="Training", val_loader=None, eval_every=500):
+        print(f"\n{'='*70}\nStarting {stage_name}: Target Steps={total_steps} (Accum={self.grad_accum_steps})\n{'='*70}", flush=True)
+        self.model.train()
+        global_step = start_step
+        running_loss = 0
+        print("Initializing data iterator...")
+        train_iter = iter(train_loader)
+        print("Fetching first batch...")
+        # Zero gradients initially
+        self.optimizer.zero_grad(set_to_none=True)
+        while global_step < total_steps:
+            step_start = time.time()
+            # Removed periodic cache clearing for performance
+            try:
+                batch = next(train_iter)
+                if global_step == start_step:
+                    print(f"✓ First batch loaded! Starting training loop...", flush=True)
+            except StopIteration:
+                train_iter = iter(train_loader)
+                batch = next(train_iter)
+            data_time = time.time() - step_start
+            input_ids, labels = batch
+            input_ids = input_ids.to(self.device, non_blocking=True)
+            labels = labels.to(self.device, non_blocking=True)
+            gpu_start = time.time()
+            # Determine autocast dtype — bf16 on Ampere+ (no NaN from range overflow)
+            autocast_dtype = torch.bfloat16
+            # Check if we should use autocast (skip if model uses float32)
+            use_autocast = True
+            if self.config.torch_dtype == torch.float32:
+                use_autocast = False
+            if use_autocast:
+                with torch.cuda.amp.autocast(dtype=autocast_dtype):
+                    output = self.model(input_ids, labels)
+                    # Scale loss for accumulation
+                    loss = output["loss"] / self.grad_accum_steps
+            else:
+                output = self.model(input_ids, labels)
+                loss = output["loss"] / self.grad_accum_steps
+            # NaN loss detection — skip batch entirely to prevent corruption
+            if torch.isnan(loss) or torch.isinf(loss):
+                nan_count = getattr(self, '_nan_count', 0) + 1
+                self._nan_count = nan_count
+                print(f"WARNING: NaN/Inf loss at step {global_step} (count={nan_count}), skipping batch", flush=True)
+                self.optimizer.zero_grad(set_to_none=True)
+                global_step += 1
+                continue
+            loss.backward()
+            if self.device.type == 'cuda':
+                torch.cuda.synchronize()
+            gpu_time = time.time() - gpu_start
+            # Gradient Accumulation Step
+            if (global_step + 1) % self.grad_accum_steps == 0:
+                # Gradient clipping
+                grad_norm = torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=0.5)
+                if torch.isnan(grad_norm) or torch.isinf(grad_norm):
+                    print(f"WARNING: NaN/Inf gradient at step {global_step}, skipping update", flush=True)
+                    self.optimizer.zero_grad(set_to_none=True)
+                else:
+                    self.optimizer.step()
+                    self.optimizer.zero_grad(set_to_none=True)
+                # Step LR scheduler
+                if self.scheduler is not None:
+                    self.scheduler.step()
+            global_step += 1
+            running_loss += (loss.item() * self.grad_accum_steps) # Un-scale for reporting
+            # Per-step progress file for monitoring (cheap I/O)
+            if global_step <= 20 or global_step % 100 == 0:
+                total_elapsed = time.time() - step_start
+                with open("/workspace/training_progress.log", "a") as pf:
+                    pf.write(f"step={global_step} loss={loss.item() * self.grad_accum_steps:.4f} total={total_elapsed:.1f}s data={data_time:.1f}s gpu={gpu_time:.1f}s\n")
+            if global_step % 10 == 0:
+                avg_loss = running_loss / 10
+                t_diff = time.time() - step_start
+                if self.device.type == 'cuda':
+                    mem = torch.cuda.memory_allocated() / 1e9
+                    max_mem = torch.cuda.max_memory_allocated() / 1e9
+                    mem_str = f"VRAM: {mem:.1f}GB (peak: {max_mem:.1f}GB)"
+                else:
+                    mem_str = "CPU Mode"
+                tokens_per_sec = (self.config.max_seq_len * input_ids.size(0)) / t_diff
+                current_lr = self.optimizer.param_groups[0]['lr']
+                msg = (f"  Step {global_step}/{total_steps} | Loss: {avg_loss:.4f} | "
+                       f"LR: {current_lr:.2e} | {mem_str} | {tokens_per_sec:.0f} tok/s")
+                print(msg, flush=True)
+                # Write progress to file (bypasses stdout buffering)
+                with open("/workspace/training_progress.log", "a") as pf:
+                    pf.write(msg + "\n")
+                running_loss = 0
+            if global_step % 500 == 0:
+                save_checkpoint(self.model, self.optimizer, self.scaler, global_step, stage_name, self.checkpoint_dir)
+                with open("/workspace/training_progress.log", "a") as pf:
+                    pf.write(f"  [Checkpoint saved at step {global_step}]\n")
+            if val_loader is not None and global_step % eval_every == 0 and global_step > start_step:
+                self.validate(val_loader, global_step)
+        return global_step

aetheris/utils.py ADDED Viewed

	@@ -0,0 +1,55 @@

+import os
+import torch
+from typing import Tuple
+def save_checkpoint(model, optimizer, scaler, step, stage, checkpoint_dir, checkpoint_name="checkpoint_current.pth"):
+    os.makedirs(checkpoint_dir, exist_ok=True)
+    path = os.path.join(checkpoint_dir, checkpoint_name)
+    torch.save({
+        'step': step,
+        'stage': stage,
+        'model_state_dict': model.state_dict(),
+        'optimizer_state_dict': optimizer.state_dict(),
+        'scaler_state_dict': scaler.state_dict()
+    }, path)
+    print(f"    [Checkpoint] Saved at step {step}")
+def load_latest_checkpoint(model, optimizer, scaler, device, checkpoint_dir, checkpoint_name="checkpoint_current.pth") -> Tuple[int, str]:
+    path = os.path.join(checkpoint_dir, checkpoint_name)
+    if not os.path.exists(path):
+        return 0, "Pre-Training"
+    print(f"    [Checkpoint] Loading from {path}...")
+    ckpt = torch.load(path, map_location=device)
+    state = ckpt['model_state_dict']
+    # Handle vocab size mismatch (base checkpoint may have fewer tokens than model)
+    model_vocab = model.config.vocab_size
+    for key in ("embedding.weight", "lm_head.weight"):
+        if key in state and state[key].shape[0] < model_vocab:
+            old = state[key]
+            pad_size = model_vocab - old.shape[0]
+            mean_vec = old.mean(dim=0)
+            state[key] = torch.cat([old, mean_vec.unsqueeze(0).expand(pad_size, -1)])
+            print(f"    [Checkpoint] Padded {key}: {old.shape[0]} → {model_vocab}")
+    model.load_state_dict(state, strict=False)
+    if optimizer and 'optimizer_state_dict' in ckpt:
+        try:
+            optimizer.load_state_dict(ckpt['optimizer_state_dict'])
+        except (ValueError, KeyError):
+            print("    [Checkpoint] Optimizer state incompatible (vocab resize), using fresh optimizer")
+    if scaler and 'scaler_state_dict' in ckpt:
+        scaler.load_state_dict(ckpt['scaler_state_dict'])
+    return ckpt['step'], ckpt['stage']
+def calculate_model_stats(model):
+    total_params = sum(p.numel() for p in model.parameters())
+    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    return {
+        'total_params': total_params,
+        'trainable_params': trainable_params,
+        'active_params': int(total_params * 0.6), # Approximation
+        'sparsity_ratio': 0.6 # Approximation
+    }

config.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+checkpoint_ssm_layers: true
+d_ff: 3072
+d_inner: 2048
+d_model: 1024
+dtype: float16
+gradient_checkpointing: true
+load_balancing_coef: 0.01
+max_seq_len: 2048
+n_layer: 24
+num_experts: 4
+router_z_loss_coef: 0.001
+ssm_d_state: 16
+ssm_expand: 2
+top_k: 1
+use_cpu_offload: false
+use_flash_attention: false
+vocab_size: 261019

pytorch_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9133520b370ce0ebab902748e74c0e60898f0ffe2c2f0d54f66f9412f40e9921
+size 2886684406