Instructions to use pathcosmos/frankenstallm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pathcosmos/frankenstallm with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="pathcosmos/frankenstallm")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("pathcosmos/frankenstallm")
model = AutoModelForCausalLM.from_pretrained("pathcosmos/frankenstallm")

llama-cpp-python

How to use pathcosmos/frankenstallm with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="pathcosmos/frankenstallm",
	filename="gguf/frankenstallm-3b-Q4_K_M.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use pathcosmos/frankenstallm with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf pathcosmos/frankenstallm:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf pathcosmos/frankenstallm:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf pathcosmos/frankenstallm:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf pathcosmos/frankenstallm:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf pathcosmos/frankenstallm:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf pathcosmos/frankenstallm:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf pathcosmos/frankenstallm:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf pathcosmos/frankenstallm:Q4_K_M

Use Docker

docker model run hf.co/pathcosmos/frankenstallm:Q4_K_M

LM Studio
Jan

vLLM

How to use pathcosmos/frankenstallm with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "pathcosmos/frankenstallm"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pathcosmos/frankenstallm",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/pathcosmos/frankenstallm:Q4_K_M

SGLang

How to use pathcosmos/frankenstallm with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "pathcosmos/frankenstallm" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pathcosmos/frankenstallm",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "pathcosmos/frankenstallm" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pathcosmos/frankenstallm",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Ollama
How to use pathcosmos/frankenstallm with Ollama:
```
ollama run hf.co/pathcosmos/frankenstallm:Q4_K_M
```

Unsloth Studio

How to use pathcosmos/frankenstallm with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pathcosmos/frankenstallm to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pathcosmos/frankenstallm to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for pathcosmos/frankenstallm to start chatting

Atomic Chat new
Docker Model Runner
How to use pathcosmos/frankenstallm with Docker Model Runner:
```
docker model run hf.co/pathcosmos/frankenstallm:Q4_K_M
```

Lemonade

How to use pathcosmos/frankenstallm with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull pathcosmos/frankenstallm:Q4_K_M

Run and chat with the model

lemonade run user.frankenstallm-Q4_K_M

List all available models

lemonade list

Upload folder using huggingface_hub

#40

by somebody-to-love - opened Mar 9

base: refs/heads/main

←

from: refs/pr/40

Discussion Files changed

+1244

-0

Files changed (6) hide show

source/model/__init__.py +18 -0
source/model/attention.py +263 -0
source/model/config.py +186 -0
source/model/layers.py +127 -0
source/model/mamba_block.py +280 -0
source/model/transformer.py +370 -0

source/model/__init__.py ADDED Viewed

	@@ -0,0 +1,18 @@

+"""
+model — LLM architecture package.
+Public API:
+    LLM        : top-level decoder-only transformer/hybrid language model
+    LMConfig   : configuration dataclass
+    Mamba2Block: Mamba-2 SSD block (used internally by LLM in hybrid mode)
+"""
+from .config import LMConfig
+from .mamba_block import Mamba2Block
+from .transformer import LLM
+__all__ = [
+    "LLM",
+    "LMConfig",
+    "Mamba2Block",
+]

source/model/attention.py ADDED Viewed

	@@ -0,0 +1,263 @@

+"""
+Multi-Head (and Grouped-Query) Attention with optional FlashAttention-2 backend.
+"""
+from __future__ import annotations
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .config import LMConfig
+# ---------------------------------------------------------------------------
+# Optional FlashAttention import
+# ---------------------------------------------------------------------------
+try:
+    from flash_attn import flash_attn_func  # type: ignore[import]
+    HAS_FLASH_ATTN = True
+except ImportError:
+    HAS_FLASH_ATTN = False
+# ---------------------------------------------------------------------------
+# Optional TransformerEngine import (FP8 support)
+# ---------------------------------------------------------------------------
+try:
+    import transformer_engine.pytorch as te  # type: ignore[import]
+    HAS_TE = True
+except ImportError:
+    te = None  # type: ignore[assignment]
+    HAS_TE = False
+# ---------------------------------------------------------------------------
+# Rotary embedding helper
+# ---------------------------------------------------------------------------
+def apply_rotary_emb(
+    x: torch.Tensor,
+    cos: torch.Tensor,
+    sin: torch.Tensor,
+) -> torch.Tensor:
+    """Apply rotary positional embeddings to query or key tensor.
+    Args:
+        x:   (B, T, H, D_head)
+        cos: (T, D_head // 2)  — from RotaryEmbedding.forward
+        sin: (T, D_head // 2)  — from RotaryEmbedding.forward
+    Returns:
+        Tensor with the same shape as *x*, rotated.
+    """
+    d = x.shape[-1]
+    half_d = d // 2
+    x1 = x[..., :half_d]   # (B, T, H, D//2)
+    x2 = x[..., half_d:]   # (B, T, H, D//2)
+    # Broadcast cos/sin from (T, D//2) → (1, T, 1, D//2)
+    cos = cos.unsqueeze(0).unsqueeze(2)  # (1, T, 1, D//2)
+    sin = sin.unsqueeze(0).unsqueeze(2)  # (1, T, 1, D//2)
+    rotated = torch.cat(
+        [x1 * cos - x2 * sin, x1 * sin + x2 * cos],
+        dim=-1,
+    )
+    return rotated.to(x.dtype)
+# ---------------------------------------------------------------------------
+# Multi-Head Attention
+# ---------------------------------------------------------------------------
+class MultiHeadAttention(nn.Module):
+    """Multi-head (or grouped-query) causal self-attention.
+    Supports:
+    - Standard MHA: n_kv_heads == n_heads
+    - GQA / MQA:    n_kv_heads < n_heads  (must evenly divide n_heads)
+    Attention backend:
+    - FlashAttention-2 when available and config.use_flash_attn is True
+    - Vanilla scaled dot-product otherwise (causal mask via upper-triangular)
+    """
+    def __init__(self, config: LMConfig) -> None:
+        super().__init__()
+        self.n_heads    = config.n_heads
+        self.n_kv_heads = config.n_kv_heads        # resolved in __post_init__
+        self.head_dim   = config.d_model // config.n_heads
+        self.d_model    = config.d_model
+        self.dropout    = config.dropout
+        self.use_flash  = config.use_flash_attn
+        # Number of query-head groups per KV head
+        self.n_rep = self.n_heads // self.n_kv_heads
+        # Projections ----------------------------------------------------
+        # Select Linear implementation: te.Linear (FP8) or nn.Linear (BF16)
+        _Linear = te.Linear if (config.use_fp8 and HAS_TE) else nn.Linear
+        # Fused QKV projection: single GEMM (d_model → q_dim + k_dim + v_dim)
+        # For GQA 24:8 with head_dim=128: 3072 + 1024 + 1024 = 5120
+        self._q_dim  = self.n_heads    * self.head_dim  # e.g. 24 * 128 = 3072
+        self._kv_dim = self.n_kv_heads * self.head_dim  # e.g.  8 * 128 = 1024
+        self.qkv_proj = _Linear(
+            config.d_model,
+            self._q_dim + 2 * self._kv_dim,  # 3072 + 2*1024 = 5120
+            bias=config.bias,
+        )
+        self.out_proj = _Linear(
+            config.d_model,
+            config.d_model,
+            bias=config.bias,
+        )
+    # ------------------------------------------------------------------
+    # KV-head expansion for GQA
+    # ------------------------------------------------------------------
+    @staticmethod
+    def _repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
+        """Expand KV heads to match the number of query heads.
+        Args:
+            x:     (B, T, n_kv_heads, head_dim)
+            n_rep: repetition factor
+        Returns:
+            (B, T, n_kv_heads * n_rep, head_dim)
+        """
+        if n_rep == 1:
+            return x
+        B, T, n_kv, D = x.shape
+        return x.repeat_interleave(n_rep, dim=2)
+    # ------------------------------------------------------------------
+    # Forward
+    # ------------------------------------------------------------------
+    def forward(
+        self,
+        x: torch.Tensor,
+        cos: torch.Tensor,
+        sin: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        Args:
+            x:   (B, T, C)
+            cos: (T, head_dim // 2) — from RotaryEmbedding
+            sin: (T, head_dim // 2) — from RotaryEmbedding
+        Returns:
+            (B, T, C)
+        """
+        B, T, C = x.shape
+        # --- Fused QKV projection (single GEMM) --------------------------------
+        qkv = self.qkv_proj(x)  # (B, T, q_dim + 2*kv_dim)
+        q, k, v = qkv.split([self._q_dim, self._kv_dim, self._kv_dim], dim=-1)
+        q = q.view(B, T, self.n_heads,    self.head_dim)
+        k = k.view(B, T, self.n_kv_heads, self.head_dim)
+        v = v.view(B, T, self.n_kv_heads, self.head_dim)
+        # FlashAttention-2 and rotary embedding require bf16/fp16.
+        # te.Linear with MXFP8 may emit FP8-format output tensors; cast if needed.
+        if q.dtype not in (torch.float16, torch.bfloat16):
+            q = q.to(torch.bfloat16)
+            k = k.to(torch.bfloat16)
+            v = v.to(torch.bfloat16)
+        # --- Rotary embeddings -----------------------------------------------
+        q = apply_rotary_emb(q, cos, sin)
+        k = apply_rotary_emb(k, cos, sin)
+        # --- Attention -------------------------------------------------------
+        if self.use_flash and HAS_FLASH_ATTN and x.is_cuda:
+            attn_out = self._flash_attention(q, k, v, B, T)
+        else:
+            attn_out = self._standard_attention(q, k, v, B, T)
+        # --- Output projection -----------------------------------------------
+        # attn_out: (B, T, C)
+        return self.out_proj(attn_out)
+    # ------------------------------------------------------------------
+    # FlashAttention-2 path
+    # ------------------------------------------------------------------
+    def _flash_attention(
+        self,
+        q: torch.Tensor,
+        k: torch.Tensor,
+        v: torch.Tensor,
+        B: int,
+        T: int,
+    ) -> torch.Tensor:
+        """Run FlashAttention-2.
+        flash_attn_func expects inputs in (B, T, H, D) layout and returns
+        (B, T, H, D).  FlashAttention-2 natively supports GQA via head count
+        mismatch (q has n_heads, k/v have n_kv_heads) — no KV expansion needed.
+        """
+        dropout_p = self.dropout if self.training else 0.0
+        # flash_attn_func: (B, T, H, D) → (B, T, H, D)
+        # GQA is handled natively: q=(B,T,n_heads,D), k/v=(B,T,n_kv_heads,D)
+        out = flash_attn_func(q, k, v, dropout_p=dropout_p, causal=True)
+        # Reshape (B, T, n_heads, head_dim) → (B, T, C)
+        return out.reshape(B, T, self.n_heads * self.head_dim)
+    # ------------------------------------------------------------------
+    # Standard (fallback) attention path
+    # ------------------------------------------------------------------
+    def _standard_attention(
+        self,
+        q: torch.Tensor,
+        k: torch.Tensor,
+        v: torch.Tensor,
+        B: int,
+        T: int,
+    ) -> torch.Tensor:
+        """Vanilla scaled dot-product causal attention.
+        Softmax is computed in float32 for numerical stability.
+        """
+        # Expand KV heads for GQA
+        k = self._repeat_kv(k, self.n_rep)  # (B, T, n_heads, head_dim)
+        v = self._repeat_kv(v, self.n_rep)  # (B, T, n_heads, head_dim)
+        # (B, T, H, D) → (B, H, T, D)
+        q = q.transpose(1, 2)
+        k = k.transpose(1, 2)
+        v = v.transpose(1, 2)
+        scale = math.sqrt(self.head_dim)
+        # Scaled dot-product: (B, H, T, T)
+        scores = torch.matmul(q, k.transpose(-2, -1)) / scale
+        # Causal mask: fill upper triangle (excluding diagonal) with -inf
+        causal_mask = torch.triu(
+            torch.ones(T, T, device=q.device, dtype=torch.bool), diagonal=1
+        )
+        scores = scores.masked_fill(causal_mask, float("-inf"))
+        # Softmax in fp32, then cast back
+        attn_weights = F.softmax(scores.float(), dim=-1).to(q.dtype)
+        if self.training and self.dropout > 0.0:
+            attn_weights = F.dropout(attn_weights, p=self.dropout)
+        # Weighted sum: (B, H, T, D)
+        out = torch.matmul(attn_weights, v)
+        # (B, H, T, D) → (B, T, H, D) → (B, T, C)
+        out = out.transpose(1, 2).contiguous().reshape(B, T, self.d_model)
+        return out

source/model/config.py ADDED Viewed

	@@ -0,0 +1,186 @@

+"""
+LMConfig: configuration dataclass for the LLM model architecture.
+"""
+from __future__ import annotations
+import math
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Optional
+import json
+import yaml
+def _round_to_multiple(n: int, multiple: int) -> int:
+    """Round n up to the nearest multiple of `multiple`."""
+    return math.ceil(n / multiple) * multiple
+@dataclass
+class LMConfig:
+    # Vocabulary
+    vocab_size: int = 32000
+    # Model dimensions
+    d_model: int = 768
+    n_layers: int = 12
+    n_heads: int = 12
+    # Grouped-query attention: None → standard MHA (n_kv_heads == n_heads)
+    n_kv_heads: Optional[int] = None
+    # Feed-forward hidden dimension: None → auto-computed
+    d_ffn: Optional[int] = None
+    # Sequence length
+    max_seq_len: int = 2048
+    # RoPE base frequency
+    rope_theta: float = 10000.0
+    # Regularisation
+    dropout: float = 0.0
+    bias: bool = False
+    # Attention backend
+    use_flash_attn: bool = True
+    # FP8 quantization
+    use_fp8: bool = False
+    # Hybrid Mamba-Transformer settings
+    use_hybrid: bool = False
+    hybrid_pattern: str = ""  # e.g. "M M A M M M M A M M M M M M M M M M A M" for 40-layer Nemotron-H style
+    # Mamba-2 SSM parameters
+    mamba_d_state: int = 128
+    mamba_head_dim: int = 64
+    mamba_expand: int = 2
+    mamba_conv_kernel: int = 4
+    mamba_n_groups: int = 1
+    mamba_chunk_size: int = 256
+    def __post_init__(self) -> None:
+        # Resolve n_kv_heads: None → full MHA
+        if self.n_kv_heads is None:
+            self.n_kv_heads = self.n_heads
+        # Validate GQA divisibility
+        if self.n_heads % self.n_kv_heads != 0:
+            raise ValueError(
+                f"n_heads ({self.n_heads}) must be divisible by "
+                f"n_kv_heads ({self.n_kv_heads})"
+            )
+        # Compute d_ffn using the LLaMA-style formula: round(8/3 * d_model)
+        # rounded up to the nearest multiple of 256.
+        if self.d_ffn is None:
+            raw = int(8 / 3 * self.d_model)
+            self.d_ffn = _round_to_multiple(raw, 256)
+        # Hybrid Mamba-Transformer validation
+        if self.use_hybrid and not self.hybrid_pattern.strip():
+            raise ValueError(
+                "use_hybrid=True requires a non-empty hybrid_pattern "
+                "(space-separated 'M'/'A' per layer)"
+            )
+        # FP8 alignment: TE requires dimensions divisible by 16
+        if self.use_fp8:
+            if self.d_model % 16 != 0:
+                raise ValueError(f"FP8: d_model ({self.d_model}) must be divisible by 16")
+            if self.d_ffn % 16 != 0:
+                raise ValueError(f"FP8: d_ffn ({self.d_ffn}) must be divisible by 16")
+    # ------------------------------------------------------------------
+    # Properties
+    # ------------------------------------------------------------------
+    @property
+    def num_params(self) -> int:
+        """Approximate parameter count using the 12 * L * d^2 rule."""
+        return 12 * self.n_layers * self.d_model ** 2
+    @property
+    def head_dim(self) -> int:
+        """Dimensionality of each attention head."""
+        return self.d_model // self.n_heads
+    # ------------------------------------------------------------------
+    # Serialisation helpers
+    # ------------------------------------------------------------------
+    def to_dict(self) -> dict:
+        """Return a plain-Python-dict representation of the config."""
+        return {
+            "vocab_size": self.vocab_size,
+            "d_model": self.d_model,
+            "n_layers": self.n_layers,
+            "n_heads": self.n_heads,
+            "n_kv_heads": self.n_kv_heads,
+            "d_ffn": self.d_ffn,
+            "max_seq_len": self.max_seq_len,
+            "rope_theta": self.rope_theta,
+            "dropout": self.dropout,
+            "bias": self.bias,
+            "use_flash_attn": self.use_flash_attn,
+            "use_fp8": self.use_fp8,
+            "use_hybrid": self.use_hybrid,
+            "hybrid_pattern": self.hybrid_pattern,
+            "mamba_d_state": self.mamba_d_state,
+            "mamba_head_dim": self.mamba_head_dim,
+            "mamba_expand": self.mamba_expand,
+            "mamba_conv_kernel": self.mamba_conv_kernel,
+            "mamba_n_groups": self.mamba_n_groups,
+            "mamba_chunk_size": self.mamba_chunk_size,
+        }
+    def to_yaml(self, path: str | Path) -> None:
+        """Serialise config to a YAML file."""
+        path = Path(path)
+        path.parent.mkdir(parents=True, exist_ok=True)
+        with open(path, "w", encoding="utf-8") as f:
+            yaml.safe_dump(self.to_dict(), f, default_flow_style=False, sort_keys=False)
+    @classmethod
+    def from_dict(cls, d: dict) -> "LMConfig":
+        """Construct a LMConfig from a plain dict (e.g. loaded from YAML)."""
+        return cls(**d)
+    @classmethod
+    def from_yaml(cls, path: str | Path) -> "LMConfig":
+        """Load config from a YAML file."""
+        with open(path, "r", encoding="utf-8") as f:
+            data = yaml.safe_load(f)
+        # Support nested YAML with 'model' section (e.g., shared multi-section configs)
+        if "model" in data and isinstance(data["model"], dict):
+            data = data["model"]
+        return cls.from_dict(data)
+    @classmethod
+    def from_hf_config(cls, path: str | Path) -> "LMConfig":
+        """Load config from a HuggingFace-format config.json (LlamaForCausalLM)."""
+        path = Path(path)
+        with open(path, "r", encoding="utf-8") as f:
+            hf = json.load(f)
+        rope_theta = 10000.0
+        if "rope_parameters" in hf and isinstance(hf["rope_parameters"], dict):
+            rope_theta = float(hf["rope_parameters"].get("rope_theta", rope_theta))
+        elif "rope_theta" in hf:
+            rope_theta = float(hf["rope_theta"])
+        return cls(
+            vocab_size=hf["vocab_size"],
+            d_model=hf["hidden_size"],
+            n_layers=hf["num_hidden_layers"],
+            n_heads=hf["num_attention_heads"],
+            n_kv_heads=hf.get("num_key_value_heads", hf["num_attention_heads"]),
+            d_ffn=hf["intermediate_size"],
+            max_seq_len=hf.get("max_position_embeddings", 4096),
+            rope_theta=rope_theta,
+            dropout=hf.get("attention_dropout", 0.0),
+            bias=hf.get("attention_bias", False),
+        )

source/model/layers.py ADDED Viewed

	@@ -0,0 +1,127 @@

+"""
+Reusable building-block layers: RMSNorm, RotaryEmbedding, SwiGLU.
+"""
+from __future__ import annotations
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+# ---------------------------------------------------------------------------
+# Optional TransformerEngine import (FP8 support)
+# ---------------------------------------------------------------------------
+try:
+    import transformer_engine.pytorch as te  # type: ignore[import]
+    HAS_TE = True
+except ImportError:
+    te = None  # type: ignore[assignment]
+    HAS_TE = False
+# ---------------------------------------------------------------------------
+# RMS Layer Normalisation
+# ---------------------------------------------------------------------------
+class RMSNorm(nn.Module):
+    """Root-Mean-Square Layer Normalisation (Zhang & Sennrich, 2019).
+    Computation is promoted to float32 for numerical stability and cast back
+    to the input dtype before returning.
+    """
+    def __init__(self, d_model: int, eps: float = 1e-6) -> None:
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(d_model))
+    def _norm(self, x: torch.Tensor) -> torch.Tensor:
+        # x: (..., D) — compute in fp32
+        return x * torch.rsqrt(x.pow(2).mean(dim=-1, keepdim=True) + self.eps)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        # Upcast to float32, normalise, scale, then restore original dtype.
+        out = self._norm(x.float()).to(x.dtype)
+        return out * self.weight
+# ---------------------------------------------------------------------------
+# Rotary Positional Embedding
+# ---------------------------------------------------------------------------
+class RotaryEmbedding(nn.Module):
+    """Precomputed rotary positional embeddings (Su et al., RoFormer 2021).
+    Cos/sin tables are stored as buffers (shape: max_seq_len × D//2) so they
+    move with the module to the correct device automatically.
+    """
+    def __init__(self, dim: int, max_seq_len: int, theta: float = 10000.0) -> None:
+        super().__init__()
+        self.dim = dim
+        self.max_seq_len = max_seq_len
+        self.theta = theta
+        # Precompute and register
+        cos, sin = self._build_tables(dim, max_seq_len, theta)
+        self.register_buffer("_cos_cached", cos, persistent=False)
+        self.register_buffer("_sin_cached", sin, persistent=False)
+    @staticmethod
+    def _build_tables(
+        dim: int, max_seq_len: int, theta: float
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        """Compute cos/sin tables with shape (max_seq_len, dim // 2)."""
+        half_dim = dim // 2
+        # Inverse frequencies: shape (half_dim,)
+        freqs = 1.0 / (
+            theta ** (torch.arange(0, half_dim, dtype=torch.float32) / half_dim)
+        )
+        # Positions: shape (max_seq_len,)
+        t = torch.arange(max_seq_len, dtype=torch.float32)
+        # Outer product → (max_seq_len, half_dim)
+        emb = torch.outer(t, freqs)
+        cos = emb.cos()  # (T, D//2)
+        sin = emb.sin()  # (T, D//2)
+        return cos, sin
+    def forward(self, seq_len: int, device: torch.device) -> tuple[torch.Tensor, torch.Tensor]:
+        """Return (cos, sin) slices of shape (seq_len, D//2) on *device*.
+        If *seq_len* exceeds the precomputed length the tables are recomputed
+        on-the-fly (rare, but graceful fallback).
+        """
+        if seq_len > self.max_seq_len:
+            cos, sin = self._build_tables(self.dim, seq_len, self.theta)
+            cos = cos.to(device)
+            sin = sin.to(device)
+        else:
+            cos = self._cos_cached[:seq_len].to(device)
+            sin = self._sin_cached[:seq_len].to(device)
+        return cos, sin
+# ---------------------------------------------------------------------------
+# SwiGLU Feed-Forward Network
+# ---------------------------------------------------------------------------
+class SwiGLU(nn.Module):
+    """SwiGLU feed-forward block (Shazeer, 2020).
+    Architecture:
+        out = down_proj( SiLU(gate_proj(x)) * up_proj(x) )
+    The gate and up projections are separate linear layers so that the gating
+    mechanism can learn an independent representation.
+    """
+    def __init__(self, d_model: int, d_ffn: int, bias: bool = False) -> None:
+        super().__init__()
+        self.gate_proj = nn.Linear(d_model, d_ffn, bias=bias)
+        self.up_proj   = nn.Linear(d_model, d_ffn, bias=bias)
+        self.down_proj = nn.Linear(d_ffn, d_model, bias=bias)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        # Gated activation: element-wise product of SiLU(gate) and up projection
+        return self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))

source/model/mamba_block.py ADDED Viewed

	@@ -0,0 +1,280 @@

+"""
+Mamba-2 block based on the Structured State Space Duality (SSD) formulation.
+Reference: "Transformers are SSMs: Generalized Models and Efficient Algorithms
+Through Structured State Space Duality" (Dao & Gu, 2024).
+This implements a pure-PyTorch sequential scan for correctness and generality.
+A chunked SSD kernel can be swapped in later for speed.
+"""
+from __future__ import annotations
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .layers import RMSNorm
+# ---------------------------------------------------------------------------
+# Selective Scan (sequential, numerically stable in float32)
+# ---------------------------------------------------------------------------
+def selective_scan(
+    x: torch.Tensor,
+    dt: torch.Tensor,
+    A_log: torch.Tensor,
+    B: torch.Tensor,
+    C: torch.Tensor,
+    D: torch.Tensor,
+    n_groups: int,
+) -> torch.Tensor:
+    """Run the SSM recurrence sequentially over the time axis.
+    Args:
+        x:     (B, L, n_heads, head_dim) — input after conv + activation.
+        dt:    (B, L, n_heads)           — discretisation time-steps (after softplus).
+        A_log: (n_heads,)                — log(-A), learnable diagonal decay.
+        B:     (B, L, n_groups, d_state) — input-to-state projection per step.
+        C:     (B, L, n_groups, d_state) — state-to-output projection per step.
+        D:     (n_heads,)                — skip/residual connection per head.
+        n_groups: int                    — number of B/C groups (heads per group share B/C).
+    Returns:
+        y: (B, L, n_heads, head_dim) — SSM output.
+    """
+    batch, seq_len, n_heads, head_dim = x.shape
+    d_state = B.shape[-1]
+    heads_per_group = n_heads // n_groups
+    # Compute decay: dA = exp(-exp(A_log) * dt)  — shape (B, L, n_heads)
+    neg_A = A_log.exp()                           # (n_heads,)
+    dA = torch.exp(-neg_A.unsqueeze(0).unsqueeze(0) * dt)  # (B, L, n_heads)
+    # Scale input by dt: dBx will be accumulated into state
+    # dt: (B, L, n_heads) -> (B, L, n_heads, 1)
+    dt_x = dt.unsqueeze(-1) * x  # (B, L, n_heads, head_dim)
+    # Allocate output
+    y = torch.zeros_like(x)
+    # State: (B, n_heads, head_dim, d_state) — accumulated in float32
+    h = torch.zeros(
+        batch, n_heads, head_dim, d_state,
+        dtype=torch.float32, device=x.device,
+    )
+    # Expand B/C from groups to heads: (B, L, n_groups, d_state) -> indexing
+    # For efficiency we index into the group dimension during the loop.
+    # group_idx[head] -> which group this head belongs to
+    group_idx = torch.arange(n_heads, device=x.device) // heads_per_group  # (n_heads,)
+    for t in range(seq_len):
+        # --- Decay state ---
+        # dA_t: (B, n_heads) -> (B, n_heads, 1, 1)
+        dA_t = dA[:, t, :].float().unsqueeze(-1).unsqueeze(-1)
+        h = h * dA_t  # (B, n_heads, head_dim, d_state)
+        # --- Input contribution ---
+        # B_t: (B, n_groups, d_state) -> (B, n_heads, d_state) via group expansion
+        B_t = B[:, t, :, :][:, group_idx, :]  # (B, n_heads, d_state)
+        # dt_x_t: (B, n_heads, head_dim)
+        dt_x_t = dt_x[:, t, :, :].float()     # (B, n_heads, head_dim)
+        # Outer product: (B, n_heads, head_dim, 1) * (B, n_heads, 1, d_state)
+        h = h + dt_x_t.unsqueeze(-1) * B_t.float().unsqueeze(-2)
+        # --- Output ---
+        # C_t: (B, n_groups, d_state) -> (B, n_heads, d_state)
+        C_t = C[:, t, :, :][:, group_idx, :]  # (B, n_heads, d_state)
+        # y_t = sum_over_d_state( h * C_t ) -> (B, n_heads, head_dim)
+        y_t = torch.einsum("bnhd,bnd->bnh", h, C_t.float())
+        y[:, t, :, :] = y_t.to(x.dtype)
+    # Skip connection: D * x
+    y = y + D.view(1, 1, n_heads, 1) * x
+    return y
+# ---------------------------------------------------------------------------
+# Mamba-2 Block
+# ---------------------------------------------------------------------------
+class Mamba2Block(nn.Module):
+    """Mamba-2 block with pre-norm residual connection.
+    Implements:
+        1. RMSNorm (pre-norm)
+        2. Input projection -> (z, x, B, C, dt)
+        3. Causal depth-wise Conv1d on x
+        4. SiLU activation on x
+        5. Selective scan (SSM recurrence)
+        6. Gated output: y * SiLU(z)
+        7. Output projection + residual
+    Args:
+        d_model:     Model hidden dimension.
+        d_state:     SSM state dimension N (default 128).
+        head_dim:    Per-head dimension for SSD (default 64).
+        expand:      Expansion factor for inner dimension (default 2).
+        conv_kernel: Causal 1D convolution kernel size (default 4).
+        n_groups:    Number of groups for B/C projections (default 1).
+        chunk_size:  Chunk size for SSD algorithm — reserved for future use (default 256).
+    """
+    def __init__(
+        self,
+        d_model: int,
+        d_state: int = 128,
+        head_dim: int = 64,
+        expand: int = 2,
+        conv_kernel: int = 4,
+        n_groups: int = 1,
+        chunk_size: int = 256,
+    ) -> None:
+        super().__init__()
+        self.d_model = d_model
+        self.d_state = d_state
+        self.head_dim = head_dim
+        self.expand = expand
+        self.n_groups = n_groups
+        self.chunk_size = chunk_size
+        # Derived dimensions
+        self.d_inner = expand * d_model
+        self.n_heads = self.d_inner // head_dim
+        assert self.d_inner % head_dim == 0, (
+            f"d_inner ({self.d_inner}) must be divisible by head_dim ({head_dim})"
+        )
+        assert self.n_heads % n_groups == 0, (
+            f"n_heads ({self.n_heads}) must be divisible by n_groups ({n_groups})"
+        )
+        # Pre-norm
+        self.norm = RMSNorm(d_model)
+        # Input projection: d_model -> z + x + B + C + dt
+        self.d_proj = (
+            self.d_inner          # z (gate)
+            + self.d_inner        # x (input to conv + SSM)
+            + n_groups * d_state  # B
+            + n_groups * d_state  # C
+            + self.n_heads        # dt (one per head)
+        )
+        self.in_proj = nn.Linear(d_model, self.d_proj, bias=False)
+        # Causal depth-wise conv1d over x
+        self.conv1d = nn.Conv1d(
+            in_channels=self.d_inner,
+            out_channels=self.d_inner,
+            kernel_size=conv_kernel,
+            groups=self.d_inner,
+            padding=conv_kernel - 1,  # causal: trim trailing values
+        )
+        # SSM parameters
+        # A_log: log(-A) where A is the diagonal decay — init from log(uniform(1, 16))
+        A_init = torch.log(torch.rand(self.n_heads) * 15.0 + 1.0)  # log(U(1,16))
+        self.A_log = nn.Parameter(A_init)
+        # D: skip connection per head — init to ones
+        self.D = nn.Parameter(torch.ones(self.n_heads))
+        # dt_bias: added before softplus — init from log(uniform(0.001, 0.1))
+        dt_bias_init = torch.log(torch.rand(self.n_heads) * 0.099 + 0.001)
+        self.dt_bias = nn.Parameter(dt_bias_init)
+        # Output projection
+        self.out_proj = nn.Linear(self.d_inner, d_model, bias=False)
+    # ------------------------------------------------------------------
+    # Helpers
+    # ------------------------------------------------------------------
+    def _split_projection(
+        self, proj: torch.Tensor
+    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Split the fused input projection into (z, x, B, C, dt).
+        Args:
+            proj: (B, L, d_proj)
+        Returns:
+            z:  (B, L, d_inner)
+            x:  (B, L, d_inner)
+            B:  (B, L, n_groups, d_state)
+            C:  (B, L, n_groups, d_state)
+            dt: (B, L, n_heads)
+        """
+        batch, seq_len, _ = proj.shape
+        i = 0
+        z = proj[:, :, i : i + self.d_inner]
+        i += self.d_inner
+        x = proj[:, :, i : i + self.d_inner]
+        i += self.d_inner
+        bc_dim = self.n_groups * self.d_state
+        B = proj[:, :, i : i + bc_dim].reshape(batch, seq_len, self.n_groups, self.d_state)
+        i += bc_dim
+        C = proj[:, :, i : i + bc_dim].reshape(batch, seq_len, self.n_groups, self.d_state)
+        i += bc_dim
+        dt = proj[:, :, i : i + self.n_heads]
+        return z, x, B, C, dt
+    # ------------------------------------------------------------------
+    # Forward
+    # ------------------------------------------------------------------
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            x: (B, L, d_model) — input hidden states.
+        Returns:
+            (B, L, d_model) — output with residual connection applied.
+        """
+        residual = x
+        x = self.norm(x)
+        # --- Input projection ---
+        proj = self.in_proj(x)                         # (B, L, d_proj)
+        z, x_ssm, B, C, dt_raw = self._split_projection(proj)
+        # --- Causal conv1d on x ---
+        # Conv1d expects (B, C, L)
+        x_conv = x_ssm.transpose(1, 2)                # (B, d_inner, L)
+        x_conv = self.conv1d(x_conv)
+        # Trim to causal: remove the (kernel-1) trailing padding
+        x_conv = x_conv[:, :, :x_ssm.shape[1]]        # (B, d_inner, L)
+        x_conv = x_conv.transpose(1, 2)               # (B, L, d_inner)
+        x_conv = F.silu(x_conv)
+        # --- Discretise dt ---
+        dt = F.softplus(dt_raw + self.dt_bias)         # (B, L, n_heads)
+        # --- Reshape x for multi-head scan ---
+        batch, seq_len, _ = x_conv.shape
+        x_heads = x_conv.reshape(batch, seq_len, self.n_heads, self.head_dim)
+        # --- Selective scan (SSM recurrence) ---
+        y = selective_scan(
+            x_heads, dt, self.A_log, B, C, self.D,
+            n_groups=self.n_groups,
+        )  # (B, L, n_heads, head_dim)
+        # --- Flatten heads back ---
+        y = y.reshape(batch, seq_len, self.d_inner)    # (B, L, d_inner)
+        # --- Gated output ---
+        y = y * F.silu(z)
+        # --- Output projection + residual ---
+        return residual + self.out_proj(y)

source/model/transformer.py ADDED Viewed

	@@ -0,0 +1,370 @@

+"""
+Full transformer: TransformerBlock and top-level LLM model.
+Supports pure Transformer and hybrid Mamba-2 + Transformer architectures.
+"""
+from __future__ import annotations
+from pathlib import Path
+from typing import Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .config import LMConfig
+from .layers import RMSNorm, RotaryEmbedding, SwiGLU
+from .attention import MultiHeadAttention
+from .mamba_block import Mamba2Block
+# ---------------------------------------------------------------------------
+# Optional TransformerEngine import (FP8 support)
+# ---------------------------------------------------------------------------
+try:
+    import transformer_engine.pytorch as te  # type: ignore[import]
+    HAS_TE = True
+except ImportError:
+    te = None  # type: ignore[assignment]
+    HAS_TE = False
+# ---------------------------------------------------------------------------
+# HuggingFace ↔ Custom weight conversion helpers
+# ---------------------------------------------------------------------------
+def _load_hf_state_dict(path: Path) -> dict[str, torch.Tensor]:
+    """Load weights from HF safetensors (or pytorch_model.bin fallback)."""
+    safetensors_path = path / "model.safetensors"
+    if safetensors_path.exists():
+        from safetensors.torch import load_file
+        return load_file(str(safetensors_path), device="cpu")
+    bin_path = path / "pytorch_model.bin"
+    if bin_path.exists():
+        return torch.load(bin_path, map_location="cpu", weights_only=True)
+    raise FileNotFoundError(f"No model.safetensors or pytorch_model.bin in {path}")
+def _convert_hf_to_custom(hf_sd: dict[str, torch.Tensor], config: LMConfig) -> dict[str, torch.Tensor]:
+    """Convert HuggingFace LlamaForCausalLM state dict to our custom format.
+    Key mapping:
+      HF: model.embed_tokens.weight       → embedding.weight
+      HF: model.layers.{i}.self_attn.q/k/v_proj.weight → layers.{i}.attn.qkv_proj.weight (fused)
+      HF: model.layers.{i}.self_attn.o_proj.weight     → layers.{i}.attn.out_proj.weight
+      HF: model.layers.{i}.input_layernorm.weight      → layers.{i}.attn_norm.weight
+      HF: model.layers.{i}.mlp.gate_proj.weight        → layers.{i}.ffn.gate_proj.weight
+      HF: model.layers.{i}.mlp.up_proj.weight          → layers.{i}.ffn.up_proj.weight
+      HF: model.layers.{i}.mlp.down_proj.weight        → layers.{i}.ffn.down_proj.weight
+      HF: model.layers.{i}.post_attention_layernorm.weight → layers.{i}.ffn_norm.weight
+      HF: model.norm.weight                → norm.weight
+      HF: lm_head.weight                   → lm_head.weight
+    """
+    sd: dict[str, torch.Tensor] = {}
+    sd["embedding.weight"] = hf_sd["model.embed_tokens.weight"]
+    sd["norm.weight"] = hf_sd["model.norm.weight"]
+    sd["lm_head.weight"] = hf_sd["lm_head.weight"]
+    for i in range(config.n_layers):
+        pfx = f"model.layers.{i}"
+        out = f"layers.{i}"
+        # Fuse Q, K, V into single qkv_proj
+        q = hf_sd[f"{pfx}.self_attn.q_proj.weight"]
+        k = hf_sd[f"{pfx}.self_attn.k_proj.weight"]
+        v = hf_sd[f"{pfx}.self_attn.v_proj.weight"]
+        sd[f"{out}.attn.qkv_proj.weight"] = torch.cat([q, k, v], dim=0)
+        sd[f"{out}.attn.out_proj.weight"] = hf_sd[f"{pfx}.self_attn.o_proj.weight"]
+        sd[f"{out}.attn_norm.weight"] = hf_sd[f"{pfx}.input_layernorm.weight"]
+        sd[f"{out}.ffn.gate_proj.weight"] = hf_sd[f"{pfx}.mlp.gate_proj.weight"]
+        sd[f"{out}.ffn.up_proj.weight"] = hf_sd[f"{pfx}.mlp.up_proj.weight"]
+        sd[f"{out}.ffn.down_proj.weight"] = hf_sd[f"{pfx}.mlp.down_proj.weight"]
+        sd[f"{out}.ffn_norm.weight"] = hf_sd[f"{pfx}.post_attention_layernorm.weight"]
+    return sd
+# ---------------------------------------------------------------------------
+# Transformer Block
+# ---------------------------------------------------------------------------
+class TransformerBlock(nn.Module):
+    """Single pre-norm transformer decoder block.
+    Layout:
+        x = x + Attention( RMSNorm(x) )
+        x = x + FFN( RMSNorm(x) )
+    """
+    def __init__(self, config: LMConfig) -> None:
+        super().__init__()
+        self.attn_norm = RMSNorm(config.d_model)
+        self.attn      = MultiHeadAttention(config)
+        self._use_fp8  = config.use_fp8 and HAS_TE
+        if self._use_fp8:
+            # te.LayerNormMLP fuses RMSNorm + gate/up/down projections into one kernel.
+            # It applies normalisation internally, so ffn_norm is not needed.
+            self.ffn_norm = None
+            self.ffn = te.LayerNormMLP(
+                hidden_size=config.d_model,
+                ffn_hidden_size=config.d_ffn,
+                bias=config.bias,
+                activation="swiglu",
+                normalization="RMSNorm",
+            )
+        else:
+            self.ffn_norm = RMSNorm(config.d_model)
+            self.ffn      = SwiGLU(config.d_model, config.d_ffn, bias=config.bias)
+    def forward(
+        self,
+        x: torch.Tensor,
+        cos: torch.Tensor,
+        sin: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        Args:
+            x:   (B, T, C)
+            cos: (T, head_dim // 2)
+            sin: (T, head_dim // 2)
+        Returns:
+            (B, T, C)
+        """
+        # Pre-norm attention with residual
+        x = x + self.attn(self.attn_norm(x), cos, sin)
+        # FFN with residual — te.LayerNormMLP applies norm internally
+        if self._use_fp8:
+            x = x + self.ffn(x)
+        else:
+            x = x + self.ffn(self.ffn_norm(x))
+        return x
+# ---------------------------------------------------------------------------
+# Full Language Model
+# ---------------------------------------------------------------------------
+class LLM(nn.Module):
+    """Decoder-only transformer language model.
+    Features:
+    - Learned token embeddings with weight tying to the LM head
+    - Rotary positional embeddings (no learned position embeddings)
+    - Stack of pre-norm TransformerBlocks
+    - Final RMSNorm before the LM head
+    - Optional cross-entropy loss computation (for training)
+    """
+    def __init__(self, config: LMConfig) -> None:
+        super().__init__()
+        self.config = config
+        # --- Embedding -------------------------------------------------------
+        self.embedding = nn.Embedding(config.vocab_size, config.d_model)
+        # --- Layers (pure Transformer or hybrid Mamba-Transformer) -----------
+        if config.use_hybrid and config.hybrid_pattern:
+            pattern = config.hybrid_pattern.strip().split()
+            if len(pattern) != config.n_layers:
+                raise ValueError(
+                    f"hybrid_pattern has {len(pattern)} entries but "
+                    f"n_layers={config.n_layers}"
+                )
+            layers: list[nn.Module] = []
+            # Track which layers are Mamba vs Attention for forward dispatch
+            self._layer_types: list[str] = pattern
+            for layer_type in pattern:
+                if layer_type == "M":
+                    layers.append(Mamba2Block(
+                        d_model=config.d_model,
+                        d_state=config.mamba_d_state,
+                        head_dim=config.mamba_head_dim,
+                        expand=config.mamba_expand,
+                        conv_kernel=config.mamba_conv_kernel,
+                        n_groups=config.mamba_n_groups,
+                        chunk_size=config.mamba_chunk_size,
+                    ))
+                elif layer_type == "A":
+                    layers.append(TransformerBlock(config))
+                else:
+                    raise ValueError(
+                        f"Unknown layer type '{layer_type}' in hybrid_pattern. "
+                        f"Use 'M' (Mamba) or 'A' (Attention)."
+                    )
+            self.layers = nn.ModuleList(layers)
+        else:
+            self._layer_types = ["A"] * config.n_layers
+            self.layers = nn.ModuleList(
+                [TransformerBlock(config) for _ in range(config.n_layers)]
+            )
+        # --- Final normalisation and LM head ---------------------------------
+        self.norm    = RMSNorm(config.d_model)
+        # NOTE: lm_head는 nn.Linear 유지 — embedding weight tying + TE FP8 호환성
+        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
+        # Weight tying: share embedding and LM-head weight matrices
+        self.lm_head.weight = self.embedding.weight
+        # --- Rotary embeddings -----------------------------------------------
+        self.rope = RotaryEmbedding(
+            dim=config.head_dim,
+            max_seq_len=config.max_seq_len,
+            theta=config.rope_theta,
+        )
+        # --- Initialise weights ----------------------------------------------
+        self.apply(self._init_weights)
+    # ------------------------------------------------------------------
+    # Weight initialisation
+    # ------------------------------------------------------------------
+    @staticmethod
+    def _init_weights(module: nn.Module) -> None:
+        """Apply standard initialisation:
+        - Linear / Embedding weights: N(0, 0.02)
+        - Bias parameters: zeros
+        - te.Linear / te.LayerNormMLP: skipped (TE manages its own init)
+        - Mamba2Block: skipped (manages its own init)
+        """
+        # TE modules handle their own weight initialisation.
+        if HAS_TE and isinstance(module, (te.Linear, te.LayerNormMLP)):
+            return
+        # Mamba2Block handles its own parameter init (A_log, D, dt_bias, etc.)
+        if isinstance(module, Mamba2Block):
+            return
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            nn.init.normal_(module.weight, mean=0.0, std=0.02)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            nn.init.zeros_(module.bias)
+    # ------------------------------------------------------------------
+    # Forward pass
+    # ------------------------------------------------------------------
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        targets: Optional[torch.Tensor] = None,
+    ) -> tuple[torch.Tensor, Optional[torch.Tensor]]:
+        """
+        Args:
+            input_ids: (B, T) long tensor of token indices
+            targets:   (B, T) long tensor of target token indices, or None.
+                       Use -1 (ignore_index) to mask positions.
+        Returns:
+            logits: (B, T, vocab_size)
+            loss:   scalar cross-entropy loss, or None if targets is None
+        """
+        B, T = input_ids.shape
+        device = input_ids.device
+        # Token embeddings: (B, T, C)
+        x = self.embedding(input_ids)
+        # Rotary cos/sin for this sequence length: (T, head_dim // 2)
+        # Only needed for Attention layers, but precomputed once for all.
+        cos, sin = self.rope(T, device)
+        # Run through blocks — Mamba blocks ignore cos/sin
+        for layer, ltype in zip(self.layers, self._layer_types):
+            if ltype == "M":
+                x = layer(x)
+            else:
+                x = layer(x, cos, sin)
+        # Final normalisation
+        x = self.norm(x)
+        # LM head: (B, T, vocab_size)
+        logits = self.lm_head(x)
+        # Compute loss if targets are provided
+        loss: Optional[torch.Tensor] = None
+        if targets is not None:
+            loss = F.cross_entropy(
+                logits.view(-1, logits.size(-1)),
+                targets.view(-1),
+                ignore_index=-1,
+            )
+        return logits, loss
+    # ------------------------------------------------------------------
+    # Properties
+    # ------------------------------------------------------------------
+    @property
+    def num_params(self) -> int:
+        """Number of trainable parameters."""
+        return sum(p.numel() for p in self.parameters() if p.requires_grad)
+    def get_input_embeddings(self) -> nn.Embedding:
+        """HuggingFace-compatible accessor for the token embedding layer."""
+        return self.embedding
+    # ------------------------------------------------------------------
+    # Constructors
+    # ------------------------------------------------------------------
+    @classmethod
+    def from_config(cls, config: LMConfig) -> "LLM":
+        """Construct an LLM from an LMConfig instance."""
+        return cls(config)
+    @classmethod
+    def from_pretrained(cls, path: str | Path) -> "LLM":
+        """Load model from a checkpoint directory.
+        Supports two formats (auto-detected):
+          1. Custom: config.yaml + model.pt
+          2. HuggingFace: config.json + model.safetensors (LlamaForCausalLM)
+        """
+        path = Path(path)
+        # --- Custom format ---
+        if (path / "config.yaml").exists():
+            config = LMConfig.from_yaml(path / "config.yaml")
+            model = cls(config)
+            state_dict = torch.load(
+                path / "model.pt",
+                map_location="cpu",
+                weights_only=True,
+            )
+            model.load_state_dict(state_dict)
+            return model
+        # --- HuggingFace format ---
+        if (path / "config.json").exists():
+            config = LMConfig.from_hf_config(path / "config.json")
+            model = cls(config)
+            hf_sd = _load_hf_state_dict(path)
+            our_sd = _convert_hf_to_custom(hf_sd, config)
+            model.load_state_dict(our_sd)
+            return model
+        raise FileNotFoundError(
+            f"No config.yaml or config.json found in {path}"
+        )
+    # ------------------------------------------------------------------
+    # Persistence
+    # ------------------------------------------------------------------
+    def save_pretrained(self, path: str | Path) -> None:
+        """Save config and model weights to a directory.
+        Creates:
+            <path>/config.yaml
+            <path>/model.pt
+        """
+        path = Path(path)
+        path.mkdir(parents=True, exist_ok=True)
+        self.config.to_yaml(path / "config.yaml")
+        torch.save(self.state_dict(), path / "model.pt")