data-archetype commited on 4 days ago

Commit

1b703d5

0 Parent(s):

Upload DINAC-AE export package

Browse files

Files changed (28) hide show

.gitattributes +35 -0
README.md +135 -0
common/norms.py +236 -0
common/rope.py +536 -0
config.json +26 -0
dinac_ae/__init__.py +12 -0
dinac_ae/adaln.py +75 -0
dinac_ae/config.py +75 -0
dinac_ae/decoder.py +163 -0
dinac_ae/encoder.py +215 -0
dinac_ae/fcdm_block.py +103 -0
dinac_ae/model.py +333 -0
dinac_ae/norms.py +39 -0
dinac_ae/samplers.py +258 -0
dinac_ae/straight_through_encoder.py +57 -0
dinac_ae/time_embed.py +83 -0
dinac_ae/vp_diffusion.py +152 -0
dit/attention_blocks.py +240 -0
dit/axial_rope2d.py +1728 -0
dit/blocks.py +259 -0
dit/body_config.py +33 -0
dit/mlp.py +117 -0
dit/mlp_types.py +51 -0
dit/position_encoding.py +23 -0
dit/repa_projection.py +226 -0
dit/xattn_blocks.py +177 -0
model.safetensors +3 -0
technical_report_dinac_ae.md +390 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,135 @@

+---
+license: apache-2.0
+tags:
+  - diffusion
+  - autoencoder
+  - image-reconstruction
+  - latent-space
+  - dino
+  - pytorch
+---
+# data-archetype/dinac_ae
+**DINAC-AE** is a **DIN**O-**A**ligned **C**lass-token **A**uto**E**ncoder.
+It follows the [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae)
+family: patch-16 spatial latents, a VP diffusion decoder, and DINO-aligned
+representations.
+Relative to SemDisDiffAE, DINAC-AE changes the encoder from FCDM blocks to a
+6-block ViT/DiT-style transformer encoder and uses DINOv3 ViT-B/16 alignment.
+The latent-to-DINO alignment head is extended to predict the DINO class token
+as well as patch tokens. `predict_class(latents)` exposes that class-token
+feature directly from latents.
+## 2k PSNR Benchmark
+| Model | Mean PSNR (dB) | Std (dB) | Median (dB) | P5 (dB) | P95 (dB) |
+|---|---:|---:|---:|---:|---:|
+| dinac_ae | `35.19` | `4.53` | `35.06` | `28.02` | `42.43` |
+| FLUX.2 VAE | `36.28` | `4.53` | `36.07` | `28.89` | `43.63` |
+Evaluated on `2000` validation images.
+DINAC-AE targets a compromise between high reconstruction quality, a learnable
+latent space with KL-like variance expansion, DINOv3 alignment, and robustness
+to local token errors.
+[Results viewer](https://huggingface.co/spaces/data-archetype/dinac_ae-results)
+shows the 39-image reconstruction set with DINAC-AE and FLUX.2 VAE
+reconstructions, RGB differences, and latent PCA.
+The released export recheck on that 39-image set gives `35.15 dB` mean PSNR
+(`25.73` min, `45.99` max).
+[Full technical report](https://huggingface.co/data-archetype/dinac_ae/blob/main/technical_report_dinac_ae.md)
+## Encode Throughput
+Measured on an `NVIDIA GeForce RTX 5090` in `bfloat16`, averaging repeated
+batches per resolution.
+| Resolution | Batch Size | dinac_ae encode (ms/batch) | FLUX.2 encode (ms/batch) | dinac_ae peak VRAM (MiB) | FLUX.2 peak VRAM (MiB) | Speedup vs FLUX.2 | Peak VRAM Reduction vs FLUX.2 |
+|---:|---:|---:|---:|---:|---:|---:|---:|
+| `256x256` | `128` | `50` | `383` | `1,637` | `12,511` | `7.62x` | `86.9%` |
+| `512x512` | `32` | `53` | `354` | `1,639` | `12,511` | `6.72x` | `86.9%` |
+The transformer encoder is slightly slower and larger than the full_capacitor
+FCDM encoder, but remains much faster and much smaller than the FLUX.2 VAE
+encoder.
+## Latent Interface
+- `encode()` returns DINAC-AE's own whitened latent space.
+- `decode()` expects that same whitened latent space and dewhitens internally.
+- `predict_class()` expects the same whitened latent space, dewhitens
+  internally, and predicts a DINOv3 ViT-B/16 class-token feature.
+- `whiten()` and `dewhiten()` are exposed for explicit control.
+- `encode_posterior()` returns the raw exported posterior before whitening.
+- `DinacAEInferenceConfig.num_steps` counts decoder evaluations directly:
+  `num_steps=1` means one NFE.
+The export ships weights in `float32`. The recommended and default runtime path
+is `bfloat16` AMP for the main encoder, decoder, and class-token path, with
+`float32` retained for sensitive operations such as whitening/dewhitening,
+normalization math, RoPE frequency construction, and VP diffusion schedule
+helpers.
+## Usage
+```python
+import torch
+from dinac_ae import DinacAE, DinacAEInferenceConfig
+device = "cuda"
+model = DinacAE.from_pretrained(
+    "data-archetype/dinac_ae",
+    device=device,
+    dtype=torch.bfloat16,
+)
+image = ...  # [1, 3, H, W] in [-1, 1], H and W divisible by 16
+with torch.inference_mode():
+    latents = model.encode(image.to(device=device, dtype=torch.bfloat16))
+    class_token = model.predict_class(latents)
+    recon = model.decode(
+        latents,
+        height=int(image.shape[-2]),
+        width=int(image.shape[-1]),
+        inference_config=DinacAEInferenceConfig(num_steps=1),
+    )
+```
+## Details
+- DINAC-AE uses a `6`-block ViT/DiT-style transformer encoder and an `8`-block
+  FCDM decoder.
+- Patch size is `16`, model width is `896`, and latent width is `128`.
+- The DINO alignment head predicts spatial patch tokens and is extended with a
+  class-token output in DINOv3 ViT-B/16 feature space.
+- The class-token output is used to improve semantic organization of the latent
+  space and to support FD-loss / Representation Frechet Distance objectives
+  directly in latent space.
+- `predict_class(latents)` reaches mean cosine similarity `0.757458` against
+  the frozen DINOv3 ViT-B/16 teacher class token on the same `2000` images.
+- DINO alignment is applied directly to clean latent tokens. Robustness to
+  local token errors is handled by random-token logSNR offset regularization.
+- Results viewer: https://huggingface.co/spaces/data-archetype/dinac_ae-results
+- Related: [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae),
+  [full_capacitor](https://huggingface.co/data-archetype/full_capacitor),
+  [capacitor_decoder](https://huggingface.co/data-archetype/capacitor_decoder)
+## Citation
+```bibtex
+@misc{dinac_ae,
+  title   = {DINAC-AE: a DINO-aligned class-token diffusion autoencoder},
+  author  = {data-archetype},
+  email   = {data-archetype@proton.me},
+  year    = {2026},
+  month   = may,
+  url     = {https://huggingface.co/data-archetype/dinac_ae},
+}
+```

common/norms.py ADDED Viewed

	@@ -0,0 +1,236 @@

+from __future__ import annotations
+from collections.abc import Sequence
+import torch
+from torch import Tensor, nn
+from torch.nn import functional as F
+__all__ = [
+    "ChannelWiseRMSNorm",
+    "GlobalRMSNorm",
+    "GroupNormF32",
+    "LayerNorm",
+    "LayerNorm2d",
+    "RMSNorm",
+    "global_rms_norm",
+    "row_norm",
+]
+_HALF_PRECISION_DTYPES: tuple[torch.dtype, ...] = (torch.float16, torch.bfloat16)
+def _cast_to_float32(x: Tensor) -> tuple[Tensor, torch.dtype]:
+    """Return tensor cast to fp32 for compute along with the original dtype."""
+    dtype = x.dtype
+    if dtype in _HALF_PRECISION_DTYPES:
+        return x.float(), dtype
+    return x, dtype
+def _restore_dtype(x: Tensor, dtype: torch.dtype) -> Tensor:
+    return x if x.dtype == dtype else x.to(dtype)
+class RMSNorm(nn.Module):
+    """Thin wrapper around ``torch.nn.RMSNorm`` that preserves our API.
+    - Keeps an ``_eps`` attribute used by tests.
+    - Maps ``affine`` -> ``elementwise_affine``.
+    - Delegates all compute to the native implementation.
+    Notes on precision
+    - PyTorch ≥ 2.8 computes RMSNorm reductions in ``opmath`` dtype
+      (float32 for float16/bfloat16) internally, then restores the input dtype.
+    """
+    def __init__(self, dim: int, eps: float = 1e-6, affine: bool = True) -> None:
+        super().__init__()
+        self._eps: float = float(eps)
+        self._impl: nn.RMSNorm = nn.RMSNorm(
+            dim, eps=self._eps, elementwise_affine=affine
+        )
+        self._dim: int = int(dim)
+    @property
+    def weight(self) -> Tensor | None:  # expose for tests/compat
+        return self._impl.weight
+    def forward(self, x: Tensor) -> Tensor:  # type: ignore[override]
+        """Apply RMSNorm while avoiding dtype-mismatch warnings under AMP.
+        When inputs are bfloat16/float16 under autocast and the stored affine
+        weight is float32 (common when model weights remain FP32), PyTorch emits
+        a warning about mismatched dtypes and disables the fused path.
+        We pass a view of the weight cast to the input dtype into the functional
+        RMSNorm to enable the fused implementation without changing the
+        parameter storage dtype (which remains FP32 for stability).
+        """
+        # Prefer functional to control the weight dtype for the kernel
+        w: Tensor | None = self._impl.weight
+        w_cast = w.to(dtype=x.dtype) if w is not None else None
+        # Bias is not present in RMSNorm; functional takes (input, shape, weight, eps)
+        return F.rms_norm(x, (self._dim,), w_cast, self._eps)
+class LayerNorm(nn.LayerNorm):
+    """Thin wrapper over ``torch.nn.LayerNorm`` with an ``_eps`` attribute.
+    Notes on precision
+    - Native LayerNorm kernels accumulate statistics in ``opmath`` dtype
+      (float32 for float16/bfloat16) before casting results back.
+    """
+    def __init__(
+        self,
+        normalized_shape: int | Sequence[int],
+        eps: float = 1e-6,
+        elementwise_affine: bool = True,
+    ) -> None:
+        shape: int | list[int]
+        match normalized_shape:
+            case int() as dim:
+                shape = dim
+            case _:
+                shape = [int(v) for v in normalized_shape]
+        super().__init__(shape, eps=eps, elementwise_affine=elementwise_affine)
+        self._eps: float = float(eps)
+    def forward(self, x: Tensor) -> Tensor:  # type: ignore[override]
+        # Delegate to native LayerNorm
+        return super().forward(x)
+# Prefer numerically stable GroupNormF32 below.
+class GroupNormF32(nn.GroupNorm):
+    """Thin wrapper over ``torch.nn.GroupNorm`` with an ``_eps`` attribute.
+    Notes on precision
+    - Native GroupNorm uses ``opmath`` accumulation (float32 for
+      float16/bfloat16) for statistics and fused scale/bias math; results
+      are cast back to the input dtype.
+    - Despite the class name, this wrapper does not force a cast; it
+      delegates to the native implementation.
+    """
+    def __init__(
+        self,
+        num_groups: int,
+        num_channels: int,
+        eps: float = 1e-6,
+        affine: bool = True,
+    ) -> None:
+        super().__init__(num_groups, num_channels, eps=eps, affine=affine)
+        self._eps: float = float(eps)
+class ChannelWiseRMSNorm(nn.Module):
+    """Channel-wise RMSNorm for NCHW tensors (fast NCHW path).
+    - Normalizes across channels per spatial position without reshaping, using
+      a float32 reduction for numerical stability and keeping elementwise ops
+      in input dtype for throughput.
+    - Supports optional per-channel affine ``weight`` and ``bias``.
+    """
+    def __init__(self, channels: int, eps: float = 1e-6, affine: bool = True) -> None:
+        super().__init__()
+        self.channels: int = int(channels)
+        self._eps: float = float(eps)
+        self.affine: bool = bool(affine)
+        if self.affine:
+            self.weight = nn.Parameter(torch.ones(self.channels))
+            self.bias = nn.Parameter(torch.zeros(self.channels))
+        else:
+            self.register_parameter("weight", None)
+            self.register_parameter("bias", None)
+    def forward(self, x: Tensor) -> Tensor:  # type: ignore[override]
+        if x.dim() < 2:
+            return x
+        C = x.size(1)
+        if self.channels != C:
+            raise ValueError(f"ChannelWiseRMSNorm expected C={self.channels}, got {C}")
+        # Keep only the reductions in fp32; scale/apply in the input dtype.
+        ms = torch.mean(torch.square(x), dim=1, keepdim=True, dtype=torch.float32)
+        inv_rms = torch.rsqrt(ms + self._eps)  # float32
+        y = x * inv_rms.to(dtype=x.dtype)
+        if self.affine and self.weight is not None:
+            shape = (1, -1) + (1,) * (x.dim() - 2)
+            y = y * self.weight.view(shape).to(dtype=x.dtype)
+            if self.bias is not None:
+                y = y + self.bias.view(shape).to(dtype=x.dtype)
+        return y
+def global_rms_norm(x: Tensor, eps: float = 1e-6) -> Tensor:
+    """Project each sample to unit RMS across all non-batch dimensions.
+    This is equivalent to RMSNorm with ``normalized_shape=x.shape[1:]`` and no
+    affine parameters. Delegating to the native functional keeps the fast fused
+    CUDA path and the same opmath accumulation behavior as ``torch.nn.RMSNorm``.
+    """
+    if x.dim() < 2:
+        return x
+    normalized_shape = tuple(int(dim) for dim in x.shape[1:])
+    return F.rms_norm(x, normalized_shape, None, eps)
+class GlobalRMSNorm(nn.Module):
+    """RMSNorm across all dims except batch — sphere projection for NCHW tensors.
+    Unlike :class:`ChannelWiseRMSNorm` (which normalizes per spatial position
+    over channels), this normalizes the *entire* feature volume jointly,
+    projecting each sample onto a hypersphere.  No learnable parameters.
+    """
+    def __init__(self, eps: float = 1e-6) -> None:
+        super().__init__()
+        self._eps: float = float(eps)
+    def forward(self, x: Tensor) -> Tensor:  # type: ignore[override]
+        return global_rms_norm(x, eps=self._eps)
+class LayerNorm2d(nn.LayerNorm):
+    """Channel-wise LayerNorm using native ``F.layer_norm`` on a reshaped view.
+    - Normalizes over channels only for each spatial location (B, h, w).
+    - Weight and bias follow the base class semantics (shape [C]).
+    Notes on precision
+    - ``F.layer_norm`` calls the native LayerNorm kernel which accumulates in
+      ``opmath`` dtype (float32 for float16/bfloat16), then casts back.
+    """
+    def forward(self, x: Tensor) -> Tensor:  # type: ignore[override]
+        if x.dim() < 3:
+            return super().forward(x)
+        B, C = x.shape[:2]
+        spatial = x.shape[2:]
+        x_view = x.permute(0, *range(2, x.dim()), 1).contiguous().view(-1, C)
+        y = F.layer_norm(x_view, (C,), self.weight, self.bias, self.eps)
+        y = y.view(B, *spatial, C).permute(0, x.dim() - 1, *range(1, x.dim() - 1))
+        return y.contiguous()
+def row_norm(W: Tensor, eps: float = 1e-6) -> Tensor:
+    """Row-normalise weight matrices along the last dimension.
+    Precision and performance
+    - Accumulates the squared sum in float32 without materializing a full fp32
+      copy of ``W`` via ``sum(..., dtype=torch.float32)``.
+    - Uses ``rsqrt`` and clamps the inverse norm via ``clamp_max(1/eps)`` to
+      match ``clamp_min(eps)`` on the denominator.
+    - Scales in the input dtype for throughput; callers relying on exact
+      float32 scaling should cast explicitly.
+    """
+    # Sum of squares in fp32 for stability
+    ss = torch.sum(torch.square(W), dim=-1, keepdim=True, dtype=torch.float32)
+    inv = torch.rsqrt(ss).clamp_max(1.0 / float(eps))  # float32
+    return W * inv.to(dtype=W.dtype)

common/rope.py ADDED Viewed

	@@ -0,0 +1,536 @@

+from __future__ import annotations
+import math
+from collections.abc import Callable
+import torch
+from torch import nn
+class Rope1D(nn.Module):
+    """
+    Rotary Position Embedding (RoPE) 1D.
+    Based on the reference LLaMA implementation (Hugging Face
+    `modeling_llama.py`), adapted to this codebase without behavior changes.
+    - dim: per-head dimension
+    - max_position_embeddings: length used to precompute cached cos/sin (not required
+      by forward)
+    - base: RoPE base theta
+    Forward expects:
+      - x: (B, H, T, D)
+      - position_ids: (B, T) integer positions
+    Returns:
+      - cos, sin: (B, T, D)
+    """
+    inv_freq: torch.Tensor
+    _cos_cached: torch.Tensor
+    _sin_cached: torch.Tensor
+    def __init__(
+        self,
+        dim: int,
+        max_position_embeddings: int = 2048,
+        base: float = 10000.0,
+        device: torch.device | None = None,
+        scaling_factor: float = 1.0,
+    ) -> None:
+        super().__init__()
+        if dim % 2 != 0:
+            raise AssertionError("head_dim must be even for RoPE")
+        self.scaling_factor: float = float(scaling_factor)
+        self.dim: int = int(dim)
+        self.max_position_embeddings: int = int(max_position_embeddings)
+        self.base: float = float(base)
+        inv_freq = self._build_inv_freq(device=device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        # Cached cos/sin (not used in application, but kept for parity with reference)
+        self.max_seq_len_cached: int = self.max_position_embeddings
+        cos_cached, sin_cached = self._build_cached_trig(device=device)
+        self.register_buffer("_cos_cached", cos_cached, persistent=False)
+        self.register_buffer("_sin_cached", sin_cached, persistent=False)
+    def _build_inv_freq(self, *, device: torch.device | None) -> torch.Tensor:
+        """Return the RoPE inverse-frequency vector in float32."""
+        return 1.0 / (
+            self.base
+            ** (
+                torch.arange(0, self.dim, 2, device=device, dtype=torch.float32)
+                / float(self.dim)
+            )
+        )
+    def _build_cached_trig(
+        self, *, device: torch.device | None
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        """Return cached RoPE trig tensors in float32."""
+        inv_freq = self._build_inv_freq(device=device)
+        t = torch.arange(
+            self.max_seq_len_cached,
+            device=device,
+            dtype=torch.float32,
+        )
+        t = t / self.scaling_factor
+        freqs = torch.outer(t, inv_freq)
+        emb = torch.cat((freqs, freqs), dim=-1)
+        return emb.cos(), emb.sin()
+    def _apply(
+        self,
+        fn: Callable[[torch.Tensor], torch.Tensor],
+        recurse: bool = True,
+    ) -> Rope1D:
+        """Apply module moves/casts while preserving fp32 RoPE buffers."""
+        out = super()._apply(fn, recurse=recurse)
+        with torch.no_grad():
+            device = self.inv_freq.device
+            self.inv_freq.data = self._build_inv_freq(device=device)
+            cos_cached, sin_cached = self._build_cached_trig(device=device)
+            self._cos_cached.data = cos_cached
+            self._sin_cached.data = sin_cached
+        return out
+    @torch.no_grad()
+    def forward(
+        self, x: torch.Tensor, position_ids: torch.Tensor
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        inv_freq_tensor = self._build_inv_freq(device=x.device)
+        inv_freq_expanded = (
+            inv_freq_tensor[None, :, None].float().expand(position_ids.shape[0], -1, 1)
+        )
+        position_ids_expanded = position_ids[:, None, :].float() / self.scaling_factor
+        device_type = x.device.type
+        device_type = (
+            device_type
+            if isinstance(device_type, str) and device_type != "mps"
+            else "cpu"
+        )
+        with torch.autocast(device_type=device_type, enabled=False):
+            freqs = (
+                inv_freq_expanded.float() @ position_ids_expanded.float()
+            ).transpose(1, 2)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos()
+            sin = emb.sin()
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+def rotate_half(x: torch.Tensor) -> torch.Tensor:
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def rotate_half_adjacent(x: torch.Tensor) -> torch.Tensor:
+    """Rotate consecutive pairs in the last dimension.
+    This matches the common EVA-02 / SpeedrunDiT RoPE convention where the last
+    dimension is interpreted as pairs ``(x0, x1), (x2, x3), ...``.
+    """
+    if x.shape[-1] % 2 != 0:
+        raise ValueError("rotate_half_adjacent requires an even last dimension")
+    x_pairs = x.reshape(*x.shape[:-1], x.shape[-1] // 2, 2)
+    x1 = x_pairs[..., 0]
+    x2 = x_pairs[..., 1]
+    return torch.stack((-x2, x1), dim=-1).reshape_as(x)
+def apply_rotary_pos_emb(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    cos: torch.Tensor,
+    sin: torch.Tensor,
+    *,
+    unsqueeze_dim: int = 1,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+class LearnableRoPE2D(nn.Module):
+    r"""
+    Learnable mixed 2D RoPE with axial RoPE2D-compatible initialization.
+    - Learnable frequency banks for X and Y.
+    - Frequencies can be shared across groups of attention heads (see
+      ``rope_param_dim``).
+    - Angle per pair: theta = x * fx[g, i] + y * fy[g, i]
+    - Initialization matches the axial RoPE2D parameterization used by DiTTrunk
+      for ``ROPE_2D_AXIAL_FREQ_AWARE`` (AxialRoPE2DConfig(base=100, dim_layout=HALF_SPLIT)):
+        - Angle multiplier ``2π``.
+        - Period base ``100`` (DINOv3-style), applied per-axis.
+      Each head group starts identically (deterministic init) so the learnable
+      variant is functionally identical to axial RoPE2D at step 0.
+    - Rotation is implemented with real-valued sin/cos to avoid complex tensors
+      (torch.compile/inductor cannot codegen complex dtypes).
+    Shapes:
+    - Expects q,k of shape (B, H, T, D) with D % 4 == 0.
+    - Positions xy: (T, 2) or (B, T, 2), any real dtype (cast to float32).
+    - Parameter `freqs`: (2, G, D//2) in float32; index 0 = x, 1 = y.
+    Head grouping / parameter budget
+    -------------------------------
+    ``rope_param_dim`` controls the total number of learned RoPE frequency
+    parameters (scalars) for this module.
+    Let:
+      - ``head_dim = D`` (per-head width)
+      - ``num_heads = H``
+      - ``rope_param_dim = P``
+    Then the module uses:
+      - ``num_groups = G = P // D``
+      - ``heads_per_group = H // G``
+    This is fail-fast: ``P`` must be divisible by ``D`` and ``H`` must be
+    divisible by ``G``. When ``rope_param_dim`` is None (default), the module
+    uses the classic per-head parameterization with ``P = H * D``.
+    """
+    def __init__(
+        self,
+        head_dim: int,
+        *,
+        num_heads: int,
+        rope_param_dim: int | None = None,
+        rope_base: float = 100.0,
+        angle_multiplier: float = 2.0 * float(math.pi),
+        learnable: bool = True,
+        persist_buffers: bool = True,
+    ) -> None:
+        super().__init__()
+        if head_dim % 4 != 0:
+            raise AssertionError("head_dim must be divisible by 4 for mixed 2D RoPE")
+        self.head_dim: int = int(head_dim)
+        # Avoid naming collisions with nn.Module.half() (dtype casting helper).
+        self.half_dim: int = self.head_dim // 2
+        self.num_heads: int = int(num_heads)
+        effective_param_dim = (
+            int(rope_param_dim)
+            if rope_param_dim is not None
+            else self.num_heads * self.head_dim
+        )
+        if effective_param_dim <= 0:
+            raise ValueError("rope_param_dim must be positive for LearnableRoPE2D")
+        self.rope_param_dim: int = int(effective_param_dim)
+        self._learnable: bool = bool(learnable)
+        theta = float(rope_base)
+        mult = float(angle_multiplier)
+        if not math.isfinite(theta) or theta <= 0.0:
+            raise ValueError("rope_base must be finite and > 0 for LearnableRoPE2D")
+        if not math.isfinite(mult) or mult <= 0.0:
+            raise ValueError(
+                "angle_multiplier must be finite and > 0 for LearnableRoPE2D"
+            )
+        if self.rope_param_dim % self.head_dim != 0:
+            raise ValueError(
+                "rope_param_dim must be divisible by head_dim for LearnableRoPE2D "
+                f"(got rope_param_dim={self.rope_param_dim}, head_dim={self.head_dim})"
+            )
+        self.num_groups: int = self.rope_param_dim // self.head_dim
+        if self.num_groups <= 0:
+            raise RuntimeError("num_groups must be positive for LearnableRoPE2D")
+        if self.num_heads % self.num_groups != 0:
+            raise ValueError(
+                "num_heads must be divisible by (rope_param_dim / head_dim) for LearnableRoPE2D "
+                f"(got num_heads={self.num_heads}, num_groups={self.num_groups}, "
+                f"rope_param_dim={self.rope_param_dim}, head_dim={self.head_dim})"
+            )
+        self.heads_per_group: int = self.num_heads // self.num_groups
+        if self.heads_per_group <= 0:
+            raise RuntimeError("heads_per_group must be positive for LearnableRoPE2D")
+        # Axial-compatible deterministic init:
+        # - periods match AxialRoPE2DConfig(base=100, dim_layout=HALF_SPLIT)
+        # - angle = 2π * coord / period
+        qtr = self.head_dim // 4
+        exponents = (
+            2.0
+            * torch.arange(int(qtr), dtype=torch.float32)
+            / float(self.head_dim // 2)
+        )
+        periods = torch.tensor(theta, dtype=torch.float32) ** exponents  # [qtr]
+        axis_freqs = (mult / periods).to(dtype=torch.float32)  # [qtr]
+        zeros = torch.zeros_like(axis_freqs)
+        # Match AxialRoPE2D(HALF_SPLIT) flatten order: [y-axis, x-axis].
+        # Our xy columns are (x, y), so:
+        # - x contributes to the second quarter (x-axis part)
+        # - y contributes to the first quarter (y-axis part)
+        fx_half = torch.cat((zeros, axis_freqs), dim=0)  # [half_dim]
+        fy_half = torch.cat((axis_freqs, zeros), dim=0)  # [half_dim]
+        freqs_x = fx_half.expand(int(self.num_groups), -1).clone()
+        freqs_y = fy_half.expand(int(self.num_groups), -1).clone()
+        freqs = torch.stack([freqs_x, freqs_y], dim=0)  # (2, G, half)
+        if self._learnable:
+            self.freqs = nn.Parameter(freqs, requires_grad=True)
+        else:
+            self.register_buffer("freqs", freqs, persistent=persist_buffers)
+    def _apply(
+        self,
+        fn: Callable[[torch.Tensor], torch.Tensor],
+        recurse: bool = True,
+    ) -> LearnableRoPE2D:
+        """Apply module moves/casts while preserving fp32 frequency tensors."""
+        out = super()._apply(fn, recurse=recurse)
+        with torch.no_grad():
+            self.freqs.data = self.freqs.data.to(dtype=torch.float32)
+        return out
+    def _apply_rotary_from_trig(
+        self,
+        x: torch.Tensor,
+        *,
+        sin: torch.Tensor,
+        cos: torch.Tensor,
+    ) -> torch.Tensor:
+        """Rotate Q/K using precomputed grouped sin/cos buffers (HALF_SPLIT layout).
+        This matches AxialRoPE2DConfig(dim_layout=HALF_SPLIT) rotation and keeps
+        the learnable variant identical at initialization when combined with
+        axial-compatible frequency init.
+        Args:
+            x: Tensor shaped ``(B, H, T, D)``.
+            sin: Sin tensor shaped ``(G, T, D//2)`` or ``(B, G, T, D//2)``.
+            cos: Cos tensor shaped ``(G, T, D//2)`` or ``(B, G, T, D//2)``.
+        Returns:
+            Tensor with the same shape/dtype/device as ``x``.
+        """
+        if x.dim() != 4:
+            raise ValueError("x must be shaped (B, H, T, D)")
+        B, H, T, D = x.shape
+        if self.num_heads != int(H):
+            raise ValueError("num_heads mismatch for LearnableRoPE2D")
+        if self.head_dim != int(D):
+            raise ValueError("head_dim mismatch for LearnableRoPE2D")
+        if sin.dim() == 3 and cos.dim() == 3:
+            sin = sin.unsqueeze(0)
+            cos = cos.unsqueeze(0)
+        if sin.dim() != 4 or cos.dim() != 4:
+            raise RuntimeError("Unexpected sin/cos rank for LearnableRoPE2D")
+        if int(D) % 2 != 0:
+            raise RuntimeError("LearnableRoPE2D requires even head_dim for HALF_SPLIT")
+        half = int(D) // 2
+        if int(sin.shape[-1]) != half or int(cos.shape[-1]) != half:
+            raise RuntimeError(
+                "LearnableRoPE2D expected sin/cos last dim == head_dim//2 "
+                f"(got sin={tuple(sin.shape)}, cos={tuple(cos.shape)}, head_dim={int(D)})"
+            )
+        sin = sin[:, :, None, :, :]  # [B, G, 1, T, half]
+        cos = cos[:, :, None, :, :]  # [B, G, 1, T, half]
+        grouped = x.reshape(
+            int(B),
+            int(self.num_groups),
+            int(self.heads_per_group),
+            int(T),
+            int(D),
+        )
+        x1 = grouped[..., :half]
+        x2 = grouped[..., half:]
+        out1 = x1 * cos - x2 * sin
+        out2 = x2 * cos + x1 * sin
+        out = torch.cat((out1, out2), dim=-1).reshape(int(B), int(H), int(T), int(D))
+        return out.to(dtype=x.dtype)
+    def _compute_mixed_cis(self, xy: torch.Tensor) -> torch.Tensor:
+        # Returns complex cis angles with shape (G, T, half) or (B, G, T, half)
+        if xy.dim() == 2:
+            # (T, 2) -> (G, T, half)
+            t_x = xy[:, 0].to(dtype=torch.float32)
+            t_y = xy[:, 1].to(dtype=torch.float32)
+            with torch.autocast(device_type=t_x.device.type, enabled=False):
+                # Memory notes:
+                # - Avoid materializing both fx and fy; accumulate in-place into angles.
+                # - Avoid torch.ones_like(angles) (full-size allocation); a scalar
+                #   magnitude broadcasts in torch.polar.
+                angles = t_x.unsqueeze(-1).unsqueeze(-1) * self.freqs[0].unsqueeze(
+                    0
+                )  # (T, G, half)
+                angles.add_(
+                    t_y.unsqueeze(-1).unsqueeze(-1) * self.freqs[1].unsqueeze(0)
+                )
+                angles = angles.permute(1, 0, 2)  # (G, T, half)
+                cis = torch.polar(
+                    torch.ones((), device=angles.device, dtype=angles.dtype), angles
+                )
+            return cis
+        elif xy.dim() == 3:
+            # (B, T, 2) -> (B, G, T, half)
+            t_x = xy[..., 0].to(dtype=torch.float32)
+            t_y = xy[..., 1].to(dtype=torch.float32)
+            with torch.autocast(device_type=t_x.device.type, enabled=False):
+                angles = t_x.unsqueeze(-1).unsqueeze(-1) * self.freqs[0].unsqueeze(
+                    0
+                ).unsqueeze(0)
+                angles.add_(
+                    t_y.unsqueeze(-1).unsqueeze(-1)
+                    * self.freqs[1].unsqueeze(0).unsqueeze(0)
+                )
+                angles = angles.permute(0, 2, 1, 3)  # (B, G, T, half)
+                cis = torch.polar(
+                    torch.ones((), device=angles.device, dtype=angles.dtype), angles
+                )
+            return cis
+        else:
+            raise ValueError("xy must have shape (T,2) or (B,T,2)")
+    def _compute_mixed_angles(self, xy: torch.Tensor) -> torch.Tensor:
+        """Return mixed RoPE2D angles without applying cis/polar.
+        Args:
+            xy: XY positions shaped ``(T, 2)`` or ``(B, T, 2)``.
+        Returns:
+            Float tensor of angles shaped ``(G, T, half)`` or ``(B, G, T, half)``.
+        """
+        if xy.dim() == 2:
+            t_x = xy[:, 0].to(dtype=torch.float32)
+            t_y = xy[:, 1].to(dtype=torch.float32)
+            with torch.autocast(device_type=t_x.device.type, enabled=False):
+                angles = t_x.unsqueeze(-1).unsqueeze(-1) * self.freqs[0].unsqueeze(0)
+                angles.add_(
+                    t_y.unsqueeze(-1).unsqueeze(-1) * self.freqs[1].unsqueeze(0)
+                )
+                return angles.permute(1, 0, 2)
+        if xy.dim() == 3:
+            t_x = xy[..., 0].to(dtype=torch.float32)
+            t_y = xy[..., 1].to(dtype=torch.float32)
+            with torch.autocast(device_type=t_x.device.type, enabled=False):
+                angles = t_x.unsqueeze(-1).unsqueeze(-1) * self.freqs[0].unsqueeze(
+                    0
+                ).unsqueeze(0)
+                angles.add_(
+                    t_y.unsqueeze(-1).unsqueeze(-1)
+                    * self.freqs[1].unsqueeze(0).unsqueeze(0)
+                )
+                return angles.permute(0, 2, 1, 3)
+        raise ValueError("xy must have shape (T,2) or (B,T,2)")
+    def _cos_sin_half_from_xy(
+        self,
+        xy: torch.Tensor,
+        *,
+        device: torch.device | None = None,
+        out_dtype: torch.dtype | None = None,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        # Helper used in tests to build real-valued cos/sin tensors.
+        cis = self._compute_mixed_cis(xy.to(device=device) if device else xy)
+        # Convert complex cis to cos/sin (real/imag) with matching shapes
+        if cis.is_complex():
+            cos_h = cis.real
+            sin_h = cis.imag
+        else:
+            # Should not happen; torch.polar returns complex64/128
+            raise RuntimeError("Expected complex cis tensor from polar")
+        if out_dtype is not None:
+            cos_h = cos_h.to(dtype=out_dtype)
+            sin_h = sin_h.to(dtype=out_dtype)
+        return cos_h, sin_h
+    def _cos_sin_from_xy(
+        self,
+        xy: torch.Tensor,
+        *,
+        device: torch.device | None = None,
+        out_dtype: torch.dtype | None = None,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        cos_h, sin_h = self._cos_sin_half_from_xy(
+            xy, device=device, out_dtype=out_dtype
+        )
+        emb_cos = torch.cat((cos_h, cos_h), dim=-1)
+        emb_sin = torch.cat((sin_h, sin_h), dim=-1)
+        return emb_cos, emb_sin
+    def rotate_qk(
+        self,
+        q: torch.Tensor,
+        k: torch.Tensor,
+        xy: torch.Tensor,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        if q.dim() != 4 or k.dim() != 4:
+            raise ValueError("q,k must be shaped (B,H,T,D)")
+        _, H, _, D = q.shape
+        if self.num_heads != H:
+            raise ValueError("num_heads mismatch for LearnableRoPE2D")
+        if self.head_dim != D:
+            raise ValueError("head_dim mismatch for LearnableRoPE2D")
+        if D % 4 != 0:
+            raise AssertionError("head_dim must be divisible by 4 for mixed 2D RoPE")
+        # Use real-valued sin/cos rotation to keep torch.compile/inductor on the
+        # fast path (inductor cannot codegen complex tensors).
+        angles = self._compute_mixed_angles(xy.to(device=q.device))
+        sin = torch.sin(angles)
+        cos = torch.cos(angles)
+        q_out = self._apply_rotary_from_trig(q, sin=sin, cos=cos)
+        k_out = self._apply_rotary_from_trig(k, sin=sin, cos=cos)
+        return q_out, k_out
+    def rotate_qk_with_dilation(
+        self,
+        q: torch.Tensor,
+        k: torch.Tensor,
+        *,
+        xy: torch.Tensor,
+        scales: torch.Tensor,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        """Rotate Q/K using mixed 2D RoPE with per-sample isotropic dilation.
+        This implements dilation by scaling the RoPE angle, i.e.
+        ``theta_dilated = scale * theta_base`` where ``theta_base`` comes from the
+        undilated XY coordinates.
+        Args:
+            q: Query tensor shaped ``(B, H, T, D)``.
+            k: Key tensor shaped ``(B, H, T, D)``.
+            xy: Base XY coordinates shaped ``(T, 2)`` or ``(B, T, 2)``.
+            scales: Per-sample dilation scales shaped ``(B,)``.
+        Raises:
+            ValueError: If shapes are inconsistent or scales are not 1D.
+        """
+        if q.dim() != 4 or k.dim() != 4:
+            raise ValueError("q,k must be shaped (B,H,T,D)")
+        B, H, T, D = q.shape
+        if self.num_heads != H:
+            raise ValueError("num_heads mismatch for LearnableRoPE2D")
+        if self.head_dim != D:
+            raise ValueError("head_dim mismatch for LearnableRoPE2D")
+        if scales.dim() != 1 or scales.shape[0] != B:
+            raise ValueError("scales must have shape (B,) matching q batch size")
+        if xy.dim() == 2 and xy.shape[0] != T:
+            raise ValueError("xy length must match q sequence length")
+        if xy.dim() == 3 and (xy.shape[0] != B or xy.shape[1] != T):
+            raise ValueError("xy must have shape (B,T,2) matching q batch/sequence")
+        if xy.shape[-1] != 2:
+            raise ValueError("xy must have last dimension 2")
+        angles = self._compute_mixed_angles(xy.to(device=q.device))
+        angles = angles * scales.to(device=q.device, dtype=torch.float32).view(
+            B, 1, 1, 1
+        )
+        sin = torch.sin(angles)
+        cos = torch.cos(angles)
+        q_out = self._apply_rotary_from_trig(q, sin=sin, cos=cos)
+        k_out = self._apply_rotary_from_trig(k, sin=sin, cos=cos)
+        return q_out, k_out

config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "in_channels": 3,
+  "patch_size": 16,
+  "model_dim": 896,
+  "encoder_depth": 6,
+  "decoder_depth": 8,
+  "decoder_start_blocks": 2,
+  "decoder_end_blocks": 2,
+  "bottleneck_dim": 128,
+  "mlp_ratio": 4.0,
+  "encoder_mlp_type": "gelu",
+  "depthwise_kernel_size": 7,
+  "adaln_low_rank_rank": 128,
+  "bottleneck_posterior_kind": "diagonal_gaussian",
+  "bottleneck_norm_mode": "disabled",
+  "logsnr_min": -10.0,
+  "logsnr_max": 10.0,
+  "pixel_noise_std": 0.558,
+  "latent_running_stats_eps": 0.0001,
+  "class_head_feature_dim": 768,
+  "class_head_model_dim": 768,
+  "class_head_head_dim": 64,
+  "class_head_mlp_ratio": 4.0,
+  "class_head_mlp_type": "gelu",
+  "class_head_register_token_count": 4
+}

dinac_ae/__init__.py ADDED Viewed

	@@ -0,0 +1,12 @@

+"""DINAC-AE: DINO-aligned class-token autoencoder export."""
+from .config import DinacAEConfig, DinacAEInferenceConfig
+from .encoder import EncoderPosterior
+from .model import DinacAE
+__all__ = [
+    "DinacAE",
+    "DinacAEConfig",
+    "DinacAEInferenceConfig",
+    "EncoderPosterior",
+]

dinac_ae/adaln.py ADDED Viewed

	@@ -0,0 +1,75 @@

+"""Scale+Gate AdaLN (2-way) for FCDM decoder blocks."""
+from __future__ import annotations
+from torch import Tensor, nn
+class AdaLNScaleGateZeroProjector(nn.Module):
+    """Packed 2-way AdaLN projection (SiLU -> Linear), zero-initialized.
+    Outputs [B, 2*d_model] packed as (scale, gate).
+    """
+    def __init__(self, d_model: int, d_cond: int) -> None:
+        super().__init__()
+        self.d_model: int = int(d_model)
+        self.d_cond: int = int(d_cond)
+        self.act: nn.SiLU = nn.SiLU()
+        self.proj: nn.Linear = nn.Linear(self.d_cond, 2 * self.d_model)
+        nn.init.zeros_(self.proj.weight)
+        nn.init.zeros_(self.proj.bias)
+    def project_activated(self, act_cond: Tensor) -> Tensor:
+        """Return packed modulation for a pre-activated conditioning vector."""
+        if act_cond.dim() != 2:
+            raise ValueError(
+                "AdaLNScaleGateZeroProjector expects act_cond with shape [B, d_cond]"
+            )
+        if act_cond.shape[1] != self.d_cond:
+            raise ValueError(
+                f"act_cond width {int(act_cond.shape[1])} does not match d_cond={self.d_cond}"
+            )
+        return self.proj(act_cond)
+    def forward(self, cond: Tensor) -> Tensor:
+        """Return packed modulation [B, 2*d_model]."""
+        if cond.dim() != 2:
+            raise ValueError(
+                "AdaLNScaleGateZeroProjector expects cond with shape [B, d_cond]"
+            )
+        if cond.shape[1] != self.d_cond:
+            raise ValueError(
+                f"cond width {int(cond.shape[1])} does not match d_cond={self.d_cond}"
+            )
+        return self.project_activated(self.act(cond))
+class AdaLNScaleGateZeroLowRankDelta(nn.Module):
+    """Low-rank delta for 2-way AdaLN: down(d_cond -> rank) -> up(rank -> 2*d_model).
+    Zero-initialized up projection preserves zero-output semantics at init.
+    """
+    def __init__(self, *, d_model: int, d_cond: int, rank: int) -> None:
+        super().__init__()
+        self.d_model: int = int(d_model)
+        self.d_cond: int = int(d_cond)
+        self.rank: int = int(rank)
+        self.down: nn.Linear = nn.Linear(self.d_cond, self.rank, bias=False)
+        self.up: nn.Linear = nn.Linear(self.rank, 2 * self.d_model, bias=False)
+        nn.init.normal_(self.down.weight, mean=0.0, std=0.02)
+        nn.init.zeros_(self.up.weight)
+    def forward(self, act_cond: Tensor) -> Tensor:
+        """Return packed delta modulation [B, 2*d_model]."""
+        if act_cond.dim() != 2:
+            raise ValueError(
+                "AdaLNScaleGateZeroLowRankDelta expects act_cond with shape [B, d_cond]"
+            )
+        if act_cond.shape[1] != self.d_cond:
+            raise ValueError(
+                f"act_cond width {int(act_cond.shape[1])} does not match d_cond={self.d_cond}"
+            )
+        return self.up(self.down(act_cond))

dinac_ae/config.py ADDED Viewed

	@@ -0,0 +1,75 @@

+"""Frozen model architecture and user-tunable inference configuration."""
+from __future__ import annotations
+import json
+from dataclasses import asdict, dataclass
+from pathlib import Path
+@dataclass(frozen=True)
+class DinacAEConfig:
+    """Frozen architecture config stored alongside exported weights."""
+    in_channels: int = 3
+    patch_size: int = 16
+    model_dim: int = 896
+    encoder_depth: int = 6
+    decoder_depth: int = 8
+    decoder_start_blocks: int = 2
+    decoder_end_blocks: int = 2
+    bottleneck_dim: int = 128
+    mlp_ratio: float = 4.0
+    encoder_mlp_type: str = "gelu"
+    depthwise_kernel_size: int = 7
+    adaln_low_rank_rank: int = 128
+    bottleneck_posterior_kind: str = "diagonal_gaussian"
+    bottleneck_norm_mode: str = "disabled"
+    logsnr_min: float = -10.0
+    logsnr_max: float = 10.0
+    pixel_noise_std: float = 0.558
+    latent_running_stats_eps: float = 1e-4
+    class_head_feature_dim: int = 768
+    class_head_model_dim: int = 768
+    class_head_head_dim: int = 64
+    class_head_mlp_ratio: float = 4.0
+    class_head_mlp_type: str = "gelu"
+    class_head_register_token_count: int = 4
+    @property
+    def latent_channels(self) -> int:
+        """Return the exported latent channel width."""
+        return int(self.bottleneck_dim)
+    @property
+    def effective_patch_size(self) -> int:
+        """Return the image-to-latent stride."""
+        return int(self.patch_size)
+    def save(self, path: str | Path) -> None:
+        """Save config as JSON."""
+        output_path = Path(path)
+        output_path.parent.mkdir(parents=True, exist_ok=True)
+        output_path.write_text(json.dumps(asdict(self), indent=2) + "\n")
+    @classmethod
+    def load(cls, path: str | Path) -> DinacAEConfig:
+        """Load config from JSON."""
+        data = json.loads(Path(path).read_text())
+        return cls(**data)
+@dataclass
+class DinacAEInferenceConfig:
+    """User-tunable VP diffusion decode settings."""
+    num_steps: int = 1
+    sampler: str = "ddim"
+    schedule: str = "linear"
+    pdg: bool = False
+    pdg_strength: float = 2.0
+    seed: int | None = None

dinac_ae/decoder.py ADDED Viewed

	@@ -0,0 +1,163 @@

+"""Decoder matching the exported FCDM decoder stack."""
+from __future__ import annotations
+import torch
+from torch import Tensor, nn
+from .adaln import AdaLNScaleGateZeroLowRankDelta, AdaLNScaleGateZeroProjector
+from .fcdm_block import FCDMBlock
+from .straight_through_encoder import Patchify
+from .time_embed import SinusoidalTimeEmbeddingMLP
+class Decoder(nn.Module):
+    """VP diffusion decoder conditioned on encoder latents and timestep.
+    Architecture (skip-concat, 2+4+2 default):
+        Patchify x_t -> Fuse with upsampled z
+        -> Start blocks (2) -> Middle blocks (4) -> Skip fuse -> End blocks (2)
+        -> Conv1x1 -> PixelShuffle
+    Path-Drop Guidance (PDG) at inference:
+    - Replace middle block output with ``path_drop_mask_feature`` to create
+      an unconditional prediction, then extrapolate.
+    """
+    def __init__(
+        self,
+        in_channels: int,
+        patch_size: int,
+        model_dim: int,
+        depth: int,
+        start_block_count: int,
+        end_block_count: int,
+        bottleneck_dim: int,
+        mlp_ratio: float,
+        depthwise_kernel_size: int,
+        adaln_low_rank_rank: int,
+    ) -> None:
+        super().__init__()
+        self.patch_size = int(patch_size)
+        self.model_dim = int(model_dim)
+        self.patchify = Patchify(
+            in_channels,
+            patch_size,
+            model_dim,
+        )
+        self.latent_up = nn.Conv2d(bottleneck_dim, model_dim, kernel_size=1, bias=True)
+        self.fuse_in = nn.Conv2d(2 * model_dim, model_dim, kernel_size=1, bias=True)
+        # Time embedding
+        self.time_embed = SinusoidalTimeEmbeddingMLP(model_dim)
+        # 2-way AdaLN: shared base projector + per-block low-rank deltas
+        self.adaln_base = AdaLNScaleGateZeroProjector(
+            d_model=model_dim, d_cond=model_dim
+        )
+        self.adaln_deltas = nn.ModuleList(
+            [
+                AdaLNScaleGateZeroLowRankDelta(
+                    d_model=model_dim, d_cond=model_dim, rank=adaln_low_rank_rank
+                )
+                for _ in range(depth)
+            ]
+        )
+        # Block layout: start + middle + end
+        middle_count = depth - start_block_count - end_block_count
+        self._middle_start_idx = start_block_count
+        self._end_start_idx = start_block_count + middle_count
+        def _make_blocks(count: int) -> nn.ModuleList:
+            return nn.ModuleList(
+                [
+                    FCDMBlock(
+                        model_dim,
+                        mlp_ratio,
+                        depthwise_kernel_size=depthwise_kernel_size,
+                        use_external_adaln=True,
+                    )
+                    for _ in range(count)
+                ]
+            )
+        self.start_blocks = _make_blocks(start_block_count)
+        self.middle_blocks = _make_blocks(middle_count)
+        self.fuse_skip = nn.Conv2d(2 * model_dim, model_dim, kernel_size=1, bias=True)
+        self.end_blocks = _make_blocks(end_block_count)
+        self.path_drop_mask_feature = nn.Parameter(torch.zeros((1, model_dim, 1, 1)))
+        self.out_proj = nn.Conv2d(
+            model_dim, in_channels * (patch_size**2), kernel_size=1, bias=True
+        )
+        self.unpatchify = nn.PixelShuffle(patch_size)
+    def _adaln_m_for_layer(self, cond: Tensor, layer_idx: int) -> Tensor:
+        """Compute packed AdaLN modulation = shared_base + per-layer delta."""
+        act = self.adaln_base.act(cond)
+        base_m = self.adaln_base.project_activated(act)
+        delta_m = self.adaln_deltas[layer_idx](act)
+        return base_m + delta_m
+    def _run_blocks(
+        self, blocks: nn.ModuleList, x: Tensor, cond: Tensor, start_index: int
+    ) -> Tensor:
+        """Run a group of decoder blocks with per-block AdaLN modulation."""
+        for local_idx, block in enumerate(blocks):
+            adaln_m = self._adaln_m_for_layer(cond, layer_idx=start_index + local_idx)
+            x = block(x, adaln_m=adaln_m)
+        return x
+    def forward(
+        self,
+        x_t: Tensor,
+        t: Tensor,
+        latents: Tensor,
+        *,
+        drop_middle_blocks: bool = False,
+    ) -> Tensor:
+        """Single decoder forward pass.
+        Args:
+            x_t: Noised image [B, C, H, W].
+            t: Timestep [B] in [0, 1].
+            latents: Encoder latents [B, bottleneck_dim, h, w].
+            drop_middle_blocks: Replace middle block output with mask feature (PDG).
+        Returns:
+            x0 prediction [B, C, H, W].
+        """
+        x_feat = self.patchify(x_t)
+        z_up = self.latent_up(latents)
+        fused = torch.cat([x_feat, z_up], dim=1)
+        fused = self.fuse_in(fused)
+        cond = self.time_embed(t.to(torch.float32).to(device=x_t.device))
+        start_out = self._run_blocks(self.start_blocks, fused, cond, start_index=0)
+        if drop_middle_blocks:
+            middle_out = self.path_drop_mask_feature.to(
+                device=x_t.device, dtype=x_t.dtype
+            ).expand_as(start_out)
+        else:
+            middle_out = self._run_blocks(
+                self.middle_blocks,
+                start_out,
+                cond,
+                start_index=self._middle_start_idx,
+            )
+        skip_fused = torch.cat([start_out, middle_out], dim=1)
+        skip_fused = self.fuse_skip(skip_fused)
+        end_out = self._run_blocks(
+            self.end_blocks, skip_fused, cond, start_index=self._end_start_idx
+        )
+        patches = self.out_proj(end_out)
+        return self.unpatchify(patches)

dinac_ae/encoder.py ADDED Viewed

	@@ -0,0 +1,215 @@

+"""Encoder matching the exported mixed DitBlock/FCDM architecture."""
+from __future__ import annotations
+from dataclasses import dataclass
+import torch
+import torch.nn.functional as F
+from torch import Tensor, nn
+from dit.axial_rope2d import (
+    AxialRoPE2D,
+    AxialRoPE2DConfig,
+    AxialRoPE2DCoordMode,
+    AxialRoPE2DDimLayout,
+    AxialRoPE2DNormalizeCoords,
+)
+from dit.blocks import DitBlock
+from dit.body_config import DiTConditioning
+from dit.mlp_types import MLPType
+from dit.position_encoding import DiTPositionEncoding
+from .straight_through_encoder import Patchify
+_ENCODER_HEAD_DIM = 64
+def _resolve_encoder_mlp_type(name: str) -> MLPType:
+    """Return the encoder DiT MLP enum for the serialized config value."""
+    match str(name):
+        case "gelu":
+            return MLPType.GELU
+        case "silu":
+            return MLPType.SILU
+        case "relu":
+            return MLPType.RELU
+        case _ as unreachable:
+            raise ValueError(
+                "Unsupported encoder_mlp_type for DinacAE export: " f"{unreachable!r}"
+            )
+@dataclass(frozen=True)
+class EncoderPosterior:
+    """VP-parameterized diagonal Gaussian posterior."""
+    mean: Tensor
+    logsnr: Tensor
+    @property
+    def alpha(self) -> Tensor:
+        """Return the VP signal coefficient."""
+        logsnr_fp32 = self.logsnr.to(torch.float32)
+        return torch.exp(0.5 * F.logsigmoid(logsnr_fp32))
+    @property
+    def sigma(self) -> Tensor:
+        """Return the VP noise coefficient."""
+        logsnr_fp32 = self.logsnr.to(torch.float32)
+        return torch.exp(0.5 * F.logsigmoid(-logsnr_fp32))
+    def mode(self) -> Tensor:
+        """Return the posterior mode in token space."""
+        return (self.alpha * self.mean.to(torch.float32)).to(dtype=self.mean.dtype)
+    def sample(self, *, generator: torch.Generator | None = None) -> Tensor:
+        """Sample from the posterior."""
+        mean_fp32 = self.mean.to(torch.float32)
+        eps = torch.randn(
+            mean_fp32.shape,
+            device=mean_fp32.device,
+            dtype=torch.float32,
+            generator=generator,
+        )
+        return (self.alpha * mean_fp32 + self.sigma * eps).to(dtype=self.mean.dtype)
+class Encoder(nn.Module):
+    """Residual-patchify plus DitBlock encoder."""
+    def __init__(
+        self,
+        *,
+        in_channels: int,
+        patch_size: int,
+        model_dim: int,
+        depth: int,
+        bottleneck_dim: int,
+        mlp_ratio: float,
+        mlp_type: str,
+        bottleneck_posterior_kind: str,
+        bottleneck_norm_mode: str,
+    ) -> None:
+        super().__init__()
+        if int(model_dim) % int(_ENCODER_HEAD_DIM) != 0:
+            raise ValueError("model_dim must be divisible by encoder head dim")
+        self.bottleneck_dim: int = int(bottleneck_dim)
+        self.bottleneck_posterior_kind: str = str(bottleneck_posterior_kind)
+        self.bottleneck_norm_mode: str = str(bottleneck_norm_mode)
+        if self.bottleneck_norm_mode != "disabled":
+            raise ValueError("DINAC-AE export requires disabled bottleneck norm")
+        self.patchify = Patchify(
+            in_channels,
+            patch_size,
+            model_dim,
+        )
+        self.blocks = nn.ModuleList(
+            [
+                DitBlock(
+                    d_model=int(model_dim),
+                    n_heads=int(model_dim) // int(_ENCODER_HEAD_DIM),
+                    mlp_ratio=float(mlp_ratio),
+                    mlp_type=_resolve_encoder_mlp_type(mlp_type),
+                    block_index=int(index),
+                    use_norms=True,
+                    position_encoding=DiTPositionEncoding.ROPE_2D_AXIAL_UNNORMALIZED,
+                    conditioning=DiTConditioning.UNCOND,
+                )
+                for index in range(int(depth))
+            ]
+        )
+        self.rope = AxialRoPE2D(
+            head_dim=int(_ENCODER_HEAD_DIM),
+            cfg=AxialRoPE2DConfig(
+                base=10_000.0,
+                min_period=None,
+                max_period=None,
+                coord_mode=AxialRoPE2DCoordMode.PATCH_INDICES,
+                normalize_coords=AxialRoPE2DNormalizeCoords.MAX,
+                dim_layout=AxialRoPE2DDimLayout.PAIR_INTERLEAVED,
+                angle_multiplier=1.0,
+                coord_offset=0.0,
+                frequency_aware=None,
+                beta_warp=None,
+                alpha_warp=None,
+            ),
+        )
+        match self.bottleneck_posterior_kind:
+            case "deterministic":
+                output_channels = int(bottleneck_dim)
+            case "diagonal_gaussian":
+                output_channels = 2 * int(bottleneck_dim)
+            case _ as unreachable:
+                raise RuntimeError(
+                    f"Unsupported bottleneck_posterior_kind: {unreachable}"
+                )
+        self.to_bottleneck = nn.Conv2d(
+            int(model_dim),
+            output_channels,
+            kernel_size=1,
+            bias=True,
+        )
+    def _encode_projection(self, images: Tensor) -> Tensor:
+        """Encode images to the raw bottleneck projection."""
+        z = self.patchify(images)
+        batch, channels, height, width = z.shape
+        cond = torch.zeros(
+            (int(batch), int(channels)),
+            device=z.device,
+            dtype=z.dtype,
+        )
+        rope_sincos = self.rope(H=int(height), W=int(width), scales=None)
+        y = z
+        for block in self.blocks:
+            y = block(
+                y,
+                hw=(int(height), int(width)),
+                cond_vec=cond,
+                adaln_m=None,
+                rope_sincos=rope_sincos,
+                generator=None,
+            )
+        return self.to_bottleneck(y)
+    def _apply_bottleneck_norm(self, z: Tensor) -> Tensor:
+        """Return the unnormalized bottleneck mean."""
+        return z
+    def encode_posterior(self, images: Tensor) -> EncoderPosterior:
+        """Encode images and return the posterior."""
+        if self.bottleneck_posterior_kind != "diagonal_gaussian":
+            raise RuntimeError(
+                "encode_posterior requires bottleneck_posterior_kind=diagonal_gaussian"
+            )
+        projection = self._encode_projection(images)
+        mean, logsnr = projection.chunk(2, dim=1)
+        mean = self._apply_bottleneck_norm(mean)
+        return EncoderPosterior(mean=mean, logsnr=logsnr)
+    def forward(self, images: Tensor) -> Tensor:
+        """Encode images to latent tokens."""
+        projection = self._encode_projection(images)
+        match self.bottleneck_posterior_kind:
+            case "diagonal_gaussian":
+                mean, logsnr = projection.chunk(2, dim=1)
+                mean = self._apply_bottleneck_norm(mean)
+                logsnr_fp32 = logsnr.to(torch.float32)
+                alpha = torch.exp(0.5 * F.logsigmoid(logsnr_fp32))
+                return (alpha * mean.to(torch.float32)).to(dtype=mean.dtype)
+            case "deterministic":
+                return self._apply_bottleneck_norm(projection)
+            case _ as unreachable:
+                raise RuntimeError(
+                    f"Unsupported bottleneck_posterior_kind: {unreachable}"
+                )

dinac_ae/fcdm_block.py ADDED Viewed

	@@ -0,0 +1,103 @@

+"""FCDM block: ConvNeXt-style conv block with GRN and scale+gate AdaLN."""
+from __future__ import annotations
+import torch
+import torch.nn.functional as F
+from torch import Tensor, nn
+from .norms import ChannelWiseRMSNorm
+class GRN(nn.Module):
+    """Global Response Normalization for NCHW tensors."""
+    def __init__(self, channels: int, *, eps: float = 1e-6) -> None:
+        super().__init__()
+        self.eps: float = float(eps)
+        c = int(channels)
+        self.gamma = nn.Parameter(torch.zeros((1, c, 1, 1), dtype=torch.float32))
+        self.beta = nn.Parameter(torch.zeros((1, c, 1, 1), dtype=torch.float32))
+    def forward(self, x: Tensor) -> Tensor:
+        g = torch.linalg.vector_norm(x, ord=2, dim=(2, 3), keepdim=True)
+        g_fp32 = g.to(dtype=torch.float32)
+        n = (g_fp32 / (g_fp32.mean(dim=1, keepdim=True) + self.eps)).to(dtype=x.dtype)
+        gamma = self.gamma.to(device=x.device, dtype=x.dtype)
+        beta = self.beta.to(device=x.device, dtype=x.dtype)
+        return gamma * (x * n) + beta + x
+class FCDMBlock(nn.Module):
+    """ConvNeXt-style block with scale+gate AdaLN and GRN.
+    Two modes:
+    - Unconditioned (encoder): uses learned layer-scale for near-identity init.
+    - External AdaLN (decoder): receives packed [B, 2*C] modulation (scale, gate).
+      The gate is applied raw (no tanh).
+    """
+    def __init__(
+        self,
+        channels: int,
+        mlp_ratio: float,
+        *,
+        depthwise_kernel_size: int = 7,
+        use_external_adaln: bool = False,
+        norm_eps: float = 1e-6,
+        layer_scale_init: float = 1e-3,
+    ) -> None:
+        super().__init__()
+        self.channels: int = int(channels)
+        self.mlp_ratio: float = float(mlp_ratio)
+        self.dwconv = nn.Conv2d(
+            channels,
+            channels,
+            kernel_size=depthwise_kernel_size,
+            padding=depthwise_kernel_size // 2,
+            stride=1,
+            groups=channels,
+            bias=True,
+        )
+        self.norm = ChannelWiseRMSNorm(channels, eps=float(norm_eps), affine=False)
+        hidden = max(int(float(channels) * float(mlp_ratio)), 1)
+        self.pwconv1 = nn.Conv2d(channels, hidden, kernel_size=1, bias=True)
+        self.grn = GRN(hidden, eps=1e-6)
+        self.pwconv2 = nn.Conv2d(hidden, channels, kernel_size=1, bias=True)
+        if not use_external_adaln:
+            self.layer_scale = nn.Parameter(
+                torch.full((channels,), float(layer_scale_init))
+            )
+        else:
+            self.register_parameter("layer_scale", None)
+    def forward(self, x: Tensor, *, adaln_m: Tensor | None = None) -> Tensor:
+        b, c, _, _ = x.shape
+        if adaln_m is not None:
+            m = adaln_m.to(device=x.device, dtype=x.dtype)
+            scale, gate = m.chunk(2, dim=-1)
+        else:
+            scale = gate = None
+        h = self.dwconv(x)
+        h = self.norm(h)
+        if scale is not None:
+            h = h * (1.0 + scale.view(b, c, 1, 1))
+        h = self.pwconv1(h)
+        h = F.gelu(h)
+        h = self.grn(h)
+        h = self.pwconv2(h)
+        if gate is not None:
+            gate_view = gate.view(b, c, 1, 1)
+        else:
+            gate_view = self.layer_scale.view(1, c, 1, 1).to(  # type: ignore[union-attr]
+                device=h.device, dtype=h.dtype
+            )
+        return x + gate_view * h

dinac_ae/model.py ADDED Viewed

	@@ -0,0 +1,333 @@

+"""Standalone mixed DitBlock/FCDM diffusion autoencoder export."""
+from __future__ import annotations
+from pathlib import Path
+import torch
+from torch import Tensor, nn
+from dit.mlp_types import MLPType
+from dit.repa_projection import DinoTokenAlignmentHead
+from .config import DinacAEConfig, DinacAEInferenceConfig
+from .decoder import Decoder
+from .encoder import Encoder, EncoderPosterior
+from .samplers import run_ddim, run_dpmpp_2m
+from .vp_diffusion import get_schedule, make_initial_state, sample_noise
+def _resolve_model_dir(
+    path_or_repo_id: str | Path,
+    *,
+    revision: str | None,
+    cache_dir: str | Path | None,
+) -> Path:
+    """Resolve a local path or Hugging Face repo ID to a directory."""
+    local = Path(path_or_repo_id)
+    if local.is_dir():
+        return local
+    repo_id = str(path_or_repo_id)
+    try:
+        from huggingface_hub import snapshot_download
+    except ImportError as exc:
+        raise ImportError(
+            f"'{repo_id}' is not an existing local directory. Install "
+            "huggingface_hub to load from the Hub."
+        ) from exc
+    cache_dir_str = str(cache_dir) if cache_dir is not None else None
+    return Path(
+        snapshot_download(
+            repo_id,
+            revision=revision,
+            cache_dir=cache_dir_str,
+        )
+    )
+def _resolve_class_head_mlp_type(name: str) -> MLPType:
+    """Return the token-head MLP enum for the serialized config value."""
+    match str(name):
+        case "gelu":
+            return MLPType.GELU
+        case "silu":
+            return MLPType.SILU
+        case "relu":
+            return MLPType.RELU
+        case _ as unreachable:
+            raise ValueError(
+                "Unsupported class_head_mlp_type for DinacAE export: "
+                f"{unreachable!r}"
+            )
+class DinacAE(nn.Module):
+    """Exported DINAC-AE wrapper with encode/decode/predict_class APIs."""
+    def __init__(self, config: DinacAEConfig) -> None:
+        super().__init__()
+        self.config = config
+        self.register_buffer(
+            "latent_norm_running_mean",
+            torch.zeros((config.latent_channels,), dtype=torch.float32),
+        )
+        self.register_buffer(
+            "latent_norm_running_var",
+            torch.ones((config.latent_channels,), dtype=torch.float32),
+        )
+        self.encoder = Encoder(
+            in_channels=int(config.in_channels),
+            patch_size=int(config.patch_size),
+            model_dim=int(config.model_dim),
+            depth=int(config.encoder_depth),
+            bottleneck_dim=int(config.bottleneck_dim),
+            mlp_ratio=float(config.mlp_ratio),
+            mlp_type=str(config.encoder_mlp_type),
+            bottleneck_posterior_kind=str(config.bottleneck_posterior_kind),
+            bottleneck_norm_mode=str(config.bottleneck_norm_mode),
+        )
+        self.decoder = Decoder(
+            in_channels=int(config.in_channels),
+            patch_size=int(config.patch_size),
+            model_dim=int(config.model_dim),
+            depth=int(config.decoder_depth),
+            start_block_count=int(config.decoder_start_blocks),
+            end_block_count=int(config.decoder_end_blocks),
+            bottleneck_dim=int(config.bottleneck_dim),
+            mlp_ratio=float(config.mlp_ratio),
+            depthwise_kernel_size=int(config.depthwise_kernel_size),
+            adaln_low_rank_rank=int(config.adaln_low_rank_rank),
+        )
+        self.dino_token_alignment_head = DinoTokenAlignmentHead(
+            in_channels=int(config.bottleneck_dim),
+            feature_dim=int(config.class_head_feature_dim),
+            model_dim=int(config.class_head_model_dim),
+            head_dim=int(config.class_head_head_dim),
+            mlp_ratio=float(config.class_head_mlp_ratio),
+            mlp_activation=_resolve_class_head_mlp_type(config.class_head_mlp_type),
+            block_index=10_001,
+            register_token_count=int(config.class_head_register_token_count),
+        )
+    def _restore_float32_norm_buffers(self) -> None:
+        """Keep latent running stats in float32 after device/dtype moves."""
+        self.latent_norm_running_mean = self.latent_norm_running_mean.to(
+            dtype=torch.float32
+        )
+        self.latent_norm_running_var = self.latent_norm_running_var.to(
+            dtype=torch.float32
+        )
+    def to(self, *args: object, **kwargs: object) -> DinacAE:
+        """Move the model while preserving float32 latent stats buffers."""
+        moved = super().to(*args, **kwargs)
+        if not isinstance(moved, DinacAE):
+            raise RuntimeError(
+                f"Expected DinacAE after nn.Module.to(), got {type(moved).__name__}"
+            )
+        moved._restore_float32_norm_buffers()
+        return moved
+    @classmethod
+    def from_pretrained(
+        cls,
+        path_or_repo_id: str | Path,
+        *,
+        dtype: torch.dtype = torch.bfloat16,
+        device: str | torch.device = "cpu",
+        revision: str | None = None,
+        cache_dir: str | Path | None = None,
+    ) -> DinacAE:
+        """Load a pretrained export from a local directory or the Hub."""
+        model_dir = _resolve_model_dir(
+            path_or_repo_id,
+            revision=revision,
+            cache_dir=cache_dir,
+        )
+        config = DinacAEConfig.load(model_dir / "config.json")
+        model = cls(config)
+        safetensors_path = model_dir / "model.safetensors"
+        if safetensors_path.exists():
+            try:
+                from safetensors.torch import load_file
+            except ImportError as exc:
+                raise ImportError(
+                    "safetensors is required to load model.safetensors"
+                ) from exc
+            state_dict = load_file(str(safetensors_path), device="cpu")
+        else:
+            raise FileNotFoundError(
+                f"No model weights found in {model_dir}. Expected model.safetensors."
+            )
+        model.load_state_dict(state_dict, strict=True)
+        model = model.to(dtype=dtype, device=torch.device(device))
+        model.eval()
+        return model
+    def _latent_norm_stats(self) -> tuple[Tensor, Tensor]:
+        """Return ``(mean, std)`` tensors for latent whitening."""
+        mean = self.latent_norm_running_mean.view(1, -1, 1, 1)
+        var = self.latent_norm_running_var.view(1, -1, 1, 1)
+        std = torch.sqrt(
+            var.to(torch.float32) + float(self.config.latent_running_stats_eps)
+        )
+        return mean.to(torch.float32), std
+    def _require_image_size_divisible(self, height: int, width: int) -> None:
+        """Require image dimensions compatible with the exported patch size."""
+        patch = int(self.config.effective_patch_size)
+        if int(height) % patch != 0 or int(width) % patch != 0:
+            raise ValueError(
+                f"Image height={height} and width={width} must be divisible by "
+                f"effective_patch_size={patch}"
+            )
+    def whiten(self, latents: Tensor) -> Tensor:
+        """Whiten raw latents using exported running stats."""
+        z = latents.to(torch.float32)
+        mean, std = self._latent_norm_stats()
+        return (z - mean.to(device=z.device)) / std.to(device=z.device)
+    def dewhiten(self, latents: Tensor) -> Tensor:
+        """Undo latent whitening back to the raw decoder scale."""
+        z = latents.to(torch.float32)
+        mean, std = self._latent_norm_stats()
+        return z * std.to(device=z.device) + mean.to(device=z.device)
+    def encode(self, images: Tensor) -> Tensor:
+        """Encode images to the exported whitened latent space."""
+        self._require_image_size_divisible(
+            height=int(images.shape[2]),
+            width=int(images.shape[3]),
+        )
+        model_dtype = next(self.parameters()).dtype
+        latents = self.encoder(images.to(dtype=model_dtype))
+        return self.whiten(latents).to(dtype=model_dtype)
+    def encode_posterior(self, images: Tensor) -> EncoderPosterior:
+        """Encode images and return the raw posterior."""
+        self._require_image_size_divisible(
+            height=int(images.shape[2]),
+            width=int(images.shape[3]),
+        )
+        model_dtype = next(self.parameters()).dtype
+        return self.encoder.encode_posterior(images.to(dtype=model_dtype))
+    def predict_class(self, latents: Tensor) -> Tensor:
+        """Predict the exported DINO class token from whitened latents."""
+        dewhitened = self.dewhiten(latents)
+        t_zero = torch.zeros(
+            (int(latents.shape[0]),),
+            device=latents.device,
+            dtype=torch.float32,
+        )
+        head_dtype = self.dino_token_alignment_head.in_proj.weight.dtype
+        device_type = "cuda" if latents.device.type == "cuda" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):
+            out = self.dino_token_alignment_head(
+                dewhitened.to(device=latents.device, dtype=head_dtype),
+                t=t_zero,
+            )
+        return out.class_token.to(torch.float32)
+    def decode(
+        self,
+        latents: Tensor,
+        height: int,
+        width: int,
+        *,
+        inference_config: DinacAEInferenceConfig | None = None,
+    ) -> Tensor:
+        """Decode exported whitened latents to images via VP diffusion."""
+        cfg = (
+            inference_config
+            if inference_config is not None
+            else DinacAEInferenceConfig()
+        )
+        self._require_image_size_divisible(height=int(height), width=int(width))
+        batch = int(latents.shape[0])
+        device = latents.device
+        model_dtype = next(self.parameters()).dtype
+        decoder_latents = self.dewhiten(latents).to(device=device, dtype=model_dtype)
+        noise = sample_noise(
+            (batch, int(self.config.in_channels), int(height), int(width)),
+            noise_std=float(self.config.pixel_noise_std),
+            seed=cfg.seed,
+            device=torch.device("cpu"),
+            dtype=torch.float32,
+        )
+        schedule = get_schedule(cfg.schedule, cfg.num_steps).to(device=device)
+        initial_state = make_initial_state(
+            noise=noise.to(device=device),
+            t_start=schedule[0:1],
+            logsnr_min=float(self.config.logsnr_min),
+            logsnr_max=float(self.config.logsnr_max),
+        )
+        device_type = "cuda" if device.type == "cuda" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):
+            def _forward_fn(
+                x_t: Tensor,
+                t: Tensor,
+                latents_in: Tensor,
+                *,
+                drop_middle_blocks: bool = False,
+                mask_latent_tokens: bool = False,
+            ) -> Tensor:
+                _ = mask_latent_tokens
+                return self.decoder(
+                    x_t.to(dtype=model_dtype),
+                    t,
+                    latents_in.to(dtype=model_dtype),
+                    drop_middle_blocks=bool(drop_middle_blocks),
+                )
+            match cfg.sampler:
+                case "ddim":
+                    sampler_fn = run_ddim
+                case "dpmpp_2m":
+                    sampler_fn = run_dpmpp_2m
+                case _ as unreachable:
+                    raise ValueError(f"Unsupported sampler: {unreachable!r}")
+            pdg_mode = "path_drop" if bool(cfg.pdg) else "disabled"
+            return sampler_fn(
+                forward_fn=_forward_fn,
+                initial_state=initial_state,
+                schedule=schedule,
+                latents=decoder_latents,
+                logsnr_min=float(self.config.logsnr_min),
+                logsnr_max=float(self.config.logsnr_max),
+                pdg_mode=pdg_mode,
+                pdg_strength=float(cfg.pdg_strength),
+                device=device,
+            )
+    def reconstruct(
+        self,
+        images: Tensor,
+        *,
+        inference_config: DinacAEInferenceConfig | None = None,
+    ) -> Tensor:
+        """Encode then decode one image batch."""
+        latents = self.encode(images)
+        _batch, _channels, height, width = images.shape
+        return self.decode(
+            latents,
+            height=int(height),
+            width=int(width),
+            inference_config=inference_config,
+        )

dinac_ae/norms.py ADDED Viewed

	@@ -0,0 +1,39 @@

+"""Channel-wise RMSNorm for NCHW tensors."""
+from __future__ import annotations
+import torch
+from torch import Tensor, nn
+class ChannelWiseRMSNorm(nn.Module):
+    """Channel-wise RMSNorm with float32 reduction for numerical stability.
+    Normalizes across channels per spatial position. Supports optional
+    per-channel affine weight and bias.
+    """
+    def __init__(self, channels: int, eps: float = 1e-6, affine: bool = True) -> None:
+        super().__init__()
+        self.channels: int = int(channels)
+        self._eps: float = float(eps)
+        if affine:
+            self.weight = nn.Parameter(torch.ones(self.channels))
+            self.bias = nn.Parameter(torch.zeros(self.channels))
+        else:
+            self.register_parameter("weight", None)
+            self.register_parameter("bias", None)
+    def forward(self, x: Tensor) -> Tensor:
+        if x.dim() < 2:
+            return x
+        # Float32 accumulation for stability
+        ms = torch.mean(torch.square(x), dim=1, keepdim=True, dtype=torch.float32)
+        inv_rms = torch.rsqrt(ms + self._eps)
+        y = x * inv_rms.to(dtype=x.dtype)
+        if self.weight is not None:
+            shape = (1, -1) + (1,) * (x.dim() - 2)
+            y = y * self.weight.view(shape).to(dtype=x.dtype)
+            if self.bias is not None:
+                y = y + self.bias.view(shape).to(dtype=x.dtype)
+        return y

dinac_ae/samplers.py ADDED Viewed

	@@ -0,0 +1,258 @@

+"""DDIM and DPM++2M samplers for VP diffusion with path-drop PDG support."""
+from __future__ import annotations
+from typing import Protocol
+import torch
+from torch import Tensor
+from .vp_diffusion import (
+    alpha_sigma_from_logsnr,
+    broadcast_time_like,
+    shifted_cosine_interpolated_logsnr_from_t,
+)
+class DecoderForwardFn(Protocol):
+    """Callable that predicts x0 from (x_t, t, latents) with path-drop PDG flag."""
+    def __call__(
+        self,
+        x_t: Tensor,
+        t: Tensor,
+        latents: Tensor,
+        *,
+        drop_middle_blocks: bool = False,
+        mask_latent_tokens: bool = False,
+    ) -> Tensor: ...
+def _reconstruct_eps_from_x0(
+    *, x_t: Tensor, x0_hat: Tensor, alpha: Tensor, sigma: Tensor
+) -> Tensor:
+    """Reconstruct eps_hat from (x_t, x0_hat) under VP parameterization.
+    eps_hat = (x_t - alpha * x0_hat) / sigma. All float32.
+    """
+    alpha_view = broadcast_time_like(alpha, x_t).to(dtype=torch.float32)
+    sigma_view = broadcast_time_like(sigma, x_t).to(dtype=torch.float32)
+    x_t_f32 = x_t.to(torch.float32)
+    x0_f32 = x0_hat.to(torch.float32)
+    return (x_t_f32 - alpha_view * x0_f32) / sigma_view
+def _ddim_step(
+    *,
+    x0_hat: Tensor,
+    eps_hat: Tensor,
+    alpha_next: Tensor,
+    sigma_next: Tensor,
+    ref: Tensor,
+) -> Tensor:
+    """DDIM step: x_next = alpha_next * x0_hat + sigma_next * eps_hat."""
+    a = broadcast_time_like(alpha_next, ref).to(dtype=torch.float32)
+    s = broadcast_time_like(sigma_next, ref).to(dtype=torch.float32)
+    return a * x0_hat + s * eps_hat
+def _predict_with_pdg(
+    forward_fn: DecoderForwardFn,
+    state: Tensor,
+    t_vec: Tensor,
+    latents: Tensor,
+    *,
+    pdg_mode: str,
+    pdg_strength: float,
+) -> Tensor:
+    """Run decoder forward with optional PDG guidance.
+    Args:
+        forward_fn: Decoder forward function.
+        state: Current noised state [B, C, H, W].
+        t_vec: Timestep vector [B].
+        latents: Encoder latents.
+        pdg_mode: "disabled" or "path_drop".
+        pdg_strength: CFG-like strength for PDG.
+    Returns:
+        x0_hat prediction in float32.
+    """
+    match pdg_mode:
+        case "path_drop":
+            x0_uncond = forward_fn(state, t_vec, latents, drop_middle_blocks=True).to(
+                torch.float32
+            )
+            x0_cond = forward_fn(state, t_vec, latents, drop_middle_blocks=False).to(
+                torch.float32
+            )
+            return x0_uncond + pdg_strength * (x0_cond - x0_uncond)
+        case "disabled":
+            return forward_fn(state, t_vec, latents, drop_middle_blocks=False).to(
+                torch.float32
+            )
+        case _ as unreachable:
+            raise ValueError(f"Unsupported PDG mode: {unreachable!r}")
+def run_ddim(
+    *,
+    forward_fn: DecoderForwardFn,
+    initial_state: Tensor,
+    schedule: Tensor,
+    latents: Tensor,
+    logsnr_min: float,
+    logsnr_max: float,
+    log_change_high: float = 0.0,
+    log_change_low: float = 0.0,
+    pdg_mode: str = "disabled",
+    pdg_strength: float = 1.5,
+    device: torch.device | None = None,
+) -> Tensor:
+    """Run DDIM sampling loop with path-drop PDG support.
+    Args:
+        forward_fn: Decoder forward function (x_t, t, latents) -> x0_hat.
+        initial_state: Starting noised state [B, C, H, W] in float32.
+        schedule: Descending t-schedule [num_steps] in [0, 1].
+        latents: Encoder latents [B, bottleneck_dim, h, w].
+        logsnr_min, logsnr_max: VP schedule endpoints.
+        log_change_high, log_change_low: Shifted-cosine schedule parameters.
+        pdg_mode: "disabled" or "path_drop".
+        pdg_strength: CFG-like strength for PDG.
+        device: Target device.
+    Returns:
+        Denoised samples [B, C, H, W] in float32.
+    """
+    run_device = device or initial_state.device
+    batch_size = int(initial_state.shape[0])
+    state = initial_state.to(device=run_device, dtype=torch.float32)
+    # Precompute logSNR, alpha, sigma for all schedule points
+    lmb = shifted_cosine_interpolated_logsnr_from_t(
+        schedule.to(device=run_device),
+        logsnr_min=logsnr_min,
+        logsnr_max=logsnr_max,
+        log_change_high=log_change_high,
+        log_change_low=log_change_low,
+    )
+    alpha_sched, sigma_sched = alpha_sigma_from_logsnr(lmb)
+    for i in range(int(schedule.numel()) - 1):
+        t_i = schedule[i]
+        a_t = alpha_sched[i].expand(batch_size)
+        s_t = sigma_sched[i].expand(batch_size)
+        a_next = alpha_sched[i + 1].expand(batch_size)
+        s_next = sigma_sched[i + 1].expand(batch_size)
+        # Model prediction with optional PDG
+        t_vec = t_i.expand(batch_size).to(device=run_device, dtype=torch.float32)
+        x0_hat = _predict_with_pdg(
+            forward_fn,
+            state,
+            t_vec,
+            latents,
+            pdg_mode=pdg_mode,
+            pdg_strength=pdg_strength,
+        )
+        eps_hat = _reconstruct_eps_from_x0(
+            x_t=state, x0_hat=x0_hat, alpha=a_t, sigma=s_t
+        )
+        state = _ddim_step(
+            x0_hat=x0_hat,
+            eps_hat=eps_hat,
+            alpha_next=a_next,
+            sigma_next=s_next,
+            ref=state,
+        )
+    return state
+def run_dpmpp_2m(
+    *,
+    forward_fn: DecoderForwardFn,
+    initial_state: Tensor,
+    schedule: Tensor,
+    latents: Tensor,
+    logsnr_min: float,
+    logsnr_max: float,
+    log_change_high: float = 0.0,
+    log_change_low: float = 0.0,
+    pdg_mode: str = "disabled",
+    pdg_strength: float = 1.5,
+    device: torch.device | None = None,
+) -> Tensor:
+    """Run DPM++2M sampling loop with path-drop PDG support.
+    Multi-step solver using exponential integrator formulation in half-lambda space.
+    """
+    run_device = device or initial_state.device
+    batch_size = int(initial_state.shape[0])
+    state = initial_state.to(device=run_device, dtype=torch.float32)
+    # Precompute logSNR, alpha, sigma, half-lambda for all schedule points
+    lmb = shifted_cosine_interpolated_logsnr_from_t(
+        schedule.to(device=run_device),
+        logsnr_min=logsnr_min,
+        logsnr_max=logsnr_max,
+        log_change_high=log_change_high,
+        log_change_low=log_change_low,
+    )
+    alpha_sched, sigma_sched = alpha_sigma_from_logsnr(lmb)
+    half_lambda = 0.5 * lmb.to(torch.float32)
+    x0_prev: Tensor | None = None
+    for i in range(int(schedule.numel()) - 1):
+        t_i = schedule[i]
+        s_t = sigma_sched[i].expand(batch_size)
+        a_next = alpha_sched[i + 1].expand(batch_size)
+        s_next = sigma_sched[i + 1].expand(batch_size)
+        # Model prediction with optional PDG
+        t_vec = t_i.expand(batch_size).to(device=run_device, dtype=torch.float32)
+        x0_hat = _predict_with_pdg(
+            forward_fn,
+            state,
+            t_vec,
+            latents,
+            pdg_mode=pdg_mode,
+            pdg_strength=pdg_strength,
+        )
+        lam_t = half_lambda[i].expand(batch_size)
+        lam_next = half_lambda[i + 1].expand(batch_size)
+        h = (lam_next - lam_t).to(torch.float32)
+        phi_1 = torch.expm1(-h)
+        sigma_ratio = (s_next / s_t).to(torch.float32)
+        if i == 0 or x0_prev is None:
+            # First-order step
+            state = (
+                sigma_ratio.view(-1, *([1] * (state.dim() - 1))) * state
+                - broadcast_time_like(a_next, state).to(torch.float32)
+                * broadcast_time_like(phi_1, state).to(torch.float32)
+                * x0_hat
+            )
+        else:
+            # Second-order step
+            lam_prev = half_lambda[i - 1].expand(batch_size)
+            h_0 = (lam_t - lam_prev).to(torch.float32)
+            r0 = h_0 / h
+            d1_0 = (x0_hat - x0_prev) / broadcast_time_like(r0, x0_hat)
+            common = broadcast_time_like(a_next, state).to(
+                torch.float32
+            ) * broadcast_time_like(phi_1, state).to(torch.float32)
+            state = (
+                sigma_ratio.view(-1, *([1] * (state.dim() - 1))) * state
+                - common * x0_hat
+                - 0.5 * common * d1_0
+            )
+        x0_prev = x0_hat
+    return state

dinac_ae/straight_through_encoder.py ADDED Viewed

	@@ -0,0 +1,57 @@

+"""Patch embedding used by the exported DINAC-AE model."""
+from __future__ import annotations
+from typing import Final
+from torch import Tensor, nn
+__all__ = ["Patchify", "StraightThroughEncoder"]
+class StraightThroughEncoder(nn.Module):
+    """Project non-overlapping image patches with a stride-patch convolution."""
+    def __init__(
+        self,
+        in_channels: int,
+        patch: int,
+        out_channels: int,
+    ) -> None:
+        super().__init__()
+        if in_channels <= 0:
+            raise ValueError("in_channels must be positive")
+        if patch <= 0:
+            raise ValueError("patch must be positive")
+        if out_channels <= 0:
+            raise ValueError("out_channels must be positive")
+        self.in_channels: Final[int] = int(in_channels)
+        self.patch: Final[int] = int(patch)
+        self._output_channels: Final[int] = int(out_channels)
+        self.proj = nn.Conv2d(
+            self.in_channels,
+            self._output_channels,
+            kernel_size=self.patch,
+            stride=self.patch,
+            bias=True,
+        )
+    def forward(self, x: Tensor) -> Tensor:  # type: ignore[override]
+        """Return the patchified token grid."""
+        return self.proj(x)
+    @property
+    def output_channels(self) -> int:
+        """Return the output channel width produced by the encoder."""
+        return int(self._output_channels)
+    @property
+    def latent_channels(self) -> int:
+        """Alias for ``output_channels`` to match encoder interface shape."""
+        return int(self._output_channels)
+Patchify = StraightThroughEncoder

dinac_ae/time_embed.py ADDED Viewed

	@@ -0,0 +1,83 @@

+"""Sinusoidal timestep embedding with MLP projection."""
+from __future__ import annotations
+import math
+import torch
+from torch import Tensor, nn
+def _log_spaced_frequencies(
+    half: int, max_period: float, *, device: torch.device | None = None
+) -> Tensor:
+    """Log-spaced frequencies for sinusoidal embedding."""
+    return torch.exp(
+        -math.log(max_period)
+        * torch.arange(half, device=device, dtype=torch.float32)
+        / max(float(half - 1), 1.0)
+    )
+def sinusoidal_time_embedding(
+    t: Tensor,
+    dim: int,
+    *,
+    max_period: float = 10000.0,
+    scale: float | None = None,
+    freqs: Tensor | None = None,
+) -> Tensor:
+    """Sinusoidal timestep embedding (DDPM/DiT-style). Always float32."""
+    t32 = t.to(torch.float32)
+    if scale is not None:
+        t32 = t32 * float(scale)
+    half = dim // 2
+    if freqs is not None:
+        freqs = freqs.to(device=t32.device, dtype=torch.float32)
+    else:
+        freqs = _log_spaced_frequencies(half, max_period, device=t32.device)
+    angles = t32[:, None] * freqs[None, :]
+    return torch.cat([torch.sin(angles), torch.cos(angles)], dim=-1)
+class SinusoidalTimeEmbeddingMLP(nn.Module):
+    """Sinusoidal time embedding followed by Linear -> SiLU -> Linear."""
+    def __init__(
+        self,
+        dim: int,
+        *,
+        freq_dim: int = 256,
+        hidden_mult: float = 1.0,
+        time_scale: float = 1000.0,
+        max_period: float = 10000.0,
+    ) -> None:
+        super().__init__()
+        self.dim = int(dim)
+        self.freq_dim = int(freq_dim)
+        self.time_scale = float(time_scale)
+        self.max_period = float(max_period)
+        hidden_dim = max(int(round(int(dim) * float(hidden_mult))), 1)
+        freqs = _log_spaced_frequencies(self.freq_dim // 2, self.max_period)
+        self.register_buffer("freqs", freqs, persistent=True)
+        self.proj_in = nn.Linear(self.freq_dim, hidden_dim)
+        self.act = nn.SiLU()
+        self.proj_out = nn.Linear(hidden_dim, self.dim)
+    def forward(self, t: Tensor) -> Tensor:
+        freqs: Tensor = self.freqs  # type: ignore[assignment]
+        emb_freq = sinusoidal_time_embedding(
+            t.to(torch.float32),
+            self.freq_dim,
+            max_period=self.max_period,
+            scale=self.time_scale,
+            freqs=freqs,
+        )
+        dtype_in = self.proj_in.weight.dtype
+        hidden = self.proj_in(emb_freq.to(dtype_in))
+        hidden = self.act(hidden)
+        if hidden.dtype != self.proj_out.weight.dtype:
+            hidden = hidden.to(self.proj_out.weight.dtype)
+        return self.proj_out(hidden)

dinac_ae/vp_diffusion.py ADDED Viewed

	@@ -0,0 +1,152 @@

+"""VP diffusion math: logSNR schedules, alpha/sigma computation, noise construction."""
+from __future__ import annotations
+import math
+import torch
+import torch.nn.functional as F
+from torch import Tensor
+def alpha_sigma_from_logsnr(lmb: Tensor) -> tuple[Tensor, Tensor]:
+    """Compute (alpha, sigma) from logSNR in float32.
+    VP constraint: alpha^2 + sigma^2 = 1.
+    """
+    lmb32 = lmb.to(dtype=torch.float32)
+    alpha = torch.exp(0.5 * F.logsigmoid(lmb32))
+    sigma = torch.exp(0.5 * F.logsigmoid(-lmb32))
+    return alpha, sigma
+def broadcast_time_like(coeff: Tensor, x: Tensor) -> Tensor:
+    """Broadcast [B] coefficient to match x for per-sample scaling."""
+    view_shape = (int(x.shape[0]),) + (1,) * (x.dim() - 1)
+    return coeff.view(view_shape)
+def _cosine_interpolated_params(
+    logsnr_min: float, logsnr_max: float
+) -> tuple[float, float]:
+    """Compute (a, b) for cosine-interpolated logSNR schedule.
+    logsnr(t) = -2 * log(tan(a*t + b))
+    logsnr(0) = logsnr_max, logsnr(1) = logsnr_min
+    """
+    b = math.atan(math.exp(-0.5 * logsnr_max))
+    a = math.atan(math.exp(-0.5 * logsnr_min)) - b
+    return a, b
+def cosine_interpolated_logsnr_from_t(
+    t: Tensor, *, logsnr_min: float, logsnr_max: float
+) -> Tensor:
+    """Map t in [0,1] to logSNR via cosine-interpolated schedule. Always float32."""
+    a, b = _cosine_interpolated_params(logsnr_min, logsnr_max)
+    t32 = t.to(dtype=torch.float32)
+    a_t = torch.tensor(a, device=t32.device, dtype=torch.float32)
+    b_t = torch.tensor(b, device=t32.device, dtype=torch.float32)
+    u = a_t * t32 + b_t
+    return -2.0 * torch.log(torch.tan(u))
+def shifted_cosine_interpolated_logsnr_from_t(
+    t: Tensor,
+    *,
+    logsnr_min: float,
+    logsnr_max: float,
+    log_change_high: float = 0.0,
+    log_change_low: float = 0.0,
+) -> Tensor:
+    """SiD2 "shifted cosine" schedule: logSNR with resolution-dependent shifts.
+    lambda(t) = (1-t) * (base(t) + log_change_high) + t * (base(t) + log_change_low)
+    """
+    base = cosine_interpolated_logsnr_from_t(
+        t, logsnr_min=logsnr_min, logsnr_max=logsnr_max
+    )
+    t32 = t.to(dtype=torch.float32)
+    high = base + float(log_change_high)
+    low = base + float(log_change_low)
+    return (1.0 - t32) * high + t32 * low
+def get_schedule(schedule_type: str, num_steps: int) -> Tensor:
+    """Generate a descending t-schedule in [0, 1] for VP diffusion sampling.
+    ``num_steps`` is the number of function evaluations (NFE = decoder forward
+    passes).  Internally the schedule has ``num_steps + 1`` time points
+    (including both endpoints).
+    Args:
+        schedule_type: "linear" or "cosine".
+        num_steps: Number of decoder forward passes (NFE), >= 1.
+    Returns:
+        Descending 1D tensor with ``num_steps + 1`` elements from ~1.0 to ~0.0.
+    """
+    if int(num_steps) < 1:
+        raise ValueError("num_steps must be at least 1")
+    n = int(num_steps) + 1
+    match schedule_type:
+        case "linear":
+            base = torch.linspace(0.0, 1.0, n)
+        case "cosine":
+            i = torch.arange(n, dtype=torch.float32)
+            base = 0.5 * (1.0 - torch.cos(math.pi * (i / (n - 1))))
+        case _ as unreachable:
+            raise ValueError(
+                f"Unsupported schedule type: {unreachable!r}. "
+                "Use 'linear' or 'cosine'."
+            )
+    # Descending: high t (noisy) -> low t (clean)
+    return torch.flip(base, dims=[0])
+def make_initial_state(
+    *,
+    noise: Tensor,
+    t_start: Tensor,
+    logsnr_min: float,
+    logsnr_max: float,
+    log_change_high: float = 0.0,
+    log_change_low: float = 0.0,
+) -> Tensor:
+    """Construct VP initial state x_t0 = sigma_start * noise (since x0=0).
+    All math in float32.
+    """
+    batch = int(noise.shape[0])
+    lmb_start = shifted_cosine_interpolated_logsnr_from_t(
+        t_start.expand(batch).to(dtype=torch.float32),
+        logsnr_min=logsnr_min,
+        logsnr_max=logsnr_max,
+        log_change_high=log_change_high,
+        log_change_low=log_change_low,
+    )
+    _alpha_start, sigma_start = alpha_sigma_from_logsnr(lmb_start)
+    sigma_view = broadcast_time_like(sigma_start, noise)
+    return sigma_view * noise.to(dtype=torch.float32)
+def sample_noise(
+    shape: tuple[int, ...],
+    *,
+    noise_std: float = 1.0,
+    seed: int | None = None,
+    device: torch.device | None = None,
+    dtype: torch.dtype = torch.float32,
+) -> Tensor:
+    """Sample Gaussian noise with optional seeding. CPU-seeded for reproducibility."""
+    if seed is None:
+        noise = torch.randn(
+            shape, device=device or torch.device("cpu"), dtype=torch.float32
+        )
+    else:
+        gen = torch.Generator(device="cpu")
+        gen.manual_seed(int(seed))
+        noise = torch.randn(shape, generator=gen, device="cpu", dtype=torch.float32)
+    noise = noise.mul(float(noise_std))
+    target_device = device if device is not None else torch.device("cpu")
+    return noise.to(device=target_device, dtype=dtype)

dit/attention_blocks.py ADDED Viewed

	@@ -0,0 +1,240 @@

+"""Dense SDPA attention blocks used by the DINAC-AE export."""
+from __future__ import annotations
+from collections.abc import Callable
+import torch
+import torch.nn.functional as F
+from torch import Tensor, nn
+from common.norms import RMSNorm
+from common.rope import rotate_half, rotate_half_adjacent
+from dit.position_encoding import DiTPositionEncoding
+def _axial_rope_rotate_fn(
+    position_encoding: DiTPositionEncoding,
+) -> Callable[[Tensor], Tensor]:
+    """Return the head-dimension rotation matching the configured RoPE layout."""
+    match position_encoding:
+        case (
+            DiTPositionEncoding.ROPE_2D_AXIAL_DILATED
+            | DiTPositionEncoding.ROPE_2D_AXIAL_NORMALIZED
+            | DiTPositionEncoding.ROPE_2D_AXIAL_FREQ_AWARE
+            | DiTPositionEncoding.ROPE_1D
+        ):
+            return rotate_half
+        case (
+            DiTPositionEncoding.ROPE_2D_AXIAL_UNNORMALIZED
+            | DiTPositionEncoding.ROPE_2D_AXIAL_UNNORMALIZED_DILATED
+            | DiTPositionEncoding.ROPE_2D_AXIAL_BETA_WARP
+            | DiTPositionEncoding.ROPE_2D_AXIAL_ALPHA_WARP
+            | DiTPositionEncoding.ROPE_3D_ZIMAGE
+        ):
+            return rotate_half_adjacent
+        case _ as unreachable:
+            raise ValueError(f"Unsupported RoPE position encoding: {unreachable}")
+class DitSelfAttentionCore(nn.Module):
+    """Dense self-attention core with optional axial RoPE on Q/K."""
+    d_model: int
+    n_heads: int
+    head_dim: int
+    position_encoding: DiTPositionEncoding
+    qkv: nn.Linear
+    proj_out: nn.Linear
+    q_norm: RMSNorm
+    k_norm: RMSNorm
+    def __init__(
+        self,
+        d_model: int,
+        n_heads: int,
+        *,
+        position_encoding: DiTPositionEncoding,
+    ) -> None:
+        super().__init__()
+        if d_model % n_heads != 0:
+            raise ValueError("d_model must be divisible by n_heads")
+        self.d_model = int(d_model)
+        self.n_heads = int(n_heads)
+        self.head_dim = int(self.d_model // self.n_heads)
+        self.position_encoding = position_encoding
+        self.qkv = nn.Linear(self.d_model, 3 * self.d_model, bias=False)
+        self.proj_out = nn.Linear(self.d_model, self.d_model, bias=False)
+        self.q_norm = RMSNorm(self.head_dim)
+        self.k_norm = RMSNorm(self.head_dim)
+    def reset_parameters(self) -> None:
+        """Reset projections to their initialization."""
+        nn.init.xavier_uniform_(self.qkv.weight)
+        nn.init.xavier_uniform_(self.proj_out.weight)
+    def forward(
+        self, tokens: Tensor, *, rope_sincos: tuple[Tensor, Tensor] | None
+    ) -> Tensor:
+        """Apply dense self-attention to ``[B, N, D]`` tokens."""
+        batch, sequence_length, _width = tokens.shape
+        qkv = self.qkv(tokens)
+        q, k, v = qkv.chunk(3, dim=-1)
+        q = q.view(batch, sequence_length, self.n_heads, self.head_dim).transpose(1, 2)
+        k = k.view(batch, sequence_length, self.n_heads, self.head_dim).transpose(1, 2)
+        v = v.view(batch, sequence_length, self.n_heads, self.head_dim).transpose(1, 2)
+        q = self.q_norm(q.contiguous())
+        k = self.k_norm(k.contiguous())
+        q, k = self._apply_axial_rope_dense(q, k, rope_sincos=rope_sincos)
+        attn = F.scaled_dot_product_attention(q, k, v, dropout_p=0.0, is_causal=False)
+        attn = (
+            attn.transpose(1, 2).contiguous().view(batch, sequence_length, self.d_model)
+        )
+        return self.proj_out(attn)
+    def _apply_axial_rope_dense(
+        self,
+        q: Tensor,
+        k: Tensor,
+        *,
+        rope_sincos: tuple[Tensor, Tensor] | None,
+    ) -> tuple[Tensor, Tensor]:
+        """Apply axial RoPE to dense Q/K tensors."""
+        if rope_sincos is None:
+            return q, k
+        sin, cos = rope_sincos
+        rope_len = int(sin.shape[-2])
+        rope_dtype = sin.dtype
+        q_dtype = q.dtype
+        k_dtype = k.dtype
+        q_rope = q.to(dtype=rope_dtype)
+        k_rope = k.to(dtype=rope_dtype)
+        match sin.dim():
+            case 2:
+                sin_b = sin.view(1, 1, rope_len, self.head_dim)
+                cos_b = cos.view(1, 1, rope_len, self.head_dim)
+            case 3:
+                sin_b = sin.view(int(q.shape[0]), 1, rope_len, self.head_dim)
+                cos_b = cos.view(int(q.shape[0]), 1, rope_len, self.head_dim)
+            case _ as unreachable:
+                raise ValueError(f"Unsupported RoPE tensor rank: {int(unreachable)}")
+        rotate = _axial_rope_rotate_fn(self.position_encoding)
+        q_span = q_rope[:, :, :rope_len, :]
+        k_span = k_rope[:, :, :rope_len, :]
+        q_head = (q_span * cos_b) + (rotate(q_span) * sin_b)
+        k_head = (k_span * cos_b) + (rotate(k_span) * sin_b)
+        q_rope = torch.cat([q_head, q_rope[:, :, rope_len:, :]], dim=2)
+        k_rope = torch.cat([k_head, k_rope[:, :, rope_len:, :]], dim=2)
+        return q_rope.to(dtype=q_dtype), k_rope.to(dtype=k_dtype)
+class CrossAttentionCore(nn.Module):
+    """Dense cross-attention core used by the class-token readout."""
+    query_dim: int
+    context_dim: int
+    context_extra_dim: int
+    key_extra_dim: int
+    n_heads: int
+    head_dim: int
+    attn_dim: int
+    context_in_dim: int
+    attn_dropout: float
+    kv_proj: nn.Linear
+    k_extra_proj: nn.Linear | None
+    out_proj: nn.Linear
+    q_norm_heads: RMSNorm
+    k_norm_heads: RMSNorm
+    def __init__(
+        self,
+        *,
+        query_dim: int,
+        context_dim: int,
+        n_heads: int,
+        head_dim: int,
+        context_extra_dim: int = 0,
+        key_extra_dim: int = 0,
+        attn_dropout: float = 0.0,
+    ) -> None:
+        super().__init__()
+        self.query_dim = int(query_dim)
+        self.context_dim = int(context_dim)
+        self.context_extra_dim = int(context_extra_dim)
+        self.key_extra_dim = int(key_extra_dim)
+        self.n_heads = int(n_heads)
+        self.head_dim = int(head_dim)
+        self.attn_dim = int(self.n_heads * self.head_dim)
+        self.context_in_dim = int(self.context_dim + self.context_extra_dim)
+        self.attn_dropout = float(attn_dropout)
+        self.kv_proj = nn.Linear(self.context_in_dim, 2 * self.attn_dim, bias=False)
+        if self.key_extra_dim == 0:
+            self.k_extra_proj = None
+        else:
+            self.k_extra_proj = nn.Linear(self.key_extra_dim, self.attn_dim, bias=False)
+        self.out_proj = nn.Linear(self.attn_dim, self.query_dim, bias=False)
+        self.q_norm_heads = RMSNorm(self.head_dim)
+        self.k_norm_heads = RMSNorm(self.head_dim)
+    def reset_parameters(self) -> None:
+        """Reset projections to their initialization."""
+        nn.init.xavier_uniform_(self.kv_proj.weight)
+        if self.k_extra_proj is not None:
+            nn.init.xavier_uniform_(self.k_extra_proj.weight)
+        nn.init.xavier_uniform_(self.out_proj.weight)
+    def _split_heads(self, x: Tensor) -> Tensor:
+        batch, sequence_length, _width = x.shape
+        return x.view(batch, sequence_length, self.n_heads, self.head_dim).transpose(
+            1, 2
+        )
+    def _merge_heads(self, x: Tensor) -> Tensor:
+        batch, _heads, sequence_length, _head_dim = x.shape
+        return (
+            x.transpose(1, 2).contiguous().view(batch, sequence_length, self.attn_dim)
+        )
+    def forward(
+        self,
+        q_tokens: Tensor,
+        kv_tokens: Tensor,
+        *,
+        training: bool,
+        key_extra: Tensor | None = None,
+        key_padding_mask: Tensor | None = None,
+    ) -> Tensor:
+        """Apply dense cross-attention to query and context tokens."""
+        kv = self.kv_proj(kv_tokens)
+        k, v = kv.chunk(2, dim=-1)
+        if self.k_extra_proj is not None and key_extra is not None:
+            k = k + self.k_extra_proj(key_extra)
+        q = self.q_norm_heads(self._split_heads(q_tokens).contiguous())
+        k = self.k_norm_heads(self._split_heads(k).contiguous())
+        v = self._split_heads(v).contiguous()
+        if key_padding_mask is None:
+            attn_mask = None
+        else:
+            attn_mask = (~key_padding_mask).to(dtype=q.dtype)
+            attn_mask = attn_mask.view(
+                key_padding_mask.shape[0], 1, 1, key_padding_mask.shape[1]
+            )
+            attn_mask = attn_mask.masked_fill(attn_mask > 0, float("-inf"))
+        attn = F.scaled_dot_product_attention(
+            q,
+            k,
+            v,
+            attn_mask=attn_mask,
+            dropout_p=self.attn_dropout if training else 0.0,
+            is_causal=False,
+        )
+        return self.out_proj(self._merge_heads(attn))
+__all__ = ["CrossAttentionCore", "DitSelfAttentionCore"]

dit/axial_rope2d.py ADDED Viewed

	@@ -0,0 +1,1728 @@

+from __future__ import annotations
+import math
+from dataclasses import dataclass, replace
+from enum import Enum
+from typing import Final, cast
+import torch
+from torch import Tensor, nn
+__all__ = [
+    "AxialRoPE2D",
+    "AxialRoPE2DAlphaWarpConfig",
+    "AxialRoPE2DBetaWarpConfig",
+    "AxialRoPE2DConfig",
+    "AxialRoPE2DCoordMode",
+    "AxialRoPE2DDimLayout",
+    "AxialRoPE2DDyPE",
+    "AxialRoPE2DDyPEConfig",
+    "AxialRoPE2DFrequencyAwareConfig",
+    "AxialRoPE2DNormalizeCoords",
+    "DyPERoPEMethod",
+    "build_axial_rope2d_dype",
+    "build_axial_rope2d_inference_warp_with_strength",
+    "build_axial_rope2d_with_lumina_frequency_warp",
+    "lumina_frequency_aware_periods_for_axis",
+    "set_axial_rope2d_dype_noise_time",
+]
+class AxialRoPE2DNormalizeCoords(Enum):
+    """Coordinate normalization strategy for axial 2D RoPE (DINOv3-style)."""
+    MIN = "min"
+    MAX = "max"
+    SEPARATE = "separate"
+class AxialRoPE2DCoordMode(Enum):
+    """Coordinate grid mode for axial 2D RoPE.
+    - ``DINOV3_NORMALIZED``: DINOv3-style normalized patch-centre coordinates in
+      ``[-1, 1]`` (after normalization).
+    - ``PATCH_INDICES``: Standard unnormalized patch-grid coordinates in patch
+      units (e.g., ``x in [0, W-1]``, ``y in [0, H-1]``).
+    """
+    DINOV3_NORMALIZED = "dinov3_normalized"
+    PATCH_INDICES = "patch_indices"
+class AxialRoPE2DDimLayout(Enum):
+    """Layout of angles along the head-dimension.
+    The layout must match the rotation convention used when applying RoPE to Q/K.
+    - ``HALF_SPLIT``: LLaMA-style layout compatible with ``common.rope.rotate_half``
+      (splits last dim into two halves).
+    - ``PAIR_INTERLEAVED``: EVA-02 / SpeedrunDiT-style layout compatible with an
+      adjacent-pair rotate_half (pairs consecutive dims).
+    TODO(refactor): Standardize on ``PAIR_INTERLEAVED`` throughout DiT to reduce
+    complexity and avoid layout mismatches, then delete ``HALF_SPLIT`` and any
+    related branching once the migration is complete.
+    """
+    HALF_SPLIT = "half_split"
+    PAIR_INTERLEAVED = "pair_interleaved"
+class DyPERoPEMethod(Enum):
+    """Dynamic position extrapolation method applied to inference RoPE."""
+    VISION_YARN = "vision_yarn"
+    DY_YARN = "dy_yarn"
+    DY_NTK = "dy_ntk"
+@dataclass(frozen=True)
+class AxialRoPE2DDyPEConfig:
+    """Inference-only DyPE controls for axial RoPE.
+    Args:
+        method: Dynamic extrapolation rule to apply.
+        ref_h_tokens: Training/reference token height.
+        ref_w_tokens: Training/reference token width.
+        lambda_s: Dynamic extrapolation magnitude.
+        lambda_t: Dynamic extrapolation noise-time exponent.
+        yarn_beta_0: YaRN first-ramp high rotation threshold.
+        yarn_beta_1: YaRN first-ramp low rotation threshold.
+        yarn_gamma_0: YaRN base-blend high rotation threshold.
+        yarn_gamma_1: YaRN base-blend low rotation threshold.
+        yarn_attention_scale: Apply YaRN's static attention magnitude correction.
+    """
+    method: DyPERoPEMethod
+    ref_h_tokens: int
+    ref_w_tokens: int
+    lambda_s: float = 2.0
+    lambda_t: float = 2.0
+    yarn_beta_0: float = 1.25
+    yarn_beta_1: float = 0.75
+    yarn_gamma_0: float = 16.0
+    yarn_gamma_1: float = 2.0
+    yarn_attention_scale: bool = True
+    def __post_init__(self) -> None:
+        if not isinstance(self.method, DyPERoPEMethod):
+            raise TypeError("method must be a DyPERoPEMethod")
+        if int(self.ref_h_tokens) <= 0 or int(self.ref_w_tokens) <= 0:
+            raise ValueError("ref_h_tokens and ref_w_tokens must be positive")
+        for name, value in (
+            ("lambda_s", self.lambda_s),
+            ("lambda_t", self.lambda_t),
+            ("yarn_beta_0", self.yarn_beta_0),
+            ("yarn_beta_1", self.yarn_beta_1),
+            ("yarn_gamma_0", self.yarn_gamma_0),
+            ("yarn_gamma_1", self.yarn_gamma_1),
+        ):
+            v = float(value)
+            if not math.isfinite(v) or v <= 0.0:
+                raise ValueError(f"{name} must be finite and > 0")
+        if not isinstance(self.yarn_attention_scale, bool):
+            raise TypeError("yarn_attention_scale must be a bool")
+@dataclass(frozen=True)
+class AxialRoPE2DFrequencyAwareConfig:
+    """Lumina/Next-DiT-style frequency-aware RoPE warping for one token grid.
+    This config implements a per-axis, per-band frequency warp that depends on
+    the input axis length ``L`` relative to a reference length ``L_ref``:
+    - Define the axis scale ``s = L / L_ref``.
+    - RoPE is parameterized by *periods* (wavelengths in tokens) ``period[d]``.
+      In this module's axial parameterization (with patch-index coordinates),
+      the angle for coordinate ``p`` and band ``d`` is:
+          angle(p, d) = 2π * p / period[d]
+      so the wavelength of band ``d`` is exactly ``period[d]`` tokens.
+    - Pick a *boundary wavelength* ``L_boundary`` (in tokens), expressed as a
+      trainable multiplier around the reference length:
+          L_boundary = L_ref * exp(boundary_log_multiplier)
+      The scalar ``boundary_log_multiplier`` is shared across H/W axes (and
+      initialized by this config).
+    - Define a (possibly fractional) boundary band index ``d*`` as the band
+      whose wavelength equals ``L_boundary``:
+          period(d*) = L_boundary
+      In practice we compute ``d*`` by linear interpolation in log-period space
+      (periods are geometric for both supported period parametrizations).
+    - The Lumina/Next-DiT implicit exponent ramp is then:
+          alpha[d] = clamp(d / d*, 0, 1)
+      where:
+        - high-frequency bands (small d) have alpha≈0 (extrapolation-like),
+        - low-frequency bands (large d) have alpha→1 (interpolation-like),
+        - alpha is capped at 1 to ensure we never compress a band more than
+          plain position interpolation would.
+    - Finally, warp the periods per axis:
+          period'[d] = period[d] * s ** alpha[d]
+      Equivalently, angular frequencies warp as:
+          omega'[d] = omega[d] / s ** alpha[d]
+    Notes
+    -----
+    - This warp is only meaningful for patch-index coordinates
+      (``AxialRoPE2DCoordMode.PATCH_INDICES``). Mixing it with normalized
+      coordinates would create an implicit "gauge switch"; we fail fast.
+    - The boundary multiplier is trainable by construction (it is stored as an
+      nn.Parameter inside AxialRoPE2D when this config is present).
+    """
+    ref_h_tokens: int
+    ref_w_tokens: int
+    boundary_log_multiplier_init: float
+    def __post_init__(self) -> None:
+        if int(self.ref_h_tokens) <= 0 or int(self.ref_w_tokens) <= 0:
+            raise ValueError("ref_h_tokens and ref_w_tokens must be positive")
+        init = float(self.boundary_log_multiplier_init)
+        if not math.isfinite(init):
+            raise ValueError("boundary_log_multiplier_init must be finite")
+@dataclass(frozen=True)
+class AxialRoPE2DBetaWarpConfig:
+    """Trainable beta-curve warping for axial 2D RoPE periods (per token grid).
+    This config defines a per-axis period warp that depends on the runtime axis
+    length ``L`` relative to a reference length ``L_ref``:
+        s = L / L_ref
+        period'[d] = period[d] * s ** beta[d]
+    where the per-band exponent curve beta(d) is parameterized by three
+    trainable u-space scalars (shared across H/W axes):
+        beta_hi   = beta_max * tanh(beta_hi_u)    (high-frequency endpoint, d=0)
+        beta_lo   = beta_max * tanh(beta_lo_u)    (low-frequency endpoint, d=qtr-1)
+        beta_bend = beta_max * tanh(beta_bend_u)  (mid-band bump amplitude)
+    and the per-band curve is:
+        t = d / (qtr - 1)   in [0, 1]
+        beta(t) = lerp(beta_hi, beta_lo, t) + beta_bend * 4*t*(1-t)
+    Interpretation
+    --------------
+    - ``beta(d) == 0``: identity / "extrapolation-like" (no warping; periods do not
+      change with axis length).
+    - ``beta(d) == 1``: position-interpolation-like for that band
+      (``period'[d] = period[d] * s`` so ``omega'[d] = omega[d] / s``).
+    This parameterization provides strong and smooth control over the effective
+    scaling of each frequency band, including allowing beta<0 (increasing
+    frequencies when s>1), which can be important for unnormalized RoPE bases
+    (e.g. base=10_000) where some very low-frequency bands barely rotate on
+    practical token grids.
+    Notes:
+      - This warp requires patch-index coordinates (coord_mode=PATCH_INDICES).
+      - The u parameters are stored as nn.Parameter inside AxialRoPE2D when this
+        config is present.
+    """
+    ref_h_tokens: int
+    ref_w_tokens: int
+    beta_max: float
+    beta_hi_u_init: float
+    beta_lo_u_init: float
+    beta_bend_u_init: float
+    def __post_init__(self) -> None:
+        if int(self.ref_h_tokens) <= 0 or int(self.ref_w_tokens) <= 0:
+            raise ValueError("ref_h_tokens and ref_w_tokens must be positive")
+        bmax = float(self.beta_max)
+        if not math.isfinite(bmax) or bmax <= 0.0:
+            raise ValueError("beta_max must be finite and > 0")
+        for name, value in (
+            ("beta_hi_u_init", self.beta_hi_u_init),
+            ("beta_lo_u_init", self.beta_lo_u_init),
+            ("beta_bend_u_init", self.beta_bend_u_init),
+        ):
+            v = float(value)
+            if not math.isfinite(v):
+                raise ValueError(f"{name} must be finite")
+@dataclass(frozen=True)
+class AxialRoPE2DAlphaWarpConfig:
+    """Per-band power-law warping of axial 2D RoPE frequencies (shared across axes).
+    This config warps RoPE frequencies per band using a learned exponent vector
+    ``alpha[d]`` shared across H/W axes:
+        f'[d] = f[d] * s ** alpha[d]        where s = L / L_ref
+    Since this module parameterizes angles via periods ``period[d]`` with
+    ``f[d] ∝ 1 / period[d]``, the equivalent period warp implemented in AxialRoPE2D is:
+        period'[d] = period[d] / s ** alpha[d]
+    Notes:
+      - This warp requires patch-index coordinates (coord_mode=PATCH_INDICES).
+      - ``alpha`` is stored as an unconstrained nn.Parameter vector of length Q
+        (bands per axis), initialized to ``alpha_init`` for all bands.
+    """
+    ref_h_tokens: int
+    ref_w_tokens: int
+    alpha_init: float
+    def __post_init__(self) -> None:
+        if int(self.ref_h_tokens) <= 0 or int(self.ref_w_tokens) <= 0:
+            raise ValueError("ref_h_tokens and ref_w_tokens must be positive")
+        init = float(self.alpha_init)
+        if not math.isfinite(init):
+            raise ValueError("alpha_init must be finite")
+@dataclass(frozen=True)
+class AxialRoPE2DConfig:
+    """Configuration for axial 2D RoPE sin/cos generation.
+    This module supports two coordinate conventions via ``coord_mode``:
+    - ``DINOV3_NORMALIZED``: DINOv3-style normalized patch-centre coordinates in
+      ``[-1, 1]`` (after normalization).
+    - ``PATCH_INDICES``: Standard unnormalized patch-grid coordinates in patch
+      units (e.g., ``x in [0, W-1]``).
+    Period parametrization
+    ----------------------
+    The periods parametrization matches DINOv3:
+    - Provide either `base` (and leave `min_period/max_period` unset), or
+    - Provide both `min_period` and `max_period` (and set `base=None`).
+    """
+    base: float | None = 100.0
+    min_period: float | None = None
+    max_period: float | None = None
+    coord_mode: AxialRoPE2DCoordMode = AxialRoPE2DCoordMode.DINOV3_NORMALIZED
+    normalize_coords: AxialRoPE2DNormalizeCoords = AxialRoPE2DNormalizeCoords.MAX
+    dim_layout: AxialRoPE2DDimLayout = AxialRoPE2DDimLayout.HALF_SPLIT
+    angle_multiplier: float = 2.0 * float(math.pi)
+    coord_offset: float = 0.5
+    frequency_aware: AxialRoPE2DFrequencyAwareConfig | None = None
+    beta_warp: AxialRoPE2DBetaWarpConfig | None = None
+    alpha_warp: AxialRoPE2DAlphaWarpConfig | None = None
+    def __post_init__(self) -> None:
+        both_periods = self.min_period is not None and self.max_period is not None
+        if (self.base is None and not both_periods) or (
+            self.base is not None and both_periods
+        ):
+            raise ValueError(
+                "AxialRoPE2DConfig requires either base!=None, or both min_period and max_period."
+            )
+        if self.base is not None and float(self.base) <= 0.0:
+            raise ValueError("AxialRoPE2DConfig.base must be positive when provided")
+        if self.min_period is not None and float(self.min_period) <= 0.0:
+            raise ValueError(
+                "AxialRoPE2DConfig.min_period must be positive when provided"
+            )
+        if self.max_period is not None and float(self.max_period) <= 0.0:
+            raise ValueError(
+                "AxialRoPE2DConfig.max_period must be positive when provided"
+            )
+        if self.min_period is not None and self.max_period is not None:
+            if float(self.max_period) <= float(self.min_period):
+                raise ValueError("AxialRoPE2DConfig.max_period must be > min_period")
+        if not isinstance(self.coord_mode, AxialRoPE2DCoordMode):
+            raise TypeError(
+                "AxialRoPE2DConfig.coord_mode must be an AxialRoPE2DCoordMode"
+            )
+        if not isinstance(self.normalize_coords, AxialRoPE2DNormalizeCoords):
+            raise TypeError(
+                "AxialRoPE2DConfig.normalize_coords must be an AxialRoPE2DNormalizeCoords"
+            )
+        if not isinstance(self.dim_layout, AxialRoPE2DDimLayout):
+            raise TypeError(
+                "AxialRoPE2DConfig.dim_layout must be an AxialRoPE2DDimLayout"
+            )
+        mult = float(self.angle_multiplier)
+        if not math.isfinite(mult) or mult <= 0.0:
+            raise ValueError(
+                "AxialRoPE2DConfig.angle_multiplier must be finite and > 0"
+            )
+        off = float(self.coord_offset)
+        if not math.isfinite(off):
+            raise ValueError("AxialRoPE2DConfig.coord_offset must be finite")
+        if self.frequency_aware is not None and not isinstance(
+            self.frequency_aware, AxialRoPE2DFrequencyAwareConfig
+        ):
+            raise TypeError(
+                "AxialRoPE2DConfig.frequency_aware must be an AxialRoPE2DFrequencyAwareConfig"
+            )
+        if self.beta_warp is not None and not isinstance(
+            self.beta_warp, AxialRoPE2DBetaWarpConfig
+        ):
+            raise TypeError(
+                "AxialRoPE2DConfig.beta_warp must be an AxialRoPE2DBetaWarpConfig"
+            )
+        if self.alpha_warp is not None and not isinstance(
+            self.alpha_warp, AxialRoPE2DAlphaWarpConfig
+        ):
+            raise TypeError(
+                "AxialRoPE2DConfig.alpha_warp must be an AxialRoPE2DAlphaWarpConfig"
+            )
+        warp_count = (
+            int(self.frequency_aware is not None)
+            + int(self.beta_warp is not None)
+            + int(self.alpha_warp is not None)
+        )
+        if warp_count > 1:
+            raise ValueError(
+                "AxialRoPE2DConfig requires at most one of frequency_aware, beta_warp, or alpha_warp"
+            )
+        if self.frequency_aware is not None and (
+            self.coord_mode is not AxialRoPE2DCoordMode.PATCH_INDICES
+        ):
+            raise ValueError(
+                "AxialRoPE2D frequency-aware warping requires coord_mode=PATCH_INDICES"
+            )
+        if self.beta_warp is not None and (
+            self.coord_mode is not AxialRoPE2DCoordMode.PATCH_INDICES
+        ):
+            raise ValueError("AxialRoPE2D beta warp requires coord_mode=PATCH_INDICES")
+        if self.alpha_warp is not None and (
+            self.coord_mode is not AxialRoPE2DCoordMode.PATCH_INDICES
+        ):
+            raise ValueError("AxialRoPE2D alpha warp requires coord_mode=PATCH_INDICES")
+_AXIAL_COORDS_CACHE: dict[
+    tuple[
+        int, int, torch.device, AxialRoPE2DCoordMode, AxialRoPE2DNormalizeCoords, float
+    ],
+    Tensor,
+] = {}
+def _get_dinov3_normalized_coords(
+    H: int,
+    W: int,
+    *,
+    device: torch.device,
+    normalize: AxialRoPE2DNormalizeCoords,
+    offset: float,
+) -> Tensor:
+    """Return DINOv3-style flattened coords in [-1, 1] with shape [HW, 2]."""
+    if H <= 0 or W <= 0:
+        raise ValueError("H and W must be positive for axial RoPE coords")
+    key = (
+        int(H),
+        int(W),
+        device,
+        AxialRoPE2DCoordMode.DINOV3_NORMALIZED,
+        normalize,
+        float(offset),
+    )
+    cached = _AXIAL_COORDS_CACHE.get(key)
+    if cached is not None:
+        return cached
+    start = float(offset)
+    end_h = start + float(int(H))
+    end_w = start + float(int(W))
+    match normalize:
+        case AxialRoPE2DNormalizeCoords.MAX:
+            denom = float(max(int(H), int(W)))
+            coords_h = (
+                torch.arange(start, end_h, device=device, dtype=torch.float32) / denom
+            )
+            coords_w = (
+                torch.arange(start, end_w, device=device, dtype=torch.float32) / denom
+            )
+        case AxialRoPE2DNormalizeCoords.MIN:
+            denom = float(min(int(H), int(W)))
+            coords_h = (
+                torch.arange(start, end_h, device=device, dtype=torch.float32) / denom
+            )
+            coords_w = (
+                torch.arange(start, end_w, device=device, dtype=torch.float32) / denom
+            )
+        case AxialRoPE2DNormalizeCoords.SEPARATE:
+            coords_h = torch.arange(
+                start, end_h, device=device, dtype=torch.float32
+            ) / float(int(H))
+            coords_w = torch.arange(
+                start, end_w, device=device, dtype=torch.float32
+            ) / float(int(W))
+        case _ as unreachable:  # pragma: no cover - defensive
+            raise RuntimeError(f"Unsupported normalize_coords: {unreachable}")
+    coords = torch.stack(torch.meshgrid(coords_h, coords_w, indexing="ij"), dim=-1)
+    coords = coords.flatten(0, 1)
+    coords = 2.0 * coords - 1.0
+    # torch.compile cannot trace `torch.is_inference_mode_enabled()` and should
+    # not record Python-side cache mutations in the graph.
+    if torch.compiler.is_compiling():
+        return coords
+    if torch.is_inference_mode_enabled():
+        return coords
+    _AXIAL_COORDS_CACHE[key] = coords
+    return coords
+def _get_patch_index_coords(
+    H: int,
+    W: int,
+    *,
+    device: torch.device,
+    offset: float,
+) -> Tensor:
+    """Return unnormalized patch-grid coords with shape [HW, 2] and (y, x) columns."""
+    if H <= 0 or W <= 0:
+        raise ValueError("H and W must be positive for axial RoPE coords")
+    key = (
+        int(H),
+        int(W),
+        device,
+        AxialRoPE2DCoordMode.PATCH_INDICES,
+        AxialRoPE2DNormalizeCoords.MAX,
+        float(offset),
+    )
+    cached = _AXIAL_COORDS_CACHE.get(key)
+    if cached is not None:
+        return cached
+    start = float(offset)
+    end_h = start + float(int(H))
+    end_w = start + float(int(W))
+    coords_h = torch.arange(start, end_h, device=device, dtype=torch.float32)
+    coords_w = torch.arange(start, end_w, device=device, dtype=torch.float32)
+    coords = torch.stack(torch.meshgrid(coords_h, coords_w, indexing="ij"), dim=-1)
+    coords = coords.flatten(0, 1)
+    if torch.compiler.is_compiling():
+        return coords
+    if torch.is_inference_mode_enabled():
+        return coords
+    _AXIAL_COORDS_CACHE[key] = coords
+    return coords
+def _lumina_boundary_band_index(
+    *,
+    periods: Tensor,
+    boundary_wavelength: Tensor,
+) -> Tensor:
+    """Return the fractional boundary band index d* for a given boundary wavelength.
+    This implements the Lumina/Next-DiT definition:
+        period(d*) = boundary_wavelength
+    We compute d* by linear interpolation in log-period space. For the supported
+    period parameterizations, periods are geometric and log(period) is linear in
+    band index.
+    Args:
+        periods: 1D float tensor of length Q containing monotonically increasing
+            periods in tokens.
+        boundary_wavelength: Scalar positive float tensor giving the desired
+            boundary wavelength in tokens.
+    Returns:
+        Scalar float32 tensor giving the (possibly fractional) boundary index d*.
+    Raises:
+        ValueError: If periods are invalid or the boundary is outside valid range
+            for a well-defined positive d*.
+    """
+    if periods.dim() != 1:
+        raise ValueError("periods must be 1D for boundary band index")
+    if int(periods.numel()) < 2:
+        raise ValueError("periods must have length >= 2 for boundary band index")
+    if boundary_wavelength.dim() != 0:
+        raise ValueError("boundary_wavelength must be a scalar tensor")
+    if not torch.isfinite(boundary_wavelength).item():
+        raise ValueError("boundary_wavelength must be finite")
+    if float(boundary_wavelength.item()) <= 0.0:
+        raise ValueError("boundary_wavelength must be > 0")
+    periods_f = periods.to(dtype=torch.float32)
+    if not torch.isfinite(periods_f).all().item():
+        raise ValueError("periods must be finite for boundary band index")
+    if float(periods_f[0].item()) <= 0.0:
+        raise ValueError("periods must be positive for boundary band index")
+    if not (periods_f[1:] > periods_f[:-1]).all().item():
+        raise ValueError("periods must be strictly increasing for boundary band index")
+    log_p0 = torch.log(periods_f[0])
+    log_p1 = torch.log(periods_f[-1])
+    denom = log_p1 - log_p0
+    if float(denom.item()) <= 0.0:
+        raise ValueError("Invalid periods range for boundary band index")
+    log_boundary = torch.log(boundary_wavelength.to(dtype=torch.float32))
+    q = int(periods_f.numel())
+    d_star = (float(q - 1) * (log_boundary - log_p0)) / denom
+    if not torch.isfinite(d_star).item():
+        raise ValueError("Computed non-finite boundary band index d*")
+    if float(d_star.item()) <= 0.0:
+        raise ValueError(
+            "Boundary wavelength implies d* <= 0; increase the boundary wavelength "
+            "(or its multiplier) to be >= the wavelength of the first non-zero band."
+        )
+    return d_star
+def _lumina_alpha_ramp(
+    *,
+    qtr: int,
+    d_star: Tensor,
+    device: torch.device,
+) -> Tensor:
+    """Return alpha[d] = clamp(d / d*, 0, 1) for d in [0, qtr).
+    Args:
+        qtr: Number of RoPE bands per axis (Q).
+        d_star: Scalar positive float tensor boundary index d*.
+        device: Device for the returned alpha tensor.
+    Returns:
+        Float32 tensor of shape [Q] with values in [0, 1].
+    """
+    if int(qtr) <= 0:
+        raise ValueError("qtr must be positive for alpha ramp")
+    if d_star.dim() != 0:
+        raise ValueError("d_star must be a scalar tensor for alpha ramp")
+    if float(d_star.item()) <= 0.0:
+        raise ValueError("d_star must be > 0 for alpha ramp")
+    d = torch.arange(int(qtr), device=device, dtype=torch.float32)
+    alpha = d / d_star.to(device=device, dtype=torch.float32)
+    return torch.clamp(alpha, min=0.0, max=1.0)
+def lumina_frequency_aware_periods_for_axis(
+    *,
+    periods: Tensor,
+    axis_len: int,
+    ref_axis_len: int,
+    boundary_log_multiplier: Tensor,
+    angle_multiplier: float,
+) -> Tensor:
+    """Return Lumina/Next-DiT frequency-aware warped periods for one axis.
+    Implements:
+      s = axis_len / ref_axis_len
+      L_boundary = ref_axis_len * exp(boundary_log_multiplier)
+      d* = boundary band index where period(d*) = L_boundary
+      alpha[d] = clamp(d / d*, 0, 1)
+      period'[d] = period[d] * s**alpha[d]
+    Notes on ``angle_multiplier``
+    -----------------------------
+    This module parameterizes angles as:
+        angle(p, d) = angle_multiplier * p / period[d]
+    The *wavelength* (period in tokens) is the delta in ``p`` that increases
+    the angle by ``2π``:
+        wavelength[d] = 2π * period[d] / angle_multiplier
+    Lumina/Next-DiT define the boundary by matching *wavelength* to the
+    reference axis length. We therefore convert the boundary wavelength
+    ``L_boundary`` into a boundary period via:
+        period_boundary = (angle_multiplier / 2π) * L_boundary
+    When ``angle_multiplier == 2π`` (the DINOv3-style parameterization), this
+    reduces to ``period_boundary == L_boundary``.
+    Args:
+        periods: Base periods ``[Q]`` in tokens (wavelengths).
+        axis_len: Input axis length ``L`` in tokens.
+        ref_axis_len: Reference axis length ``L_ref`` in tokens.
+        boundary_log_multiplier: Scalar tensor; shared trainable log-multiplier.
+        angle_multiplier: RoPE angle multiplier used when converting periods to
+            physical wavelengths in tokens.
+    Returns:
+        Warped periods ``[Q]`` as float32.
+    Raises:
+        ValueError: If inputs are malformed or imply an invalid boundary index.
+    """
+    if int(axis_len) <= 0:
+        raise ValueError("axis_len must be positive for frequency-aware periods")
+    if int(ref_axis_len) <= 0:
+        raise ValueError("ref_axis_len must be positive for frequency-aware periods")
+    if boundary_log_multiplier.dim() != 0:
+        raise ValueError("boundary_log_multiplier must be a scalar tensor")
+    if not torch.isfinite(boundary_log_multiplier).item():
+        raise ValueError("boundary_log_multiplier must be finite")
+    mult = float(angle_multiplier)
+    if not math.isfinite(mult) or mult <= 0.0:
+        raise ValueError("angle_multiplier must be finite and > 0")
+    device = periods.device
+    qtr = int(periods.numel())
+    s = float(int(axis_len)) / float(int(ref_axis_len))
+    if not math.isfinite(s) or s <= 0.0:
+        raise ValueError("axis_len/ref_axis_len must be finite and > 0")
+    boundary_wavelength = float(int(ref_axis_len)) * torch.exp(
+        boundary_log_multiplier.to(device=device, dtype=torch.float32)
+    )
+    boundary_period = (mult / (2.0 * float(math.pi))) * boundary_wavelength
+    d_star = _lumina_boundary_band_index(
+        periods=periods, boundary_wavelength=boundary_period
+    )
+    alpha = _lumina_alpha_ramp(qtr=qtr, d_star=d_star, device=device)
+    scale = torch.pow(torch.tensor(s, device=device, dtype=torch.float32), alpha)
+    return periods.to(device=device, dtype=torch.float32) * scale
+def build_axial_rope2d_with_lumina_frequency_warp(
+    base: AxialRoPE2D,
+    *,
+    ref_h_tokens: int,
+    ref_w_tokens: int,
+    boundary_log_multiplier: float | None,
+    boundary_band_multiplier: float | None,
+) -> AxialRoPE2D:
+    """Return an AxialRoPE2D module that applies Lumina-style frequency warping.
+    This helper is intended for inference-time experimentation on checkpoints
+    that were trained without frequency-aware warping (e.g.
+    ``position_encoding=ROPE_2D_AXIAL_UNNORMALIZED``). It constructs a new
+    AxialRoPE2D instance that:
+      - Keeps the base RoPE periods and layout identical to ``base``.
+      - Applies Lumina/Next-DiT per-axis warping based on the runtime token
+        lengths ``H`` and ``W`` relative to reference lengths.
+      - Uses a fixed (non-trainable) scalar boundary multiplier for inference.
+    Args:
+        base: Existing AxialRoPE2D instance from a loaded model.
+        ref_h_tokens: Reference H token length (L_ref,h).
+        ref_w_tokens: Reference W token length (L_ref,w).
+        boundary_log_multiplier: Optional log multiplier applied to reference
+            lengths to define the boundary wavelength. Use 0.0 for "boundary at
+            L_ref". Mutually exclusive with boundary_band_multiplier.
+        boundary_band_multiplier: Optional multiplier that directly selects the
+            boundary band index d* relative to the lowest-frequency band index
+            (qtr-1). Concretely, with qtr bands per axis:
+                d* = boundary_band_multiplier * (qtr - 1)
+            This lets you move the transition point in frequency space:
+            - smaller values => more bands become PI-like (more interpolation)
+            - larger values => fewer bands become PI-like (more extrapolation)
+            When provided, we compute the implied boundary wavelength and store
+            it as boundary_log_multiplier for the module.
+    Returns:
+        New AxialRoPE2D instance on the same device as ``base``.
+    Raises:
+        TypeError: If base is not an AxialRoPE2D.
+        ValueError: If base uses incompatible coordinates for Lumina warping.
+    """
+    if not isinstance(base, AxialRoPE2D):
+        raise TypeError("base must be an AxialRoPE2D")
+    if base.cfg.coord_mode is not AxialRoPE2DCoordMode.PATCH_INDICES:
+        raise ValueError(
+            "Lumina frequency-aware warping requires coord_mode=PATCH_INDICES"
+        )
+    if (boundary_log_multiplier is None) == (boundary_band_multiplier is None):
+        raise ValueError(
+            "Provide exactly one of boundary_log_multiplier or boundary_band_multiplier"
+        )
+    resolved_log_multiplier: float
+    if boundary_band_multiplier is not None:
+        if int(ref_h_tokens) != int(ref_w_tokens):
+            raise ValueError(
+                "boundary_band_multiplier requires ref_h_tokens == ref_w_tokens when using a shared scalar boundary"
+            )
+        mult = float(boundary_band_multiplier)
+        if not math.isfinite(mult) or mult <= 0.0:
+            raise ValueError("boundary_band_multiplier must be finite and > 0")
+        qtr = int(base.periods.numel())
+        if qtr < 2:
+            raise ValueError(
+                "AxialRoPE2D periods length must be >= 2 for boundary band selection"
+            )
+        # Solve for the boundary wavelength implied by choosing d* directly.
+        #
+        # We use geometric interpolation in log-period space:
+        #   log(period(d*)) = log(period0) + (d*/(qtr-1)) * (log(period_max) - log(period0))
+        # with:
+        #   d* = boundary_band_multiplier * (qtr-1)
+        #
+        # This allows d* outside the trained band range (multiplier > 1), which
+        # corresponds to pushing the transition beyond the lowest-frequency band.
+        with torch.no_grad():
+            periods_f = base.periods.to(dtype=torch.float32, device=torch.device("cpu"))
+            if not (periods_f[1:] > periods_f[:-1]).all().item():
+                raise ValueError(
+                    "base.periods must be strictly increasing for boundary band selection"
+                )
+            log_p0 = float(torch.log(periods_f[0]).item())
+            log_p1 = float(torch.log(periods_f[-1]).item())
+        d_star = mult * float(qtr - 1)
+        log_boundary_period = log_p0 + (d_star / float(qtr - 1)) * (log_p1 - log_p0)
+        boundary_period = math.exp(log_boundary_period)
+        angle_mult = float(base.cfg.angle_multiplier)
+        if not math.isfinite(angle_mult) or angle_mult <= 0.0:
+            raise ValueError("base.cfg.angle_multiplier must be finite and > 0")
+        boundary_wavelength = (2.0 * float(math.pi) / angle_mult) * boundary_period
+        resolved_log_multiplier = math.log(
+            boundary_wavelength / float(int(ref_h_tokens))
+        )
+    else:
+        if boundary_log_multiplier is None:  # pragma: no cover - validated above
+            raise RuntimeError("boundary_log_multiplier missing despite validation")
+        resolved_log_multiplier = float(boundary_log_multiplier)
+    freq_cfg = AxialRoPE2DFrequencyAwareConfig(
+        ref_h_tokens=int(ref_h_tokens),
+        ref_w_tokens=int(ref_w_tokens),
+        boundary_log_multiplier_init=resolved_log_multiplier,
+    )
+    cfg = replace(base.cfg, frequency_aware=freq_cfg, beta_warp=None, alpha_warp=None)
+    device = base.periods.device
+    warped = AxialRoPE2D(head_dim=int(base.head_dim), cfg=cfg).to(device=device)
+    with torch.no_grad():
+        warped.periods.copy_(base.periods.to(device=device, dtype=torch.float32))
+        if warped.boundary_log_multiplier is None:  # pragma: no cover - defensive
+            raise RuntimeError("Expected boundary_log_multiplier to be initialized")
+        warped.boundary_log_multiplier.copy_(
+            torch.tensor(resolved_log_multiplier, device=device, dtype=torch.float32)
+        )
+        warped.boundary_log_multiplier.requires_grad_(False)
+    return warped
+def build_axial_rope2d_inference_warp_with_strength(
+    base: AxialRoPE2D,
+    *,
+    ref_h_tokens: int,
+    ref_w_tokens: int,
+    beta_hi_u: float,
+    beta_lo_u: float,
+    beta_bend_u: float,
+    beta_max: float,
+) -> AxialRoPE2D:
+    """Build an inference-only RoPE warp parameterized by a 3-knob beta(t) curve.
+    This helper is meant for notebook experimentation on checkpoints trained
+    with patch-index axial RoPE (e.g. ``position_encoding=ROPE_2D_AXIAL_UNNORMALIZED``).
+    We warp per-axis RoPE periods (wavelengths, in tokens) as:
+        period'[d] = period[d] * s ** beta[d]      where s = L / L_ref
+    with a smooth exponent curve beta(d) over bands. Unlike a strict
+    interpolation-only exponent (0..1), beta is allowed to be negative or > 1,
+    which is important for unnormalized RoPE (e.g. base=10_000) where some very
+    low-frequency bands are effectively "dead" on practical token grids unless
+    their frequencies can be increased (beta < 0).
+    Knobs (bounded via u-space)
+    ---------------------------
+    We use three unconstrained parameters (u-space) which map to bounded beta
+    values via tanh:
+        beta_hi   = beta_max * tanh(beta_hi_u)    (high-frequency endpoint, d=0)
+        beta_lo   = beta_max * tanh(beta_lo_u)    (low-frequency endpoint, d=qtr-1)
+        beta_bend = beta_max * tanh(beta_bend_u)  ("bump" amplitude in the middle)
+    Then define the per-band curve over t in [0,1] (high -> low frequency):
+        t = d / (qtr - 1)
+        beta(t) = lerp(beta_hi, beta_lo, t) + beta_bend * 4*t*(1-t)
+    The bump term is 0 at the endpoints and peaks at 1 at t=0.5.
+    Notes
+    -----
+    - This wrapper is inference-only: it is not saved in checkpoints.
+    - It requires patch-index coordinates (no normalized "gauge").
+    - It preserves the base module's periods and layout exactly.
+    Args:
+        base: Existing AxialRoPE2D instance from a loaded model.
+        ref_h_tokens: Reference H token length (L_ref,h).
+        ref_w_tokens: Reference W token length (L_ref,w).
+        beta_hi_u: Unconstrained u for beta_hi.
+        beta_lo_u: Unconstrained u for beta_lo.
+        beta_bend_u: Unconstrained u for beta_bend (mid-band bump).
+        beta_max: Maximum absolute beta value (> 0). Higher increases control.
+    Returns:
+        An AxialRoPE2D instance whose forward applies the inference-only warp.
+    """
+    if not isinstance(base, AxialRoPE2D):
+        raise TypeError("base must be an AxialRoPE2D")
+    if base.cfg.coord_mode is not AxialRoPE2DCoordMode.PATCH_INDICES:
+        raise ValueError(
+            "Inference freq-warp requires base.cfg.coord_mode=PATCH_INDICES"
+        )
+    if int(ref_h_tokens) <= 0 or int(ref_w_tokens) <= 0:
+        raise ValueError("ref_h_tokens and ref_w_tokens must be positive")
+    hi_u = float(beta_hi_u)
+    lo_u = float(beta_lo_u)
+    bend_u = float(beta_bend_u)
+    if not math.isfinite(hi_u):
+        raise ValueError("beta_hi_u must be finite")
+    if not math.isfinite(lo_u):
+        raise ValueError("beta_lo_u must be finite")
+    if not math.isfinite(bend_u):
+        raise ValueError("beta_bend_u must be finite")
+    bmax = float(beta_max)
+    if not math.isfinite(bmax) or bmax <= 0.0:
+        raise ValueError("beta_max must be finite and > 0")
+    class _AxialRoPE2DInferenceWarp(AxialRoPE2D):
+        """Inference-only axial RoPE variant with beta-curve knobs."""
+        def __init__(self, *, device: torch.device) -> None:
+            super().__init__(head_dim=int(base.head_dim), cfg=base.cfg)
+            self.ref_h_tokens: Final[int] = int(ref_h_tokens)
+            self.ref_w_tokens: Final[int] = int(ref_w_tokens)
+            # Store as buffers so the notebook can mutate by replacing the module.
+            self.register_buffer(
+                "beta_hi_u",
+                torch.tensor(float(hi_u), dtype=torch.float32),
+                persistent=False,
+            )
+            self.register_buffer(
+                "beta_lo_u",
+                torch.tensor(float(lo_u), dtype=torch.float32),
+                persistent=False,
+            )
+            self.register_buffer(
+                "beta_bend_u",
+                torch.tensor(float(bend_u), dtype=torch.float32),
+                persistent=False,
+            )
+            self.register_buffer(
+                "beta_max",
+                torch.tensor(float(bmax), dtype=torch.float32),
+                persistent=False,
+            )
+            self.to(device=device)
+            with torch.no_grad():
+                self.periods.copy_(
+                    base.periods.detach().to(device=device, dtype=torch.float32)
+                )
+        def forward(
+            self,
+            *,
+            H: int,
+            W: int,
+            scales: Tensor | None,
+        ) -> tuple[Tensor, Tensor]:
+            if scales is not None:
+                raise ValueError("Inference freq-warp does not support dilation scales")
+            if int(H) <= 0 or int(W) <= 0:
+                raise ValueError("H and W must be positive for axial RoPE")
+            device = self.periods.device
+            offset = float(self.cfg.coord_offset)
+            coords = _get_patch_index_coords(
+                int(H), int(W), device=device, offset=offset
+            )
+            if coords.dim() != 2 or coords.shape[1] != 2:
+                raise RuntimeError("Axial RoPE coords must have shape [HW, 2]")
+            qtr = int(self.periods.numel())
+            if qtr <= 0:
+                raise RuntimeError("Axial RoPE periods length must be positive")
+            beta_max_t = cast("Tensor", self.beta_max).to(
+                device=device, dtype=torch.float32
+            )
+            beta_hi = beta_max_t * torch.tanh(
+                cast("Tensor", self.beta_hi_u).to(device=device, dtype=torch.float32)
+            )
+            beta_lo = beta_max_t * torch.tanh(
+                cast("Tensor", self.beta_lo_u).to(device=device, dtype=torch.float32)
+            )
+            beta_bend = beta_max_t * torch.tanh(
+                cast("Tensor", self.beta_bend_u).to(device=device, dtype=torch.float32)
+            )
+            if qtr == 1:
+                beta = beta_hi[None]
+            else:
+                t = torch.arange(int(qtr), device=device, dtype=torch.float32) / float(
+                    qtr - 1
+                )
+                bump = 4.0 * t * (1.0 - t)
+                beta = (1.0 - t) * beta_hi + t * beta_lo + beta_bend * bump
+            s_h = float(int(H)) / float(int(self.ref_h_tokens))
+            s_w = float(int(W)) / float(int(self.ref_w_tokens))
+            if (
+                not math.isfinite(s_h)
+                or s_h <= 0.0
+                or not math.isfinite(s_w)
+                or s_w <= 0.0
+            ):
+                raise ValueError(
+                    "H/ref_h_tokens and W/ref_w_tokens must be finite and > 0"
+                )
+            periods_h = self.periods * torch.pow(
+                torch.tensor(s_h, device=device, dtype=torch.float32), beta
+            )
+            periods_w = self.periods * torch.pow(
+                torch.tensor(s_w, device=device, dtype=torch.float32), beta
+            )
+            axis_periods = torch.stack([periods_h, periods_w], dim=0)  # [2, Q]
+            angles = (
+                float(self.cfg.angle_multiplier)
+                * coords[:, :, None].to(dtype=torch.float32)
+                / axis_periods[None, :, :].to(dtype=torch.float32)
+            )
+            match self.cfg.dim_layout:
+                case AxialRoPE2DDimLayout.HALF_SPLIT:
+                    angles = angles.flatten(1, 2).repeat(1, 2)
+                case AxialRoPE2DDimLayout.PAIR_INTERLEAVED:
+                    angles = angles.repeat_interleave(2, dim=-1).flatten(1, 2)
+                case _ as unreachable:  # pragma: no cover - defensive
+                    raise RuntimeError(f"Unsupported dim_layout: {unreachable}")
+            if angles.shape != (int(H) * int(W), int(self.head_dim)):
+                raise RuntimeError(
+                    "Unexpected angles shape in inference freq-warp: "
+                    f"{tuple(angles.shape)} for H={int(H)} W={int(W)}"
+                )
+            return torch.sin(angles), torch.cos(angles)
+    return _AxialRoPE2DInferenceWarp(device=base.periods.device)
+class AxialRoPE2D(nn.Module):
+    """DINOv3-style axial 2D RoPE sin/cos generator.
+    The base periods are fixed by ``AxialRoPE2DConfig``. Optionally, this module
+    can include learnable scalar parameters when using:
+      - ``frequency_aware`` (boundary_log_multiplier), or
+      - ``beta_warp`` (beta_hi_u/beta_lo_u/beta_bend_u), or
+      - ``alpha_warp`` (alpha per-band exponents).
+    """
+    periods: Tensor
+    def __init__(self, *, head_dim: int, cfg: AxialRoPE2DConfig) -> None:
+        super().__init__()
+        if int(head_dim) <= 0:
+            raise ValueError("head_dim must be positive for AxialRoPE2D")
+        if int(head_dim) % 4 != 0:
+            raise ValueError(
+                "AxialRoPE2D requires head_dim % 4 == 0 (DINOv3 constraint); "
+                f"got head_dim={int(head_dim)}"
+            )
+        if not isinstance(cfg, AxialRoPE2DConfig):
+            raise TypeError("cfg must be an AxialRoPE2DConfig for AxialRoPE2D")
+        self.head_dim: Final[int] = int(head_dim)
+        self.cfg: Final[AxialRoPE2DConfig] = cfg
+        self._d_head: Final[int] = self.head_dim
+        self.register_buffer(
+            "periods",
+            torch.empty(self._d_head // 4, dtype=torch.float32),
+            persistent=True,
+        )
+        if cfg.frequency_aware is None:
+            self.register_parameter("boundary_log_multiplier", None)
+        else:
+            init = float(cfg.frequency_aware.boundary_log_multiplier_init)
+            self.boundary_log_multiplier = nn.Parameter(
+                torch.tensor(init, dtype=torch.float32),
+                requires_grad=True,
+            )
+        if cfg.beta_warp is None:
+            self.register_parameter("beta_hi_u", None)
+            self.register_parameter("beta_lo_u", None)
+            self.register_parameter("beta_bend_u", None)
+        else:
+            beta = cfg.beta_warp
+            self.beta_hi_u = nn.Parameter(
+                torch.tensor(float(beta.beta_hi_u_init), dtype=torch.float32),
+                requires_grad=True,
+            )
+            self.beta_lo_u = nn.Parameter(
+                torch.tensor(float(beta.beta_lo_u_init), dtype=torch.float32),
+                requires_grad=True,
+            )
+            self.beta_bend_u = nn.Parameter(
+                torch.tensor(float(beta.beta_bend_u_init), dtype=torch.float32),
+                requires_grad=True,
+            )
+        if cfg.alpha_warp is None:
+            self.register_parameter("alpha", None)
+        else:
+            qtr = int(self._d_head) // 4
+            if qtr <= 0:  # pragma: no cover - defensive
+                raise RuntimeError("AxialRoPE2D periods length must be positive")
+            init = float(cfg.alpha_warp.alpha_init)
+            if not math.isfinite(init):
+                raise RuntimeError("alpha_init must be finite for alpha-warp RoPE")
+            self.alpha = nn.Parameter(
+                torch.full((int(qtr),), init, dtype=torch.float32),
+                requires_grad=True,
+            )
+        self._init_periods()
+    def _apply(self, fn):  # type: ignore[override]
+        out = super()._apply(fn)
+        with torch.no_grad():
+            self.periods.data = self.periods.data.to(dtype=torch.float32)
+            if self.boundary_log_multiplier is not None:
+                self.boundary_log_multiplier.data = (
+                    self.boundary_log_multiplier.data.to(dtype=torch.float32)
+                )
+            if self.beta_hi_u is not None:
+                self.beta_hi_u.data = self.beta_hi_u.data.to(dtype=torch.float32)
+            if self.beta_lo_u is not None:
+                self.beta_lo_u.data = self.beta_lo_u.data.to(dtype=torch.float32)
+            if self.beta_bend_u is not None:
+                self.beta_bend_u.data = self.beta_bend_u.data.to(dtype=torch.float32)
+            if self.alpha is not None:
+                self.alpha.data = self.alpha.data.to(dtype=torch.float32)
+        return out
+    def _init_periods(self) -> None:
+        """Initialize per-dimension periods using DINOv3 formulas."""
+        device: torch.device = self.periods.device
+        dtype: torch.dtype = self.periods.dtype
+        d_head = int(self._d_head)
+        qtr = d_head // 4
+        if qtr <= 0:
+            raise RuntimeError("AxialRoPE2D periods length must be positive")
+        if self.cfg.base is not None:
+            base = float(self.cfg.base)
+            exponents = (
+                2.0
+                * torch.arange(int(qtr), device=device, dtype=dtype)
+                / float(d_head // 2)
+            )
+            periods = torch.tensor(base, device=device, dtype=dtype) ** exponents
+        else:
+            if self.cfg.min_period is None or self.cfg.max_period is None:
+                raise RuntimeError(
+                    "AxialRoPE2DConfig must provide min_period and max_period when base is None"
+                )
+            min_p = float(self.cfg.min_period)
+            max_p = float(self.cfg.max_period)
+            base = max_p / min_p
+            exponents = torch.linspace(0.0, 1.0, int(qtr), device=device, dtype=dtype)
+            periods = torch.tensor(base, device=device, dtype=dtype) ** exponents
+            periods = periods / torch.tensor(base, device=device, dtype=dtype)
+            periods = periods * torch.tensor(max_p, device=device, dtype=dtype)
+        self.periods.data = periods
+    def forward(
+        self,
+        *,
+        H: int,
+        W: int,
+        scales: Tensor | None,
+    ) -> tuple[Tensor, Tensor]:
+        """Return (sin, cos) buffers for axial 2D RoPE.
+        Args:
+            H: Patch-grid height.
+            W: Patch-grid width.
+            scales: Optional per-batch dilation scale (scalar tensor). When
+                None, returns shared sin/cos shaped ``[HW, head_dim]``. When
+                provided, applies the scalar dilation and still returns shared
+                sin/cos shaped ``[HW, head_dim]``.
+        """
+        if int(H) <= 0 or int(W) <= 0:
+            raise ValueError("H and W must be positive for AxialRoPE2D forward")
+        device = self.periods.device
+        offset = float(self.cfg.coord_offset)
+        coords: Tensor
+        match self.cfg.coord_mode:
+            case AxialRoPE2DCoordMode.DINOV3_NORMALIZED:
+                coords = _get_dinov3_normalized_coords(
+                    int(H),
+                    int(W),
+                    device=device,
+                    normalize=self.cfg.normalize_coords,
+                    offset=offset,
+                )
+            case AxialRoPE2DCoordMode.PATCH_INDICES:
+                coords = _get_patch_index_coords(
+                    int(H), int(W), device=device, offset=offset
+                )
+            case _ as unreachable:  # pragma: no cover - defensive
+                raise RuntimeError(f"Unsupported coord_mode: {unreachable}")
+        if coords.dim() != 2 or coords.shape[1] != 2:
+            raise RuntimeError("AxialRoPE2D coords must have shape [HW, 2]")
+        if self.cfg.frequency_aware is not None:
+            if scales is not None:
+                raise ValueError(
+                    "frequency-aware axial RoPE does not support dilation scales"
+                )
+            if self.boundary_log_multiplier is None:
+                raise RuntimeError(
+                    "boundary_log_multiplier parameter missing for frequency-aware RoPE"
+                )
+            ref_h = int(self.cfg.frequency_aware.ref_h_tokens)
+            ref_w = int(self.cfg.frequency_aware.ref_w_tokens)
+            periods_h = lumina_frequency_aware_periods_for_axis(
+                periods=self.periods,
+                axis_len=int(H),
+                ref_axis_len=ref_h,
+                boundary_log_multiplier=self.boundary_log_multiplier,
+                angle_multiplier=float(self.cfg.angle_multiplier),
+            )
+            periods_w = lumina_frequency_aware_periods_for_axis(
+                periods=self.periods,
+                axis_len=int(W),
+                ref_axis_len=ref_w,
+                boundary_log_multiplier=self.boundary_log_multiplier,
+                angle_multiplier=float(self.cfg.angle_multiplier),
+            )
+            axis_periods = torch.stack([periods_h, periods_w], dim=0)  # [2, Q]
+        elif self.cfg.beta_warp is not None:
+            if scales is not None:
+                raise ValueError(
+                    "beta-warp axial RoPE does not support dilation scales"
+                )
+            if (
+                self.beta_hi_u is None
+                or self.beta_lo_u is None
+                or self.beta_bend_u is None
+            ):
+                raise RuntimeError("beta warp parameters missing for beta-warp RoPE")
+            beta_cfg = self.cfg.beta_warp
+            ref_h = int(beta_cfg.ref_h_tokens)
+            ref_w = int(beta_cfg.ref_w_tokens)
+            qtr = int(self.periods.numel())
+            if qtr <= 0:  # pragma: no cover - defensive (checked elsewhere)
+                raise RuntimeError("AxialRoPE2D periods length must be positive")
+            beta_max = float(beta_cfg.beta_max)
+            if not math.isfinite(beta_max) or beta_max <= 0.0:
+                raise RuntimeError("beta_max must be finite and > 0")
+            beta_max_t = torch.tensor(beta_max, device=device, dtype=torch.float32)
+            beta_hi = beta_max_t * torch.tanh(self.beta_hi_u.to(dtype=torch.float32))
+            beta_lo = beta_max_t * torch.tanh(self.beta_lo_u.to(dtype=torch.float32))
+            beta_bend = beta_max_t * torch.tanh(
+                self.beta_bend_u.to(dtype=torch.float32)
+            )
+            if qtr == 1:
+                beta = beta_hi[None]
+            else:
+                t = torch.arange(int(qtr), device=device, dtype=torch.float32) / float(
+                    qtr - 1
+                )
+                bump = 4.0 * t * (1.0 - t)
+                beta = (1.0 - t) * beta_hi + t * beta_lo + beta_bend * bump
+            s_h = float(int(H)) / float(ref_h)
+            s_w = float(int(W)) / float(ref_w)
+            if (
+                not math.isfinite(s_h)
+                or s_h <= 0.0
+                or not math.isfinite(s_w)
+                or s_w <= 0.0
+            ):
+                raise RuntimeError(
+                    "Computed invalid axis scale factors for beta-warp RoPE"
+                )
+            periods_h = self.periods.to(dtype=torch.float32) * torch.pow(
+                torch.tensor(s_h, device=device, dtype=torch.float32), beta
+            )
+            periods_w = self.periods.to(dtype=torch.float32) * torch.pow(
+                torch.tensor(s_w, device=device, dtype=torch.float32), beta
+            )
+            axis_periods = torch.stack([periods_h, periods_w], dim=0)  # [2, Q]
+        elif self.cfg.alpha_warp is not None:
+            if scales is not None:
+                raise ValueError(
+                    "alpha-warp axial RoPE does not support dilation scales"
+                )
+            if self.alpha is None:
+                raise RuntimeError("alpha parameter missing for alpha-warp RoPE")
+            alpha_cfg = self.cfg.alpha_warp
+            ref_h = int(alpha_cfg.ref_h_tokens)
+            ref_w = int(alpha_cfg.ref_w_tokens)
+            qtr = int(self.periods.numel())
+            if int(self.alpha.numel()) != qtr:
+                raise RuntimeError(
+                    "alpha length must match RoPE periods length for alpha-warp RoPE"
+                )
+            s_h = float(int(H)) / float(ref_h)
+            s_w = float(int(W)) / float(ref_w)
+            if (
+                not math.isfinite(s_h)
+                or s_h <= 0.0
+                or not math.isfinite(s_w)
+                or s_w <= 0.0
+            ):
+                raise RuntimeError(
+                    "Computed invalid axis scale factors for alpha-warp RoPE"
+                )
+            alpha = self.alpha.to(device=device, dtype=torch.float32)
+            scale_h = torch.pow(
+                torch.tensor(s_h, device=device, dtype=torch.float32), alpha
+            )
+            scale_w = torch.pow(
+                torch.tensor(s_w, device=device, dtype=torch.float32), alpha
+            )
+            periods_h = self.periods.to(dtype=torch.float32) / scale_h
+            periods_w = self.periods.to(dtype=torch.float32) / scale_w
+            axis_periods = torch.stack([periods_h, periods_w], dim=0)  # [2, Q]
+        else:
+            axis_periods = self.periods[None, :].expand(2, -1).to(dtype=torch.float32)
+        # Angles: angle_multiplier * coords / periods, flattened and tiled.
+        angles = (
+            float(self.cfg.angle_multiplier)
+            * coords[:, :, None].to(dtype=torch.float32)
+            / axis_periods[None, :, :].to(dtype=torch.float32)
+        )
+        match self.cfg.dim_layout:
+            case AxialRoPE2DDimLayout.HALF_SPLIT:
+                angles = angles.flatten(1, 2).repeat(1, 2)
+            case AxialRoPE2DDimLayout.PAIR_INTERLEAVED:
+                angles = angles.repeat_interleave(2, dim=-1).flatten(1, 2)
+            case _ as unreachable:  # pragma: no cover - defensive
+                raise RuntimeError(f"Unsupported dim_layout: {unreachable}")
+        if angles.shape != (int(H) * int(W), int(self._d_head)):
+            raise RuntimeError(
+                "Unexpected angles shape in AxialRoPE2D: "
+                f"{tuple(angles.shape)} for H={int(H)} W={int(W)}"
+            )
+        if scales is not None:
+            if scales.dim() != 0:
+                raise ValueError(
+                    "AxialRoPE2D scales must be a scalar tensor for per-batch dilation; "
+                    "per-sample dilation is not supported"
+                )
+            angles = angles * scales.to(device=device, dtype=torch.float32)
+        cos = torch.cos(angles)
+        sin = torch.sin(angles)
+        return sin, cos
+def _dy_ntk_periods_for_axis(
+    *,
+    periods: Tensor,
+    axis_len: int,
+    ref_axis_len: int,
+    noise_time: Tensor,
+    lambda_s: float,
+    lambda_t: float,
+) -> Tensor:
+    """Return Dy-NTK periods for one spatial axis.
+    Raises:
+        ValueError: If token lengths or scheduler values are invalid.
+    """
+    if int(axis_len) <= 0 or int(ref_axis_len) <= 0:
+        raise ValueError("axis_len and ref_axis_len must be positive for Dy-NTK")
+    qtr = int(periods.numel())
+    if qtr <= 0:
+        raise ValueError("periods must be non-empty for Dy-NTK")
+    scale = float(int(axis_len)) / float(int(ref_axis_len))
+    if not math.isfinite(scale) or scale <= 0.0:
+        raise ValueError("Dy-NTK axis scale must be finite and > 0")
+    return _dy_ntk_periods_for_scale(
+        periods=periods,
+        scale=scale,
+        noise_time=noise_time,
+        lambda_s=float(lambda_s),
+        lambda_t=float(lambda_t),
+    )
+def _dy_ntk_periods_for_scale(
+    *,
+    periods: Tensor,
+    scale: float,
+    noise_time: Tensor,
+    lambda_s: float,
+    lambda_t: float,
+) -> Tensor:
+    """Return Dy-NTK periods for a precomputed axis scale."""
+    axis_scale = float(scale)
+    if not math.isfinite(axis_scale) or axis_scale <= 0.0:
+        raise ValueError("Dy-NTK scale must be finite and > 0")
+    qtr = int(periods.numel())
+    if qtr <= 0:
+        raise ValueError("periods must be non-empty for Dy-NTK")
+    if scale <= 1.0:
+        return periods.to(dtype=torch.float32)
+    if qtr == 1:
+        exponent = torch.zeros((1,), device=periods.device, dtype=torch.float32)
+    else:
+        exponent = torch.arange(qtr, device=periods.device, dtype=torch.float32) / (
+            float(qtr - 1)
+        )
+    kappa = float(lambda_s) * torch.pow(
+        noise_time.to(device=periods.device, dtype=torch.float32),
+        float(lambda_t),
+    )
+    return periods.to(dtype=torch.float32) * torch.pow(
+        torch.tensor(axis_scale, device=periods.device, dtype=torch.float32),
+        kappa * exponent,
+    )
+def _dype_dynamic_exponent(
+    *, noise_time: float, lambda_s: float, lambda_t: float
+) -> float:
+    """Return Comfy/DyPE-style dynamic magnitude for normalized noise time."""
+    noise = float(noise_time)
+    if not math.isfinite(noise):
+        raise ValueError("DyPE noise_time must be finite")
+    noise = max(0.0, min(1.0, noise))
+    scale = float(lambda_s)
+    exponent = float(lambda_t)
+    if not math.isfinite(scale) or scale <= 0.0:
+        raise ValueError("DyPE lambda_s must be finite and > 0")
+    if not math.isfinite(exponent) or exponent <= 0.0:
+        raise ValueError("DyPE lambda_t must be finite and > 0")
+    return scale * (noise**exponent)
+def _dype_correction_factor(
+    *,
+    periods: Tensor,
+    rotations: float,
+    ref_axis_len: int,
+    angle_multiplier: float,
+) -> float:
+    """Return fractional band index whose wavelength makes ``rotations`` turns."""
+    if int(ref_axis_len) <= 0:
+        raise ValueError("ref_axis_len must be positive for DyPE correction")
+    rot = float(rotations)
+    if not math.isfinite(rot) or rot <= 0.0:
+        raise ValueError("rotations must be finite and > 0")
+    mult = float(angle_multiplier)
+    if not math.isfinite(mult) or mult <= 0.0:
+        raise ValueError("angle_multiplier must be finite and > 0")
+    if int(periods.numel()) < 2:
+        return 0.0
+    periods_cpu = periods.detach().to(device=torch.device("cpu"), dtype=torch.float32)
+    p0 = float(periods_cpu[0].item())
+    p1 = float(periods_cpu[-1].item())
+    if p0 <= 0.0 or p1 <= p0:
+        raise ValueError("periods must be positive and strictly increasing for DyPE")
+    boundary_wavelength = float(int(ref_axis_len)) / rot
+    boundary_period = (mult / (2.0 * float(math.pi))) * boundary_wavelength
+    log_p0 = math.log(p0)
+    log_p1 = math.log(p1)
+    return float(periods.numel() - 1) * (
+        (math.log(boundary_period) - log_p0) / (log_p1 - log_p0)
+    )
+def _dype_ramp_mask(
+    *,
+    periods: Tensor,
+    threshold_high_rotations: float,
+    threshold_low_rotations: float,
+    ref_axis_len: int,
+    angle_multiplier: float,
+) -> Tensor:
+    """Return YaRN's high-to-low band mask for one dynamic threshold pair."""
+    qtr = int(periods.numel())
+    if qtr <= 0:
+        raise ValueError("periods must be non-empty for DyPE ramp mask")
+    device = periods.device
+    if qtr == 1:
+        return torch.ones((1,), device=device, dtype=torch.float32)
+    low = math.floor(
+        _dype_correction_factor(
+            periods=periods,
+            rotations=float(threshold_high_rotations),
+            ref_axis_len=int(ref_axis_len),
+            angle_multiplier=float(angle_multiplier),
+        )
+    )
+    high = math.ceil(
+        _dype_correction_factor(
+            periods=periods,
+            rotations=float(threshold_low_rotations),
+            ref_axis_len=int(ref_axis_len),
+            angle_multiplier=float(angle_multiplier),
+        )
+    )
+    low = max(0, min(qtr - 1, int(low)))
+    high = max(0, min(qtr, int(high)))
+    if low == high:
+        high = min(qtr, low + 1)
+    band = torch.arange(qtr, device=device, dtype=torch.float32)
+    ramp = (band - float(low)) / float(high - low)
+    return 1.0 - torch.clamp(ramp, min=0.0, max=1.0)
+def _dy_yarn_periods_for_axis(
+    *,
+    periods: Tensor,
+    linear_scale: float,
+    ntk_scale: float,
+    ref_axis_len: int,
+    noise_time: float,
+    lambda_s: float,
+    cfg: AxialRoPE2DDyPEConfig,
+    angle_multiplier: float,
+) -> Tensor:
+    """Return Dy-YaRN periods for one spatial axis."""
+    if int(ref_axis_len) <= 0:
+        raise ValueError("ref_axis_len must be positive for Dy-YaRN")
+    linear_s = float(linear_scale)
+    ntk_s = float(ntk_scale)
+    if (
+        not math.isfinite(linear_s)
+        or linear_s <= 0.0
+        or not math.isfinite(ntk_s)
+        or ntk_s <= 0.0
+    ):
+        raise ValueError("Dy-YaRN axis scales must be finite and > 0")
+    periods_f = periods.to(dtype=torch.float32)
+    if max(linear_s, ntk_s) <= 1.0:
+        return periods_f
+    kappa = _dype_dynamic_exponent(
+        noise_time=float(noise_time),
+        lambda_s=float(lambda_s),
+        lambda_t=float(cfg.lambda_t),
+    )
+    if kappa <= 1e-6:
+        return periods_f
+    freq_base = float(angle_multiplier) / periods_f
+    freq_linear = float(angle_multiplier) / (periods_f * max(1.0, linear_s))
+    periods_ntk = _dy_ntk_periods_for_scale(
+        periods=periods_f,
+        scale=max(1.0, ntk_s),
+        noise_time=torch.ones((), device=periods.device, dtype=torch.float32),
+        lambda_s=1.0,
+        lambda_t=1.0,
+    )
+    freq_ntk = float(angle_multiplier) / periods_ntk
+    beta_mask = _dype_ramp_mask(
+        periods=periods_f,
+        threshold_high_rotations=float(cfg.yarn_beta_0) ** kappa,
+        threshold_low_rotations=float(cfg.yarn_beta_1) ** kappa,
+        ref_axis_len=int(ref_axis_len),
+        angle_multiplier=float(angle_multiplier),
+    )
+    freq = freq_linear * (1.0 - beta_mask) + freq_ntk * beta_mask
+    gamma_mask = _dype_ramp_mask(
+        periods=periods_f,
+        threshold_high_rotations=float(cfg.yarn_gamma_0) ** kappa,
+        threshold_low_rotations=float(cfg.yarn_gamma_1) ** kappa,
+        ref_axis_len=int(ref_axis_len),
+        angle_multiplier=float(angle_multiplier),
+    )
+    freq = freq * (1.0 - gamma_mask) + freq_base * gamma_mask
+    return float(angle_multiplier) / freq
+class AxialRoPE2DDyPE(AxialRoPE2D):
+    """Inference-only axial RoPE wrapper using dynamic position extrapolation."""
+    dype_cfg: AxialRoPE2DDyPEConfig
+    dype_noise_time: Tensor
+    dype_noise_time_values: list[float]
+    def __init__(self, *, base: AxialRoPE2D, cfg: AxialRoPE2DDyPEConfig) -> None:
+        if not isinstance(base, AxialRoPE2D):
+            raise TypeError("base must be an AxialRoPE2D")
+        if not isinstance(cfg, AxialRoPE2DDyPEConfig):
+            raise TypeError("cfg must be an AxialRoPE2DDyPEConfig")
+        if base.cfg.coord_mode is not AxialRoPE2DCoordMode.PATCH_INDICES:
+            raise ValueError("DyPE requires patch-index axial RoPE coordinates")
+        super().__init__(head_dim=int(base.head_dim), cfg=base.cfg)
+        self.dype_cfg = cfg  # ty: ignore[unresolved-attribute]
+        self.register_buffer(
+            "dype_noise_time",
+            torch.tensor(1.0, dtype=torch.float32),
+            persistent=False,
+        )
+        self.dype_noise_time_values: list[float] = [1.0]
+        with torch.no_grad():
+            self.periods.copy_(base.periods.detach().to(dtype=torch.float32))
+    def set_dype_noise_time(self, noise_time: float) -> None:
+        """Set the current normalized diffusion noise time in ``[0, 1]``."""
+        t = float(noise_time)
+        if not math.isfinite(t) or t < 0.0 or t > 1.0:
+            raise ValueError("DyPE noise_time must be finite and within [0, 1]")
+        self.dype_noise_time.fill_(t)
+        self.dype_noise_time_values[0] = t
+    def _dype_axis_periods(
+        self,
+        *,
+        axis_len: int,
+        ref_axis_len: int,
+        global_scale: float,
+        lambda_s: float,
+    ) -> Tensor:
+        """Return method-specific periods for one spatial axis."""
+        cfg = self.dype_cfg
+        axis_scale = float(int(axis_len)) / float(int(ref_axis_len))
+        shared_scale = float(global_scale)
+        if (
+            not math.isfinite(axis_scale)
+            or axis_scale <= 0.0
+            or not math.isfinite(shared_scale)
+            or shared_scale <= 0.0
+        ):
+            raise ValueError("DyPE axis and global scales must be finite and > 0")
+        match cfg.method:
+            case DyPERoPEMethod.DY_NTK:
+                return _dy_ntk_periods_for_scale(
+                    periods=self.periods,
+                    scale=shared_scale,
+                    noise_time=self.dype_noise_time,
+                    lambda_s=float(lambda_s),
+                    lambda_t=float(cfg.lambda_t),
+                )
+            case DyPERoPEMethod.VISION_YARN:
+                return _dy_yarn_periods_for_axis(
+                    periods=self.periods,
+                    linear_scale=axis_scale,
+                    ntk_scale=shared_scale,
+                    ref_axis_len=int(ref_axis_len),
+                    noise_time=float(self.dype_noise_time_values[0]),
+                    lambda_s=float(lambda_s),
+                    cfg=cfg,
+                    angle_multiplier=float(self.cfg.angle_multiplier),
+                )
+            case DyPERoPEMethod.DY_YARN:
+                return _dy_yarn_periods_for_axis(
+                    periods=self.periods,
+                    linear_scale=shared_scale,
+                    ntk_scale=shared_scale,
+                    ref_axis_len=int(ref_axis_len),
+                    noise_time=float(self.dype_noise_time_values[0]),
+                    lambda_s=float(lambda_s),
+                    cfg=cfg,
+                    angle_multiplier=float(self.cfg.angle_multiplier),
+                )
+            case _ as unreachable:
+                raise RuntimeError(f"Unsupported DyPE method: {unreachable}")
+    def forward(
+        self,
+        *,
+        H: int,
+        W: int,
+        scales: Tensor | None,
+    ) -> tuple[Tensor, Tensor]:
+        """Return timestep-aware DyPE sin/cos buffers."""
+        if scales is not None:
+            raise ValueError("DyPE axial RoPE does not support dilation scales")
+        if int(H) <= 0 or int(W) <= 0:
+            raise ValueError("H and W must be positive for DyPE axial RoPE")
+        device = self.periods.device
+        coords = _get_patch_index_coords(
+            int(H), int(W), device=device, offset=float(self.cfg.coord_offset)
+        )
+        scale_h = float(int(H)) / float(int(self.dype_cfg.ref_h_tokens))
+        scale_w = float(int(W)) / float(int(self.dype_cfg.ref_w_tokens))
+        global_scale = max(scale_h, scale_w)
+        periods_h = self._dype_axis_periods(
+            axis_len=int(H),
+            ref_axis_len=int(self.dype_cfg.ref_h_tokens),
+            global_scale=global_scale,
+            lambda_s=float(self.dype_cfg.lambda_s),
+        )
+        periods_w = self._dype_axis_periods(
+            axis_len=int(W),
+            ref_axis_len=int(self.dype_cfg.ref_w_tokens),
+            global_scale=global_scale,
+            lambda_s=float(self.dype_cfg.lambda_s),
+        )
+        axis_periods = torch.stack([periods_h, periods_w], dim=0)
+        angles = (
+            float(self.cfg.angle_multiplier)
+            * coords[:, :, None].to(dtype=torch.float32)
+            / axis_periods[None, :, :].to(dtype=torch.float32)
+        )
+        match self.cfg.dim_layout:
+            case AxialRoPE2DDimLayout.HALF_SPLIT:
+                angles = angles.flatten(1, 2).repeat(1, 2)
+            case AxialRoPE2DDimLayout.PAIR_INTERLEAVED:
+                angles = angles.repeat_interleave(2, dim=-1).flatten(1, 2)
+            case _ as unreachable:
+                raise RuntimeError(f"Unsupported dim_layout: {unreachable}")
+        expected_shape = (int(H) * int(W), int(self.head_dim))
+        if angles.shape != expected_shape:
+            raise RuntimeError(
+                "Unexpected angles shape in DyPE axial RoPE: "
+                f"{tuple(angles.shape)} for expected {expected_shape}"
+            )
+        sin = torch.sin(angles)
+        cos = torch.cos(angles)
+        if (
+            self.dype_cfg.method in (DyPERoPEMethod.VISION_YARN, DyPERoPEMethod.DY_YARN)
+            and bool(self.dype_cfg.yarn_attention_scale)
+            and global_scale > 1.0
+        ):
+            match self.dype_cfg.method:
+                case DyPERoPEMethod.VISION_YARN:
+                    mscale_start = 0.1 * math.log(global_scale) + 1.0
+                    kappa = _dype_dynamic_exponent(
+                        noise_time=float(self.dype_noise_time_values[0]),
+                        lambda_s=1.0,
+                        lambda_t=float(self.dype_cfg.lambda_t),
+                    )
+                    mscale = 1.0 + (mscale_start - 1.0) * kappa
+                case DyPERoPEMethod.DY_YARN:
+                    mscale = 1.0 + 0.1 * math.log(global_scale) / math.sqrt(
+                        global_scale
+                    )
+                case _ as unreachable:  # pragma: no cover - guarded above
+                    raise RuntimeError(
+                        f"Unsupported YaRN attention scale: {unreachable}"
+                    )
+            if mscale > 1.0:
+                sin = sin * float(mscale)
+                cos = cos * float(mscale)
+        return sin, cos
+def build_axial_rope2d_dype(
+    *, base: AxialRoPE2D, cfg: AxialRoPE2DDyPEConfig
+) -> AxialRoPE2DDyPE:
+    """Build an inference-only DyPE wrapper for an existing axial RoPE."""
+    return AxialRoPE2DDyPE(base=base, cfg=cfg).to(device=base.periods.device)
+def set_axial_rope2d_dype_noise_time(module: nn.Module, *, noise_time: float) -> bool:
+    """Set DyPE noise time on all axial DyPE modules inside ``module``."""
+    updated = False
+    for child in module.modules():
+        match child:
+            case AxialRoPE2DDyPE() as dype:
+                dype.set_dype_noise_time(float(noise_time))
+                updated = True
+            case _:
+                pass
+    return updated

dit/blocks.py ADDED Viewed

	@@ -0,0 +1,259 @@

+"""Dense unconditional DiT blocks used by the DINAC-AE export."""
+from __future__ import annotations
+import torch
+from torch import Tensor, nn
+from common.norms import RMSNorm
+from common.rope import Rope1D
+from dit.attention_blocks import DitSelfAttentionCore
+from dit.body_config import DiTConditioning
+from dit.mlp import build_dit_mlp, reset_module_parameters
+from dit.mlp_types import MLPType
+from dit.position_encoding import DiTPositionEncoding
+def _flatten_tokens(
+    x: Tensor, hw: tuple[int, int] | None
+) -> tuple[Tensor, tuple[int, int], bool]:
+    """Return dense tokens plus spatial metadata."""
+    if x.dim() == 4:
+        batch, channels, height, width = x.shape
+        tokens = x.permute(0, 2, 3, 1).reshape(batch, height * width, channels)
+        return tokens, (int(height), int(width)), True
+    return x, hw if hw is not None else (int(x.shape[1]), 1), False
+def _restore_spatial(tokens: Tensor, hw: tuple[int, int]) -> Tensor:
+    """Restore dense tokens to NCHW features."""
+    batch, _sequence_length, width = tokens.shape
+    height, spatial_width = hw
+    return tokens.transpose(1, 2).reshape(batch, width, height, spatial_width)
+class TransformerBlock(nn.Module):
+    """Dense pre-norm transformer block kept for import compatibility."""
+    d_model: int
+    n_heads: int
+    attn_norm: RMSNorm | None
+    mlp_norm: RMSNorm | None
+    self_attn: DitSelfAttentionCore
+    rope_1d: Rope1D | None
+    mlp: nn.Module
+    def __init__(
+        self,
+        *,
+        d_model: int,
+        n_heads: int,
+        mlp_ratio: float,
+        mlp_type: MLPType,
+        activation_config: object | None = None,
+        block_index: int = 0,
+        use_norms: bool = True,
+        position_encoding: DiTPositionEncoding = DiTPositionEncoding.NONE,
+        rope_theta: float | None = None,
+        rope_max_position_embeddings: int | None = None,
+    ) -> None:
+        super().__init__()
+        self.d_model = int(d_model)
+        self.n_heads = int(n_heads)
+        self.attn_norm = RMSNorm(self.d_model) if bool(use_norms) else None
+        self.mlp_norm = RMSNorm(self.d_model) if bool(use_norms) else None
+        self.self_attn = DitSelfAttentionCore(
+            d_model=self.d_model,
+            n_heads=self.n_heads,
+            position_encoding=position_encoding,
+        )
+        self.rope_1d = self._build_rope_1d(
+            position_encoding=position_encoding,
+            rope_theta=rope_theta,
+            rope_max_position_embeddings=rope_max_position_embeddings,
+        )
+        self.mlp = build_dit_mlp(
+            mlp_type=mlp_type,
+            in_features=self.d_model,
+            hidden_budget=int(round(float(mlp_ratio) * self.d_model)),
+            activation_config=activation_config,
+            block_index=int(block_index),
+            bias_up=False,
+            bias_down=False,
+        )
+    def reset_parameters(self) -> None:
+        """Reset attention and MLP parameters."""
+        self.self_attn.reset_parameters()
+        reset_module_parameters(self.mlp)
+    def _build_rope_1d(
+        self,
+        *,
+        position_encoding: DiTPositionEncoding,
+        rope_theta: float | None,
+        rope_max_position_embeddings: int | None,
+    ) -> Rope1D | None:
+        """Build 1D RoPE for sequence-only transformer blocks."""
+        match position_encoding:
+            case DiTPositionEncoding.NONE:
+                return None
+            case DiTPositionEncoding.ROPE_1D:
+                if rope_theta is None or rope_max_position_embeddings is None:
+                    raise ValueError("ROPE_1D requires theta and max positions")
+                return Rope1D(
+                    dim=int(self.d_model // self.n_heads),
+                    max_position_embeddings=int(rope_max_position_embeddings),
+                    base=float(rope_theta),
+                )
+            case _ as unreachable:
+                raise ValueError(f"Unsupported TransformerBlock RoPE: {unreachable}")
+    def forward(self, tokens: Tensor, *, generator: torch.Generator | None) -> Tensor:  # type: ignore[override]
+        """Apply dense self-attention and MLP to token sequences."""
+        _ = generator
+        attn_in = self.attn_norm(tokens) if self.attn_norm is not None else tokens
+        rope_sincos = self._build_rope_sincos(attn_in)
+        x = tokens + self.self_attn(attn_in, rope_sincos=rope_sincos)
+        mlp_in = self.mlp_norm(x) if self.mlp_norm is not None else x
+        return x + self.mlp(mlp_in)
+    def _build_rope_sincos(self, tokens: Tensor) -> tuple[Tensor, Tensor] | None:
+        """Return dense 1D RoPE sin/cos buffers."""
+        rope = self.rope_1d
+        if rope is None:
+            return None
+        batch = int(tokens.shape[0])
+        seqlen = int(tokens.shape[1])
+        position_ids = torch.arange(
+            seqlen,
+            device=tokens.device,
+            dtype=torch.int64,
+        ).unsqueeze(0)
+        position_ids = position_ids.expand(batch, seqlen)
+        dummy = tokens.new_empty(batch, self.n_heads, seqlen, rope.dim)
+        cos, sin = rope(dummy, position_ids)
+        return sin, cos
+class DitBlock(nn.Module):
+    """Dense unconditional DiT self-attention block."""
+    d: int
+    h: int
+    dh: int
+    hidden_budget: int
+    position_encoding: DiTPositionEncoding
+    conditioning: DiTConditioning
+    adaln: object | None
+    gate_attn: nn.Parameter | None
+    gate_mlp: nn.Parameter | None
+    use_norms: bool
+    attn_norm1: RMSNorm
+    attn_norm2: RMSNorm
+    mlp_norm1: RMSNorm
+    mlp_norm2: RMSNorm
+    attn_core: DitSelfAttentionCore
+    qkv: nn.Linear
+    proj_out: nn.Linear
+    mlp: nn.Module
+    def __init__(
+        self,
+        d_model: int,
+        n_heads: int,
+        mlp_ratio: float,
+        *,
+        adaln: object | None = None,
+        mlp_type: MLPType = MLPType.GELU,
+        activation_config: object | None = None,
+        block_index: int = 0,
+        use_norms: bool = True,
+        position_encoding: DiTPositionEncoding = DiTPositionEncoding.NONE,
+        conditioning: DiTConditioning = DiTConditioning.UNCOND,
+    ) -> None:
+        super().__init__()
+        if conditioning is not DiTConditioning.UNCOND or adaln is not None:
+            raise ValueError("DINAC-AE export only supports unconditional DitBlock")
+        self.d = int(d_model)
+        self.h = int(n_heads)
+        self.dh = int(self.d // self.h)
+        self.hidden_budget = int(float(mlp_ratio) * self.d)
+        self.position_encoding = position_encoding
+        self.conditioning = conditioning
+        self.adaln = None
+        self.gate_attn = None
+        self.gate_mlp = None
+        self.use_norms = bool(use_norms)
+        self.attn_norm1 = RMSNorm(self.d)
+        self.attn_norm2 = RMSNorm(self.d)
+        self.mlp_norm1 = RMSNorm(self.d)
+        self.mlp_norm2 = RMSNorm(self.d)
+        self.attn_core = DitSelfAttentionCore(
+            d_model=self.d,
+            n_heads=self.h,
+            position_encoding=position_encoding,
+        )
+        self.qkv = self.attn_core.qkv
+        self.proj_out = self.attn_core.proj_out
+        self.mlp = build_dit_mlp(
+            mlp_type=mlp_type,
+            in_features=self.d,
+            hidden_budget=self.hidden_budget,
+            activation_config=activation_config,
+            block_index=int(block_index),
+            bias_up=False,
+            bias_down=False,
+        )
+        self.reset_parameters()
+    def reset_parameters(self) -> None:
+        """Reset attention and MLP parameters."""
+        self.attn_core.reset_parameters()
+        reset_module_parameters(self.mlp)
+    def compile_for_training(self, *, fullgraph: bool, dynamic: bool) -> None:
+        """No-op hook kept for API compatibility."""
+        _ = fullgraph, dynamic
+    def compile_for_eval(self, *, fullgraph: bool, dynamic: bool) -> None:
+        """No-op hook kept for API compatibility."""
+        _ = fullgraph, dynamic
+    def forward(
+        self,
+        x: Tensor,
+        hw: tuple[int, int],
+        cond_vec: Tensor,
+        adaln_m: Tensor | None = None,
+        *,
+        rope_sincos: tuple[Tensor, Tensor] | None = None,
+        generator: torch.Generator | None = None,
+    ) -> Tensor:
+        """Apply the dense unconditional block to spatial features or tokens."""
+        _ = cond_vec, adaln_m, generator
+        tokens, hw_tokens, was_spatial = _flatten_tokens(x, hw)
+        attn_in = self.attn_norm1(tokens) if self.use_norms else tokens
+        y = self.attn_core(attn_in, rope_sincos=rope_sincos)
+        attn_out = self.attn_norm2(y) if self.use_norms else y
+        tokens = tokens + attn_out
+        mlp_in = self.mlp_norm1(tokens) if self.use_norms else tokens
+        mlp_out = self.mlp(mlp_in)
+        mlp_out = self.mlp_norm2(mlp_out) if self.use_norms else mlp_out
+        tokens = tokens + mlp_out
+        if was_spatial:
+            return _restore_spatial(tokens, hw_tokens)
+        return tokens
+__all__ = ["DitBlock", "TransformerBlock"]

dit/body_config.py ADDED Viewed

	@@ -0,0 +1,33 @@

+"""Small DiT configuration enums required by the DINAC-AE export."""
+from __future__ import annotations
+from dataclasses import dataclass
+from enum import Enum, auto
+class DiTConditioning(Enum):
+    """Conditioning strategy for exported DiT blocks."""
+    ADALN = auto()
+    GATED_UNCOND = auto()
+    UNCOND = auto()
+class AdaLNSharingMode(Enum):
+    """AdaLN sharing modes retained for auxiliary config imports."""
+    PER_BLOCK = auto()
+    SHARED_BASE_LOW_RANK_DELTA = auto()
+@dataclass
+class DiTBodyConfig:
+    """Minimal body config placeholder for unused auxiliary heads."""
+    depth: int = 1
+    d_model: int = 768
+    n_heads: int = 12
+__all__ = ["AdaLNSharingMode", "DiTBodyConfig", "DiTConditioning"]

dit/mlp.py ADDED Viewed

	@@ -0,0 +1,117 @@

+"""Small MLP factory for DINAC-AE DiT blocks."""
+from __future__ import annotations
+from collections.abc import Callable
+from typing import Protocol, cast
+import torch.nn.functional as F
+from torch import Tensor, nn
+from dit.mlp_types import MLPType
+class Resettable(Protocol):
+    """Typing protocol for modules with ``reset_parameters``."""
+    def reset_parameters(self) -> None:
+        """Reset module parameters."""
+def reset_module_parameters(module: nn.Module) -> None:
+    """Reset a module that exposes ``reset_parameters``."""
+    cast(Resettable, module).reset_parameters()
+class SimpleActivationMLP(nn.Module):
+    """Feedforward MLP: ``down(activation(up(x)))``."""
+    in_features: int
+    hidden_features: int
+    activation: Callable[[Tensor], Tensor]
+    activation_name: str
+    up: nn.Linear
+    down: nn.Linear
+    def __init__(
+        self,
+        in_features: int,
+        hidden_features: int,
+        *,
+        activation: Callable[[Tensor], Tensor],
+        activation_name: str,
+        bias_up: bool,
+        bias_down: bool,
+    ) -> None:
+        super().__init__()
+        self.in_features = int(in_features)
+        self.hidden_features = int(hidden_features)
+        self.activation = activation
+        self.activation_name = str(activation_name)
+        self.up = nn.Linear(self.in_features, self.hidden_features, bias=bias_up)
+        self.down = nn.Linear(self.hidden_features, self.in_features, bias=bias_down)
+        self.reset_parameters()
+    def reset_parameters(self) -> None:
+        """Reset linear projections."""
+        nn.init.xavier_uniform_(self.up.weight)
+        if self.up.bias is not None:
+            nn.init.zeros_(self.up.bias)
+        nn.init.xavier_uniform_(self.down.weight)
+        if self.down.bias is not None:
+            nn.init.zeros_(self.down.bias)
+    def forward(self, x: Tensor) -> Tensor:  # type: ignore[override]
+        """Apply the MLP."""
+        return self.down(self.activation(self.up(x)))
+def build_dit_mlp(
+    *,
+    mlp_type: MLPType,
+    in_features: int,
+    hidden_budget: int,
+    activation_config: object | None = None,
+    block_index: int = 0,
+    bias_up: bool = False,
+    bias_down: bool = False,
+) -> nn.Module:
+    """Build the exported MLP variant."""
+    _ = activation_config, block_index
+    match mlp_type:
+        case MLPType.GELU:
+            return SimpleActivationMLP(
+                in_features=int(in_features),
+                hidden_features=int(hidden_budget),
+                activation=F.gelu,
+                activation_name="gelu",
+                bias_up=bool(bias_up),
+                bias_down=bool(bias_down),
+            )
+        case MLPType.SILU:
+            return SimpleActivationMLP(
+                in_features=int(in_features),
+                hidden_features=int(hidden_budget),
+                activation=F.silu,
+                activation_name="silu",
+                bias_up=bool(bias_up),
+                bias_down=bool(bias_down),
+            )
+        case MLPType.RELU:
+            return SimpleActivationMLP(
+                in_features=int(in_features),
+                hidden_features=int(hidden_budget),
+                activation=F.relu,
+                activation_name="relu",
+                bias_up=bool(bias_up),
+                bias_down=bool(bias_down),
+            )
+        case _ as unreachable:
+            raise ValueError(f"Unsupported exported MLP type: {unreachable}")
+__all__ = ["SimpleActivationMLP", "build_dit_mlp", "reset_module_parameters"]

dit/mlp_types.py ADDED Viewed

	@@ -0,0 +1,51 @@

+from __future__ import annotations
+from enum import Enum
+class MLPType(Enum):
+    """MLP implementation variants for DiT blocks.
+    - SWI: baseline SwiGLU MLP
+    - SWINE: Sigmoid-gated sine GLU (σ·sin) with trig promoted to float32
+    - SWINER: Sigmoid-gated FINER-style chirp (σ·sin(ω₀·((1+|x|)·x)))
+    - SPWIDER: sqrt-gated sine GLU (√|a|·sin(ω₀·b))
+    - RELU: Plain ReLU-activated feedforward
+    - RELU2: ReLU-squared activation (ReLU(x)^2) feedforward
+    - SILU: Plain SiLU-activated feedforward
+    - GELU: Plain GELU-activated feedforward
+    - SIREN: Pure sine-activated MLP
+    - SPIDER: Sine with sqrt magnitude (sin(ω₀·x)·√|x|)
+    - SINC: Sinc-activated MLP with log-spaced per-channel scales
+    - FINER: FINER activation MLP with a fixed global scale (non-learnable)
+    - RBF: Low-rank per-patch RBF with Gaussian kernel
+    - RBF_ODD: RBF with odd-Gaussian kernel (z·exp(-z^2))
+    - RBF_SHARP: RBF with sharpness exponent alpha (exp(-(s·|x-b|)^alpha))
+    - RBF_SIREN: RBF using sine basis sin(ω0·(s·(x-b)))
+    - RBF_FINER: RBF using FINER (chirp) basis sin(ω0·((1+|z|)·z)), z=s·(x-b)
+    - RBF_DAMPED_SINE: RBF using damped sine sin(ω0·z)·exp(-|z|), z=s·(x-b)
+    - RBF_SINC: RBF using sinc basis sinc(z)=sin(z)/z with z=s·(x-b)
+    """
+    SWI = "swi"
+    SWINE = "swine"
+    SWINER = "swiner"
+    SPWIDER = "spwider"
+    RELU = "relu"
+    RELU2 = "relu2"
+    SILU = "silu"
+    GELU = "gelu"
+    SIREN = "siren"
+    SPIDER = "spider"
+    SINC = "sinc"
+    FINER = "finer"
+    RBF = "rbf"
+    RBF_ODD = "rbf_odd"
+    RBF_SHARP = "rbf_sharp"
+    RBF_SIREN = "rbf_siren"
+    RBF_FINER = "rbf_finer"
+    RBF_DAMPED_SINE = "rbf_damped_sine"
+    RBF_SINC = "rbf_sinc"
+__all__ = ["MLPType"]

dit/position_encoding.py ADDED Viewed

	@@ -0,0 +1,23 @@

+"""Position encoding options used by exported dense DiT blocks."""
+from __future__ import annotations
+from enum import Enum
+class DiTPositionEncoding(Enum):
+    """Position encoding strategy used inside exported DiT blocks."""
+    ROPE_2D_AXIAL_DILATED = "rope_2d_axial_dilated"
+    ROPE_2D_AXIAL_UNNORMALIZED_DILATED = "rope_2d_axial_unnormalized_dilated"
+    ROPE_2D_AXIAL_NORMALIZED = "rope_2d_axial_normalized"
+    ROPE_2D_AXIAL_UNNORMALIZED = "rope_2d_axial_unnormalized"
+    ROPE_2D_AXIAL_FREQ_AWARE = "rope_2d_axial_freq_aware"
+    ROPE_2D_AXIAL_BETA_WARP = "rope_2d_axial_beta_warp"
+    ROPE_2D_AXIAL_ALPHA_WARP = "rope_2d_axial_alpha_warp"
+    ROPE_3D_ZIMAGE = "rope_3d_zimage"
+    ROPE_1D = "rope_1d"
+    NONE = "none"
+__all__ = ["DiTPositionEncoding"]

dit/repa_projection.py ADDED Viewed

	@@ -0,0 +1,226 @@

+"""DINO token/class alignment head used by the DINAC-AE export."""
+from __future__ import annotations
+from dataclasses import dataclass
+from typing import TYPE_CHECKING
+import torch
+from torch import Tensor, nn
+from common.norms import RMSNorm
+from dit.axial_rope2d import (
+    AxialRoPE2D,
+    AxialRoPE2DConfig,
+    AxialRoPE2DCoordMode,
+    AxialRoPE2DDimLayout,
+    AxialRoPE2DNormalizeCoords,
+)
+from dit.blocks import DitBlock
+from dit.body_config import DiTConditioning
+from dit.position_encoding import DiTPositionEncoding
+from dit.xattn_blocks import CrossAttentionBlock, CrossAttentionConfig
+if TYPE_CHECKING:
+    from dit.mlp_types import MLPType
+@dataclass(frozen=True)
+class DinoTokenAlignmentOutput:
+    """Predicted DINO class token and spatial patch tokens."""
+    class_token: Tensor
+    spatial_tokens: Tensor
+def _prepend_identity_rope_prefix(
+    *,
+    rope_sincos: tuple[Tensor, Tensor],
+    prefix_token_count: int,
+    device: torch.device,
+) -> tuple[Tensor, Tensor]:
+    """Prepend no-op RoPE entries for class/register prefix tokens."""
+    sin, cos = rope_sincos
+    prefix_shape = (int(prefix_token_count), int(sin.shape[-1]))
+    prefix_sin = torch.zeros(prefix_shape, device=device, dtype=sin.dtype)
+    prefix_cos = torch.ones(prefix_shape, device=device, dtype=cos.dtype)
+    match sin.dim():
+        case 2:
+            return (
+                torch.cat([prefix_sin, sin.to(device=device)], dim=0),
+                torch.cat([prefix_cos, cos.to(device=device)], dim=0),
+            )
+        case 3:
+            batch = int(sin.shape[0])
+            return (
+                torch.cat(
+                    [
+                        prefix_sin.unsqueeze(0).expand(batch, -1, -1),
+                        sin.to(device=device),
+                    ],
+                    dim=1,
+                ),
+                torch.cat(
+                    [
+                        prefix_cos.unsqueeze(0).expand(batch, -1, -1),
+                        cos.to(device=device),
+                    ],
+                    dim=1,
+                ),
+            )
+        case _ as unreachable:
+            raise ValueError(f"Unsupported RoPE tensor rank: {int(unreachable)}")
+class DinoTokenAlignmentHead(nn.Module):
+    """Predict DINOv3 spatial tokens and a class token from latent grids."""
+    in_channels: int
+    feature_dim: int
+    model_dim: int
+    register_token_count: int
+    in_proj: nn.Conv2d
+    initial_class_token: nn.Parameter
+    register_tokens: nn.Parameter
+    block: DitBlock
+    spatial_output_norm: RMSNorm
+    class_readout: CrossAttentionBlock
+    class_output_norm: RMSNorm
+    _axial_rope2d: AxialRoPE2D
+    def __init__(
+        self,
+        *,
+        in_channels: int,
+        feature_dim: int,
+        model_dim: int,
+        head_dim: int,
+        mlp_ratio: float,
+        mlp_activation: MLPType,
+        block_index: int,
+        register_token_count: int,
+    ) -> None:
+        super().__init__()
+        if int(feature_dim) != int(model_dim):
+            raise ValueError("DINAC-AE class head requires feature_dim == model_dim")
+        if int(register_token_count) != 4:
+            raise ValueError("DINAC-AE class head requires four register tokens")
+        self.register_token_count = int(register_token_count)
+        self.in_channels = int(in_channels)
+        self.feature_dim = int(feature_dim)
+        self.model_dim = int(model_dim)
+        self.in_proj = nn.Conv2d(
+            self.in_channels,
+            self.model_dim,
+            kernel_size=1,
+            padding=0,
+            stride=1,
+            bias=True,
+        )
+        self.initial_class_token = nn.Parameter(torch.empty((1, self.model_dim)))
+        self.register_tokens = nn.Parameter(
+            torch.empty((self.register_token_count, self.model_dim))
+        )
+        nn.init.normal_(self.initial_class_token, mean=0.0, std=0.02)
+        nn.init.normal_(self.register_tokens, mean=0.0, std=0.02)
+        conditioning = DiTConditioning.UNCOND
+        self.block = DitBlock(
+            d_model=self.model_dim,
+            n_heads=int(self.model_dim // int(head_dim)),
+            mlp_ratio=float(mlp_ratio),
+            mlp_type=mlp_activation,
+            block_index=int(block_index),
+            use_norms=True,
+            position_encoding=DiTPositionEncoding.ROPE_2D_AXIAL_UNNORMALIZED,
+            conditioning=conditioning,
+        )
+        self.spatial_output_norm = RMSNorm(self.model_dim, affine=False)
+        self.class_readout = CrossAttentionBlock(
+            query_dim=self.model_dim,
+            context_dim=self.model_dim,
+            cfg=CrossAttentionConfig(
+                n_heads=int(self.model_dim // int(head_dim)),
+                head_dim=int(head_dim),
+                query_extra_dim=0,
+                context_extra_dim=0,
+                mlp_ratio=float(mlp_ratio),
+                attn_dropout=0.0,
+                mlp_type=mlp_activation,
+                activation_config=None,
+                use_norms=True,
+                block_index=int(block_index) + 1,
+                use_attn_residual=True,
+            ),
+        )
+        self.class_output_norm = RMSNorm(self.model_dim, affine=False)
+        self._axial_rope2d = AxialRoPE2D(
+            head_dim=int(head_dim),
+            cfg=AxialRoPE2DConfig(
+                base=10_000.0,
+                min_period=None,
+                max_period=None,
+                coord_mode=AxialRoPE2DCoordMode.PATCH_INDICES,
+                normalize_coords=AxialRoPE2DNormalizeCoords.MAX,
+                dim_layout=AxialRoPE2DDimLayout.PAIR_INTERLEAVED,
+                angle_multiplier=1.0,
+                coord_offset=0.0,
+                frequency_aware=None,
+                beta_warp=None,
+                alpha_warp=None,
+            ),
+        )
+    def compile_for_training(self, *, fullgraph: bool, dynamic: bool) -> None:
+        """No-op hook kept for source API compatibility."""
+        _ = fullgraph, dynamic
+    def compile_for_eval(self, *, fullgraph: bool, dynamic: bool) -> None:
+        """No-op hook kept for source API compatibility."""
+        _ = fullgraph, dynamic
+    def forward(self, latents: Tensor, *, t: Tensor) -> DinoTokenAlignmentOutput:
+        """Return predicted class and spatial DINO tokens."""
+        y = self.in_proj(latents)
+        batch, _channels, height, width = y.shape
+        spatial_tokens = y.flatten(2).transpose(1, 2)
+        class_token = self.initial_class_token.to(device=y.device, dtype=y.dtype)
+        class_token = class_token.unsqueeze(0).expand(int(batch), -1, -1)
+        register_tokens = self.register_tokens.to(device=y.device, dtype=y.dtype)
+        register_tokens = register_tokens.unsqueeze(0).expand(int(batch), -1, -1)
+        tokens = torch.cat([class_token, register_tokens, spatial_tokens], dim=1)
+        rope_sincos = _prepend_identity_rope_prefix(
+            rope_sincos=self._axial_rope2d(H=int(height), W=int(width), scales=None),
+            prefix_token_count=int(1 + self.register_token_count),
+            device=y.device,
+        )
+        _ = t
+        cond = torch.zeros(
+            (int(batch), self.model_dim),
+            device=y.device,
+            dtype=y.dtype,
+        )
+        tokens = self.block(
+            tokens,
+            hw=(int(height), int(width)),
+            cond_vec=cond,
+            adaln_m=None,
+            rope_sincos=rope_sincos,
+            generator=None,
+        )
+        class_query = tokens[:, :1, :]
+        context = tokens[:, 1:, :]
+        class_output = self.class_readout(class_query, context)[:, 0, :]
+        class_output = self.class_output_norm(class_output)
+        prefix_token_count = int(1 + self.register_token_count)
+        predicted_spatial = self.spatial_output_norm(tokens[:, prefix_token_count:, :])
+        return DinoTokenAlignmentOutput(
+            class_token=class_output,
+            spatial_tokens=predicted_spatial,
+        )
+__all__ = ["DinoTokenAlignmentHead", "DinoTokenAlignmentOutput"]

dit/xattn_blocks.py ADDED Viewed

	@@ -0,0 +1,177 @@

+"""Dense cross-attention block used by the DINAC-AE class-token head."""
+from __future__ import annotations
+from dataclasses import dataclass
+from torch import Tensor, nn
+from common.norms import RMSNorm
+from dit.attention_blocks import CrossAttentionCore
+from dit.mlp import build_dit_mlp, reset_module_parameters
+from dit.mlp_types import MLPType
+@dataclass
+class CrossAttentionConfig:
+    """Configuration for the exported dense cross-attention block."""
+    n_heads: int = 16
+    head_dim: int | None = None
+    query_extra_dim: int = 0
+    context_extra_dim: int = 0
+    key_extra_dim: int = 0
+    mlp_ratio: float = 2.0
+    attn_dropout: float = 0.0
+    mlp_type: MLPType = MLPType.GELU
+    activation_config: object | None = None
+    use_norms: bool = True
+    block_index: int = 0
+    use_attn_residual: bool = True
+class CrossAttentionBlock(nn.Module):
+    """Dense pre-norm cross-attention plus residual MLP."""
+    query_dim: int
+    context_dim: int
+    query_extra_dim: int
+    context_extra_dim: int
+    key_extra_dim: int
+    n_heads: int
+    head_dim: int
+    attn_dim: int
+    use_norms: bool
+    attn_dropout: float
+    use_attn_residual: bool
+    query_norm: RMSNorm | None
+    context_norm: RMSNorm | None
+    mlp_norm: RMSNorm | None
+    q_proj: nn.Linear
+    attn_core: CrossAttentionCore
+    kv_proj: nn.Linear
+    out_proj: nn.Linear
+    mlp: nn.Module
+    def __init__(
+        self,
+        *,
+        query_dim: int,
+        context_dim: int,
+        cfg: CrossAttentionConfig,
+    ) -> None:
+        super().__init__()
+        n_heads = int(cfg.n_heads)
+        if cfg.head_dim is None:
+            if query_dim % n_heads != 0:
+                raise ValueError("query_dim must be divisible by n_heads")
+            head_dim = query_dim // n_heads
+        else:
+            head_dim = int(cfg.head_dim)
+        self.query_dim = int(query_dim)
+        self.context_dim = int(context_dim)
+        self.query_extra_dim = int(cfg.query_extra_dim)
+        self.context_extra_dim = int(cfg.context_extra_dim)
+        self.key_extra_dim = int(cfg.key_extra_dim)
+        self.n_heads = n_heads
+        self.head_dim = int(head_dim)
+        self.attn_dim = int(self.n_heads * self.head_dim)
+        self.use_norms = bool(cfg.use_norms)
+        self.attn_dropout = float(cfg.attn_dropout)
+        if not cfg.use_attn_residual:
+            raise ValueError("DINAC-AE export requires attention residuals")
+        self.use_attn_residual = True
+        self.query_norm = RMSNorm(self.query_dim) if self.use_norms else None
+        self.context_norm = RMSNorm(self.context_dim) if self.use_norms else None
+        self.mlp_norm = RMSNorm(query_dim) if self.use_norms else None
+        self.q_proj = nn.Linear(
+            self.query_dim + self.query_extra_dim, self.attn_dim, bias=False
+        )
+        self.attn_core = CrossAttentionCore(
+            query_dim=query_dim,
+            context_dim=context_dim,
+            context_extra_dim=self.context_extra_dim,
+            key_extra_dim=self.key_extra_dim,
+            n_heads=self.n_heads,
+            head_dim=self.head_dim,
+            attn_dropout=self.attn_dropout,
+        )
+        self.kv_proj = self.attn_core.kv_proj
+        self.out_proj = self.attn_core.out_proj
+        hidden = int(round(cfg.mlp_ratio * query_dim))
+        self.mlp = build_dit_mlp(
+            mlp_type=cfg.mlp_type,
+            in_features=query_dim,
+            hidden_budget=hidden,
+            activation_config=cfg.activation_config,
+            block_index=int(cfg.block_index),
+            bias_up=False,
+            bias_down=False,
+        )
+        self.reset_parameters()
+    def reset_parameters(self) -> None:
+        """Reset projections and MLP parameters."""
+        nn.init.xavier_uniform_(self.q_proj.weight)
+        self.attn_core.reset_parameters()
+        reset_module_parameters(self.mlp)
+    def forward(
+        self,
+        query: Tensor,
+        context: Tensor,
+        *,
+        query_extra: Tensor | None = None,
+        context_extra: Tensor | None = None,
+        key_extra: Tensor | None = None,
+        key_padding_mask: Tensor | None = None,
+    ) -> Tensor:  # type: ignore[override]
+        """Run dense cross-attention followed by the residual MLP."""
+        query_tokens = self.query_norm(query) if self.query_norm is not None else query
+        if query_extra is not None:
+            q_in = query_tokens.new_empty(
+                *query_tokens.shape[:-1],
+                int(query_tokens.shape[-1]) + int(query_extra.shape[-1]),
+            )
+            q_in[..., : int(query_tokens.shape[-1])] = query_tokens
+            q_in[..., int(query_tokens.shape[-1]) :] = query_extra
+        else:
+            q_in = query_tokens
+        context_tokens = (
+            self.context_norm(context) if self.context_norm is not None else context
+        )
+        if context_extra is not None:
+            kv_tokens = context_tokens.new_empty(
+                *context_tokens.shape[:-1],
+                int(context_tokens.shape[-1]) + int(context_extra.shape[-1]),
+            )
+            kv_tokens[..., : int(context_tokens.shape[-1])] = context_tokens
+            kv_tokens[..., int(context_tokens.shape[-1]) :] = context_extra
+        else:
+            kv_tokens = context_tokens
+        q_attn_tokens = self.q_proj(q_in)
+        attn_out = self.attn_core(
+            q_attn_tokens,
+            kv_tokens,
+            training=self.training,
+            key_extra=key_extra,
+            key_padding_mask=key_padding_mask,
+        )
+        tokens = query + attn_out
+        mlp_in = self.mlp_norm(tokens) if self.mlp_norm is not None else tokens
+        return tokens + self.mlp(mlp_in)
+    def compile_for_training(self, *, fullgraph: bool, dynamic: bool) -> None:
+        """No-op hook kept for the token-alignment head API."""
+        _ = fullgraph, dynamic
+    def compile_for_eval(self, *, fullgraph: bool, dynamic: bool) -> None:
+        """No-op hook kept for the token-alignment head API."""
+        _ = fullgraph, dynamic
+__all__ = ["CrossAttentionBlock", "CrossAttentionConfig"]

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b610db6eb9ba995f14ddfbe3ca683044b3b7b4ebe2409fec9465c545a5ec88f7
+size 633374472

technical_report_dinac_ae.md ADDED Viewed

	@@ -0,0 +1,390 @@

+# DINAC-AE Technical Report
+`dinac_ae` is a DINO-aligned class-token autoencoder in the SemDisDiffAE
+family: patch-16 spatial latents, a VP diffusion decoder, and semantic
+alignment to frozen vision features.
+Relative to [SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae),
+DINAC-AE replaces the FCDM encoder with a 6-block ViT/DiT-style transformer
+encoder, uses DINOv3 ViT-B/16 features, and extends the latent-to-DINO
+alignment head with a class-token output. The decoder remains in the same FCDM
+VP-diffusion family described in the SemDisDiffAE and capacitor reports.
+Related reports:
+- SemDisDiffAE report: https://huggingface.co/data-archetype/semdisdiffae/blob/main/technical_report_semantic.md
+- full_capacitor report: https://huggingface.co/data-archetype/full_capacitor/blob/main/technical_report_full_capacitor.md
+- capacitor_decoder report: https://huggingface.co/data-archetype/capacitor_decoder/blob/main/technical_report_capacitor_decoder.md
+## 1. Motivation
+We trained both FCDM-encoder and transformer-encoder variants. The transformer
+encoder latents were easier for downstream DiT models to learn from, so this
+release keeps the FCDM diffusion decoder and changes the encoder.
+The second change is the class-token output. The DINO alignment head predicts
+patch tokens as before, and is extended with a class-token prediction path.
+`predict_class(latents)` exposes that feature from latents, enabling
+Representation Frechet Distance / FD-loss style objectives without decoding to
+RGB, and empirically helping make the latents more semantically aligned.
+## 2. Architecture Summary
+| Component | SemDisDiffAE | dinac_ae |
+| --- | ---: | ---: |
+| Patch size | 16 | 16 |
+| Latent channels | 128 | 128 |
+| Encoder block family | FCDM | DiT transformer |
+| Encoder width | 896 | 896 |
+| Encoder depth | 4 | 6 |
+| Decoder block family | FCDM | FCDM |
+| Decoder width | 896 | 896 |
+| Decoder depth | 8 | 8 |
+| Decoder skip layout | start/middle/end skip concat | start/middle/end skip concat |
+| DINO alignment | DINOv3 ViT-S/16 patch tokens | DINOv3 ViT-B/16 patch tokens + class token |
+Parameter counts in the released checkpoint:
+| Module | Parameters |
+| --- | ---: |
+| Encoder | 64,939,521 |
+| Decoder | 68,133,505 |
+| DINO alignment head | 14,264,320 |
+| Total | 147,337,346 |
+## 3. Encoder
+The encoder is a single-scale transformer patch encoder. As in SemDisDiffAE,
+all blocks operate at the final latent grid resolution.
+### 3.1 Patch Embedding
+The image is first converted to a patch grid with a stride-16 patch projection:
+```text
+image [B, 3, H, W]
+  -> stride-16 patch projection (3 x 16 x 16 -> 896)   [B, 896, h, w]
+```
+### 3.2 DiT Encoder Block
+The patch grid is processed by `6` unconditional `DitBlock` layers at width
+`896`. Each block has `14` attention heads with head dimension `64`, an MLP
+ratio of `4.0`, and GELU activations. The encoder uses our shared
+AdaLN-capable `DitBlock` implementation with conditioning disabled. As a result
+it keeps that block's RMSNorm sandwich structure: RMSNorm before and after the
+attention branch, RMSNorm before and after the MLP branch, plus per-head RMSNorm
+on Q and K:
+```text
+x
+  -> RMSNorm
+  -> biasless QKV projection
+  -> per-head RMSNorm on Q and K
+  -> axial 2D RoPE on Q and K
+  -> scaled dot-product attention
+  -> biasless output projection
+  -> RMSNorm
+  -> residual add
+  -> RMSNorm
+  -> biasless Linear(896 -> 3584)
+  -> GELU
+  -> biasless Linear(3584 -> 896)
+  -> RMSNorm
+  -> residual add
+```
+### 3.3 Posterior Projection
+After the six transformer blocks, a pointwise projection maps from `896` to
+`256` channels and splits into mean and logSNR:
+```text
+features [B, 896, h, w]
+  -> pointwise projection (896 -> 256)
+  -> split: mean [B, 128, h, w], logSNR [B, 128, h, w]
+  -> posterior mode: sqrt(sigmoid(logSNR)) * mean
+```
+The posterior is a VP-parameterized diagonal Gaussian. The mean branch is used
+directly in the posterior mode computation.
+The transformer encoder uses 2D axial RoPE with unnormalized patch-index
+coordinates:
+- Position encoding: axial 2D RoPE.
+- Coordinate mode: patch indices.
+- Coordinate normalization: max-coordinate normalization.
+- Pair layout: interleaved pairs.
+- RoPE base: 10000.
+This RoPE choice is shared by the encoder blocks and DINO alignment head.
+## 4. Decoder
+The decoder is the same FCDM decoder family used by full_capacitor and
+capacitor_decoder:
+- 8 FCDM decoder blocks.
+- Width 896.
+- 16x16 latent patch grid.
+- 128 latent channels.
+- Start/middle/end skip-concat architecture with 2 start blocks and 2 end
+  blocks.
+- Depthwise convolution kernel size 7.
+- GELU MLP activations and SiLU convolution activations.
+Each decoder FCDM block uses the same single residual path as SemDisDiffAE:
+```text
+x
+  -> depthwise Conv 7x7
+  -> RMSNorm
+  -> scale modulation from timestep AdaLN
+  -> pointwise projection
+  -> GELU
+  -> GRN
+  -> pointwise projection
+  -> gate modulation from timestep AdaLN
+  -> residual add
+```
+Timestep conditioning uses the low-rank AdaLN scheme from the capacitor
+decoder: a shared base projection plus per-layer low-rank deltas, split into
+`scale` and `gate`. The decoder topology is:
+```text
+noisy image x_t
+  -> stride-16 image patch projection
+  -> concatenate projected latents
+  -> 2 start FCDM blocks
+  -> 4 middle FCDM blocks
+  -> concatenate start and middle activations
+  -> 2 end FCDM blocks
+  -> patch-output projection + PixelShuffle(16)
+  -> x0 prediction
+```
+DINAC-AE keeps this decoder path; the architectural changes are in the encoder
+and alignment head.
+## 5. DINOv3 Alignment
+The alignment target is DINOv3 ViT-B/16 trained on LVD1689M:
+- Teacher: `dino_v3_vit_base_patch16_lvd1689m`.
+- Feature type: `dino_v3_vit_base_patch16_tokens`.
+- Target feature dimension: 768.
+- The loss supervises both the DINO class token and DINO spatial patch tokens.
+The alignment head first maps unwhitened DINAC-AE latents into DINO token
+space:
+```text
+latents [B, 128, h, w]
+  -> pointwise projection (128 -> 768)
+  -> flatten spatial tokens                              [B, h*w, 768]
+  -> prepend learned class token                         [B, 1, 768]
+  -> prepend 4 learned register tokens                   [B, 4, 768]
+  -> 1 unconditional RoPE `DitBlock` over all tokens
+  -> RMSNorm(spatial tokens)
+  -> spatial negative-cosine target
+```
+The token-alignment block uses the same unconditional RoPE `DitBlock` form as
+the encoder, but at DINO width: `768` channels, `12` heads, head dimension `64`,
+MLP ratio `4.0`, and GELU MLP. The learned prefix tokens receive identity RoPE
+rotation; spatial tokens receive the same axial 2D RoPE configuration as the
+encoder.
+The class-token path uses an additional residual cross-attention block:
+```text
+updated class token [B, 1, 768]
+updated register + spatial tokens [B, 4 + h*w, 768]
+  -> class query cross-attends to register + spatial tokens
+  -> residual add
+  -> RMSNorm
+  -> biasless Linear(768 -> 3072)
+  -> GELU
+  -> biasless Linear(3072 -> 768)
+  -> residual add
+  -> RMSNorm(class token)
+  -> class negative-cosine target
+```
+The class readout uses the same DINO width and head geometry as the
+token-alignment block, but replaces self-attention with dense cross-attention.
+The DINO alignment loss is negative cosine on RMS-normalized features:
+```text
+class_negative_cosine_loss + spatial_negative_cosine_loss
+```
+The DINO alignment weight is `0.01`. Alignment is applied directly to clean
+latent tokens. Robustness to local token errors is handled separately by
+[random-token logSNR offset regularization](#8-logsnr-offset-regularization).
+## 6. Class-Token Output
+`predict_class(latents)` is part of the public API. It expects the same
+whitened latent convention returned by `encode(...)`, applies dewhitening, and
+runs the DINO alignment head. The returned tensor has shape `[B, 768]` and
+lives in the DINOv3 ViT-B/16 class-token feature space.
+This output has two intended uses:
+- It adds a global semantic pressure during autoencoder training, complementing
+  the spatial patch-token alignment loss.
+- It provides a latent-space feature endpoint for FD-loss / Representation
+  Frechet Distance objectives, avoiding the cost and gradient path of decoding
+  latents back to RGB before computing representation statistics.
+The class-token output is trained by negative cosine alignment to the frozen
+DINOv3 class token.
+## 7. Training Losses
+The checkpoint was trained with these active loss terms:
+| Loss term | Weight / value |
+| --- | ---: |
+| Main VP diffusion reconstruction loss | 1.0 |
+| DINO class + spatial token alignment | 0.01 |
+| Latent posterior VE variance loss | 0.00003 |
+| Latent log-variance scale penalty | 0.0003 |
+The main decoder objective is VP diffusion `x_pred` with the SID2 x-prediction
+variant. The timestep sampler is uniform, with logSNR range `[-10, 10]` and
+training logSNR shift `-1.0`.
+The latent posterior regularization follows the same KL-like variance expansion
+idea described in the SemDisDiffAE report:
+https://huggingface.co/data-archetype/semdisdiffae/blob/main/technical_report_semantic.md#32-variance-expansion-loss
+Latent running statistics use momentum `0.0005` and epsilon `0.0001`.
+`encode(...)` returns whitened latents; `decode(...)` and
+`predict_class(...)` dewhiten those latents before applying the decoder or
+class output path.
+## 8. LogSNR Offset Regularization
+During training, 10% of spatial latent tokens across the batch/grid are
+selected at random and their posterior logSNR is shifted by `-2.0` before
+sampling.
+This injects non-smooth token-level latent errors during training. The decoder
+and DINO alignment heads both see these perturbed latents, encouraging
+robustness to downstream DiT prediction mistakes.
+## 9. Training Recipe
+The model was trained on 12M images using a single NVIDIA RTX PRO 6000
+Blackwell 96GB GPU. Training used two stages:
+| Stage | Resolution schedule | Batch size | Approx steps |
+| --- | --- | ---: | ---: |
+| Stage 1 | 90% 256-scale AR buckets, 10% 384-scale AR buckets | 128 | 150k |
+| Stage 2 | equal mix of 256/384/512/768/1024 buckets | 64 | 200k |
+The mixed-resolution second stage is especially important for the transformer
+encoder. In practice, and even with 2D RoPE, transformer blocks tolerate only
+limited resolution extrapolation unless they see the higher-resolution patch
+grids during training.
+Optimizer/training parameters:
+| Setting | Stage 1 | Stage 2 |
+| --- | ---: | ---: |
+| Optimizer | AdamW | AdamW |
+| Learning rate | 1e-4 | 5e-5 |
+| Betas | (0.9, 0.99) | (0.9, sqrt(0.98) ~= 0.98995) |
+| Weight decay | 0.0 | 0.0 |
+| EMA decay | 0.9995 | 0.9995 |
+| Warmup | 2,000 steps | 10,000 steps |
+| Precision | AMP BF16, TF32 matmul | AMP BF16, TF32 matmul |
+| Gradient clipping | 1.0 | 1.0 |
+| Optimizer state dtype | BF16 | BF16 |
+Resolution and accumulation settings:
+| Resolution | Stage 1 mix | Stage 1 grad accumulation | Stage 2 mix | Stage 2 grad accumulation |
+| ---: | ---: | ---: | ---: | ---: |
+| 256 | 90% | 1 | 20% | 1 |
+| 384 | 10% | 1 | 20% | 1 |
+| 512 | 0% | 0 | 20% | 1 |
+| 768 | 0% | 0 | 20% | 2 |
+| 1024 | 0% | 0 | 20% | 2 |
+## 10. Reconstruction Quality
+Reconstruction quality on `2000` validation images:
+| Model | Mean PSNR | Std | Median | Min | p5 | p95 | Max |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| dinac_ae | 35.19 | 4.53 | 35.06 | 22.44 | 28.02 | 42.43 | 47.31 |
+| FLUX.2 VAE | 36.28 | 4.53 | 36.07 | 22.73 | 28.89 | 43.63 | 47.38 |
+The 39-image reconstruction viewer includes originals, DINAC-AE
+reconstructions, FLUX.2 VAE references, RGB deltas, and latent PCA:
+https://huggingface.co/spaces/data-archetype/dinac_ae-results
+The released export recheck on that viewer set gives `35.15 dB` mean PSNR
+(`25.73` min, `45.99` max).
+## 11. Class-Token Alignment Results
+The `predict_class(...)` path is evaluated against the frozen DINOv3 ViT-B/16
+teacher class token on the same `2000` images used for reconstruction PSNR.
+| Metric | Cosine similarity |
+| --- | ---: |
+| Mean | 0.757458 |
+| Std | 0.076265 |
+| Median | 0.765958 |
+| Min | 0.394647 |
+| p5 | 0.623156 |
+| p10 | 0.656243 |
+| p25 | 0.711337 |
+| p75 | 0.813098 |
+| p90 | 0.849525 |
+| p95 | 0.865219 |
+| Max | 0.932722 |
+## 12. Encoder Throughput
+Encoder timing was measured with the released package on an NVIDIA GeForce RTX
+5090. Decoder timing is unchanged from the capacitor decoder/full_capacitor
+release because DINAC-AE uses the same decoder architecture.
+| Resolution | Batch | FLUX.2 encode ms/batch | full_capacitor ms/batch | dinac_ae ms/batch | Speedup vs FLUX.2 | dinac_ae vs full_capacitor |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: |
+| 256x256 | 128 | 383.41 | 42.56 | 50.32 | 7.62x | 1.18x slower |
+| 512x512 | 32 | 353.58 | 44.97 | 52.65 | 6.72x | 1.17x slower |
+Peak allocated encoder memory:
+| Resolution | Batch | FLUX.2 MiB | full_capacitor MiB | dinac_ae MiB | Reduction vs FLUX.2 |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| 256x256 | 128 | 12,511.0 | 1,008.2 | 1,637.3 | 86.9% |
+| 512x512 | 32 | 12,511.0 | 1,005.6 | 1,638.5 | 86.9% |
+The transformer encoder is slightly slower and uses more memory than the
+full_capacitor FCDM encoder, but it remains much faster and much smaller than
+the FLUX.2 VAE encoder.
+## References
+- Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre,
+  Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi,
+  Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt,
+  Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana,
+  Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,
+  Julien Mairal, Hervé Jégou, Patrick Labatut, and Piotr Bojanowski.
+  DINOv3. arXiv:2508.10104, 2025.
+  https://arxiv.org/abs/2508.10104
+- Jiawei Yang, Zhengyang Geng, Xuan Ju, Yonglong Tian, and Yue Wang.
+  Representation Frechet Loss for Visual Generation. arXiv:2604.28190, 2026.
+  https://arxiv.org/abs/2604.28190
+- FD-Loss implementation: https://github.com/Jiawei-Yang/FD-Loss