Upload DiffusionLlamaLM

Browse files

Files changed (5) hide show

README.md +199 -0
config.json +35 -0
configuration_diff_llama.py +77 -0
model.safetensors +3 -0
modeling_diff_llama.py +463 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "architectures": [
+    "DiffusionLlamaLM"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_diff_llama.DiffusionLlamaConfig",
+    "AutoModel": "modeling_diffusion_llama.DiffusionLlamaLM",
+    "AutoModelForCausalLM": "modeling_diff_llama.DiffusionLlamaLM"
+  },
+  "bias": false,
+  "block_size": 2048,
+  "condense_ratio": 1,
+  "dtype": "float32",
+  "eos_token_id": 2,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "mask_token_id": 32000,
+  "mlp_class": "LLaMAMLP",
+  "model_type": "diff_llama",
+  "n_embd": 1024,
+  "n_head": 16,
+  "n_layer": 20,
+  "n_query_groups": 16,
+  "name": "Diff_LLaMA_336M",
+  "norm_class": "FusedRMSNorm",
+  "norm_eps": 1e-05,
+  "pad_token_id": 0,
+  "padded_vocab_size": 32000,
+  "padding_multiple": 64,
+  "parallel_residual": false,
+  "rotary_percentage": 1.0,
+  "shared_attention_norm": false,
+  "transformers_version": "4.57.3",
+  "vocab_size": 32000
+}

configuration_diff_llama.py ADDED Viewed

	@@ -0,0 +1,77 @@

+from transformers import PretrainedConfig
+from typing import Literal, Optional
+class DiffusionLlamaConfig(PretrainedConfig):
+    model_type = "diff_llama"
+    def __init__(
+        self,
+        block_size: int = 4096,
+        vocab_size: int = 50254,
+        padding_multiple: int = 512,
+        padded_vocab_size: Optional[int] = None,
+        n_layer: int = 16,
+        n_head: int = 32,
+        n_embd: int = 4096,
+        rotary_percentage: float = 0.25,
+        parallel_residual: bool = True,
+        bias: bool = True,
+        n_query_groups: Optional[int] = None,
+        shared_attention_norm: bool = False,
+        norm_class: Literal["LayerNorm", "RMSNorm", "FusedRMSNorm"] = "LayerNorm",
+        norm_eps: float = 1e-5,
+        mlp_class: Literal["GptNeoxMLP", "LLaMAMLP"] = "GptNeoxMLP",
+        intermediate_size: Optional[int] = None,
+        condense_ratio: int = 1,
+        initializer_range: float = 0.02,
+        **kwargs,
+    ):
+        self.block_size = block_size
+        self.vocab_size = vocab_size
+        self.padding_multiple = padding_multiple
+        # Logic from original Config.__post_init__
+        # 1. Calculate padded vocab size
+        if padded_vocab_size is None:
+            self.padded_vocab_size = self._find_multiple(vocab_size, padding_multiple)
+        else:
+            self.padded_vocab_size = padded_vocab_size
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.n_embd = n_embd
+        self.rotary_percentage = rotary_percentage
+        self.parallel_residual = parallel_residual
+        self.bias = bias
+        # 2. Calculate query groups
+        if n_query_groups is not None:
+            self.n_query_groups = n_query_groups
+        else:
+            self.n_query_groups = n_head
+        self.shared_attention_norm = shared_attention_norm
+        self.norm_class = norm_class
+        self.norm_eps = norm_eps
+        self.mlp_class = mlp_class
+        # 3. Calculate intermediate size
+        if intermediate_size is None:
+            # Default to 4x if not specified, though LLaMA usually specifies it explicitly
+            self.intermediate_size = 4 * n_embd
+        else:
+            self.intermediate_size = intermediate_size
+        self.condense_ratio = condense_ratio
+        self.initializer_range = initializer_range
+        super().__init__(**kwargs)
+    @property
+    def head_size(self) -> int:
+        return self.n_embd // self.n_head
+    def _find_multiple(self, n: int, k: int) -> int:
+        if k > 0 and n % k == 0:
+            return n
+        return n + k - (n % k)

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:671794256bef4dff670845aca8d38e5fa382931f8f96d40028b887ee01a116f8
+size 1604509704

modeling_diff_llama.py ADDED Viewed

	@@ -0,0 +1,463 @@

+import math
+from typing import Any, List, Optional, Tuple, Union
+import torch
+import torch.nn as nn
+from torch.nn import init
+from transformers import PreTrainedModel, AutoModelForCausalLM
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from einops import rearrange, repeat
+from xformers.ops import SwiGLU
+from .configuration_diff_llama import DiffusionLlamaConfig
+# ===========================================================================
+#  IMPORTS & CHECKS
+# ===========================================================================
+try:
+    from lightning_utilities.core.imports import RequirementCache
+    FlashAttention2Available = RequirementCache("flash-attn>=2.0.0.post1")
+except ImportError:
+    # Fallback if lightning_utilities is missing
+    FlashAttention2Available = False
+# Import compiled extensions if available
+try:
+    import rotary_emb
+except ImportError:
+    rotary_emb = None
+try:
+    import dropout_layer_norm
+except ImportError:
+    dropout_layer_norm = None
+# ===========================================================================
+#  PART 1: ROTARY EMBEDDING (Autograd Function for Training)
+# ===========================================================================
+class ApplyRotaryEmb(torch.autograd.Function):
+    @staticmethod
+    @torch.compiler.disable
+    def forward(ctx, x, cos, sin, interleaved=False, inplace=False):
+        """
+        Full forward pass from fused_rotary_embedding.py
+        """
+        batch, seqlen, nheads, headdim = x.shape
+        rotary_seqlen, rotary_dim = cos.shape
+        rotary_dim *= 2
+        assert rotary_dim <= headdim
+        assert seqlen <= rotary_seqlen
+        x_ro = x[..., :rotary_dim]
+        x1, x2 = x_ro.chunk(2, dim=-1) if not interleaved else (x_ro[..., ::2], x_ro[..., 1::2])
+        out = torch.empty_like(x) if not inplace else x
+        out_ro = out[..., :rotary_dim]
+        if inplace:
+            o1, o2 = x1, x2
+        else:
+            o1, o2 = (
+                out_ro.chunk(2, dim=-1)
+                if not interleaved
+                else (out_ro[..., ::2], out_ro[..., 1::2])
+            )
+        if rotary_emb is None:
+             # Fallback or error if extension is missing but this code path is hit
+             raise ImportError("rotary_emb extension not found. Please install it to use fused rotary embeddings.")
+        rotary_emb.apply_rotary(
+            x1, x2,
+            rearrange(cos[:seqlen], "s d -> s 1 d"),
+            rearrange(sin[:seqlen], "s d -> s 1 d"),
+            o1, o2,
+            False,
+        )
+        if not inplace and rotary_dim < headdim:
+            out[..., rotary_dim:].copy_(x[..., rotary_dim:])
+        ctx.save_for_backward(cos, sin)
+        ctx.interleaved = interleaved
+        ctx.inplace = inplace
+        return out if not inplace else x
+    @staticmethod
+    def backward(ctx, do):
+        """
+        Full backward pass from fused_rotary_embedding.py to support training
+        """
+        cos, sin = ctx.saved_tensors
+        _, seqlen, _, headdim = do.shape
+        rotary_dim = cos.shape[-1] * 2
+        inplace = ctx.inplace
+        do_ro = do[..., :rotary_dim]
+        do1, do2 = (
+            do_ro.chunk(2, dim=-1) if not ctx.interleaved else (do_ro[..., ::2], do_ro[..., 1::2])
+        )
+        dx = torch.empty_like(do) if not inplace else do
+        if inplace:
+            dx1, dx2 = do1, do2
+        else:
+            dx_ro = dx[..., :rotary_dim]
+            dx1, dx2 = (
+                dx_ro.chunk(2, dim=-1)
+                if not ctx.interleaved
+                else (dx_ro[..., ::2], dx_ro[..., 1::2])
+            )
+        rotary_emb.apply_rotary(
+            do1, do2,
+            rearrange(cos[:seqlen], "s d -> s 1 d"),
+            rearrange(sin[:seqlen], "s d -> s 1 d"),
+            dx1, dx2,
+            True,
+        )
+        if not inplace and rotary_dim < headdim:
+            dx[..., rotary_dim:].copy_(do[..., rotary_dim:])
+        return dx, None, None, None, None
+apply_rotary_emb_func = ApplyRotaryEmb.apply
+def build_rope_cache(
+    seq_len: int, n_elem: int, dtype: torch.dtype, device: torch.device, base: int = 10000, condense_ratio: int = 1
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    theta = 1.0 / (base ** (torch.arange(0, n_elem, 2, device=device) / n_elem))
+    seq_idx = torch.arange(seq_len, device=device) / condense_ratio
+    idx_theta = torch.outer(seq_idx, theta)
+    cos, sin = torch.cos(idx_theta), torch.sin(idx_theta)
+    if dtype == torch.bfloat16:
+        return cos.bfloat16(), sin.bfloat16()
+    if dtype in (torch.float16, torch.bfloat16, torch.int8):
+        return cos.half(), sin.half()
+    return cos, sin
+# ===========================================================================
+#  PART 2: NORMALIZATION (Fused RMS Norm)
+# ===========================================================================
+def maybe_align(x, alignment_in_bytes=16):
+    return x if x.data_ptr() % alignment_in_bytes == 0 else x.clone()
+class DropoutAddLayerNormFn(torch.autograd.Function):
+    @staticmethod
+    @torch.compiler.disable
+    def forward(ctx, x0, residual, gamma, beta, rowscale, colscale, dropout_p, epsilon, residual_in_fp32=False, prenorm=False, is_rms_norm=False, return_dmask=False):
+        if dropout_layer_norm is None:
+             raise ImportError("dropout_layer_norm extension not found. Cannot use FusedRMSNorm.")
+        x0 = maybe_align(x0.contiguous(), 16)
+        residual = maybe_align(residual.contiguous(), 16) if residual is not None else None
+        gamma = maybe_align(gamma.contiguous(), 16)
+        zmat, xmat, dmask, mu, rsigma = dropout_layer_norm.dropout_add_ln_fwd(
+            x0.view((-1, gamma.numel())),
+            residual.view((-1, gamma.numel())) if residual is not None else None,
+            gamma,
+            None, None, None, None, None, # unused args
+            dropout_p,
+            epsilon,
+            1.0, 0, None,
+            residual_in_fp32,
+            is_rms_norm,
+        )
+        # --- FIX START ---
+        # When dropout_p is 0.0, the C++ kernel returns xmat as None optimization.
+        # We must fallback to the input x0.
+        if xmat is None:
+            xmat = x0
+        # --- FIX END ---
+        ctx.save_for_backward(xmat.view(x0.shape), x0, dmask, gamma, mu, rsigma)
+        ctx.dropout_p = dropout_p
+        ctx.is_rms_norm = is_rms_norm
+        ctx.has_residual = residual is not None
+        return zmat.view(x0.shape)
+    @staticmethod
+    def backward(ctx, dz, *args):
+        # Full backward implementation for training
+        dz = maybe_align(dz.contiguous(), 16)
+        x, x0, dmask, gamma, mu, rsigma = ctx.saved_tensors
+        dx0mat, dresidualmat, dgamma, dbeta, *rest = dropout_layer_norm.dropout_add_ln_bwd(
+            dz.view((-1, gamma.numel())),  # <--- CHANGED: Force 2D view [batch*seq, hidden]
+            None, # dx
+            x.view((-1, gamma.numel())),   # Note: x is already being flattened here
+            x0.view((-1, gamma.numel())) if x0 is not None else None,
+            dmask, mu, rsigma, gamma,
+            None, None, None, None, # scales
+            ctx.dropout_p,
+            1.0, 0,
+            ctx.has_residual,
+            ctx.is_rms_norm,
+        )
+        # The outputs are reshaped back to original x.shape here, so the rest is fine
+        dx0 = dx0mat.view(x.shape)
+        dresidual = dresidualmat.view(x.shape) if dresidualmat is not None else None
+        return (dx0, dresidual, dgamma, None, None, None, None, None, None, None, None, None)
+def rms_norm(x, weight, epsilon):
+    return DropoutAddLayerNormFn.apply(x, None, weight, None, None, None, 0.0, epsilon, False, False, True)
+class FusedRMSNorm(torch.nn.Module):
+    def __init__(self, size: int, dim: int = -1, eps: float = 1e-5):
+        super().__init__()
+        self.eps = eps
+        self.weight = torch.nn.Parameter(torch.ones(size))
+        self.dim = dim
+    def reset_parameters(self):
+        init.ones_(self.weight)
+    def forward(self, x):
+        return rms_norm(x, self.weight, self.eps)
+class RMSNorm(torch.nn.Module):
+    def __init__(self, size: int, dim: int = -1, eps: float = 1e-5) -> None:
+        super().__init__()
+        self.weight = torch.nn.Parameter(torch.ones(size))
+        self.eps = eps
+        self.dim = dim
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        norm_x = torch.mean(x * x, dim=self.dim, keepdim=True)
+        x_normed = x * torch.rsqrt(norm_x + self.eps)
+        return self.weight * x_normed
+# ===========================================================================
+#  PART 3: BLOCKS & LAYERS
+# ===========================================================================
+class GptNeoxMLP(nn.Module):
+    def __init__(self, config: DiffusionLlamaConfig) -> None:
+        super().__init__()
+        self.fc = nn.Linear(config.n_embd, config.intermediate_size, bias=config.bias)
+        self.proj = nn.Linear(config.intermediate_size, config.n_embd, bias=config.bias)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.fc(x)
+        x = torch.nn.functional.gelu(x)
+        return self.proj(x)
+class LLaMAMLP(nn.Module):
+    def __init__(self, config: DiffusionLlamaConfig) -> None:
+        super().__init__()
+        self.swiglu = SwiGLU(config.n_embd, config.intermediate_size, bias=False, _pack_weights=False)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.swiglu(x)
+class SelfAttention(nn.Module):
+    def __init__(self, config: DiffusionLlamaConfig) -> None:
+        super().__init__()
+        shape = (config.n_head + 2 * config.n_query_groups) * config.head_size
+        self.attn = nn.Linear(config.n_embd, shape, bias=config.bias)
+        self.proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
+        self.config = config
+    def forward(self, x: torch.Tensor, rope: Tuple[torch.Tensor, torch.Tensor]) -> torch.Tensor:
+        B, T, C = x.size()
+        qkv = self.attn(x)
+        q_per_kv = self.config.n_head // self.config.n_query_groups
+        total_qkv = q_per_kv + 2
+        qkv = qkv.view(B, T, self.config.n_query_groups, total_qkv, self.config.head_size)
+        q, k, v = qkv.split((q_per_kv, 1, 1), dim=-2)
+        q = q.reshape(B, T, -1, self.config.head_size)
+        k = k.reshape(B, T, -1, self.config.head_size)
+        v = v.reshape(B, T, -1, self.config.head_size)
+        cos, sin = rope
+        # Apply Rotary
+        q = apply_rotary_emb_func(q, cos, sin, False, True)
+        k = apply_rotary_emb_func(k, cos, sin, False, True)
+        y = self.scaled_dot_product_attention(q, k, v)
+        y = y.reshape(B, T, C)
+        y = self.proj(y)
+        return y
+    def scaled_dot_product_attention(self, q, k, v):
+        scale = 1.0 / math.sqrt(self.config.head_size)
+        # Use Flash Attention 2 if available and on CUDA
+        if FlashAttention2Available and q.device.type == "cuda" and q.dtype in (torch.float16, torch.bfloat16):
+            from flash_attn import flash_attn_func
+            return flash_attn_func(q, k, v, dropout_p=0.0, softmax_scale=scale, causal=False)
+        # Fallback to SDPA
+        q = q.transpose(1, 2)
+        k = k.transpose(1, 2)
+        v = v.transpose(1, 2)
+        # Handle GQA/MQA broadcast
+        if q.size() != k.size():
+             k = k.repeat_interleave(q.shape[1]//k.shape[1], dim=1)
+             v = v.repeat_interleave(q.shape[1]//v.shape[1], dim=1)
+        y = torch.nn.functional.scaled_dot_product_attention(
+            q, k, v, attn_mask=None, dropout_p=0.0, scale=scale, is_causal=False
+        )
+        return y.transpose(1, 2)
+class Block(nn.Module):
+    def __init__(self, config: DiffusionLlamaConfig) -> None:
+        super().__init__()
+        # Determine classes dynamically based on config strings
+        if config.norm_class == "RMSNorm":
+            norm_cls = RMSNorm
+        elif config.norm_class == "FusedRMSNorm":
+            norm_cls = FusedRMSNorm
+        else:
+            norm_cls = getattr(torch.nn, config.norm_class)
+        mlp_cls = LLaMAMLP if config.mlp_class == "LLaMAMLP" else GptNeoxMLP
+        self.norm_1 = norm_cls(config.n_embd, eps=config.norm_eps)
+        self.attn = SelfAttention(config)
+        if not config.shared_attention_norm:
+            self.norm_2 = norm_cls(config.n_embd, eps=config.norm_eps)
+        self.mlp = mlp_cls(config)
+        self.config = config
+    def forward(self, x: torch.Tensor, rope: Tuple[torch.Tensor, torch.Tensor]) -> torch.Tensor:
+        n_1 = self.norm_1(x)
+        h = self.attn(n_1, rope)
+        if self.config.parallel_residual:
+            n_2 = n_1 if self.config.shared_attention_norm else self.norm_2(x)
+            x = x + h + self.mlp(n_2)
+        else:
+            if self.config.shared_attention_norm:
+                raise NotImplementedError("Shared attention norm not supported with non-parallel residual")
+            x = x + h
+            x = x + self.mlp(self.norm_2(x))
+        return x
+# ===========================================================================
+#  PART 4: MAIN MODEL CLASSES
+# ===========================================================================
+class TransEncoder(nn.Module):
+    def __init__(self, config: DiffusionLlamaConfig) -> None:
+        super().__init__()
+        assert config.padded_vocab_size is not None
+        self.config = config
+        if config.norm_class == "RMSNorm":
+            norm_cls = RMSNorm
+        elif config.norm_class == "FusedRMSNorm":
+            norm_cls = FusedRMSNorm
+        else:
+            norm_cls = getattr(torch.nn, config.norm_class)
+        self.lm_head = nn.Linear(config.n_embd, config.padded_vocab_size, bias=False)
+        self.transformer = nn.ModuleDict(
+            dict(
+                wte=nn.Embedding(config.padded_vocab_size + 1, config.n_embd),
+                h=nn.ModuleList(Block(config) for _ in range(config.n_layer)),
+                ln_f=norm_cls(config.n_embd, eps=config.norm_eps),
+            )
+        )
+        self.rope_cache: Optional[Tuple[torch.Tensor, torch.Tensor]] = None
+    def forward(self, idx: torch.Tensor) -> torch.Tensor:
+        B, T = idx.size()
+        # Build Rope cache if needed
+        if self.rope_cache is None:
+            self.rope_cache = build_rope_cache(
+                seq_len=self.config.block_size,
+                n_elem=int(self.config.rotary_percentage * self.config.head_size),
+                dtype=torch.bfloat16,
+                device=idx.device,
+                condense_ratio=self.config.condense_ratio,
+            )
+        # Retrieve and slice cache
+        cos, sin = self.rope_cache
+        cos = cos[:T]
+        sin = sin[:T]
+        x = self.transformer.wte(idx)
+        for block in self.transformer.h:
+            x = block(x, (cos, sin))
+        x = self.transformer.ln_f(x)
+        return self.lm_head(x)
+class DiffusionLlamaLM(PreTrainedModel):
+    config_class = DiffusionLlamaConfig
+    base_model_prefix = "model"
+    def __init__(self, config: DiffusionLlamaConfig):
+        super().__init__(config)
+        self.model = TransEncoder(config)
+        # Initialize weights (Training feature)
+        self.post_init()
+    def _init_weights(self, module: nn.Module) -> None:
+        """
+        Initialization logic for training.
+        Adapted from original TransEncoder._init_weights.
+        """
+        n_layer = self.config.n_layer
+        if isinstance(module, nn.Embedding):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=math.sqrt(2.0 / 5 / self.config.n_embd))
+        elif isinstance(module, nn.Linear):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=math.sqrt(2.0 / 5 / self.config.n_embd))
+            if module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        # Special initialization for SwiGLU / Projections based on names
+        # In HF _init_weights, 'module' is the current leaf. We check specific instances.
+        if isinstance(module, LLaMAMLP):
+            for name, p in module.named_parameters():
+                 if "proj.weight" in name:
+                     nn.init.normal_(p, mean=0.0, std=1 / math.sqrt(self.config.n_embd) / n_layer)
+        if isinstance(module, SwiGLU):
+             for name, p in module.named_parameters():
+                 if "w3.weight" in name:
+                      nn.init.normal_(p, mean=0.0, std=1 / math.sqrt(self.config.n_embd) / n_layer)
+        if isinstance(module, SelfAttention):
+             for name, p in module.named_parameters():
+                 if "proj.weight" in name:
+                      nn.init.normal_(p, mean=0.0, std=1 / math.sqrt(self.config.n_embd) / n_layer)
+    def forward(self, input_ids: torch.Tensor, labels: Optional[torch.Tensor] = None, return_dict: Optional[bool] = None, **kwargs) -> Union[Tuple, CausalLMOutputWithPast]:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        logits = self.model(input_ids)
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
+        if not return_dict:
+            return ((loss,) + (logits,)) if loss is not None else (logits,)
+        return CausalLMOutputWithPast(loss=loss, logits=logits)