Instructions to use Epsilonoid/Diffusion-Engram-IME-Demo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Epsilonoid/Diffusion-Engram-IME-Demo with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Epsilonoid/Diffusion-Engram-IME-Demo", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Epsilonoid/Diffusion-Engram-IME-Demo", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Epsilonoid/Diffusion-Engram-IME-Demo with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Epsilonoid/Diffusion-Engram-IME-Demo"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Epsilonoid/Diffusion-Engram-IME-Demo",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Epsilonoid/Diffusion-Engram-IME-Demo

SGLang

How to use Epsilonoid/Diffusion-Engram-IME-Demo with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Epsilonoid/Diffusion-Engram-IME-Demo" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Epsilonoid/Diffusion-Engram-IME-Demo",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Epsilonoid/Diffusion-Engram-IME-Demo" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Epsilonoid/Diffusion-Engram-IME-Demo",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Epsilonoid/Diffusion-Engram-IME-Demo with Docker Model Runner:
```
docker model run hf.co/Epsilonoid/Diffusion-Engram-IME-Demo
```

Epsilonoid commited on Jan 16

Commit

db56acd

verified ·

1 Parent(s): 1d8542d

Init

Browse files

Files changed (8) hide show

README.md +75 -0
config.json +71 -0
configuration_llada_engram.py +477 -0
example.py +172 -0
model.safetensors +3 -0
modeling_llada_engram.py +1895 -0
tokenizer.json +0 -0
tokenizer_config.json +8 -0

README.md ADDED Viewed

	@@ -0,0 +1,75 @@

+# Diffusion Engram IME Demo
+[English Version](#introduction)
+本项目探索了一种基于扩散语言模型的输入法实现思路。它基于 [LLaDA](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) 的实现，并融合了 [Engram](https://github.com/deepseek-ai/Engram) 模块，期盼利用语言模型强大的上下文理解能力来提升长句输入的准确性与连贯性。
+模型在 [Chinese Fineweb Edu Dataset V2.1](https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1) 上进行了训练，采用[虎码输入方案](https://www.tiger-code.com/)作为编码标准。
+## 使用方式
+本项目提供了简易的交互脚本 `example.py`，用于演示核心功能。
+### 输入格式
+在提示符下输入字符串。
+- **汉字**: 请输入其虎码编码的前两位。
+- **标点/大写字母/特殊符号**: 直接输入原文即可。符号可以输入半角版本。
+- **小写字母**: 输入字母+空格。
+- **混合输入**: 支持编码与原文混合输入。
+部分实例见 `example.py` 末尾注释。
+## 局限性
+⚠️本项目可能包含⚠️：
+- 没有认真处理和选择的训练语料
+- 拍脑袋想出来的模型架构和超参数
+- 低效的模型实现和推理代码
+## 杂谈
+其实早在 ChatGPT 还没有横空出世、我还没有了解过 transformer 的时候，我就思考过能不能用深度学习模型做出更加强大的输入法。那时候我试着学习形码（最后并没有坚持下来），也常常幻想其他提升输入效率的手段。我曾想输入法也许可以更熟悉语言应当有的语法和语义、能以更高的概率组出合理的句子。当然它的用户界面可能与现在的输入法大相径庭、以输入句子甚至段落为核心，从而能利用起来上下文的信息。（当然做这事的人可能不少。）不过那时没有 vibe coding 帮忙，我的行动力不足以让我把这些想法变成现实。
+随着 LLM 的发展，我愈发觉得输入效率很大程度上限制了人机交互的效率。了解到 diffusion language model 的思路后，我觉得这非常适合输入法的场景：模型需要根据完整的上下文来预测每个字（而在自回归模型上做 constrastive decoding 只能感知到上文），并且可以从易到难地逐步推断原文，甚至可以无缝地处理部分词由用户手动选择的情况。之所以选择形码，首要原因是形码不会有多音字的问题，数据处理简单一些，且每个编码上的字数分布更加均匀。不过最后训练下来，效果不太理想。
+看到 Engram 的时候，我立刻开始重新尝试这个项目。Engram 所做的对 n-gram 查表，几乎就是完美地承担了“词库”的职责，Engram 模块应当可以大幅度减轻模型主干记忆词库的压力。训练下来，结果确实也比之前好不少。
+这个项目离实际可用的输入法还有很大差距：最显然的当然是其没有一个合适的用户界面，要有合适的方式让用户修改候选结果、且能适应零点几秒的延迟；模型的训练数据在类型上很窄，尤其缺乏口语化或文学化的内容；模型的推理几乎没有优化过；模型具体应该做多大、超参数如何选择也没有认真实验过；等等。除此之外，怎么让模型利用已经上屏的部分作为上下文，以及能否针对性地再改造 Engram（和其他各个模块）使其更适合输入法场景，都是潜在的改进方向。
+---
+## Introduction
+This project explores an implementation idea for an Input Method Editor (IME) based on diffusion language models. It is built upon the implementation of [LLaDA](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) and incorporates the [Engram](https://github.com/deepseek-ai/Engram) module, aiming to leverage the powerful context understanding capabilities of language models to improve the accuracy and coherence of long sentence input.
+The model is trained on the [Chinese Fineweb Edu Dataset V2.1](https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1) and uses the [Tiger Code (Huma) input scheme](https://www.tiger-code.com/) as the encoding standard.
+## Usage
+This project provides a simple interactive script `example.py` to demonstrate core functionalities.
+### Input Format
+Enter a string at the prompt.
+- **Chinese Characters**: Please enter the first two characters of their Tiger Code encoding.
+- **Punctuation/Uppercase Letters/Special Symbols**: Enter the original characters directly. Symbols can be entered in their half-width versions.
+- **Lowercase Letters**: Enter the letter followed by a space.
+- **Mixed Input**: Supports mixing encoding and original text input.
+See comments at the end of `example.py` for some examples.
+## Limitations
+⚠️ This project may contain ⚠️:
+- Training corpora that have not been carefully processed or selected
+- Model architecture and hyperparameters conceived on a whim
+- Inefficient model implementation and inference code
+## Ramblings
+Actually, long before ChatGPT took the world by storm and before I knew anything about transformers, I wondered if deep learning models could be used to create a more powerful IME. At that time, I tried learning shape-based input methods (though I didn't stick with it) and often fantasized about other means to improve input efficiency. I thought that an IME could perhaps be more familiar with the syntax and semantics that language should have, and group reasonable sentences with higher probability. Of course, its user interface might be vastly different from current IMEs, focusing on inputting sentences or even paragraphs to utilize context information. (Of course, many people might be doing this.) However, without "vibe coding" to help me then, my lack of execution prevented me from turning these ideas into reality.
+With the development of LLMs, I increasingly feel that input efficiency largely limits the efficiency of human-computer interaction. After learning about the idea of diffusion language models, I felt this was very suitable for IME scenarios: the model needs to predict each character based on the complete context (whereas contrastive decoding on autoregressive models can only perceive the preceding text), and can infer the original text gradually from easy to difficult, and can even seamlessly handle cases where some words are manually selected by the user. The primary reason for choosing a shape-based code is that it avoids the problem of polyphones, simplifying data processing, and the character distribution for each code is more uniform. However, the initial training results were not ideal.
+When I saw Engram, I immediately started retrying this project. Engram's n-gram lookup almost perfectly assumes the responsibility of a "lexicon". The Engram module should significantly reduce the pressure on the model backbone to memorize the lexicon. After training, the results are indeed much better than before.
+This project is still far from being a practically usable IME: the most obvious gap is the lack of a suitable user interface that allows users to modify candidate results and adapt to a latency of a few tenths of a second; the training data is narrow in type, especially lacking colloquial or literary content; the model's inference is hardly optimized; no serious experiments have been done on how large the model should be or how to select hyperparameters; etc. Besides, how to let the model use the already entered text as context, and whether Engram (and other modules) can be specifically transformed to better suit IME scenarios, are potential directions for improvement.

config.json ADDED Viewed

	@@ -0,0 +1,71 @@

+{
+  "activation_type": "silu",
+  "alibi": false,
+  "alibi_bias_max": 8.0,
+  "architectures": [
+    "LLaDAModelLM"
+  ],
+  "attention_dropout": 0.0,
+  "attention_layer_norm": false,
+  "attention_layer_norm_with_affine": true,
+  "auto_map": {
+    "AutoConfig": "configuration_llada_engram.LLaDAConfig",
+    "AutoModelForCausalLM": "modeling_llada_engram.LLaDAModelLM",
+    "AutoModel": "modeling_llada_engram.LLaDAModelLM"
+  },
+  "bias_for_layer_norm": false,
+  "block_group_size": 1,
+  "block_type": "llama",
+  "d_model": 768,
+  "embedding_dropout": 0.0,
+  "embedding_size": 7424,
+  "eos_token_id": 6624,
+  "flash_attention": false,
+  "include_bias": false,
+  "include_qkv_bias": false,
+  "init_cutoff_factor": null,
+  "init_device": "meta",
+  "init_fn": "mitchell",
+  "init_std": 0.02,
+  "input_emb_norm": false,
+  "layer_norm_type": "rms",
+  "layer_norm_with_affine": true,
+  "mask_token_id": 7186,
+  "max_sequence_length": 128,
+  "mlp_hidden_size": 1536,
+  "model_type": "llada",
+  "multi_query_attention": null,
+  "n_heads": 12,
+  "n_kv_heads": 12,
+  "n_layers": 14,
+  "pad_token_id": 6624,
+  "precision": "amp_bf16",
+  "residual_dropout": 0.0,
+  "rms_norm_eps": 1e-05,
+  "rope": true,
+  "rope_full_precision": true,
+  "rope_theta": 500000.0,
+  "scale_logits": false,
+  "transformers_version": "4.46.3",
+  "use_cache": false,
+  "vocab_size": 7397,
+  "weight_tying": false,
+  "engram_config": {
+    "tokenizer_name_or_path": "./tokenizer.json",
+    "engram_vocab_size": [
+      51200,
+      51200,
+      51200
+    ],
+    "max_ngram_size": 4,
+    "n_embed_per_ngram": 256,
+    "n_head_per_ngram": 4,
+    "layer_ids": [
+      1,
+      7
+    ],
+    "pad_id": 6629,
+    "seed": 42,
+    "kernel_size": 7
+  }
+}

configuration_llada_engram.py ADDED Viewed

	@@ -0,0 +1,477 @@

+"""
+LLaDA configuration
+"""
+from transformers import AutoConfig, PretrainedConfig
+from enum import Enum
+from os import PathLike
+from typing import Union
+from dataclasses import asdict, dataclass, field
+from glob import glob
+from pathlib import Path
+from typing import (
+    Any,
+    Dict,
+    Iterable,
+    List,
+    Optional,
+    Tuple,
+    Type,
+    TypeVar,
+    Union,
+    cast,
+)
+__all__ = [
+    "ActivationType",
+    "ActivationCheckpointingStrategy",
+    "BlockType",
+    "LayerNormType",
+    "InitFnType",
+    "ModelConfig",
+]
+PathOrStr = Union[str, PathLike]
+class StrEnum(str, Enum):
+    """
+    This is equivalent to Python's :class:`enum.StrEnum` since version 3.11.
+    We include this here for compatibility with older version of Python.
+    """
+    def __str__(self) -> str:
+        return self.value
+    def __repr__(self) -> str:
+        return f"'{str(self)}'"
+class LayerNormType(StrEnum):
+    default = "default"
+    """
+    The default LayerNorm implementation, equivalent to PyTorch's built-in version.
+    """
+    low_precision = "low_precision"
+    """
+    A low-precision version of the default LayerNorm.
+    """
+    rms = "rms"
+    """
+    An RMSNorm implementation. When using ``torch.compile`` this is
+    probably the fastest implementation.
+    """
+    gemma_rms = "gemma_rms"
+    """
+    An RMSNorm implementation by gemmma. When using ``torch.compile`` this is
+    probably the fastest implementation.
+    """
+    amd_compatible = "amd_compatible"
+    """
+    LayerNorm implemented manually to work around an issue with ROCm.
+    """
+class ActivationType(StrEnum):
+    gelu = "gelu"
+    relu = "relu"
+    silu = "silu"
+    swiglu = "swiglu"
+class BlockType(StrEnum):
+    sequential = "sequential"
+    parallel = "parallel"
+    llama = "llama"
+    """
+    A block similar to the sequential block with slightly different
+    implementations of operations like attention to imitate the behavior of Llama.
+    """
+class InitFnType(StrEnum):
+    mitchell = "mitchell"
+    """
+    The strategy suggested to us by Mitchell Wortsman from UW.
+    This uses a truncated normal distribution with an adaptive standard deviation that depends
+    on the size of the weights as well as the depth of the layer.
+    """
+    normal = "normal"
+    """
+    All weights are initialized from the same normal distribution.
+    """
+    kaiming_normal = "kaiming_normal"
+    """
+    All weights are initialized with the Kaiming method from a normal distribution.
+    Note this currently won't work with FSDP.
+    """
+    fan_in = "fan_in"
+    """
+    "Fan-in variance scaling", i.e. normal with a standard deviation of ``1/sqrt(d_in)`` where ``d_in``
+    is the input dimensionality of the kernel.
+    """
+    full_megatron = "full_megatron"
+    """
+    This is what metaseq calls "full megatron init". It is the init used for Llama 2.
+    """
+@dataclass
+class EngramConfig:
+    tokenizer_name_or_path: str = "deepseek-ai/DeepSeek-V3"
+    engram_vocab_size: List[int] = field(default_factory=lambda: [129280*5, 129280*5])
+    max_ngram_size: int = 3
+    n_embed_per_ngram: int = 512
+    n_head_per_ngram: int = 8
+    layer_ids: List[int] = field(default_factory=lambda: [1, 15])
+    pad_id: int = 2
+    seed: int = 0
+    kernel_size: int = 7
+@dataclass
+class ModelConfig():
+    """
+    LLaDA (model) configuration.
+    """
+    # Note that the defaults for these attributes are equivalent to the base GPT2 model.
+    d_model: int = 768
+    """
+    The hidden size of the model.
+    """
+    n_heads: int = 12
+    """
+    The number of self-attention heads.
+    """
+    n_kv_heads: Optional[int] = None
+    """
+    The number of heads to use for keys and values. Defaults to `n_heads`.
+    Set this to ``None`` or ``n_heads`` for normal multi-head attention.
+    Set this to 1 for multi-query attention.
+    Set it to some in-between value for Llama2-style grouped query attention.
+    """
+    n_layers: int = 12
+    """
+    The number of layers/blocks.
+    """
+    mlp_ratio: int = 4
+    """
+    The ratio of the inner MLP dimensionality to ``d_model``.
+    This is only used when ``mlp_hidden_size`` is not set.
+    """
+    mlp_hidden_size: Optional[int] = None
+    """
+    Set the exact hidden size for the MLP. Otherwise the inner MLP hidden size will be set to `mlp_ratio * d_model`.
+    """
+    activation_type: ActivationType = ActivationType.swiglu
+    """
+    The activation function to use within the MLP layers.
+    """
+    block_type: BlockType = BlockType.sequential
+    """
+    The transformer block implementation.
+    """
+    block_group_size: int = 1
+    """
+    The number of blocks to group together into a single parent block.
+    This has no affect on the number of parameters in the model and is only used to wrap groups
+    of blocks together with a single FSDP wrapper during training.
+    """
+    alibi: bool = False
+    """
+    If ``True``, use ALiBi embeddings. Mutually exclusive with ``rope``.
+    """
+    alibi_bias_max: float = 8.0
+    """
+    Maximum absolute value of ALiBi bias.
+    """
+    rope: bool = False
+    """
+    Use rotary positional embeddings (RoPE). Mutually exclusive with ``alibi``.
+    """
+    rope_full_precision: bool = True
+    """
+    If ``True``, apply RoPE embeddings at full precision regardless of the input type. Otherwise,
+    apply RoPE at the precision of the input.
+    """
+    flash_attention: bool = False
+    """
+    If ``True``, use ``FlashAttention``.
+    """
+    attention_dropout: float = 0.1
+    """
+    The dropout probability within the attention modules.
+    """
+    multi_query_attention: Optional[bool] = None
+    """
+    Use the Multi-Query formulation of attention used in PaLM. This reduces the number of parameters
+    and is more efficient during inference.
+    """
+    attention_layer_norm: bool = False
+    """
+    Apply layer norm to the keys and queries within the attention mechanism.
+    This can help stabilize training.
+    """
+    residual_dropout: float = 0.1
+    """
+    The dropout probability for the MLP and attention output within each block.
+    """
+    embedding_dropout: float = 0.1
+    """
+    The dropout probability for embeddings.
+    """
+    input_emb_norm: bool = False
+    """
+    An input hidden_states norm implementation by gemmma.
+    """
+    layer_norm_type: LayerNormType = LayerNormType.default
+    """
+    The layernorm implementation to use.
+    """
+    layer_norm_with_affine: bool = True
+    """
+    Whether to include bias and weight parameters for the layer norms.
+    This only affects layer norms that are immediately followed by a linear layer in the forward pass,
+    so everything except QK-norms. To turn off affines for QK norms as well, set :attr:`attention_layer_norm_with_affine`
+    to ``False``.
+    """
+    rms_norm_eps: float = 1e-05
+    """
+    The rms layernorm eps param.
+    """
+    attention_layer_norm_with_affine: bool = True
+    """
+    Toggle affine transform for the QK norms.
+    """
+    max_sequence_length: int = 1024
+    """
+    The maximum input sequence length supported by the model.
+    """
+    rope_theta: float = 10000.0
+    """
+    The rope base param.
+    """
+    include_qkv_bias: Optional[bool] = False
+    """
+    Whether or not to include bias parameters in qkv linear layers.
+    """
+    include_bias: bool = False
+    """
+    Whether or not to include bias parameters in linear layers.
+    In PaLM, they got rid of all bias terms because they found that large
+    models tend to have near 0 bias terms anyway.
+    """
+    bias_for_layer_norm: Optional[bool] = None
+    """
+    Whether or not to include bias parameters in layer norm.
+    This is separate from the include_bias parameter, because of a ROCm crash when biases are disabled in
+    layer norm.
+    When this is None (the default), it inherits the setting from include_bias.
+    """
+    scale_logits: bool = False
+    """
+    If ``True``, scale the output logits by ``1 / sqrt(d_model)``.
+    """
+    vocab_size: int = 50257
+    """
+    Vocabulary size of the model.
+    """
+    embedding_size: Optional[int] = 50304
+    """
+    The number of embeddings, i.e. the number of tokens. If set to ``None`` it will default
+    to ``vocab_size``. If ``vocab_size`` is not a multiple of 128, setting this to the
+    next multiple of 128 that's greater than ``vocab_size`` can improve throughput
+    substantially.
+    """
+    weight_tying: bool = True
+    """
+    Whether to tie output linear weights to the input embedding.
+    """
+    eos_token_id: int = 50256
+    """
+    The ID of the end-of-sentence special token.
+    """
+    pad_token_id: int = 50256
+    """
+    The ID of the token to use for padding. Defaults to the ID of the EOS token.
+    """
+    mask_token_id: Optional[int] = 50256
+    """
+    The ID of the token to use for mask token. Defaults to the ID of the EOS token.
+    """
+    init_device: Optional[str] = None
+    """
+    The torch device to use when initializing the model parameters, e.g. "cpu", "cuda:0", "meta".
+    """
+    init_fn: InitFnType = InitFnType.normal
+    """
+    The weight initialization strategy.
+    """
+    init_std: float = 0.02
+    """
+    The standard deviation to use when initializing weights with a "fixed distribution" ``init_fn``, such
+    as "normal".
+    """
+    init_cutoff_factor: Optional[float] = None
+    """
+    A positive factor used to scale the cutoff values when initializing weights with a "fixed distribution" ``init_fn``, such
+    as "normal". Setting this to None means values are not cutoff.
+    """
+    precision: Optional[str] = None
+    """
+    Precision used to train/evaluate with. You shouldn't set this directly.
+    See :data:`TrainConfig.precision` instead.
+    """
+    engram_config: Optional[EngramConfig] = None
+    @property
+    def effective_n_kv_heads(self) -> int:
+        if self.n_kv_heads is None:
+            if self.multi_query_attention is True:
+                return 1
+            else:
+                return self.n_heads
+        else:
+            if self.multi_query_attention is None:
+                return self.n_kv_heads
+            if self.multi_query_attention:
+                n_kv_heads_should_be = 1
+            else:
+                n_kv_heads_should_be = self.n_heads
+            if self.n_kv_heads == n_kv_heads_should_be:
+                return n_kv_heads_should_be
+            else:
+                raise Exception(
+                    "You can't set `multi_query_attention` and `n_kv_heads` at the same time."
+                )
+class ActivationCheckpointingStrategy(StrEnum):
+    whole_layer = "whole_layer"
+    """
+    Checkpoint every transformer layer.
+    """
+    one_in_two = "one_in_two"
+    """
+    Checkpoint one in two transformer layers.
+    """
+    one_in_three = "one_in_three"
+    """
+    Checkpoint one in three transformer layers.
+    """
+    one_in_four = "one_in_four"
+    """
+    Checkpoint one in four transformer layers.
+    """
+    two_in_three = "two_in_three"
+    """
+    Checkpoint two out of every three transformer layers.
+    """
+    three_in_four = "three_in_four"
+    """
+    Checkpoint three out of four of every transformer layers.
+    """
+    four_in_five = "four_in_five"
+    """
+    Checkpoint four out of five of every transformer layers.
+    """
+    nine_in_ten = "nine_in_ten"
+    """
+    Checkpoint nine out of ten of every transformer layers.
+    """
+    fine_grained = "fine_grained"
+    """
+    Focus checkpointing on where it is cheap to recompute and saves most memory.
+    """
+class LLaDAConfig(PretrainedConfig):
+    model_type = "llada"
+    keys_to_ignore_at_inference = ["past_key_values"]  # TODO: confirm
+    def __init__(self, use_cache: bool = False, **kwargs):
+        model_config = ModelConfig()
+        all_kwargs = model_config.__dict__
+        all_kwargs.update(kwargs)
+        all_kwargs.update({"use_cache": use_cache})
+        all_kwargs.update(
+            {
+                "architectures": all_kwargs.get("architectures", ["LLaDAModelLM"])
+            }
+        )
+        super().__init__(**all_kwargs)
+    @property
+    def num_attention_heads(self):
+        return self.n_heads
+    @property
+    def num_hidden_layers(self):
+        return self.n_layers
+    @property
+    def hidden_size(self):
+        return self.d_model
+# Register the config class so that it is available for transformer pipelines, auto-loading etc.
+AutoConfig.register("llada", LLaDAConfig)

example.py ADDED Viewed

	@@ -0,0 +1,172 @@

+from tokenizers import Tokenizer
+import torch
+import numpy as np
+import time
+import os
+from datetime import datetime
+def process_string_into_pairs(input_str: str) -> list[str]:
+    result = []
+    i = 0
+    n = len(input_str)
+    while i < n:
+        char = input_str[i]
+        # 检查当前字符是否为小写字母
+        if "a" <= char <= "z":
+            # 检查是否有下一个字符，并且下一个字符也是小写字母（配对情况）
+            if i + 1 < n and "a" <= input_str[i + 1] <= "z":
+                result.append(char + input_str[i + 1])
+                i += 2  # 跳过两个字符
+            # 检查是否有下一个字符，并且下一个字符是空格（落单小写字母+空格 的特殊情况）
+            elif i + 1 < n and input_str[i + 1] == " ":
+                result.append(char)
+                i += 2  # 跳过当前字母和后面的空格
+            # 其他情况（落单小写字母，后面是其他字符或已到末尾）
+            else:
+                result.append(char)
+                i += 1  # 只跳过当前一个字符
+        # 如果当前字符不是小写字母
+        else:
+            result.append(char)
+            i += 1  # 只跳过当前一个字符
+    return result
+def get_mask_from_string(input_str: str, tokenizer) -> torch.Tensor:
+    pairs = process_string_into_pairs(input_str)
+    masks = [
+        f"<|mask_{pair}|>" if all(ord(i) < 128 for i in pair) else pair
+        for pair in pairs
+    ]
+    mask_tensor = torch.tensor(
+        [tokenizer.token_to_id(mask) for mask in masks], dtype=torch.long
+    )
+    return mask_tensor
+def inference(model, input_str: str, tokenizer, device, threshold=0.9):
+    model.eval()
+    # Initialize NgramHashMapping
+    engram_cfg = model.config.engram_config
+    hash_mapping = None
+    if engram_cfg is not None:
+        from modeling_llada_engram import ModelConfig, EngramConfig, NgramHashMapping
+        from dataclasses import fields
+        # Prepare ModelConfig for NgramHashMapping
+        backbone_config_dict = model.config.to_dict()
+        # Filter out keys not in ModelConfig if necessary, but ModelConfig usually matches LLaDAConfig fields
+        backbone_config = ModelConfig(**{k: v for k, v in backbone_config_dict.items() if k in [f.name for f in fields(ModelConfig)]})
+        hash_mapping = NgramHashMapping(
+            engram_vocab_size = engram_cfg.get('engram_vocab_size', [129280*5, 129280*5]),
+            max_ngram_size    = engram_cfg.get('max_ngram_size', 3),
+            n_embed_per_ngram = engram_cfg.get('n_embed_per_ngram', 512),
+            n_head_per_ngram  = engram_cfg.get('n_head_per_ngram', 8),
+            layer_ids         = engram_cfg.get('layer_ids', [1, 15]),
+            pad_id            = engram_cfg.get('pad_id', 2),
+            seed              = engram_cfg.get('seed', 0),
+            config            = backbone_config,
+        )
+    with torch.no_grad():
+        mask_tensor = get_mask_from_string(input_str, tokenizer).unsqueeze(0).to(device)
+        # is_masked = torch.ones(mask_tensor.shape, dtype=torch.bool, device=device)
+        is_masked = mask_tensor >= tokenizer.token_to_id("<|mask|>")
+        rounds = 0
+        while is_masked.any():
+            rounds += 1
+            output = model(input_ids=mask_tensor)[0]
+            # Logit to probability
+            output = torch.softmax(output, dim=-1)
+            unmasked_any = False
+            prob_info = []
+            most_certain_token = (0, 0, 0) # (probability, index, token_id)
+            # Check each token that still is_masked
+            for i in range(mask_tensor.shape[1]):
+                if is_masked[0, i]:
+                    # Get the token with the highest probability
+                    predicted_token = output[0, i].argmax().item()
+                    prob_info.append(
+                        f"{output[0, i, predicted_token].item():.2f} {tokenizer.id_to_token(predicted_token)}"
+                    )
+                    most_certain_token = max(
+                        most_certain_token,
+                        (output[0, i, predicted_token].item(), i, predicted_token)
+                    )
+                    # If the probability is above the threshold, replace the mask
+                    if output[0, i, predicted_token].item() > threshold:
+                        mask_tensor[0, i] = predicted_token
+                        is_masked[0, i] = False
+                        unmasked_any = True
+                else:
+                    prob_info.append("")
+            if not unmasked_any:
+                # Unmask the most certain one
+                mask_tensor[0, most_certain_token[1]] = most_certain_token[2]
+                is_masked[0, most_certain_token[1]] = False
+            masked_str = "".join(
+                (
+                    tokenizer.id_to_token(mask_tensor[0, i].item())
+                    if not is_masked[0, i]
+                    else tokenizer.id_to_token(mask_tensor[0, i].item())[7:-2]
+                )
+                for i in range(mask_tensor.shape[1])
+            )
+            print(masked_str)
+if __name__ == "__main__":
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    tokenizer = Tokenizer.from_file("tokenizer.json")
+    # Load from local directory using AutoModel
+    # Note: Ensure you have transformers installed and trust_remote_code=True
+    try:
+        from transformers import AutoModelForCausalLM
+        model = AutoModelForCausalLM.from_pretrained(".", trust_remote_code=True).to(device)
+    except Exception as e:
+        print(f"Failed to load with AutoModel: {e}")
+        print("Falling back to manual loading (if needed, but prefer AutoModel for validation)")
+        # Fallback code removed for clarity as we want to enforce AutoModel structure
+        raise e
+    # To bfloat16
+    model = model.to(torch.bfloat16) if device.type == "cuda" else model.float()
+    print("Loaded model. Parameters:", sum(p.numel() for p in model.parameters()))
+    threshold = 0.9
+    while True:
+        input_str = input("Enter a string to process: ")
+        inference(model, input_str, tokenizer, device, threshold=threshold)
+        print("")  # 空行分隔
+# Input example: nhkzotdgjvdmleunkmiekz。
+# Output: 黄河是中华民族的母亲河。
+# Input example: mdflswsyelfl，eyxxmdswsyelfl，raxxmdelfl，otfixdzhfnjrugfoirmbisunswsyelfl。zhldxxdgun“mdfl”uvelflqhnvxtmdunkmpbofvjcjnnmdunsoirpbucheel。
+# Output: 大型语言模型，也称大语言模型，简称大模型，是一种基于人工神经网络的语言模型。其名称中的“大型”指模型具有庞大的参数量以及巨大的训练数据规模。
+# Input example: hgzz(Go o g l e )otfiwjpmrnxjuchkaf,hdidjifngmrnsdoovsoggn.
+# Output:
+# 谷歌(Google)是一家跨国科技公司,总部位于美国加州山景城.
+# 谷歌(Google)是一家跨国科技公司,总部位于美国加州山景城。
+# 谷歌(Google)是一家跨国科技公司，总部位于美国加州山景城。
+# 谷歌（Google)是一家跨国科技公司，总部位于美国加州山景城。
+# 谷歌（Google）是一家跨国科技公司，总部位于美国加州山景城。
+# Input example: jxvuygvbotghtusvwtvbdt。auwvvbotcbghwhtkshdl？
+# Output:
+# 天对地，雨对风。大陆对长空。山lj对ke树，赤日对ljeb。雷隐隐，雾蒙蒙。日下对天中。风高秋月白，雨tq晚霞红。
+# 天对地，雨对风。大陆对长空。山lj对杂树，赤日对苍eb。雷隐隐，雾蒙蒙。日下对天中。风高秋月白，雨雷晚霞红。
+# 天对地，雨对风。大陆对长空。山lj对杂树，赤日对苍穹。雷隐隐，雾蒙蒙。日下对天中。风高秋月白，雨雷晚霞红。
+# 天对地，雨对风。大陆对长空。山苍对杂树，赤日对苍穹。雷隐隐，雾蒙蒙。日下对天中。风高秋月白，雨雷晚霞红。
+# (Expected Output: 天对地，雨对风。大陆对长空。山花对海树，赤日对苍穹。雷隐隐，雾蒙蒙。日下对天中。风高秋月白，雨霁晚霞红。)

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a0114ad14a6671ade8155e31d930bb1c19779dab6af574d139d11678d7152270
+size 700907000

modeling_llada_engram.py ADDED Viewed

	@@ -0,0 +1,1895 @@

+from __future__ import annotations
+import logging
+import math
+import sys
+from abc import abstractmethod
+from collections import defaultdict
+from functools import partial
+from typing import (
+    Callable,
+    Dict,
+    Iterable,
+    List,
+    NamedTuple,
+    Optional,
+    Sequence,
+    Set,
+    Tuple,
+    cast,
+)
+from dataclasses import fields
+from typing import List, Optional, Tuple, Union
+import torch
+import torch.backends.cuda
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.nn.utils.rnn as rnn_utils
+from torch import einsum
+from transformers import PreTrainedModel
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.models.auto import AutoModel
+from transformers.models.auto.tokenization_auto import AutoTokenizer
+from transformers.cache_utils import Cache
+from sympy import isprime
+import numpy as np
+from configuration_llada_engram import (
+    EngramConfig,
+    LLaDAConfig,
+    StrEnum,
+    InitFnType,
+    ActivationType,
+    BlockType,
+    LayerNormType,
+    ModelConfig,
+    ActivationCheckpointingStrategy,
+)
+if sys.version_info.minor > 8:
+    from collections.abc import MutableMapping
+elif sys.version_info.minor == 8:
+    from typing import MutableMapping
+else:
+    raise SystemExit("This script supports Python 3.8 or higher")
+__all__ = [
+    "LayerNormBase",
+    "LayerNorm",
+    "RMSLayerNorm",
+    "GemmaRMSLayerNorm",
+    "RotaryEmbedding",
+    "Activation",
+    "GELU",
+    "ReLU",
+    "SwiGLU",
+    "LLaDABlock",
+    "LLaDASequentialBlock",
+    "LLaDAModel",
+    "LLaDAOutput",
+    "LLaDAGenerateOutput",
+]
+log = logging.getLogger(__name__)
+class ModuleType(StrEnum):
+    in_module = "in"
+    out_module = "out"
+    emb = "emb"
+    final_out = "final_out"
+def init_weights(
+    config: ModelConfig,
+    module: Union[nn.Linear, nn.Embedding],
+    d: Optional[int] = None,
+    layer_id: Optional[int] = None,
+    std_factor: float = 1.0,
+    type_of_module: Optional[ModuleType] = None,
+) -> None:
+    """
+    Initialize weights of a linear or embedding module.
+    :param config: The model config.
+    :param module: The linear or embedding submodule to initialize.
+    :param d: The effective input dimensionality of the weights. This could be smaller than the actual dimensions
+        for fused layers.
+    :param layer_id: When set, the standard deviation for the "mitchell" method will be adjusted by
+        ``1 / sqrt(2 * (layer_id + 1))``.
+    """
+    d = d if d is not None else config.d_model
+    if config.init_fn == InitFnType.normal:
+        std = config.init_std * std_factor
+        if config.init_cutoff_factor is not None:
+            cutoff_value = config.init_cutoff_factor * std
+            nn.init.trunc_normal_(module.weight, mean=0.0, std=std, a=-cutoff_value, b=cutoff_value)
+        else:
+            nn.init.normal_(module.weight, mean=0.0, std=std)
+    elif config.init_fn == InitFnType.mitchell:
+        std = std_factor / math.sqrt(d)
+        if layer_id is not None:
+            std = std / math.sqrt(2 * (layer_id + 1))
+        nn.init.trunc_normal_(module.weight, mean=0.0, std=std, a=-3 * std, b=3 * std)
+    elif config.init_fn == InitFnType.kaiming_normal:
+        nn.init.kaiming_normal_(module.weight, nonlinearity="relu")
+    elif config.init_fn == InitFnType.fan_in:
+        std = std_factor / math.sqrt(d)
+        nn.init.normal_(module.weight, mean=0.0, std=std)
+    elif config.init_fn == InitFnType.full_megatron:
+        if type_of_module is None:
+            raise RuntimeError(f"When using the {InitFnType.full_megatron} init, every module must have a type.")
+        cutoff_factor = config.init_cutoff_factor
+        if cutoff_factor is None:
+            cutoff_factor = 3
+        if type_of_module == ModuleType.in_module:
+            # for att_proj (same as QKV), ff_proj
+            std = config.init_std
+        elif type_of_module == ModuleType.out_module:
+            # for attn_out, ff_out
+            std = config.init_std / math.sqrt(2.0 * config.n_layers)
+        elif type_of_module == ModuleType.emb:
+            # positional embeddings (wpe)
+            # token embeddings (wte)
+            std = config.init_std
+        elif type_of_module == ModuleType.final_out:
+            # final output (ff_out)
+            std = config.d_model**-0.5
+        else:
+            raise RuntimeError(f"Unknown module type '{type_of_module}'")
+        nn.init.trunc_normal_(
+            module.weight,
+            mean=0.0,
+            std=std,
+            a=-cutoff_factor * std,
+            b=cutoff_factor * std,
+        )
+    else:
+        raise NotImplementedError(config.init_fn)
+    if isinstance(module, nn.Linear):
+        if module.bias is not None:
+            nn.init.zeros_(module.bias)
+        if config.init_fn == InitFnType.normal and getattr(module, "_is_residual", False):
+            with torch.no_grad():
+                module.weight.div_(math.sqrt(2 * config.n_layers))
+def ensure_finite_(x: torch.Tensor, check_neg_inf: bool = True, check_pos_inf: bool = False):
+    """
+    Modify ``x`` in place to replace ``float("-inf")`` with the minimum value of the dtype when ``check_neg_inf``
+    is ``True`` and to replace ``float("inf")`` with the maximum value of the dtype when ``check_pos_inf`` is ``True``.
+    """
+    if check_neg_inf:
+        x.masked_fill_(x == float("-inf"), torch.finfo(x.dtype).min)
+    if check_pos_inf:
+        x.masked_fill_(x == float("inf"), torch.finfo(x.dtype).max)
+def activation_checkpoint_function(cfg: ModelConfig):
+    preserve_rng_state = (
+        (cfg.attention_dropout == 0.0) and (cfg.embedding_dropout == 0.0) and (cfg.residual_dropout == 0.0)
+    )
+    from torch.utils.checkpoint import checkpoint
+    return partial(
+        checkpoint,
+        preserve_rng_state=preserve_rng_state,
+        use_reentrant=False,
+    )
+class BufferCache(dict, MutableMapping[str, torch.Tensor]):
+    """
+    Cache for attention biases and other things that would normally be stored as buffers.
+    We avoid using buffers because we've run into various issues doing so with FSDP.
+    In general it appears the way FSDP handles buffers is not well-defined.
+    It doesn't shard them but apparently it does synchronize them across processes, which we want to avoid
+    since (A) it isn't necessary, and (B) we sometimes have `-inf` in these biases which might get turned into
+    NaNs when they're synchronized due to casting or some other issue.
+    """
+def _non_meta_init_device(config: ModelConfig) -> torch.device:
+    if config.init_device is not None and config.init_device != "meta":
+        return torch.device(config.init_device)
+    else:
+        return torch.device("cuda" if torch.cuda.is_available() else "cpu")
+class Dropout(nn.Dropout):
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        if self.p == 0.0:
+            return input
+        else:
+            return F.dropout(input, self.p, self.training, self.inplace)
+class LayerNormBase(nn.Module):
+    def __init__(
+        self,
+        config: ModelConfig,
+        *,
+        size: Optional[int] = None,
+        elementwise_affine: Optional[bool] = True,
+        eps: float = 1e-05,
+    ):
+        super().__init__()
+        self.config = config
+        self.eps = eps
+        self.normalized_shape = (size or config.d_model,)
+        if elementwise_affine or (elementwise_affine is None and self.config.layer_norm_with_affine):
+            self.weight = nn.Parameter(torch.ones(self.normalized_shape, device=config.init_device))
+            use_bias = self.config.bias_for_layer_norm
+            if use_bias is None:
+                use_bias = self.config.include_bias
+            if use_bias:
+                self.bias = nn.Parameter(torch.zeros(self.normalized_shape, device=config.init_device))
+            else:
+                self.register_parameter("bias", None)
+        else:
+            self.register_parameter("bias", None)
+            self.register_parameter("weight", None)
+    @abstractmethod
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        raise NotImplementedError
+    @classmethod
+    def build(cls, config: ModelConfig, size: Optional[int] = None, **kwargs) -> LayerNormBase:
+        if config.layer_norm_type == LayerNormType.default:
+            return LayerNorm(config, size=size, low_precision=False, **kwargs)
+        elif config.layer_norm_type == LayerNormType.low_precision:
+            return LayerNorm(config, size=size, low_precision=True, **kwargs)
+        elif config.layer_norm_type == LayerNormType.rms:
+            return RMSLayerNorm(config, size=size, **kwargs)
+        elif config.layer_norm_type == LayerNormType.gemma_rms:
+            return GemmaRMSLayerNorm(config, size=size, **kwargs)
+        else:
+            raise NotImplementedError(f"Unknown LayerNorm type: '{config.layer_norm_type}'")
+    def _cast_if_autocast_enabled(self, tensor: torch.Tensor, dtype: Optional[torch.dtype] = None) -> torch.Tensor:
+        # NOTE: `is_autocast_enabled()` only checks for CUDA autocast, so we use the separate function
+        # `is_autocast_cpu_enabled()` for CPU autocast.
+        # See https://github.com/pytorch/pytorch/issues/110966.
+        if tensor.device.type == "cuda" and torch.is_autocast_enabled():
+            return tensor.to(dtype=dtype if dtype is not None else torch.get_autocast_gpu_dtype())
+        elif tensor.device.type == "cpu" and torch.is_autocast_cpu_enabled():
+            return tensor.to(dtype=dtype if dtype is not None else torch.get_autocast_cpu_dtype())
+        else:
+            return tensor
+    def reset_parameters(self):
+        if self.weight is not None:
+            torch.nn.init.ones_(self.weight)  # type: ignore
+        if self.bias is not None:
+            torch.nn.init.zeros_(self.bias)  # type: ignore
+class LayerNorm(LayerNormBase):
+    """
+    The default :class:`LayerNorm` implementation which can optionally run in low precision.
+    """
+    def __init__(
+        self,
+        config: ModelConfig,
+        size: Optional[int] = None,
+        low_precision: bool = False,
+        elementwise_affine: Optional[bool] = None,
+        eps: float = 1e-05,
+    ):
+        super().__init__(config, size=size, elementwise_affine=elementwise_affine, eps=eps)
+        self.low_precision = low_precision
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        if self.low_precision:
+            module_device = x.device
+            downcast_x = self._cast_if_autocast_enabled(x)
+            downcast_weight = (
+                self._cast_if_autocast_enabled(self.weight) if self.weight is not None else self.weight
+            )
+            downcast_bias = self._cast_if_autocast_enabled(self.bias) if self.bias is not None else self.bias
+            with torch.autocast(enabled=False, device_type=module_device.type):
+                return F.layer_norm(
+                    downcast_x, self.normalized_shape, weight=downcast_weight, bias=downcast_bias, eps=self.eps
+                )
+        else:
+            return F.layer_norm(x, self.normalized_shape, weight=self.weight, bias=self.bias, eps=self.eps)
+class RMSLayerNorm(LayerNormBase):
+    """
+    RMS layer norm, a simplified :class:`LayerNorm` implementation
+    """
+    def __init__(
+        self,
+        config: ModelConfig,
+        size: Optional[int] = None,
+        elementwise_affine: Optional[bool] = None,
+        eps: float = 1e-5,
+    ):
+        super().__init__(config, size=size, elementwise_affine=elementwise_affine, eps=config.rms_norm_eps)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        with torch.autocast(enabled=False, device_type=x.device.type):
+            og_dtype = x.dtype
+            x = x.to(torch.float32)
+            variance = x.pow(2).mean(-1, keepdim=True)
+            x = x * torch.rsqrt(variance + self.eps)
+            x = x.to(og_dtype)
+        if self.weight is not None:
+            if self.bias is not None:
+                return self.weight * x + self.bias
+            else:
+                return self.weight * x
+        else:
+            return x
+class GemmaRMSLayerNorm(LayerNormBase):
+    """
+    Gemma RMS layer norm, a simplified :class:`LayerNorm` implementation
+    """
+    def __init__(
+        self,
+        config: ModelConfig,
+        size: Optional[int] = None,
+        elementwise_affine: Optional[bool] = None,
+        eps: float = 1e-5,
+    ):
+        super().__init__(config, size=size, elementwise_affine=elementwise_affine, eps=config.rms_norm_eps)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        with torch.autocast(enabled=False, device_type=x.device.type):
+            og_dtype = x.dtype
+            x = x.to(torch.float32)
+            variance = x.pow(2).mean(-1, keepdim=True)
+            x = x * torch.rsqrt(variance + self.eps)
+            x = x.to(og_dtype)
+        if self.weight is not None:
+            if self.bias is not None:
+                return x * (1 + self.weight) + self.bias
+            else:
+                return x * (1 + self.weight)
+        else:
+            return x
+class RotaryEmbedding(nn.Module):
+    """
+    [Rotary positional embeddings (RoPE)](https://arxiv.org/abs/2104.09864).
+    """
+    def __init__(self, config: ModelConfig, cache: BufferCache):
+        super().__init__()
+        self.config = config
+        self.__cache = cache
+        # Warm up cache.
+        self.rope_theta = config.rope_theta
+        self.get_rotary_embedding(config.max_sequence_length, _non_meta_init_device(config))
+    def get_rotary_embedding(self, seq_len: int, device: torch.device) -> Tuple[torch.Tensor, torch.Tensor]:
+        if (
+            (pos_sin := self.__cache.get("rope_pos_sin")) is not None
+            and (pos_cos := self.__cache.get("rope_pos_cos")) is not None
+            and pos_sin.shape[-2] >= seq_len
+            and pos_cos.shape[-2] >= seq_len
+        ):
+            if pos_sin.device != device:
+                pos_sin = pos_sin.to(device)
+                self.__cache["rope_pos_sin"] = pos_sin
+            if pos_cos.device != device:
+                pos_cos = pos_cos.to(device)
+                self.__cache["rope_pos_cos"] = pos_cos
+            return pos_sin[:, :, :seq_len, :], pos_cos[:, :, :seq_len, :]
+        with torch.autocast(device.type, enabled=False):
+            dim = self.config.d_model // self.config.n_heads
+            inv_freq = 1.0 / (self.rope_theta ** (torch.arange(0, dim, 2, device=device, dtype=torch.float) / dim))
+            seq = torch.arange(seq_len, device=device, dtype=torch.float)
+            freqs = einsum("i , j -> i j", seq, inv_freq)
+            positions = torch.cat((freqs, freqs), dim=-1)
+            pos_sin, pos_cos = positions.sin()[None, None, :, :], positions.cos()[None, None, :, :]
+        self.__cache["rope_pos_sin"] = pos_sin
+        self.__cache["rope_pos_cos"] = pos_cos
+        return pos_sin, pos_cos
+    def rotate_half(self, x: torch.Tensor) -> torch.Tensor:
+        B, nh, T, hs = x.size()
+        x = x.view(B, nh, T, 2, hs // 2)
+        x1, x2 = x.unbind(dim=-2)
+        return torch.cat((-x2, x1), dim=-1)
+    def apply_rotary_pos_emb(self, pos_sin: torch.Tensor, pos_cos: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
+        return ((t * pos_cos) + (self.rotate_half(t) * pos_sin)).to(t.dtype)
+    def forward(self, q: torch.Tensor, k: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        if self.config.rope_full_precision:
+            q_, k_ = q.float(), k.float()
+        else:
+            q_, k_ = q, k
+        with torch.autocast(q.device.type, enabled=False):
+            query_len, key_len = q_.shape[-2], k_.shape[-2]  # could be different if layer_past not None
+            pos_sin, pos_cos = self.get_rotary_embedding(key_len, q_.device)
+            pos_sin = pos_sin.type_as(q_)
+            pos_cos = pos_cos.type_as(q_)
+            q_ = self.apply_rotary_pos_emb(
+                pos_sin[:, :, key_len - query_len : key_len, :],
+                pos_cos[:, :, key_len - query_len : key_len, :],
+                q_,
+            )
+            k_ = self.apply_rotary_pos_emb(pos_sin, pos_cos, k_)
+        return q_.type_as(q), k_.type_as(k)
+class Activation(nn.Module):
+    def __init__(self, config: ModelConfig):
+        super().__init__()
+        self.config = config
+    @abstractmethod
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        raise NotImplementedError
+    @property
+    @abstractmethod
+    def output_multiplier(self) -> float:
+        raise NotImplementedError
+    @classmethod
+    def build(cls, config: ModelConfig) -> Activation:
+        if config.activation_type == ActivationType.gelu:
+            return cast(Activation, GELU(approximate="none"))
+        elif config.activation_type == ActivationType.relu:
+            return cast(Activation, ReLU(inplace=False))
+        elif config.activation_type == ActivationType.silu:
+            return cast(Activation, SiLU(inplace=False))
+        elif config.activation_type == ActivationType.swiglu:
+            return SwiGLU(config)
+        else:
+            raise NotImplementedError(f"Unknown activation: '{config.activation_type}'")
+class GELU(nn.GELU):
+    @property
+    def output_multiplier(self) -> float:
+        return 1.0
+class ReLU(nn.ReLU):
+    @property
+    def output_multiplier(self) -> float:
+        return 1.0
+class SiLU(nn.SiLU):
+    @property
+    def output_multiplier(self) -> float:
+        return 1.0
+class SwiGLU(Activation):
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x, gate = x.chunk(2, dim=-1)
+        return F.silu(gate) * x
+    @property
+    def output_multiplier(self) -> float:
+        return 0.5
+def causal_attention_bias(seq_len: int, device: torch.device) -> torch.FloatTensor:
+    att_bias = torch.triu(
+        torch.ones(seq_len, seq_len, device=device, dtype=torch.float),
+        diagonal=1,
+    )
+    att_bias.masked_fill_(att_bias == 1, torch.finfo(att_bias.dtype).min)
+    return att_bias.view(1, 1, seq_len, seq_len)  # type: ignore
+def get_causal_attention_bias(cache: BufferCache, seq_len: int, device: torch.device) -> torch.Tensor:
+    if (causal_bias := cache.get("causal_attention_bias")) is not None and causal_bias.shape[-1] >= seq_len:
+        if causal_bias.device != device:
+            causal_bias = causal_bias.to(device)
+            cache["causal_attention_bias"] = causal_bias
+        return causal_bias
+    with torch.autocast(device.type, enabled=False):
+        causal_bias = causal_attention_bias(seq_len, device)
+    cache["causal_attention_bias"] = causal_bias
+    return causal_bias
+def alibi_attention_bias(seq_len: int, config: ModelConfig, device: torch.device) -> torch.FloatTensor:
+    alibi_bias = torch.arange(1 - seq_len, 1, dtype=torch.float, device=device).view(1, 1, 1, seq_len)
+    # shape: (1, 1, seq_len, seq_len)
+    alibi_bias = alibi_bias - torch.arange(1 - seq_len, 1, dtype=torch.float, device=device).view(1, 1, seq_len, 1)
+    alibi_bias.abs_().mul_(-1)
+    # shape: (n_heads,)
+    m = torch.arange(1, config.n_heads + 1, dtype=torch.float, device=device)
+    m.mul_(config.alibi_bias_max / config.n_heads)
+    # shape: (1, n_heads, seq_len, seq_len)
+    return alibi_bias * (1.0 / (2 ** m.view(1, config.n_heads, 1, 1)))  # type: ignore
+class ShortConv(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        kernel_size: int = 7,  # 修改默认值为 7
+        dilation: int = 1,
+        norm_eps: float = 1e-5,
+        hc_mult: int = 1,
+        activation: bool = True,
+    ):
+        super().__init__()
+        self.activation = activation
+        self.kernel_size = kernel_size
+        self.dilation = dilation
+        # 针对奇数核(如7)的 Same Padding 计算: (K-1)/2
+        # K=7 -> padding=3
+        self.padding = (kernel_size - 1) // 2 * dilation
+        # 标准卷积，这次我们可以直接用 PyTorch 的 padding，
+        # 因为奇数核的 same padding 是对称的，不需要手动 F.pad
+        self.conv = nn.Conv1d(
+            in_channels=hidden_size,
+            out_channels=hidden_size,
+            kernel_size=kernel_size,
+            groups=hidden_size,
+            bias=False,
+            padding=self.padding, # 直接设置 padding
+            dilation=dilation,
+        )
+        self.norm = nn.RMSNorm(hidden_size, eps=norm_eps)
+        if self.activation:
+            self.act_fn = nn.SiLU()
+        self.reset_parameters()
+    def reset_parameters(self):
+        nn.init.zeros_(self.conv.weight)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        # x: [B, L, D]
+        x_norm = self.norm(x)
+        # [B, L, D] -> [B, D, L]
+        x_bct = x_norm.transpose(1, 2)
+        # 卷积 (自动 padding)
+        y_bct = self.conv(x_bct)
+        if self.activation:
+            y_bct = self.act_fn(y_bct)
+        # [B, D, L] -> [B, L, D]
+        y = y_bct.transpose(1, 2).contiguous()
+        return y
+def find_next_prime(start, seen_primes):
+    candidate = start + 1
+    while True:
+        if isprime(candidate) and candidate not in seen_primes:
+            return candidate
+        candidate += 1
+class NgramHashMapping:
+    def __init__(
+        self,
+        engram_vocab_size,
+        max_ngram_size,
+        n_embed_per_ngram,
+        n_head_per_ngram,
+        layer_ids,
+        pad_id,
+        seed,
+        config: ModelConfig,
+    ):
+        self.vocab_size_per_ngram = engram_vocab_size
+        self.max_ngram_size = max_ngram_size
+        self.n_embed_per_ngram = n_embed_per_ngram
+        self.n_head_per_ngram = n_head_per_ngram
+        self.pad_id = pad_id
+        self.layer_ids = layer_ids
+        self.tokenizer_vocab_size = config.vocab_size
+        max_long = np.iinfo(np.int64).max
+        M_max = int(max_long // self.tokenizer_vocab_size)
+        half_bound = max(1, M_max // 2)
+        PRIME_1 = 10007
+        self.layer_multipliers = {}
+        for layer_id in self.layer_ids:
+            base_seed = int(seed + PRIME_1 * int(layer_id))
+            g = np.random.default_rng(base_seed)
+            r = g.integers(
+                low=0,
+                high=half_bound,
+                size=(self.max_ngram_size,),
+                dtype=np.int64
+            )
+            multipliers = r * 2 + 1
+            self.layer_multipliers[layer_id] = multipliers
+        self.vocab_size_across_layers = self.calculate_vocab_size_across_layers()
+    def calculate_vocab_size_across_layers(self):
+        seen_primes = set()
+        vocab_size_across_layers = {}
+        for layer_id in self.layer_ids:
+            all_ngram_vocab_sizes = []
+            for ngram in range(2, self.max_ngram_size + 1):
+                current_ngram_heads_sizes = []
+                vocab_size = self.vocab_size_per_ngram[ngram - 2]
+                num_head = self.n_head_per_ngram
+                current_prime_search_start = vocab_size - 1
+                for _ in range(num_head):
+                    found_prime = find_next_prime(
+                        current_prime_search_start,
+                        seen_primes
+                    )
+                    seen_primes.add(found_prime)
+                    current_ngram_heads_sizes.append(found_prime)
+                    current_prime_search_start = found_prime
+                all_ngram_vocab_sizes.append(current_ngram_heads_sizes)
+            vocab_size_across_layers[layer_id] = all_ngram_vocab_sizes
+        return vocab_size_across_layers
+    def _get_ngram_hashes(
+        self,
+        input_ids: np.ndarray,
+        layer_id: int,
+    ) -> np.ndarray:
+        x = np.asarray(input_ids, dtype=np.int64)
+        B, T = x.shape
+        multipliers = self.layer_multipliers[layer_id]
+        def shift_k(k: int) -> np.ndarray:
+            if k == 0: return x
+            shifted = np.pad(x, ((0, 0), (k, 0)),
+                                mode='constant', constant_values=self.pad_id)[:, :T]
+            return shifted
+        base_shifts = [shift_k(k) for k in range(self.max_ngram_size)]
+        all_hashes = []
+        for n in range(2, self.max_ngram_size + 1):
+            n_gram_index = n - 2
+            tokens = base_shifts[:n]
+            mix = (tokens[0] * multipliers[0])
+            for k in range(1, n):
+                mix = np.bitwise_xor(mix, tokens[k] * multipliers[k])
+            num_heads_for_this_ngram = self.n_head_per_ngram
+            head_vocab_sizes = self.vocab_size_across_layers[layer_id][n_gram_index]
+            for j in range(num_heads_for_this_ngram):
+                mod = int(head_vocab_sizes[j])
+                head_hash = mix % mod
+                all_hashes.append(head_hash.astype(np.int64, copy=False))
+        return np.stack(all_hashes, axis=2)
+    def hash(self, input_ids):
+        hash_ids_for_all_layers = {}
+        for layer_id in self.layer_ids:
+            hash_ids_for_all_layers[layer_id] = self._get_ngram_hashes(input_ids, layer_id=layer_id)
+        return hash_ids_for_all_layers
+class TorchNgramHashMapping:
+    """
+    在 GPU 上进行 n-gram 哈希计算的 Torch 实现。
+    由现有的 NgramHashMapping 提供 multipliers 与每 head 的素数模数组，
+    以确保与 numpy 版本一致的哈希结果与 head 排列顺序。
+    输出: dict[layer_id] -> (B, T, num_hash_heads) [long]
+    """
+    def __init__(self, np_mapping: NgramHashMapping, device: torch.device):
+        self.layer_ids = list(np_mapping.layer_ids)
+        self.max_ngram_size = int(np_mapping.max_ngram_size)
+        self.n_head_per_ngram = int(np_mapping.n_head_per_ngram)
+        self.pad_id = int(np_mapping.pad_id)
+        # 每层 multipliers: (max_ngram_size,)
+        self._multipliers = {
+            lid: torch.tensor(np_mapping.layer_multipliers[lid], dtype=torch.long, device=device)
+            for lid in self.layer_ids
+        }
+        # 每层 mods: 列表，mods[n-2] = (n_head_per_ngram,)
+        self._mods = {}
+        for lid in self.layer_ids:
+            mods_per_n = []
+            for n in range(2, self.max_ngram_size + 1):
+                head_mods = np_mapping.vocab_size_across_layers[lid][n - 2]
+                mods_per_n.append(torch.tensor(head_mods, dtype=torch.long, device=device))
+            self._mods[lid] = mods_per_n
+        self.num_hash_heads = (self.max_ngram_size - 1) * self.n_head_per_ngram
+    def hash(self, input_ids: torch.Tensor) -> Dict[int, torch.Tensor]:
+        """
+        input_ids: (B, T) long tensor on target device
+        return: {layer_id: (B, T, num_hash_heads) long}
+        """
+        x = input_ids.to(torch.long)
+        B, T = x.shape
+        # 右移 k 位（左侧 pad）: shifts[k] shape (B, T)
+        shifts = [x]
+        for k in range(1, self.max_ngram_size):
+            shifts.append(F.pad(x, (k, 0), value=self.pad_id)[:, :T])
+        out: Dict[int, torch.Tensor] = {}
+        for lid in self.layer_ids:
+            multipliers = self._multipliers[lid]
+            heads_per_layer = []
+            for n in range(2, self.max_ngram_size + 1):
+                mix = shifts[0] * multipliers[0]
+                for k in range(1, n):
+                    mix = torch.bitwise_xor(mix, shifts[k] * multipliers[k])
+                mods = self._mods[lid][n - 2]  # (H,)
+                # (B, T, 1) % (1, 1, H) -> (B, T, H)
+                head_hash = mix.unsqueeze(-1) % mods.view(1, 1, -1)
+                heads_per_layer.append(head_hash)
+            out[lid] = torch.cat(heads_per_layer, dim=-1)
+        return out
+class MultiHeadEmbedding(nn.Module):
+    def __init__(self, list_of_N: List[int], D: int):
+        super().__init__()
+        self.num_heads = len(list_of_N)
+        self.embedding_dim = D
+        offsets = [0]
+        for n in list_of_N[:-1]:
+            offsets.append(offsets[-1] + n)
+        self.register_buffer("offsets", torch.tensor(offsets, dtype=torch.long))
+        total_N = sum(list_of_N)
+        self.embedding = nn.Embedding(num_embeddings=total_N, embedding_dim=D)
+    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
+        shifted_input_ids = input_ids + self.offsets
+        output = self.embedding(shifted_input_ids)
+        return output
+class Engram(nn.Module):
+    def __init__(self, layer_id: int, config: ModelConfig):
+        super().__init__()
+        self.layer_id = layer_id
+        self.engram_cfg = config.engram_config
+        self.backbone_config = config
+        engram_cfg = self.engram_cfg
+        backbone_config = self.backbone_config
+        self.hash_mapping = NgramHashMapping(
+            engram_vocab_size = engram_cfg.engram_vocab_size,
+            max_ngram_size = engram_cfg.max_ngram_size,
+            n_embed_per_ngram = engram_cfg.n_embed_per_ngram,
+            n_head_per_ngram = engram_cfg.n_head_per_ngram,
+            layer_ids = engram_cfg.layer_ids,
+            pad_id = engram_cfg.pad_id,
+            seed = engram_cfg.seed,
+            config = backbone_config,
+        )
+        self.multi_head_embedding = MultiHeadEmbedding(
+            list_of_N = [x for y in self.hash_mapping.vocab_size_across_layers[self.layer_id] for x in y],
+            D = engram_cfg.n_embed_per_ngram // engram_cfg.n_head_per_ngram,
+        )
+        # 修改 ShortConv 调用
+        self.short_conv = ShortConv(
+            hidden_size = backbone_config.d_model,
+            kernel_size = engram_cfg.kernel_size,
+            dilation    = engram_cfg.max_ngram_size,
+            hc_mult     = 1 # 设为 1
+        )
+        engram_hidden_size = (engram_cfg.max_ngram_size-1) * engram_cfg.n_embed_per_ngram
+        # --- 修改点：不再使用 ModuleList，而是单个层 ---
+        self.value_proj = nn.Linear(engram_hidden_size, backbone_config.d_model)
+        # 只需要 1 个 Key Projection
+        self.key_proj = nn.Linear(engram_hidden_size, backbone_config.d_model)
+        # 只需要 1 组 Norm
+        self.norm_key = nn.RMSNorm(backbone_config.d_model)
+        self.norm_query = nn.RMSNorm(backbone_config.d_model)
+        self.reset_parameters()
+        # Torch 版哈希缓存（按设备惰性构建）
+        self._torch_hash_mapping: Optional[TorchNgramHashMapping] = None
+        self._torch_hash_device: Optional[torch.device] = None
+    def reset_parameters(self):
+        init_weights(
+            self.backbone_config,
+            self.multi_head_embedding.embedding,
+            type_of_module=ModuleType.emb,
+        )
+        init_weights(
+            self.backbone_config,
+            self.value_proj,
+            layer_id=self.layer_id,
+            type_of_module=ModuleType.in_module,
+        )
+        init_weights(
+            self.backbone_config,
+            self.key_proj,
+            layer_id=self.layer_id,
+            type_of_module=ModuleType.in_module,
+        )
+        self.short_conv.reset_parameters()
+    def forward(self, hidden_states, input_ids, engram_hash=None):
+        """
+        hidden_states: [B, L, D] <-- 标准形状
+        input_ids: [B, L]
+        engram_hash: [B, L, NumHeads] (Optional)
+        """
+        # 1. 查表 (不变)
+        if engram_hash is None:
+            # 优先使用 GPU 版哈希，避免 CPU<->GPU 往返
+            cur_dev = hidden_states.device
+            if self._torch_hash_mapping is None or self._torch_hash_device != cur_dev:
+                self._torch_hash_mapping = TorchNgramHashMapping(self.hash_mapping, device=cur_dev)
+                self._torch_hash_device = cur_dev
+            hash_input_ids = self._torch_hash_mapping.hash(input_ids)[self.layer_id]
+        else:
+            hash_input_ids = engram_hash
+        embeddings = self.multi_head_embedding(hash_input_ids).flatten(start_dim=-2)
+        # 2. 计算 Gate (不需要循环了)
+        # Key 部分
+        key = self.key_proj(embeddings)
+        normed_key = self.norm_key(key)
+        # Query 部分 (直接使用 hidden_states)
+        query = hidden_states
+        normed_query = self.norm_query(query)
+        # Gate 计算
+        # [B, L, D] * [B, L, D] -> sum(dim=-1) -> [B, L]
+        gate = (normed_key * normed_query).sum(dim=-1) / math.sqrt(self.backbone_config.d_model)
+        gate = gate.abs().clamp_min(1e-6).sqrt() * gate.sign()
+        gate = gate.sigmoid().unsqueeze(-1) # [B, L, 1]
+        # 3. 融合 Value
+        value = gate * self.value_proj(embeddings) # [B, L, 1] * [B, L, D] -> [B, L, D]
+        # 4. Short Conv
+        output = value + self.short_conv(value)
+        return output
+class LLaDABlock(nn.Module):
+    """
+    A base class for transformer block implementations.
+    """
+    def __init__(self, layer_id: int, config: ModelConfig, cache: BufferCache):
+        super().__init__()
+        self.layer_id = layer_id
+        self.config = config
+        self.hidden_size = (
+            config.mlp_hidden_size if config.mlp_hidden_size is not None else config.mlp_ratio * config.d_model
+        )
+        self.__cache = cache
+        assert config.d_model % config.n_heads == 0
+        self.engram = None
+        if config.engram_config is not None and layer_id in config.engram_config.layer_ids:
+            self.engram = Engram(layer_id, config)
+        self._activation_checkpoint_fn = None
+        # Dropout.
+        self.dropout = Dropout(config.residual_dropout)
+        # Layer norms.
+        self.k_norm: Optional[LayerNormBase] = None
+        self.q_norm: Optional[LayerNormBase] = None
+        if config.attention_layer_norm:
+            self.k_norm = LayerNormBase.build(
+                config,
+                size=(config.d_model // config.n_heads) * config.effective_n_kv_heads,
+                elementwise_affine=config.attention_layer_norm_with_affine,
+            )
+            self.q_norm = LayerNormBase.build(config, elementwise_affine=config.attention_layer_norm_with_affine)
+        # Activation function.
+        self.act = Activation.build(config)
+        assert (self.act.output_multiplier * self.hidden_size) % 1 == 0
+        # Attention output projection.
+        self.attn_out = nn.Linear(
+            config.d_model, config.d_model, bias=config.include_bias, device=config.init_device
+        )
+        # Feed-forward output projection.
+        self.ff_out = nn.Linear(
+            int(self.act.output_multiplier * self.hidden_size),
+            config.d_model,
+            bias=config.include_bias,
+            device=config.init_device,
+        )
+        self.ff_out._is_residual = True  # type: ignore
+        # Rotary embeddings.
+        if self.config.rope:
+            self.rotary_emb = RotaryEmbedding(config, self.__cache)
+        self.flash_attn_func = None
+        if config.flash_attention:
+            try:
+                from flash_attn import flash_attn_func  # type: ignore
+                self.flash_attn_func = flash_attn_func
+            except ModuleNotFoundError:
+                pass
+    def reset_parameters(self):
+        if self.engram is not None:
+            self.engram.reset_parameters()
+        if self.k_norm is not None:
+            self.k_norm.reset_parameters()
+        if self.q_norm is not None:
+            self.q_norm.reset_parameters()
+        init_weights(
+            self.config,
+            self.attn_out,
+            d=self.config.d_model,
+            layer_id=self.layer_id,
+            type_of_module=ModuleType.out_module,
+        )
+        init_weights(
+            self.config,
+            self.ff_out,
+            d=self.ff_out.in_features,
+            layer_id=self.layer_id,
+            type_of_module=ModuleType.out_module,
+        )
+    def set_activation_checkpointing(self, strategy: Optional[ActivationCheckpointingStrategy]):
+        if strategy == ActivationCheckpointingStrategy.fine_grained:
+            self._activation_checkpoint_fn = activation_checkpoint_function(self.config)
+        else:
+            self._activation_checkpoint_fn = None
+    @classmethod
+    def _cast_attn_bias(cls, bias: torch.Tensor, input_dtype: torch.dtype) -> torch.Tensor:
+        target_dtype = input_dtype
+        # NOTE: `is_autocast_enabled()` only checks for CUDA autocast, so we use the separate function
+        # `is_autocast_cpu_enabled()` for CPU autocast.
+        # See https://github.com/pytorch/pytorch/issues/110966.
+        if bias.device.type == "cuda" and torch.is_autocast_enabled():
+            target_dtype = torch.get_autocast_gpu_dtype()
+        elif bias.device.type == "cpu" and torch.is_autocast_cpu_enabled():
+            target_dtype = torch.get_autocast_cpu_dtype()
+        if bias.dtype != target_dtype:
+            bias = bias.to(target_dtype)
+            ensure_finite_(bias, check_neg_inf=True, check_pos_inf=False)
+        return bias
+    def _scaled_dot_product_attention(
+        self,
+        q: torch.Tensor,
+        k: torch.Tensor,
+        v: torch.Tensor,
+        attn_mask: Optional[torch.Tensor] = None,
+        dropout_p: float = 0.0,
+        is_causal: bool = False,
+    ) -> torch.Tensor:
+        """
+        Computes scaled dot product attention on query, key and value tensors, using an optional
+        attention mask if passed, and applying dropout if a probability greater than 0.0 is specified.
+        """
+        if self.flash_attn_func is not None and attn_mask is None:
+            r = self.flash_attn_func(
+                q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2), dropout_p=dropout_p, causal=False
+            )
+            return r.transpose(1, 2)
+        else:
+            # torch's sdpa doesn't support GQA, so we're doing this
+            assert k.size(1) == v.size(1)
+            num_kv_heads = k.size(1)
+            num_q_heads = q.size(1)
+            if num_q_heads != num_kv_heads:
+                assert num_q_heads % num_kv_heads == 0
+                k = k.repeat_interleave(num_q_heads // num_kv_heads, dim=1, output_size=num_q_heads)
+                v = v.repeat_interleave(num_q_heads // num_kv_heads, dim=1, output_size=num_q_heads)
+            # Modify: MDM set causal to False, and with no attn_mask.
+            return F.scaled_dot_product_attention(
+                q,
+                k,
+                v,
+                attn_mask=None,
+                dropout_p=dropout_p,
+                is_causal=False,
+            )
+    def attention(
+        self,
+        q: torch.Tensor,
+        k: torch.Tensor,
+        v: torch.Tensor,
+        attention_bias: Optional[torch.Tensor] = None,
+        layer_past: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        use_cache: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, torch.Tensor]]]:
+        B, T, C = q.size()  # batch size, sequence length, d_model
+        dtype = k.dtype
+        # Optionally apply layer norm to keys and queries.
+        if self.q_norm is not None and self.k_norm is not None:
+            q = self.q_norm(q).to(dtype=dtype)
+            k = self.k_norm(k).to(dtype=dtype)
+        # Move head forward to be next to the batch dim.
+        # shape: (B, nh, T, hs)
+        q = q.view(B, T, self.config.n_heads, C // self.config.n_heads).transpose(1, 2)
+        # shape: (B, n_kv_h, T, hs)
+        k = k.view(B, T, self.config.effective_n_kv_heads, C // self.config.n_heads).transpose(1, 2)
+        # shape: (B, n_kv_h, T, hs)
+        v = v.view(B, T, self.config.effective_n_kv_heads, C // self.config.n_heads).transpose(1, 2)
+        if layer_past is not None:
+            past_key, past_value = layer_past
+            k = torch.cat((past_key, k), dim=-2)
+            v = torch.cat((past_value, v), dim=-2)
+        present = (k, v) if use_cache else None
+        query_len, key_len = q.shape[-2], k.shape[-2]  # could be different if layer_past not None
+        if self.config.rope:
+            # Apply rotary embeddings.
+            q, k = self.rotary_emb(q, k)
+        if attention_bias is not None:
+            # Resize and cast attention bias.
+            # The current dtype of the attention bias might not match the dtype that the SDP attn function will
+            # run in if AMP is enabled, and this can be a problem if some tokens are masked out due to padding
+            # as down-casting the attention bias to the autocast precision will result in -infs, which will
+            # cause the SDP attn function to produce NaNs.
+            attention_bias = self._cast_attn_bias(
+                attention_bias[:, :, key_len - query_len : key_len, :key_len], dtype
+            )
+        # Get the attention scores.
+        # shape: (B, nh, T, hs)
+        att = self._scaled_dot_product_attention(
+            q,
+            k,
+            v,
+            attn_mask=None,
+            dropout_p=0.0 if not self.training else self.config.attention_dropout,
+            is_causal=False,
+        )
+        # Re-assemble all head outputs side-by-side.
+        att = att.transpose(1, 2).contiguous().view(B, T, C)
+        # Apply output projection.
+        return self.attn_out(att), present
+    @abstractmethod
+    def forward(
+        self,
+        x: torch.Tensor,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_bias: Optional[torch.FloatTensor] = None,
+        layer_past: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        use_cache: bool = False,
+        engram_hash: Optional[torch.Tensor] = None,
+    ) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, torch.Tensor]]]:
+        raise NotImplementedError
+    @classmethod
+    def build(cls, layer_id: int, config: ModelConfig, cache: BufferCache) -> LLaDABlock:
+        if config.block_type == BlockType.sequential:
+            return LLaDASequentialBlock(layer_id, config, cache)
+        elif config.block_type == BlockType.llama:
+            return LLaDALlamaBlock(layer_id, config, cache)
+        else:
+            raise NotImplementedError(f"Unknown block type: '{config.block_type}'")
+class LLaDASequentialBlock(LLaDABlock):
+    """
+    This is a typical transformer block where the output is computed as ``MLP(LN(x + Attention(LN(x))))``
+    (plus another skip connection).
+    """
+    def __init__(self, layer_id: int, config: ModelConfig, cache: BufferCache):
+        super().__init__(layer_id, config, cache)
+        # Layer norms.
+        self.attn_norm = LayerNorm.build(config)
+        self.ff_norm = LayerNorm.build(config)
+        # Attention input projection. Projects x -> (q, k, v)
+        head_dim = config.d_model // config.n_heads
+        self.fused_dims = (
+            config.d_model,
+            config.effective_n_kv_heads * head_dim,
+            config.effective_n_kv_heads * head_dim,
+        )
+        self.att_proj = nn.Linear(
+            config.d_model, sum(self.fused_dims), bias=config.include_bias | config.include_qkv_bias, device=config.init_device
+        )
+        # Feed-forward input projection.
+        self.ff_proj = nn.Linear(
+            config.d_model, self.hidden_size, bias=config.include_bias, device=config.init_device
+        )
+    def reset_parameters(self):
+        super().reset_parameters()
+        self.attn_norm.reset_parameters()
+        self.ff_norm.reset_parameters()
+        # NOTE: the standard deviation for these weights does not depend on the layer.
+        init_weights(
+            self.config, self.att_proj, d=self.config.d_model, layer_id=None, type_of_module=ModuleType.in_module
+        )
+        init_weights(
+            self.config, self.ff_proj, d=self.config.d_model, layer_id=None, type_of_module=ModuleType.in_module
+        )
+    def forward(
+        self,
+        x: torch.Tensor,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_bias: Optional[torch.Tensor] = None,
+        layer_past: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        use_cache: bool = False,
+        engram_hash: Optional[torch.Tensor] = None,
+    ) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, torch.Tensor]]]:
+        if self.engram is not None:
+            assert input_ids is not None
+            x = x + self.engram(x, input_ids, engram_hash=engram_hash)
+        # Get query, key, value projections.
+        # shape:
+        #  - for regular attn q, k, v: (batch_size, seq_len, d_model)
+        #  - for multi-query attn q: (batch_size, seq_len, d_model)
+        #                      k, v: (batch_size, seq_len, d_model // n_heads)
+        #  - for group query attn q: (batch_size, seq_len, d_model)
+        #                      k, v: (batch_size, seq_len, d_model // n_kv_heads)
+        if self._activation_checkpoint_fn is not None:
+            q, k, v = self.att_proj(self._activation_checkpoint_fn(self.attn_norm, x)).split(
+                self.fused_dims, dim=-1
+            )
+        else:
+            q, k, v = self.att_proj(self.attn_norm(x)).split(self.fused_dims, dim=-1)
+        # Get attention scores.
+        if self._activation_checkpoint_fn is not None:
+            att, cache = self._activation_checkpoint_fn(  # type: ignore
+                self.attention, q, k, v, attention_bias, layer_past=layer_past, use_cache=use_cache
+            )
+        else:
+            att, cache = self.attention(q, k, v, attention_bias, layer_past=layer_past, use_cache=use_cache)
+        # Add attention scores.
+        # shape: (B, T, C)
+        x = x + self.dropout(att)
+        # Add feed-forward projection.
+        # shape: (batch_size, seq_len, d_model)
+        og_x = x
+        if self._activation_checkpoint_fn is not None:
+            x = self._activation_checkpoint_fn(self.ff_norm, x)  # type: ignore
+        else:
+            x = self.ff_norm(x)
+        x = self.ff_proj(x)
+        if self._activation_checkpoint_fn is not None:
+            x = self._activation_checkpoint_fn(self.act, x)  # type: ignore
+        else:
+            x = self.act(x)
+        x = self.ff_out(x)
+        x = self.dropout(x)
+        x = og_x + x
+        return x, cache
+class LLaDALlamaBlock(LLaDABlock):
+    """
+    This is a transformer block where the output is computed as ``MLP(LN(x + Attention(LN(x))))``
+    (plus another skip connection). This block is similar to `LLaDASequentialBlock`
+    but some operations have slightly different implementations to imitate the
+    behavior of Llama.
+    """
+    def __init__(self, layer_id: int, config: ModelConfig, cache: BufferCache):
+        super().__init__(layer_id, config, cache)
+        # Layer norms.
+        self.attn_norm = LayerNorm.build(config)
+        self.ff_norm = LayerNorm.build(config)
+        self.__cache = cache
+        # Attention input projection. Projects x -> (q, k, v)
+        head_dim = config.d_model // config.n_heads
+        q_proj_out_dim = config.d_model
+        k_proj_out_dim = config.effective_n_kv_heads * head_dim
+        v_proj_out_dim = config.effective_n_kv_heads * head_dim
+        self.q_proj = nn.Linear(
+            config.d_model, q_proj_out_dim, bias=config.include_bias | config.include_qkv_bias, device=config.init_device
+        )
+        self.k_proj = nn.Linear(
+            config.d_model, k_proj_out_dim, bias=config.include_bias | config.include_qkv_bias, device=config.init_device
+        )
+        self.v_proj = nn.Linear(
+            config.d_model, v_proj_out_dim, bias=config.include_bias | config.include_qkv_bias, device=config.init_device
+        )
+        # Feed-forward input projection.
+        self.ff_proj = nn.Linear(
+            config.d_model, self.hidden_size, bias=config.include_bias, device=config.init_device
+        )
+        # new add
+        self.up_proj = nn.Linear(
+            config.d_model, self.hidden_size, bias=config.include_bias, device=config.init_device
+        )
+    def reset_parameters(self):
+        super().reset_parameters()
+        self.attn_norm.reset_parameters()
+        self.ff_norm.reset_parameters()
+        # NOTE: the standard deviation for these weights does not depend on the layer.
+        init_weights(self.config, self.q_proj, d=self.config.d_model, layer_id=None)
+        init_weights(self.config, self.k_proj, d=self.config.d_model, layer_id=None)
+        init_weights(self.config, self.v_proj, d=self.config.d_model, layer_id=None)
+        init_weights(self.config, self.ff_proj, d=self.config.d_model, layer_id=None)
+        init_weights(self.config, self.up_proj, d=self.config.d_model, layer_id=None)  # new add
+    def forward(
+        self,
+        x: torch.Tensor,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_bias: Optional[torch.Tensor] = None,
+        layer_past: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        use_cache: bool = False,
+        engram_hash: Optional[torch.Tensor] = None,
+    ) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, torch.Tensor]]]:
+        if self.engram is not None:
+            assert input_ids is not None
+            x = x + self.engram(x, input_ids, engram_hash=engram_hash)
+        # Get query, key, value projections.
+        # shape:
+        #  - for regular attn q, k, v: (batch_size, seq_len, d_model)
+        #  - for multi-query attn q: (batch_size, seq_len, d_model)
+        #                      k, v: (batch_size, seq_len, d_model // n_heads)
+        #  - for group query attn q: (batch_size, seq_len, d_model)
+        #                      k, v: (batch_size, seq_len, d_model // n_kv_heads)
+        x_normed = self.attn_norm(x)
+        q = self.q_proj(x_normed)
+        k = self.k_proj(x_normed)
+        v = self.v_proj(x_normed)
+        # Get attention scores.
+        if self._activation_checkpoint_fn is not None:
+            att, cache = self._activation_checkpoint_fn(  # type: ignore
+                self.attention, q, k, v, attention_bias, layer_past=layer_past, use_cache=use_cache
+            )
+        else:
+            att, cache = self.attention(q, k, v, attention_bias, layer_past=layer_past, use_cache=use_cache)
+        # Add attention scores.
+        # shape: (B, T, C)
+        x = x + self.dropout(att)
+        # Add feed-forward projection.
+        # shape: (batch_size, seq_len, d_model)
+        og_x = x
+        if self._activation_checkpoint_fn is not None:
+            x = self._activation_checkpoint_fn(self.ff_norm, x)  # type: ignore
+        else:
+            x = self.ff_norm(x)
+        x, x_up = self.ff_proj(x), self.up_proj(x) # new add
+        if self._activation_checkpoint_fn is not None:
+            x = self._activation_checkpoint_fn(self.act, x)  # type: ignore
+        else:
+            x = self.act(x)
+        x = x * x_up # new add
+        x = self.ff_out(x)
+        x = self.dropout(x)
+        x = og_x + x
+        return x, cache
+class LLaDAOutput(NamedTuple):
+    logits: torch.FloatTensor
+    """
+    A tensor of shape `(batch_size, seq_len, vocab_size)` representing the log probabilities
+    for the next token *before* normalization via (log) softmax.
+    """
+    attn_key_values: Optional[List[Tuple[torch.Tensor, torch.Tensor]]]
+    """
+    Attention keys and values from each block.
+    """
+    hidden_states: Optional[Tuple[torch.Tensor]]
+    """
+    Hidden states from each block.
+    """
+class LLaDAGenerateOutput(NamedTuple):
+    token_ids: torch.LongTensor
+    """
+    The generated token IDs, a tensor of shape `(batch_size, beam_size, max_steps)`.
+    These do *not* include the original input IDs.
+    """
+    scores: torch.FloatTensor
+    """
+    The scores of the generated sequences, a tensor of shape `(batch_size, beam_size)`.
+    """
+class LLaDABlockGroup(nn.ModuleList):
+    def __init__(self, config: ModelConfig, layer_offset: int, modules: Optional[Iterable[nn.Module]] = None):
+        super().__init__(modules)
+        self.config = config
+        self.layer_offset = layer_offset
+        self.activation_checkpointing_strategy: Optional[ActivationCheckpointingStrategy] = None
+        self._activation_checkpoint_fn = activation_checkpoint_function(self.config)
+    def forward(
+        self,
+        x: torch.Tensor,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_bias: Optional[torch.FloatTensor] = None,
+        layers_past: Optional[List[Tuple[torch.Tensor, torch.Tensor]]] = None,
+        use_cache: bool = False,
+        engram_hashes: Optional[Dict[int, torch.Tensor]] = None,
+    ) -> Tuple[torch.Tensor, Optional[List[Tuple[torch.Tensor, torch.Tensor]]]]:
+        attn_key_values: Optional[List[Tuple[torch.Tensor, torch.Tensor]]] = [] if use_cache else None
+        for block_idx, block in enumerate(self):
+            layer_past = None if layers_past is None else layers_past[block_idx]
+            block_idx += self.layer_offset
+            if (
+                (self.activation_checkpointing_strategy == ActivationCheckpointingStrategy.whole_layer)
+                or (
+                    self.activation_checkpointing_strategy == ActivationCheckpointingStrategy.one_in_two
+                    and block_idx % 2 == 0
+                )
+                or (
+                    self.activation_checkpointing_strategy == ActivationCheckpointingStrategy.one_in_three
+                    and block_idx % 3 == 0
+                )
+                or (
+                    self.activation_checkpointing_strategy == ActivationCheckpointingStrategy.one_in_four
+                    and block_idx % 4 == 0
+                )
+            ):
+                # shape: (batch_size, seq_len, d_model)
+                x, cache = self._activation_checkpoint_fn(  # type: ignore
+                    block, x, input_ids=input_ids, attention_bias=attention_bias, layer_past=layer_past, use_cache=use_cache,
+                    engram_hash=None if engram_hashes is None else engram_hashes.get(block_idx)
+                )
+            else:
+                # shape: (batch_size, seq_len, d_model)
+                x, cache = block(x, input_ids=input_ids, attention_bias=attention_bias, layer_past=layer_past, use_cache=use_cache,
+                                 engram_hash=None if engram_hashes is None else engram_hashes.get(block_idx))
+            if attn_key_values is not None:
+                assert cache is not None
+                attn_key_values.append(cache)
+        return x, attn_key_values
+    def reset_parameters(self):
+        for block in self:
+            block.reset_parameters()
+    def set_activation_checkpointing(self, strategy: Optional[ActivationCheckpointingStrategy]):
+        self.activation_checkpointing_strategy = strategy
+        for block in self:
+            block.set_activation_checkpointing(strategy)
+class LLaDAModel(nn.Module):
+    def __init__(self, config: ModelConfig, init_params: bool = True):
+        super().__init__()
+        self.config = config
+        self.__cache = BufferCache()
+        # Validate config.
+        if self.config.alibi and self.config.flash_attention:
+            raise Exception("ALiBi is currently not supported with FlashAttention")
+        if self.config.alibi and self.config.rope:
+            raise Exception("ALiBi and RoPE are mutually exclusive")
+        if self.config.embedding_size is not None and self.config.embedding_size != self.config.vocab_size:
+            if self.config.embedding_size < self.config.vocab_size:
+                raise Exception("embedding size should be at least as big as vocab size")
+            elif self.config.embedding_size % 128 != 0:
+                import warnings
+                warnings.warn(
+                    "Embedding size is not a multiple of 128! This could hurt throughput performance.", UserWarning
+                )
+        self.activation_checkpointing_strategy: Optional[ActivationCheckpointingStrategy] = None
+        self._activation_checkpoint_fn: Callable = activation_checkpoint_function(self.config)
+        if not (
+            0 < self.config.block_group_size <= self.config.n_layers
+            and self.config.n_layers % self.config.block_group_size == 0
+        ):
+            raise Exception("n layers must be divisible by block group size")
+        torch.backends.cuda.enable_flash_sdp(True)
+        torch.backends.cuda.enable_mem_efficient_sdp(False)  # this is super slow so make sure torch won't use it
+        self.transformer = nn.ModuleDict(
+            dict(
+                wte=nn.Embedding(
+                    config.embedding_size or config.vocab_size, config.d_model, device=config.init_device
+                ),
+                emb_drop=Dropout(config.embedding_dropout),
+                ln_f=LayerNorm.build(config),
+            )
+        )
+        blocks = [LLaDABlock.build(i, config, self.__cache) for i in range(config.n_layers)]
+        if self.config.block_group_size > 1:
+            block_groups = [
+                LLaDABlockGroup(config, i, blocks[i : i + config.block_group_size])
+                for i in range(0, config.n_layers, config.block_group_size)
+            ]
+            self.transformer.update({"block_groups": nn.ModuleList(block_groups)})
+        else:
+            self.transformer.update({"blocks": nn.ModuleList(blocks)})
+        if not (self.config.alibi or self.config.rope):
+            self.transformer.update(
+                {"wpe": nn.Embedding(config.max_sequence_length, config.d_model, device=config.init_device)}
+            )
+        if not config.weight_tying:
+            self.transformer.update(
+                {
+                    "ff_out": nn.Linear(
+                        config.d_model,
+                        config.embedding_size or config.vocab_size,
+                        bias=config.include_bias,
+                        device=config.init_device,
+                    )
+                }
+            )
+        # When `init_device="meta"` FSDP will call `reset_parameters()` to initialize weights.
+        if init_params and self.config.init_device != "meta":
+            self.reset_parameters()
+        self.__num_fwd_flops: Optional[int] = None
+        # Warm up cache.
+        if self.config.alibi:
+            get_causal_attention_bias(self.__cache, config.max_sequence_length, _non_meta_init_device(config))
+            self.get_alibi_attention_bias(config.max_sequence_length, _non_meta_init_device(config))
+    def set_activation_checkpointing(self, strategy: Optional[ActivationCheckpointingStrategy]):
+        self.activation_checkpointing_strategy = strategy
+        if self.config.block_group_size != 1:
+            for block_group in self.transformer.block_groups:
+                block_group.set_activation_checkpointing(strategy)
+        else:
+            for block in self.transformer.blocks:
+                block.set_activation_checkpointing(strategy)
+    @property
+    def device(self) -> torch.device:
+        device: torch.device = self.transformer.wte.weight.device  # type: ignore
+        if device.type == "meta":
+            return _non_meta_init_device(self.config)
+        else:
+            return device
+    def reset_parameters(self):
+        log.info("Initializing model parameters...")
+        # Top-level embeddings / linear layers.
+        init_weights(
+            self.config,
+            self.transformer.wte,  # type: ignore
+            std_factor=(0.5 * math.sqrt(self.config.d_model)) if self.config.scale_logits else 1.0,
+            type_of_module=ModuleType.emb,
+        )
+        if hasattr(self.transformer, "wpe"):
+            init_weights(self.config, self.transformer.wpe, type_of_module=ModuleType.emb)  # type: ignore
+        # Top-level layer norm.
+        self.transformer.ln_f.reset_parameters()  # type: ignore
+        # Output weights.
+        if hasattr(self.transformer, "ff_out"):
+            init_weights(self.config, self.transformer.ff_out, type_of_module=ModuleType.final_out)  # type: ignore
+        # Let the blocks handle themselves.
+        if self.config.block_group_size == 1:
+            for block in self.transformer.blocks:
+                block.reset_parameters()
+        else:
+            for block_group in self.transformer.block_groups:
+                block_group.reset_parameters()
+    def get_alibi_attention_bias(self, seq_len: int, device: torch.device) -> torch.Tensor:
+        if (alibi_bias := self.__cache.get("alibi_attention_bias")) is not None and alibi_bias.shape[
+            -1
+        ] >= seq_len:
+            if alibi_bias.device != device:
+                alibi_bias = alibi_bias.to(device)
+                self.__cache["alibi_attention_bias"] = alibi_bias
+            return alibi_bias
+        with torch.autocast(device.type, enabled=False):
+            alibi_bias = alibi_attention_bias(seq_len, self.config, device)
+        self.__cache["alibi_attention_bias"] = alibi_bias
+        return alibi_bias
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        input_embeddings: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        attention_bias: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Sequence[Tuple[torch.Tensor, torch.Tensor]]] = None,
+        use_cache: bool = False,
+        last_logits_only: bool = False,
+        output_hidden_states: Optional[bool] = None,
+        engram_hashes: Optional[Dict[int, torch.Tensor]] = None,
+    ) -> LLaDAOutput:
+        """
+        :param input_ids: A tensor of shape `(batch_size, seq_len)`.
+        :param input_embeddings: A tensor of shape `(batch_size, seq_len, d_model)` with input
+            embeddings. When provided, it is treated as the output of the input embedding layer.
+        :param attention_mask: A tensor of shape `(batch_size, seq_len)` that indicates
+            which input IDs are masked. A `1` value in the mask means that
+            the corresponding input ID should *not* be ignored. A `0` means
+            that the corresponding input ID is masked.
+            This has the same meaning as the `attention_mask` in HuggingFace's `transformers`
+            library.
+        :param attention_bias: A tensor of shape `(batch_size, 1, seq_len, seq_len)`,
+            `(1, 1, seq_len, seq_len)`, or `(seq_len, seq_len)`. This is used
+            to introduce causal or other biases.
+            If the tensor is a bool or byte tensor, a `True` or `1` at `attention_bias[:, :, i, j]`
+            indicates that the i-th element in the sequence is allowed to attend to the j-th
+            element in the sequence.
+            If the tensor is a float tensor, it will just be added to the attention
+            scores before the softmax.
+            The default is causal, which corresponds to a lower-diagonal byte matrix of ones.
+        :param past_key_values: Pre-computed keys and values for each attention block.
+            Can be used to speed up sequential decoding. The `input_ids` which have
+            their past given to this model should not be passed as `input_ids` as they have already been computed.
+        :param use_cache: If `True`, return key and value tensors for each block.
+        :param last_logits_only: If `True`, only compute the logits for the last token of each sequence.
+            This can speed up decoding when you only care about the next token.
+        """
+        # Add Basic MDM Model config check
+        assert not self.config.alibi, "Alibi length extrapolation is not supported for MDM."
+        assert self.config.rope, "Rope must be used in Llama-Encoder for MDM."
+        assert (past_key_values is None and not use_cache), "The kvcache is not suppotred for MDM."
+        output_hidden_states = output_hidden_states if output_hidden_states is not None else False
+        if past_key_values:
+            assert len(past_key_values) == self.config.n_layers
+        batch_size, seq_len = input_ids.size() if input_embeddings is None else input_embeddings.size()[:2]
+        if past_key_values is None:
+            past_length = 0
+        else:
+            past_length = past_key_values[0][0].size(-2)
+        # Get embeddings of input.
+        # shape: (batch_size, seq_len, d_model)
+        x = self.transformer.wte(input_ids) if input_embeddings is None else input_embeddings  # type: ignore
+        if self.config.input_emb_norm:
+            x = x * (self.config.d_model**0.5)
+        if not (self.config.alibi or self.config.rope):
+            # Get positional embeddings.
+            # shape: (1, seq_len)
+            pos = torch.arange(past_length, past_length + seq_len, dtype=torch.long, device=x.device).unsqueeze(0)
+            # shape: (1, seq_len, d_model)
+            pos_emb = self.transformer.wpe(pos)  # type: ignore
+            x = pos_emb + x
+        # Add input + positional embeddings and apply dropout.
+        # shape: (batch_size, seq_len, d_model)
+        x = self.transformer.emb_drop(x)  # type: ignore
+        # Transform the attention mask into what the blocks expect.
+        if attention_mask is not None and 0.0 in attention_mask:
+            # shape: (batch_size, 1, 1, seq_len)
+            attention_mask = attention_mask.to(dtype=torch.float).view(batch_size, -1)[:, None, None, :]
+            attention_mask = (1.0 - attention_mask) * torch.finfo(attention_mask.dtype).min
+        else:
+            attention_mask = None
+        # Merge attention mask with attention bias.
+        if (
+            attention_bias is not None
+            or attention_mask is not None
+            or self.config.alibi
+            # NOTE (epwalsh): we need to initialize the attn bias in order for attn to work properly
+            # with key+value cache. Otherwise `F.scaled_dot_product_attention()` doesn't seem to compute
+            # scores correctly.
+            or past_key_values is not None
+        ):
+            if attention_bias is None and self.config.alibi:
+                attention_bias = get_causal_attention_bias(
+                    self.__cache, past_length + seq_len, x.device
+                ) + self.get_alibi_attention_bias(past_length + seq_len, x.device)
+            elif attention_bias is None:
+                attention_bias = get_causal_attention_bias(self.__cache, past_length + seq_len, x.device)
+            elif attention_bias.dtype in (torch.int8, torch.bool):
+                attention_bias = attention_bias.to(dtype=torch.float)
+                attention_bias.masked_fill_(attention_bias == 0.0, torch.finfo(attention_bias.dtype).min)
+            # Transform to the right shape and data type.
+            mask_len = seq_len
+            if attention_mask is not None:
+                mask_len = attention_mask.shape[-1]
+            elif past_key_values is not None:
+                mask_len = past_key_values[0][0].shape[-2] + seq_len
+            attention_bias = attention_bias[:, :, :mask_len, :mask_len].to(dtype=torch.float)
+            # Add in the masking bias.
+            if attention_mask is not None:
+                attention_bias = attention_bias + attention_mask
+                # Might get -infs after adding attention mask, since dtype.min + dtype.min = -inf.
+                # `F.scaled_dot_product_attention()` doesn't handle -inf like you'd expect, instead
+                # it can produce NaNs.
+                ensure_finite_(attention_bias, check_neg_inf=True, check_pos_inf=False)
+        attn_key_values: Optional[List[Tuple[torch.Tensor, torch.Tensor]]] = [] if use_cache else None
+        # decoder layers
+        all_hidden_states = []
+        # Apply blocks one-by-one.
+        if self.config.block_group_size == 1:
+            for block_idx, block in enumerate(self.transformer.blocks):
+                if output_hidden_states:
+                    # add hidden states
+                    all_hidden_states.append(x)
+                layer_past = None if past_key_values is None else past_key_values[block_idx]
+                if (
+                    (self.activation_checkpointing_strategy == ActivationCheckpointingStrategy.whole_layer)
+                    or (
+                        self.activation_checkpointing_strategy == ActivationCheckpointingStrategy.one_in_two
+                        and block_idx % 2 == 0
+                    )
+                    or (
+                        self.activation_checkpointing_strategy == ActivationCheckpointingStrategy.one_in_three
+                        and block_idx % 3 == 0
+                    )
+                    or (
+                        self.activation_checkpointing_strategy == ActivationCheckpointingStrategy.one_in_four
+                        and block_idx % 4 == 0
+                    )
+                ):
+                    # shape: (batch_size, seq_len, d_model)
+                    x, cache = self._activation_checkpoint_fn(
+                        block, x, input_ids=input_ids, attention_bias=attention_bias, layer_past=layer_past, use_cache=use_cache,
+                        engram_hash=None if engram_hashes is None else engram_hashes.get(block_idx)
+                    )
+                else:
+                    # shape: (batch_size, seq_len, d_model)
+                    x, cache = block(x, input_ids=input_ids, attention_bias=attention_bias, layer_past=layer_past, use_cache=use_cache,
+                                     engram_hash=None if engram_hashes is None else engram_hashes.get(block_idx))
+                if attn_key_values is not None:
+                    assert cache is not None
+                    attn_key_values.append(cache)
+        else:
+            for group_idx, block_group in enumerate(self.transformer.block_groups):
+                if output_hidden_states:
+                    # add hidden states
+                    all_hidden_states.append(x)
+                layers_past = (
+                    None
+                    if past_key_values is None
+                    else past_key_values[
+                        group_idx * self.config.block_group_size : (group_idx + 1) * self.config.block_group_size
+                    ]
+                )
+                x, cache = block_group(
+                    x, input_ids=input_ids, attention_bias=attention_bias, layers_past=layers_past, use_cache=use_cache,
+                    engram_hashes=engram_hashes
+                )
+                if attn_key_values is not None:
+                    assert cache is not None
+                    attn_key_values.extend(cache)
+        if last_logits_only:
+            # shape: (batch_size, 1, d_model)
+            x = x[:, -1, :].unsqueeze(1)
+        # Apply final layer norm.
+        # shape: (batch_size, seq_len or 1, d_model)
+        x = self.transformer.ln_f(x)  # type: ignore
+        if output_hidden_states:
+            # add final hidden state post-final-layernorm, following HuggingFace's convention
+            all_hidden_states.append(x)
+        # Get logits.
+        # shape: (batch_size, seq_len or 1, vocab_size)
+        if self.config.weight_tying:
+            logits = F.linear(x, self.transformer.wte.weight, None)  # type: ignore
+        else:
+            logits = self.transformer.ff_out(x)  # type: ignore
+        if self.config.scale_logits:
+            logits.mul_(1 / math.sqrt(self.config.d_model))
+        return LLaDAOutput(logits=logits, attn_key_values=attn_key_values, hidden_states=tuple(all_hidden_states) if output_hidden_states else None)  # type: ignore[arg-type]
+def create_model_config_from_pretrained_config(config: LLaDAConfig):
+    """
+    Utility function
+    """
+    kwargs = {}
+    for field in fields(ModelConfig):
+        val = getattr(config, field.name, None)
+        if field.name == "engram_config" and isinstance(val, dict):
+            val = EngramConfig(**val)
+        kwargs[field.name] = val
+    model_config = ModelConfig(**kwargs)
+    return model_config
+class LLaDAModelLM(PreTrainedModel):
+    """
+    Extremely barebones HF model wrapper.
+    """
+    config_class = LLaDAConfig
+    base_model_prefix = "model"
+    _no_split_modules = ["LLaDABlock", "LLaDASequentialBlock", "LLaDALlamaBlock"]
+    def __init__(self, config: LLaDAConfig, model: Optional[LLaDAModel] = None, init_params: bool = False):
+        super().__init__(config)
+        if not model:
+            model_config = create_model_config_from_pretrained_config(config)
+            # Initialize model (always on CPU to start with so we don't run out of GPU memory).
+            model_config.init_device = "cpu"
+            self.model = LLaDAModel(model_config, init_params=init_params)
+        else:
+            self.model = model
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        attention_bias: Optional[torch.Tensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[Cache] = None,  # This is a hack mitigation of an issue in transformers `4.39.x`
+        engram_hashes: Optional[Dict[int, torch.Tensor]] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        if use_cache is None:
+            use_cache = self.config.use_cache
+        if output_attentions:
+            raise ValueError("output_attentions is not yet supported in LLaDA")
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model.forward(
+            input_ids=input_ids,
+            input_embeddings=inputs_embeds,
+            attention_mask=attention_mask,
+            attention_bias=attention_bias,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_hidden_states=output_hidden_states,
+            engram_hashes=engram_hashes,
+        )
+        logits = outputs.logits
+        hidden_states = outputs.hidden_states
+        loss = None
+        if labels is not None:
+            import warnings
+            warnings.warn("Note that for LLaDA, you cannot calculate the loss here.", UserWarning)
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+        return CausalLMOutputWithPast(
+            logits=logits,
+            past_key_values=outputs.attn_key_values,
+            hidden_states=hidden_states,
+        )
+    def can_generate(self) -> bool:
+        return True
+    def prepare_inputs_for_generation(
+        self, input_ids: torch.LongTensor, past_key_values: Optional[List[Tuple]] = None, **kwargs
+    ):
+        if past_key_values:
+            # This is because we want the model to only process the last generated token.
+            input_ids = input_ids[:, -1:]
+        model_inputs = {"input_ids": input_ids, "past_key_values": past_key_values}
+        model_inputs.update(kwargs)
+        model_inputs["use_cache"] = kwargs.pop("use_cache", self.config.use_cache)
+        return model_inputs
+    # TODO: these are required to make the implementation complete.
+    # def resize_position_embeddings(self, new_num_position_embeddings: int):
+    #     pass
+    #
+    # def get_position_embeddings(self) -> Union[nn.Embedding, Tuple[nn.Embedding]]:
+    #     pass
+    #
+    # def _reorder_cache(self, past_key_values, beam_idx):
+    #     pass
+    def get_input_embeddings(self) -> torch.nn.Module:
+        return self.model.transformer.wte
+    def set_input_embeddings(self, value: torch.nn.Module):
+        self.model.transformer.wte = value
+    def get_output_embeddings(self):
+        if self.config.weight_tying:
+            return self.model.transformer.wte
+        else:
+            return self.model.transformer.ff_out
+    def set_output_embeddings(self, value: torch.nn.Module):
+        if self.config.weight_tying:
+            self.model.transformer.wte = value
+        else:
+            self.model.transformer.ff_out = value
+    def tie_weights(self):
+        if self.config.weight_tying:
+            self.model.transformer.ff_out = self.model.transformer.wte
+# Register the model so that it is available for transformer pipelines, auto-loading, etc.
+AutoModel.register(LLaDAConfig, LLaDAModelLM)

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "tokenizer_file": "tokenizer.json",
+  "bos_token": null,
+  "eos_token": null,
+  "pad_token": 6629,
+  "unk_token": 6630
+}