Upload smj-diffusion checkpoint (step 12000)

Browse files

Files changed (13) hide show

.gitattributes +1 -0
README.md +130 -0
added_tokens.json +25 -0
chat_template.jinja +54 -0
config.json +34 -0
inference.py +692 -0
merges.txt +0 -0
modeling_diffusion_qwen3.py +515 -0
pytorch_model.bin +3 -0
special_tokens_map.json +38 -0
tokenizer.json +3 -0
tokenizer_config.json +216 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,130 @@

+# smj-diffusion
+A discrete diffusion language model for code generation, based on the CoDA (Coding LM via Diffusion Adaptation) architecture.
+> ⚠️ **Note:** This is an intermediate checkpoint (step 12,000) from an interrupted training run. The model may not be fully trained.
+## Model Details
+| Property | Value |
+|----------|-------|
+| **Architecture** | DiffusionQwen3 (Bidirectional Transformer) |
+| **Base Model** | Qwen-based architecture |
+| **Hidden Size** | 1536 |
+| **Layers** | 28 |
+| **Attention Heads** | 12 |
+| **KV Heads** | 2 (GQA) |
+| **Intermediate Size** | 8960 |
+| **Max Position Embeddings** | 32,768 |
+| **Vocab Size** | 151,666 |
+| **Training Checkpoint** | 12,000 steps |
+## How Diffusion LMs Work
+Unlike autoregressive models that generate tokens left-to-right, this model uses **discrete diffusion**:
+1. Start with all `<mask>` tokens in the generation region
+2. Iteratively unmask tokens based on model confidence
+3. Higher-confidence predictions are revealed first
+4. Process repeats until all tokens are generated
+This enables **bidirectional context** during generation, potentially improving coherence for code.
+## Usage
+### Installation
+```bash
+pip install torch transformers
+```
+### Inference
+```python
+import torch
+from transformers import AutoTokenizer
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/smj-diffusion", trust_remote_code=True)
+# Load model (see inference.py for full diffusion generation logic)
+# The model uses custom DiffusionQwen3Model class
+```
+For full inference with diffusion sampling, use the included `inference.py` script:
+```bash
+# Single prompt
+python inference.py --checkpoint /path/to/model --prompt "def fibonacci(n):"
+# Interactive chat
+python inference.py --checkpoint /path/to/model --mode chat
+# With custom parameters
+python inference.py --checkpoint /path/to/model \
+    --prompt "Write a function to sort a list" \
+    --steps 128 \
+    --temperature 0.0 \
+    --max-tokens 256 \
+    --alg entropy
+```
+### Generation Parameters
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `steps` | 128 | Number of diffusion denoising steps |
+| `temperature` | 0.0 | Sampling temperature (0 = greedy) |
+| `top_p` | None | Nucleus sampling threshold |
+| `top_k` | None | Top-k sampling |
+| `alg` | entropy | Sampling algorithm: `origin`, `entropy`, `maskgit_plus`, `topk_margin` |
+| `alg_temp` | 0.1 | Algorithm-specific confidence temperature |
+## Model Architecture
+The model is a bidirectional transformer (non-causal attention) trained with discrete diffusion objectives:
+```
+DiffusionQwen3Model(
+  (model): Qwen2Model with bidirectional attention
+  (lm_head): Linear(1536, 151666)
+)
+```
+### Training Objective
+- **Forward process:** Randomly mask tokens with probability `σ ~ U[ε, 1]`
+- **Reverse process:** Predict original tokens from masked input
+- **Loss weighting:** `1/σ` (ELBO-derived)
+## Files
+- `pytorch_model.bin` - Model weights
+- `config.json` - Model configuration
+- `tokenizer.json`, `vocab.json`, `merges.txt` - Tokenizer files
+- `inference.py` - Standalone inference script
+- `modeling_diffusion_qwen3.py` - Model class definition
+## Limitations
+- This is a **checkpoint from interrupted training** - not a fully trained model
+- Performance may be limited compared to fully trained models
+- Primarily designed for code generation tasks
+## Citation
+Based on CoDA by Salesforce AI Research:
+```bibtex
+@article{coda2024,
+  title={CoDA: Coding LM via Diffusion Adaptation},
+  author={Salesforce AI Research},
+  journal={arXiv preprint},
+  year={2024}
+}
+```
+## License
+Please refer to the base Qwen model license for usage terms.

added_tokens.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "</tool_call>": 151658,
+  "<tool_call>": 151657,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|mask|>": 151665,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,54 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- messages[0]['content'] }}
+    {%- else %}
+        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
+    {%- endif %}
+    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
+    {%- else %}
+        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role }}
+        {%- if message.content %}
+            {{- '\n' + message.content }}
+        {%- endif %}
+        {%- for tool_call in message.tool_calls %}
+            {%- if tool_call.function is defined %}
+                {%- set tool_call = tool_call.function %}
+            {%- endif %}
+            {{- '\n<tool_call>\n{"name": "' }}
+            {{- tool_call.name }}
+            {{- '", "arguments": ' }}
+            {{- tool_call.arguments | tojson }}
+            {{- '}\n</tool_call>' }}
+        {%- endfor %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}

config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "architectures": [
+    "DiffusionQwen3Model"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "block_masking_probability": 0.01,
+  "bos_token_id": null,
+  "dtype": "bfloat16",
+  "eos_token_id": 151645,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 1536,
+  "intermediate_size": 8960,
+  "mask_block_sizes": [
+    2,
+    4,
+    8
+  ],
+  "mask_token_id": 151665,
+  "max_position_embeddings": 32768,
+  "model_type": "diffusion_qwen3",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 28,
+  "num_key_value_heads": 2,
+  "pad_token_id": 151643,
+  "prefix_probability": 0.01,
+  "rms_norm_eps": 1e-06,
+  "rope_theta": 1000000.0,
+  "sampling_eps": 0.001,
+  "transformers_version": "4.57.3",
+  "truncate_probability": 0.01,
+  "vocab_size": 151666
+}

inference.py ADDED Viewed

	@@ -0,0 +1,692 @@

+#!/usr/bin/env python3
+"""
+Inference script for DiffusionQwen3 model checkpoint.
+Usage:
+    # Interactive chat mode
+    python inference.py --checkpoint ./outputs/pretrain/checkpoint-1000 --mode chat
+    # Single prompt completion
+    python inference.py --checkpoint ./outputs/pretrain/checkpoint-1000 --prompt "def fibonacci(n):"
+    # With custom generation parameters
+    python inference.py --checkpoint ./outputs/pretrain/checkpoint-1000 \
+        --prompt "Write a hello world in Python" \
+        --steps 128 --temperature 0.0 --max-tokens 256
+"""
+import argparse
+import sys
+import os
+from typing import Optional, Tuple, List
+import torch
+import torch.nn.functional as F
+import torch.distributions as dists
+from transformers import AutoTokenizer, PreTrainedModel, PretrainedConfig
+# ============================================================================
+# Diffusion Sampling Utilities (adapted from CoDALanguageModel/generation_utils.py)
+# ============================================================================
+def top_p_logits(logits: torch.Tensor, top_p: float) -> torch.Tensor:
+    """Apply nucleus (top-p) filtering to logits."""
+    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+    sorted_indices_to_remove = cumulative_probs > top_p
+    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+    sorted_indices_to_remove[..., 0] = 0
+    mask = torch.zeros_like(logits, dtype=torch.bool)
+    mask = mask.scatter_(-1, sorted_indices, sorted_indices_to_remove)
+    logits = logits.masked_fill(mask, torch.finfo(logits.dtype).min)
+    return logits
+def top_k_logits(logits: torch.Tensor, top_k: int) -> torch.Tensor:
+    """Apply top-k filtering to logits."""
+    top_k = min(top_k, logits.size(-1))
+    indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+    logits = logits.masked_fill(indices_to_remove, torch.finfo(logits.dtype).min)
+    return logits
+def sample_tokens(
+    logits: torch.Tensor,
+    temperature: float = 0.0,
+    top_p: Optional[float] = None,
+    top_k: Optional[int] = None,
+    neg_entropy: bool = False,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """
+    Sample tokens from logits with optional temperature, top-p, and top-k.
+    Returns:
+        confidence: Confidence scores for sampled tokens
+        x0: Sampled token IDs
+    """
+    if temperature > 0:
+        logits = logits / temperature
+    if top_p is not None and top_p < 1.0:
+        logits = top_p_logits(logits, top_p)
+    if top_k is not None:
+        logits = top_k_logits(logits, top_k)
+    probs = torch.softmax(logits, dim=-1)
+    if temperature > 0:
+        try:
+            x0 = dists.Categorical(probs=probs).sample()
+            confidence = torch.gather(probs, -1, x0.unsqueeze(-1)).squeeze(-1)
+        except:
+            confidence, x0 = probs.max(dim=-1)
+    else:
+        confidence, x0 = probs.max(dim=-1)
+    if neg_entropy:
+        # Use negative entropy as confidence (for entropy-based sampling)
+        epsilon = 1e-10
+        log_probs = torch.log(probs + epsilon)
+        confidence = torch.sum(probs * log_probs, dim=-1)
+    return confidence, x0
+# ============================================================================
+# Diffusion Generation
+# ============================================================================
+@torch.no_grad()
+def diffusion_generate(
+    model: PreTrainedModel,
+    input_ids: torch.LongTensor,
+    mask_token_id: int,
+    max_new_tokens: int = 128,
+    steps: int = 128,
+    temperature: float = 0.0,
+    top_p: Optional[float] = None,
+    top_k: Optional[int] = None,
+    alg: str = "entropy",
+    alg_temp: Optional[float] = 0.1,
+    eps: float = 1e-3,
+    verbose: bool = False,
+) -> torch.LongTensor:
+    """
+    Generate text using discrete diffusion.
+    Args:
+        model: The diffusion language model
+        input_ids: Input token IDs (prompt) [batch_size, prompt_len]
+        mask_token_id: Token ID for mask token
+        max_new_tokens: Maximum number of new tokens to generate
+        steps: Number of diffusion steps
+        temperature: Sampling temperature (0 = greedy)
+        top_p: Nucleus sampling threshold
+        top_k: Top-k sampling threshold
+        alg: Sampling algorithm ("origin", "entropy", "maskgit_plus", "topk_margin")
+        alg_temp: Algorithm-specific temperature for confidence weighting
+        eps: Small epsilon for numerical stability
+        verbose: Print progress during generation
+    Returns:
+        Generated token sequence [batch_size, prompt_len + max_new_tokens]
+    """
+    device = input_ids.device
+    batch_size = input_ids.shape[0]
+    prompt_len = input_ids.shape[1]
+    total_len = prompt_len + max_new_tokens
+    # Initialize sequence: prompt + mask tokens for generation
+    x = F.pad(input_ids, (0, max_new_tokens), value=mask_token_id)
+    # Create timesteps from 1 to eps
+    timesteps = torch.linspace(1, eps, steps + 1, device=device)
+    for i in range(steps):
+        mask_index = (x == mask_token_id)
+        if not mask_index.any():
+            if verbose:
+                print(f"Step {i}: No more masked tokens, stopping early")
+            break
+        # Forward pass
+        outputs = model(x, return_logits_only=True)
+        if hasattr(outputs, 'logits'):
+            logits = outputs.logits
+        elif isinstance(outputs, tuple):
+            logits = outputs[0]
+        else:
+            logits = outputs
+        # Shift logits for next-token prediction
+        logits = torch.cat([logits[:, :1], logits[:, :-1]], dim=1)
+        # Get logits only for masked positions
+        mask_logits = logits[mask_index]
+        t = timesteps[i]
+        s = timesteps[i + 1]
+        if alg == "origin":
+            # Original diffusion: random unmasking with probability 1 - s/t
+            p_transfer = 1 - s / t if i < steps - 1 else 1
+            x0 = torch.zeros_like(x[mask_index], device=device, dtype=torch.long) + mask_token_id
+            transfer_index = torch.rand(*x0.shape, device=device) < p_transfer
+            _, x0[transfer_index] = sample_tokens(
+                mask_logits[transfer_index],
+                temperature=temperature,
+                top_p=top_p,
+                top_k=top_k
+            )
+            x[mask_index] = x0.clone()
+        else:
+            # Confidence-based unmasking algorithms
+            if alg == "maskgit_plus":
+                confidence, x0 = sample_tokens(
+                    mask_logits, temperature=temperature, top_p=top_p, top_k=top_k
+                )
+            elif alg == "topk_margin":
+                # Margin confidence: difference between top-2 probabilities
+                probs = F.softmax(mask_logits / (temperature if temperature > 0 else 1), dim=-1)
+                sorted_probs, _ = torch.sort(probs, dim=-1, descending=True)
+                confidence = sorted_probs[:, 0] - sorted_probs[:, 1]
+                _, x0 = sample_tokens(
+                    mask_logits, temperature=temperature, top_p=top_p, top_k=top_k
+                )
+            elif alg == "entropy":
+                confidence, x0 = sample_tokens(
+                    mask_logits, temperature=temperature, top_p=top_p, top_k=top_k,
+                    neg_entropy=True
+                )
+            else:
+                raise ValueError(f"Unknown algorithm: {alg}")
+            # Determine how many tokens to unmask
+            num_mask_token = mask_index.sum() / batch_size
+            num_transfer = int(num_mask_token * (1 - s / t)) if i < steps - 1 else int(num_mask_token)
+            if num_transfer > 0:
+                # Create full confidence tensor
+                full_confidence = torch.full_like(x, -torch.inf, dtype=logits.dtype)
+                full_confidence[mask_index] = confidence
+                # Select top-k most confident positions to unmask
+                if alg_temp is None or alg_temp == 0:
+                    _, transfer_index = torch.topk(full_confidence, num_transfer)
+                else:
+                    # Stochastic selection with temperature
+                    conf_probs = F.softmax(full_confidence / alg_temp, dim=-1)
+                    transfer_index = torch.multinomial(conf_probs, num_samples=num_transfer)
+                # Create candidate tensor with predicted tokens
+                x_candidate = torch.zeros_like(x, dtype=torch.long) + mask_token_id
+                x_candidate[mask_index] = x0.clone()
+                # Update only selected positions
+                row_indices = torch.arange(batch_size, device=device).unsqueeze(1).expand_as(transfer_index)
+                x[row_indices, transfer_index] = x_candidate[row_indices, transfer_index]
+        if verbose and (i + 1) % max(1, steps // 10) == 0:
+            remaining_masks = (x == mask_token_id).sum().item()
+            print(f"Step {i+1}/{steps}: {remaining_masks} masked tokens remaining")
+    return x
+# ============================================================================
+# Model Loading
+# ============================================================================
+def load_model_and_tokenizer(
+    checkpoint_path: str,
+    device: str = "auto",
+    torch_dtype: str = "bfloat16",
+) -> Tuple[PreTrainedModel, AutoTokenizer, dict]:
+    """
+    Load the diffusion model and tokenizer from checkpoint.
+    Args:
+        checkpoint_path: Path to the checkpoint directory
+        device: Device to load model on ("auto", "cuda", "cpu")
+        torch_dtype: Data type for model weights
+    Returns:
+        model: Loaded model
+        tokenizer: Loaded tokenizer
+        config: Model configuration dict
+    """
+    import json
+    from transformers import Qwen2ForCausalLM, Qwen2Config
+    # Determine device
+    if device == "auto":
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+    # Get dtype
+    dtype_map = {
+        "float32": torch.float32,
+        "float16": torch.float16,
+        "bfloat16": torch.bfloat16,
+    }
+    dtype = dtype_map.get(torch_dtype, torch.bfloat16)
+    if device == "cpu" and dtype == torch.bfloat16:
+        print("Warning: bfloat16 on CPU may be slow, using float32")
+        dtype = torch.float32
+    print(f"Loading model from {checkpoint_path}...")
+    print(f"  Device: {device}, Dtype: {dtype}")
+    # Load config
+    config_path = os.path.join(checkpoint_path, "config.json")
+    with open(config_path, "r") as f:
+        config_dict = json.load(f)
+    # Import and register the model class
+    sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+    from models.diffusion_qwen import DiffusionQwen3Model, DiffusionQwen3Config
+    # Create diffusion config
+    diff_config = DiffusionQwen3Config(**config_dict)
+    # Create a Qwen2Config to initialize the base model architecture
+    qwen_config = Qwen2Config(
+        vocab_size=diff_config.vocab_size,
+        hidden_size=diff_config.hidden_size,
+        intermediate_size=diff_config.intermediate_size,
+        num_hidden_layers=diff_config.num_hidden_layers,
+        num_attention_heads=diff_config.num_attention_heads,
+        num_key_value_heads=diff_config.num_key_value_heads,
+        max_position_embeddings=diff_config.max_position_embeddings,
+        rms_norm_eps=diff_config.rms_norm_eps,
+        rope_theta=diff_config.rope_theta,
+        hidden_act=diff_config.hidden_act,
+        attention_dropout=diff_config.attention_dropout,
+        use_sliding_window=False,
+        pad_token_id=diff_config.pad_token_id,
+        bos_token_id=diff_config.bos_token_id,
+        eos_token_id=diff_config.eos_token_id,
+    )
+    # Create DiffusionQwen3Model with proper architecture
+    model = DiffusionQwen3Model(diff_config)
+    # Initialize the base Qwen2 model architecture
+    print("  Initializing model architecture...")
+    base_model = Qwen2ForCausalLM(qwen_config)
+    model._init_from_qwen(base_model)
+    del base_model  # Free memory
+    # Load state dict
+    weights_path = os.path.join(checkpoint_path, "pytorch_model.bin")
+    if not os.path.exists(weights_path):
+        # Try model.safetensors
+        weights_path = os.path.join(checkpoint_path, "model.safetensors")
+    print(f"  Loading weights from {weights_path}...")
+    state_dict = torch.load(weights_path, map_location="cpu", weights_only=True)
+    # Handle potential key mismatches
+    missing, unexpected = model.load_state_dict(state_dict, strict=False)
+    if missing:
+        print(f"  Warning: Missing keys ({len(missing)}): {missing[:3]}{'...' if len(missing) > 3 else ''}")
+    if unexpected:
+        print(f"  Warning: Unexpected keys ({len(unexpected)}): {unexpected[:3]}{'...' if len(unexpected) > 3 else ''}")
+    # Move to device and set eval mode
+    model = model.to(device=device, dtype=dtype)
+    model.eval()
+    # Disable causal attention for bidirectional
+    model._disable_causal_masking()
+    # Load tokenizer
+    print("  Loading tokenizer...")
+    tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, trust_remote_code=True)
+    # Ensure mask token is set
+    if tokenizer.mask_token_id is None:
+        tokenizer.mask_token_id = config_dict.get("mask_token_id", 151665)
+    print(f"  Model loaded successfully!")
+    print(f"    Vocab size: {diff_config.vocab_size}")
+    print(f"    Hidden size: {diff_config.hidden_size}")
+    print(f"    Num layers: {diff_config.num_hidden_layers}")
+    print(f"    Mask token ID: {diff_config.mask_token_id}")
+    return model, tokenizer, config_dict
+# ============================================================================
+# Generation Wrapper
+# ============================================================================
+def generate(
+    model: PreTrainedModel,
+    tokenizer: AutoTokenizer,
+    prompt: str,
+    max_new_tokens: int = 128,
+    steps: int = 128,
+    temperature: float = 0.0,
+    top_p: Optional[float] = None,
+    top_k: Optional[int] = None,
+    alg: str = "entropy",
+    alg_temp: float = 0.1,
+    verbose: bool = False,
+) -> str:
+    """
+    Generate text from a prompt.
+    Args:
+        model: The diffusion language model
+        tokenizer: The tokenizer
+        prompt: Input prompt text
+        max_new_tokens: Maximum tokens to generate
+        steps: Diffusion steps
+        temperature: Sampling temperature
+        top_p: Nucleus sampling threshold
+        top_k: Top-k sampling threshold
+        alg: Sampling algorithm
+        alg_temp: Algorithm temperature
+        verbose: Print progress
+    Returns:
+        Generated text (prompt + completion)
+    """
+    device = next(model.parameters()).device
+    # Tokenize prompt
+    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
+    # Get mask token ID
+    mask_token_id = getattr(model.config, "mask_token_id", tokenizer.mask_token_id)
+    if mask_token_id is None:
+        mask_token_id = 151665  # Default from config
+    # Generate
+    output_ids = diffusion_generate(
+        model=model,
+        input_ids=input_ids,
+        mask_token_id=mask_token_id,
+        max_new_tokens=max_new_tokens,
+        steps=steps,
+        temperature=temperature,
+        top_p=top_p,
+        top_k=top_k,
+        alg=alg,
+        alg_temp=alg_temp,
+        verbose=verbose,
+    )
+    # Filter out mask and pad tokens
+    output_ids = output_ids[0]  # Remove batch dimension
+    pad_token_id = tokenizer.pad_token_id or 151643
+    output_ids = output_ids[output_ids != mask_token_id]
+    output_ids = output_ids[output_ids != pad_token_id]
+    # Decode
+    generated_text = tokenizer.decode(output_ids, skip_special_tokens=True)
+    return generated_text
+def chat_generate(
+    model: PreTrainedModel,
+    tokenizer: AutoTokenizer,
+    messages: List[dict],
+    max_new_tokens: int = 256,
+    steps: int = 128,
+    temperature: float = 0.0,
+    top_p: Optional[float] = None,
+    top_k: Optional[int] = None,
+    alg: str = "entropy",
+    alg_temp: float = 0.1,
+    verbose: bool = False,
+) -> str:
+    """
+    Generate chat response from conversation history.
+    Args:
+        model: The diffusion language model
+        tokenizer: The tokenizer
+        messages: List of message dicts with 'role' and 'content'
+        Other args: Same as generate()
+    Returns:
+        Assistant response text
+    """
+    device = next(model.parameters()).device
+    # Apply chat template
+    prompt = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+    )
+    # Tokenize
+    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
+    prompt_len = input_ids.shape[1]
+    # Get mask token ID
+    mask_token_id = getattr(model.config, "mask_token_id", tokenizer.mask_token_id)
+    if mask_token_id is None:
+        mask_token_id = 151665
+    # Generate
+    output_ids = diffusion_generate(
+        model=model,
+        input_ids=input_ids,
+        mask_token_id=mask_token_id,
+        max_new_tokens=max_new_tokens,
+        steps=steps,
+        temperature=temperature,
+        top_p=top_p,
+        top_k=top_k,
+        alg=alg,
+        alg_temp=alg_temp,
+        verbose=verbose,
+    )
+    # Get only the generated tokens (after prompt)
+    generated_ids = output_ids[0, prompt_len:]
+    # Filter out mask and pad tokens
+    pad_token_id = tokenizer.pad_token_id or 151643
+    generated_ids = generated_ids[generated_ids != mask_token_id]
+    generated_ids = generated_ids[generated_ids != pad_token_id]
+    # Decode
+    response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
+    return response
+# ============================================================================
+# Interactive Chat
+# ============================================================================
+def interactive_chat(
+    model: PreTrainedModel,
+    tokenizer: AutoTokenizer,
+    system_prompt: str = "You are a helpful assistant.",
+    **gen_kwargs,
+):
+    """Run interactive chat session."""
+    print("\n" + "=" * 60)
+    print("Interactive Chat Mode")
+    print("=" * 60)
+    print("Commands:")
+    print("  /exit or /quit  - Exit the chat")
+    print("  /reset          - Reset conversation history")
+    print("  /system <text>  - Set new system prompt")
+    print("=" * 60 + "\n")
+    messages = [{"role": "system", "content": system_prompt}]
+    while True:
+        try:
+            user_input = input("\033[92mYou: \033[0m").strip()
+        except (EOFError, KeyboardInterrupt):
+            print("\nGoodbye!")
+            break
+        if not user_input:
+            continue
+        # Handle commands
+        if user_input.lower() in ["/exit", "/quit"]:
+            print("Goodbye!")
+            break
+        if user_input.lower() == "/reset":
+            messages = [{"role": "system", "content": system_prompt}]
+            print("\033[90mConversation reset.\033[0m")
+            continue
+        if user_input.lower().startswith("/system "):
+            system_prompt = user_input[8:].strip()
+            messages = [{"role": "system", "content": system_prompt}]
+            print("\033[90mSystem prompt updated.\033[0m")
+            continue
+        # Add user message
+        messages.append({"role": "user", "content": user_input})
+        # Generate response
+        print("\033[94mAssistant: \033[0m", end="", flush=True)
+        try:
+            response = chat_generate(
+                model=model,
+                tokenizer=tokenizer,
+                messages=messages,
+                **gen_kwargs,
+            )
+            print(response)
+            messages.append({"role": "assistant", "content": response})
+        except Exception as e:
+            print(f"\033[91mError: {e}\033[0m")
+            messages.pop()  # Remove failed user message
+# ============================================================================
+# Main
+# ============================================================================
+def main():
+    parser = argparse.ArgumentParser(
+        description="Run inference with DiffusionQwen3 model",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    # Model arguments
+    parser.add_argument(
+        "--checkpoint", "-c",
+        type=str,
+        default="./outputs/pretrain/checkpoint-1000",
+        help="Path to model checkpoint directory",
+    )
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="auto",
+        choices=["auto", "cuda", "cpu"],
+        help="Device to run on",
+    )
+    parser.add_argument(
+        "--dtype",
+        type=str,
+        default="bfloat16",
+        choices=["float32", "float16", "bfloat16"],
+        help="Model data type",
+    )
+    # Generation mode
+    parser.add_argument(
+        "--mode", "-m",
+        type=str,
+        default="prompt",
+        choices=["prompt", "chat"],
+        help="Generation mode: 'prompt' for single completion, 'chat' for interactive",
+    )
+    parser.add_argument(
+        "--prompt", "-p",
+        type=str,
+        default=None,
+        help="Input prompt for single completion mode",
+    )
+    parser.add_argument(
+        "--system",
+        type=str,
+        default="You are a helpful assistant.",
+        help="System prompt for chat mode",
+    )
+    # Generation parameters
+    parser.add_argument("--max-tokens", type=int, default=256, help="Max tokens to generate")
+    parser.add_argument("--steps", type=int, default=128, help="Diffusion steps")
+    parser.add_argument("--temperature", type=float, default=0.0, help="Sampling temperature")
+    parser.add_argument("--top-p", type=float, default=None, help="Nucleus sampling threshold")
+    parser.add_argument("--top-k", type=int, default=None, help="Top-k sampling")
+    parser.add_argument(
+        "--alg",
+        type=str,
+        default="entropy",
+        choices=["origin", "entropy", "maskgit_plus", "topk_margin"],
+        help="Diffusion sampling algorithm",
+    )
+    parser.add_argument("--alg-temp", type=float, default=0.1, help="Algorithm temperature")
+    parser.add_argument("--verbose", "-v", action="store_true", help="Verbose output")
+    args = parser.parse_args()
+    # Load model
+    model, tokenizer, config = load_model_and_tokenizer(
+        args.checkpoint,
+        device=args.device,
+        torch_dtype=args.dtype,
+    )
+    # Generation kwargs
+    gen_kwargs = {
+        "max_new_tokens": args.max_tokens,
+        "steps": args.steps,
+        "temperature": args.temperature,
+        "top_p": args.top_p,
+        "top_k": args.top_k,
+        "alg": args.alg,
+        "alg_temp": args.alg_temp,
+        "verbose": args.verbose,
+    }
+    if args.mode == "chat":
+        interactive_chat(model, tokenizer, system_prompt=args.system, **gen_kwargs)
+    else:
+        # Single prompt mode
+        if args.prompt is None:
+            # Default demo prompts
+            prompts = [
+                "def fibonacci(n):",
+                "Write a Python function to check if a number is prime:",
+                "# Calculate the factorial of a number\ndef factorial(n):",
+            ]
+            print("\nNo prompt provided. Running demo with sample prompts...\n")
+            for prompt in prompts:
+                print("=" * 60)
+                print(f"Prompt: {prompt}")
+                print("-" * 60)
+                result = generate(model, tokenizer, prompt, **gen_kwargs)
+                print(f"Generated:\n{result}")
+                print("=" * 60 + "\n")
+        else:
+            result = generate(model, tokenizer, args.prompt, **gen_kwargs)
+            print("\n" + "=" * 60)
+            print("Generated:")
+            print("=" * 60)
+            print(result)
+            print("=" * 60)
+if __name__ == "__main__":
+    main()

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

modeling_diffusion_qwen3.py ADDED Viewed

	@@ -0,0 +1,515 @@

+"""
+DiffusionQwen3 Model - Converts Qwen3-1.7B AR to Bidirectional Diffusion LLM
+This module provides:
+1. DiffusionQwen3Config - Configuration for diffusion-adapted Qwen3
+2. DiffusionQwen3Model - The main model class with diffusion training/inference
+Based on CoDA (Coding LM via Diffusion Adaptation) by Salesforce AI Research
+https://arxiv.org/abs/2510.03270
+CRITICAL: Loss normalization matches CoDA official implementation exactly:
+  loss = (dsigma[:, None] * loss).sum() / (batch_size * seq_len)
+NOT dividing by num_masked (which causes gradient explosion)
+"""
+import math
+from dataclasses import dataclass
+from typing import Optional, Tuple, Union, List, Dict, Any
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import PreTrainedModel, PretrainedConfig
+from transformers import Qwen2ForCausalLM, Qwen2Config, AutoTokenizer
+from transformers.modeling_outputs import CausalLMOutputWithPast
+@dataclass
+class DiffusionQwen3Config(PretrainedConfig):
+    """Configuration for Diffusion-adapted Qwen3 model."""
+    model_type = "diffusion_qwen3"
+    def __init__(
+        self,
+        # Base Qwen3 config
+        vocab_size: int = 151936,
+        hidden_size: int = 2048,
+        intermediate_size: int = 6144,
+        num_hidden_layers: int = 28,
+        num_attention_heads: int = 16,
+        num_key_value_heads: int = 8,
+        head_dim: int = 128,
+        max_position_embeddings: int = 40960,
+        rms_norm_eps: float = 1e-6,
+        rope_theta: float = 1000000.0,
+        hidden_act: str = "silu",
+        attention_dropout: float = 0.0,
+        attention_bias: bool = False,
+        tie_word_embeddings: bool = True,
+        # Diffusion-specific config
+        mask_token_id: int = 151669,
+        pad_token_id: int = 151643,
+        bos_token_id: int = 151643,
+        eos_token_id: int = 151645,
+        # Diffusion training parameters
+        sampling_eps: float = 0.001,  # CoDA default: creates 1/t in [1, 1000]
+        mask_block_sizes: List[int] = None,
+        block_masking_probability: float = 0.01,
+        prefix_probability: float = 0.01,
+        truncate_probability: float = 0.01,
+        **kwargs
+    ):
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs
+        )
+        # Base model config
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.head_dim = head_dim
+        self.max_position_embeddings = max_position_embeddings
+        self.rms_norm_eps = rms_norm_eps
+        self.rope_theta = rope_theta
+        self.hidden_act = hidden_act
+        self.attention_dropout = attention_dropout
+        self.attention_bias = attention_bias
+        # Diffusion config
+        self.mask_token_id = mask_token_id
+        self.sampling_eps = sampling_eps
+        self.mask_block_sizes = mask_block_sizes or [2, 4, 8]
+        self.block_masking_probability = block_masking_probability
+        self.prefix_probability = prefix_probability
+        self.truncate_probability = truncate_probability
+class DiffusionQwen3Model(PreTrainedModel):
+    """
+    Qwen3 model adapted for discrete diffusion language modeling.
+    Key modifications from standard Qwen3:
+    1. Bidirectional attention (is_causal=False)
+    2. Masked diffusion training objective
+    3. Loss weighted by 1/t (inverse noise level)
+    4. Support for progressive masking (S1/S2/S3)
+    CRITICAL: Loss normalization follows CoDA exactly (line 524 of modeling.py):
+      loss = (dsigma[:, None] * loss).sum() / (batch_size * seq_len)
+    """
+    config_class = DiffusionQwen3Config
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["Qwen2DecoderLayer"]
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
+    def __init__(self, config: DiffusionQwen3Config):
+        super().__init__(config)
+        self.config = config
+        # Initialize the base Qwen2 model (Qwen3 uses Qwen2 architecture in transformers)
+        # We'll load this from pretrained in the from_pretrained method
+        self.model = None
+        self.lm_head = None
+        self.embed_tokens = None
+        # Diffusion parameters
+        self.mask_token_id = config.mask_token_id
+        self.sampling_eps = config.sampling_eps
+        # Loss function
+        self.loss_fn = nn.CrossEntropyLoss(reduction='none')
+    def _init_from_qwen(self, qwen_model: Qwen2ForCausalLM):
+        """Initialize from a pretrained Qwen model."""
+        # Extract the base model and lm_head
+        self.model = qwen_model.model
+        self.lm_head = qwen_model.lm_head
+        self.embed_tokens = self.model.embed_tokens
+        # Disable causal masking in all attention layers
+        self._disable_causal_masking()
+    def _disable_causal_masking(self):
+        """Disable causal attention masks for bidirectional attention."""
+        for layer in self.model.layers:
+            if hasattr(layer.self_attn, 'is_causal'):
+                layer.self_attn.is_causal = False
+    def get_input_embeddings(self):
+        return self.embed_tokens
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+        self.model.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def get_embeds(self, input_ids: torch.LongTensor) -> torch.Tensor:
+        """Get token embeddings."""
+        return self.embed_tokens(input_ids)
+    def transition(
+        self,
+        x_0: torch.LongTensor,
+        sigma: torch.Tensor,
+        maskable_mask: torch.BoolTensor,
+        mask_block_size: int = 1,
+    ) -> torch.LongTensor:
+        """
+        Apply noise transition: mask tokens with probability sigma.
+        Args:
+            x_0: Original token IDs [batch_size, seq_len]
+            sigma: Noise level per sample [batch_size, 1] or [batch_size]
+            maskable_mask: Boolean mask of which positions can be masked [batch_size, seq_len]
+            mask_block_size: Size of contiguous blocks to mask (1 for individual tokens)
+        Returns:
+            x_t: Noisy token IDs with some tokens replaced by mask_token_id
+        """
+        if sigma.dim() == 1:
+            sigma = sigma.unsqueeze(-1)
+        if mask_block_size == 1:
+            # Standard per-token masking
+            move_indices = (torch.rand_like(x_0, dtype=torch.float) < sigma) & maskable_mask
+            x_t = torch.where(move_indices, self.mask_token_id, x_0)
+        else:
+            # Block masking
+            x_t = self._block_masking(x_0, sigma, maskable_mask, mask_block_size)
+        return x_t
+    def _block_masking(
+        self,
+        x_0: torch.LongTensor,
+        sigma: torch.Tensor,
+        maskable_mask: torch.BoolTensor,
+        mask_block_size: int,
+    ) -> torch.LongTensor:
+        """Apply block masking for contiguous spans."""
+        batch_size, seq_len = x_0.shape
+        if seq_len < mask_block_size:
+            return x_0
+        # Calculate number of possible block positions
+        num_windows = seq_len - mask_block_size + 1
+        # Create all possible block positions
+        window_starts = torch.arange(num_windows, device=x_0.device)
+        block_offsets = torch.arange(mask_block_size, device=x_0.device)
+        all_positions = window_starts.unsqueeze(1) + block_offsets.unsqueeze(0)
+        # Check which blocks are fully maskable
+        maskable_blocks = maskable_mask.unsqueeze(1).expand(-1, num_windows, -1)
+        maskable_blocks = maskable_blocks.gather(2, all_positions.unsqueeze(0).expand(batch_size, -1, -1))
+        fully_maskable = maskable_blocks.all(dim=2)
+        # Scale sigma for block masking (CoDA line 569)
+        effective_sigma = 1 - (1 - sigma) ** (1 / mask_block_size)
+        # Determine which blocks to mask
+        should_mask = (torch.rand(batch_size, num_windows, device=x_0.device) < effective_sigma) & fully_maskable
+        # Create final mask
+        position_indices = torch.arange(seq_len, device=x_0.device).unsqueeze(0).unsqueeze(0)
+        all_positions_expanded = all_positions.unsqueeze(0)
+        should_mask_expanded = should_mask.unsqueeze(2)
+        position_matches = (position_indices == all_positions_expanded.unsqueeze(3)).any(dim=2)
+        should_mask_positions = should_mask_expanded & position_matches
+        final_mask = should_mask_positions.any(dim=1)
+        return torch.where(final_mask, self.mask_token_id, x_0)
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        src_mask: Optional[torch.BoolTensor] = None,
+        training_mode: str = "pretrain",
+        masking_schedule: Optional[Dict[str, Any]] = None,
+        epoch: Optional[int] = None,
+        return_logits_only: bool = False,
+        **kwargs,
+    ) -> Union[Tuple[torch.Tensor, Optional[torch.Tensor]], CausalLMOutputWithPast]:
+        """
+        Forward pass with diffusion training.
+        Args:
+            input_ids: Input token IDs [batch_size, seq_len]
+            attention_mask: Attention mask [batch_size, seq_len]
+            labels: Target labels (same as input_ids for diffusion)
+            src_mask: Source mask for SFT (True = prompt, False = response)
+            training_mode: "pretrain", "midtrain", or "sft"
+            masking_schedule: Optional override for masking probabilities
+            epoch: Current epoch for progressive masking
+            return_logits_only: If True, skip diffusion training logic (used by trainer)
+        Returns:
+            logits: Model predictions [batch_size, seq_len, vocab_size]
+            loss: Diffusion loss (if training and not return_logits_only)
+        """
+        if not self.training or return_logits_only:
+            # Inference mode OR trainer is handling diffusion logic
+            hidden_states = self.model(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+            ).last_hidden_state
+            logits = self.lm_head(hidden_states)
+            return CausalLMOutputWithPast(logits=logits, loss=None)
+        # Training mode
+        batch_size, seq_len = input_ids.shape
+        # Get masking configuration
+        if masking_schedule is not None:
+            prefix_prob = masking_schedule.get("prefix_probability", 0)
+            truncate_prob = masking_schedule.get("truncate_probability", 0)
+            block_prob = masking_schedule.get("block_masking_probability", 0)
+            mask_block_sizes = masking_schedule.get("mask_block_sizes", self.config.mask_block_sizes)
+        else:
+            prefix_prob = self.config.prefix_probability
+            truncate_prob = self.config.truncate_probability
+            block_prob = self.config.block_masking_probability
+            mask_block_sizes = self.config.mask_block_sizes
+        # Create maskable_mask based on training mode
+        if src_mask is not None:
+            # SFT mode: only mask response tokens
+            maskable_mask = ~src_mask
+        else:
+            # Pre-training/mid-training: all tokens maskable
+            maskable_mask = torch.ones_like(input_ids, dtype=torch.bool)
+            # Apply S1: Unmaskable prefix
+            if prefix_prob > 0:
+                maskable_mask = self._apply_prefix_masking(
+                    input_ids, maskable_mask, prefix_prob
+                )
+            # Apply S2: Truncated suffix
+            if truncate_prob > 0:
+                input_ids, maskable_mask = self._apply_truncate_masking(
+                    input_ids, maskable_mask, truncate_prob
+                )
+        # Sample timesteps and compute sigma
+        # CoDA line 475: sigma = (1 - sampling_eps) * rand + sampling_eps
+        sampling_eps = self.config.sampling_eps
+        t = (1 - sampling_eps) * torch.rand(batch_size, device=input_ids.device) + sampling_eps
+        sigma = t
+        # CoDA line 476: dsigma = 1 / sigma (for loss weighting)
+        dsigma = torch.reciprocal(t)
+        # Select block masking size
+        if block_prob > 0 and mask_block_sizes and torch.rand(1).item() < block_prob:
+            mask_block_size = mask_block_sizes[torch.randint(len(mask_block_sizes), (1,)).item()]
+        else:
+            mask_block_size = 1
+        # Apply noise transition
+        noisy_input_ids = self.transition(
+            input_ids, sigma, maskable_mask, mask_block_size
+        )
+        # Track which positions are masked (for loss computation)
+        loss_mask = (noisy_input_ids == self.mask_token_id)
+        # Forward pass through model
+        hidden_states = self.model(
+            input_ids=noisy_input_ids,
+            attention_mask=attention_mask,
+        ).last_hidden_state
+        logits = self.lm_head(hidden_states)
+        logits = logits.float()
+        # =================================================================
+        # LOSS COMPUTATION - MATCHES CODA EXACTLY (modeling.py lines 509-524)
+        # =================================================================
+        # Shift for next-token prediction
+        # logits: [batch, seq_len-1, vocab_size]
+        # labels: [batch, seq_len-1]
+        shift_logits = logits[..., :-1, :].contiguous()
+        shift_labels = input_ids[..., 1:].contiguous()
+        shift_loss_mask = loss_mask[..., 1:].contiguous()
+        # Cross-entropy loss per token
+        loss = self.loss_fn(
+            shift_logits.view(-1, self.config.vocab_size),
+            shift_labels.view(-1)
+        ).view(batch_size, -1)
+        # Zero out loss for non-masked positions
+        loss = loss.masked_fill(~shift_loss_mask, 0)
+        # =================================================================
+        # CRITICAL: CoDA normalization (line 524)
+        # Divide by (batch_size * seq_len), NOT by num_masked!
+        # This gives stable gradients regardless of mask ratio
+        # =================================================================
+        # loss = (dsigma[:, None] * loss).sum() / (batch_size * seq_len)
+        loss = (dsigma.unsqueeze(-1) * loss).sum() / (batch_size * seq_len)
+        return logits, loss
+    def _apply_prefix_masking(
+        self,
+        input_ids: torch.LongTensor,
+        maskable_mask: torch.BoolTensor,
+        prefix_prob: float,
+    ) -> torch.BoolTensor:
+        """Apply S1: Random unmaskable prefix."""
+        batch_size, seq_len = input_ids.shape
+        # Randomly decide which samples get prefix
+        apply_prefix = torch.rand(batch_size, device=input_ids.device) < prefix_prob
+        # Generate random prefix lengths
+        prefix_lengths = torch.randint(1, seq_len, (batch_size,), device=input_ids.device)
+        # Create position indices
+        positions = torch.arange(seq_len, device=input_ids.device).unsqueeze(0)
+        # Create prefix mask
+        prefix_mask = positions < prefix_lengths.unsqueeze(1)
+        # Apply: set maskable_mask to False for prefix positions
+        maskable_mask = maskable_mask & ~(apply_prefix.unsqueeze(1) & prefix_mask)
+        return maskable_mask
+    def _apply_truncate_masking(
+        self,
+        input_ids: torch.LongTensor,
+        maskable_mask: torch.BoolTensor,
+        truncate_prob: float,
+    ) -> Tuple[torch.LongTensor, torch.BoolTensor]:
+        """Apply S2: Random truncated suffix."""
+        batch_size, seq_len = input_ids.shape
+        # Randomly decide which samples get truncated
+        apply_truncate = torch.rand(batch_size, device=input_ids.device) < truncate_prob
+        # Generate random truncation positions
+        truncate_positions = torch.randint(1, seq_len, (batch_size,), device=input_ids.device)
+        # Create position indices
+        positions = torch.arange(seq_len, device=input_ids.device).unsqueeze(0)
+        # Create truncate mask
+        truncate_mask = positions >= truncate_positions.unsqueeze(1)
+        # Apply: replace with pad token and update maskable_mask
+        input_ids = torch.where(
+            apply_truncate.unsqueeze(1) & truncate_mask,
+            self.config.pad_token_id,
+            input_ids
+        )
+        maskable_mask = maskable_mask & (input_ids != self.config.pad_token_id)
+        return input_ids, maskable_mask
+    @classmethod
+    def from_pretrained_qwen(
+        cls,
+        pretrained_model_name_or_path: str = "Qwen/Qwen3-1.7B",
+        config: Optional[DiffusionQwen3Config] = None,
+        **kwargs
+    ) -> "DiffusionQwen3Model":
+        """
+        Load from a pretrained Qwen3 model and convert to diffusion.
+        Args:
+            pretrained_model_name_or_path: HuggingFace model name or path
+            config: Optional DiffusionQwen3Config override
+            **kwargs: Additional arguments for from_pretrained
+        Returns:
+            DiffusionQwen3Model ready for diffusion training
+        """
+        # Load the base Qwen model
+        print(f"Loading base model from {pretrained_model_name_or_path}...")
+        qwen_model = Qwen2ForCausalLM.from_pretrained(
+            pretrained_model_name_or_path,
+            torch_dtype=kwargs.pop("torch_dtype", torch.bfloat16),
+            attn_implementation=kwargs.pop("attn_implementation", "flash_attention_2"),
+            **kwargs
+        )
+        # Create diffusion config if not provided
+        if config is None:
+            qwen_config = qwen_model.config
+            config = DiffusionQwen3Config(
+                vocab_size=qwen_config.vocab_size,
+                hidden_size=qwen_config.hidden_size,
+                intermediate_size=qwen_config.intermediate_size,
+                num_hidden_layers=qwen_config.num_hidden_layers,
+                num_attention_heads=qwen_config.num_attention_heads,
+                num_key_value_heads=qwen_config.num_key_value_heads,
+                max_position_embeddings=qwen_config.max_position_embeddings,
+                rms_norm_eps=qwen_config.rms_norm_eps,
+                rope_theta=qwen_config.rope_theta,
+            )
+        # Create diffusion model and initialize from Qwen
+        model = cls(config)
+        model._init_from_qwen(qwen_model)
+        print(f"Converted to DiffusionQwen3Model with bidirectional attention")
+        print(f"  - Mask token ID: {config.mask_token_id}")
+        print(f"  - Vocab size: {config.vocab_size}")
+        print(f"  - Hidden size: {config.hidden_size}")
+        print(f"  - Num layers: {config.num_hidden_layers}")
+        return model
+def prepare_tokenizer(tokenizer_name: str = "Qwen/Qwen3-1.7B") -> AutoTokenizer:
+    """
+    Prepare tokenizer with mask token for diffusion training.
+    Args:
+        tokenizer_name: HuggingFace tokenizer name
+    Returns:
+        Tokenizer with mask token added
+    """
+    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
+    # Check if mask token already exists
+    if tokenizer.mask_token is None:
+        # Add mask token (CoDA uses ID 151669)
+        tokenizer.add_tokens("<|mask|>", special_tokens=True)
+        tokenizer.add_special_tokens(
+            {"mask_token": "<|mask|>"},
+            replace_additional_special_tokens=False
+        )
+        print(f"Added mask token: {tokenizer.mask_token} (ID: {tokenizer.mask_token_id})")
+    else:
+        print(f"Mask token already exists: {tokenizer.mask_token} (ID: {tokenizer.mask_token_id})")
+    return tokenizer

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:47e6306d5cb44f8ea9da0ab55d9f13b581cf8306205bb4c9cb71039ce923c4c3
+size 3086713515

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,38 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<|mask|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a59820ad3f728fff77cf7e4188532fc45e5f80cd0299cde28046bd2b51c64bdf
+size 11422081

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,216 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<|mask|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "mask_token": "<|mask|>",
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff