Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +48 -73
config.json +4 -1
configuration_plasmid_lm.py +20 -0
modeling_plasmid_lm.py +106 -32
moe.py +99 -0
tokenization_plasmid_lm.py +16 -0

README.md CHANGED Viewed

@@ -1,74 +1,42 @@
 ---
-language:
-- en
-license: apache-2.0
 library_name: transformers
 tags:
-- biology
-- genomics
-- dna
-- plasmid
-- synthetic-biology
-- causal-lm
-- protein-engineering
-datasets:
-- custom
 pipeline_tag: text-generation
-model-index:
-- name: PlasmidLM
-  results:
-  - task:
-      type: text-generation
-      name: Plasmid DNA Generation
-    metrics:
-    - name: Eval Loss
-      type: loss
-      value: 0.093
-    - name: Token Accuracy
-      type: accuracy
-      value: 0.961
 ---
 # PlasmidLM
-A 17M-parameter transformer language model for conditional generation of synthetic plasmid DNA sequences.
-## Model Description
-PlasmidLM generates plasmid DNA sequences conditioned on functional component specifications. Given a prompt specifying desired elements (antibiotic resistance genes, origins of replication, promoters, reporters, etc.), it autoregressively generates a complete DNA sequence containing those elements.
-**Architecture**: LLaMA-style transformer decoder with RoPE, RMSNorm, and GELU activations.
-| Parameter | Value |
-|-----------|-------|
-| Parameters | 17M |
 | Hidden size | 384 |
 | Layers | 10 |
 | Attention heads | 8 |
-| Context length | 16,384 tokens |
-| Vocabulary | 120 tokens |
-The vocabulary consists of 5 DNA bases (A, T, C, G, N), control tokens (BOS, EOS, SEP, PAD, UNK), and ~100 categorical tokens representing functional plasmid components (e.g., `<AMR_KANAMYCIN>`, `<ORI_COLE1>`, `<PROM_T7>`).
-## Training
-Pretrained with causal language modeling on ~108K plasmid sequences derived from the [Addgene](https://www.addgene.org/) repository, annotated with functional components via [pLannotate](https://github.com/barricklab/pLannotate).
 - **Steps**: 15,000
-- **Epochs**: ~2.3
 - **Eval loss**: 0.093
 - **Token accuracy**: 96.1%
-- **Optimizer**: AdamW
-- **Precision**: bf16
-## Intended Use
-This is a **base pretrained model**. It has learned the statistical patterns of plasmid DNA sequences and their relationship to categorical component tokens. It can be used for:
-- **Direct generation**: Prompt with component tokens to generate plasmid sequences
-- **Fine-tuning**: Post-train with reinforcement learning (GRPO/PPO) to improve motif placement accuracy
-- **Embeddings**: Use hidden states as learned representations of plasmid sequences
-- **Research**: Study the learned structure of synthetic DNA
 ## Usage
@@ -78,14 +46,15 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
 model = AutoModelForCausalLM.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)
 tokenizer = AutoTokenizer.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)
-# Generate a plasmid with kanamycin resistance and ColE1 origin
 prompt = "<BOS><AMR_KANAMYCIN><ORI_COLE1><SEP>"
 inputs = tokenizer(prompt, return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.8, do_sample=True)
-sequence = tokenizer.decode(outputs[0], skip_special_tokens=False)
-print(sequence)
 ```
 ## Input Format
 ```
@@ -94,31 +63,37 @@ print(sequence)
 The model generates DNA bases (A/T/C/G) after the `<SEP>` token until it produces `<EOS>` or hits the maximum length.
-## Component Categories
-| Category | Examples | Count |
-|----------|----------|-------|
-| Antibiotic Resistance (AMR) | Kanamycin, Ampicillin, Chloramphenicol, ... | 11 |
-| Origin of Replication (ORI) | ColE1, F1, P15A, pSC101, SV40, ... | 7 |
-| Promoter (PROM) | CMV, T7, U6, EF1a, CAG, ... | 11 |
-| Reporter | EGFP, mCherry, YFP, NanoLuc, ... | 6 |
-| Vector Type (VEC) | Lentiviral, CRISPR, Bacterial, AAV, ... | 10 |
-| Other | Tags, elements, species, backbones | ~55 |
 ## Limitations
-- This is a **pretrained base model** -- it learns sequence statistics but has not been optimized for motif placement accuracy. Post-training with RL significantly improves functional element fidelity.
-- Generated sequences are **not experimentally validated**. Always verify computationally (e.g., with pLannotate) and experimentally before synthesis.
-- The model was trained on Addgene plasmids, which are biased toward commonly deposited vectors (mammalian expression, bacterial cloning, CRISPR).
-- Maximum context of 16K tokens (~16 kbp), which covers most but not all plasmids.
 ## Citation
 ```bibtex
 @misc{thiel2026plasmidlm,
-  title={PlasmidLM: Language Models for Conditional Plasmid DNA Generation},
   author={Thiel, McClain},
-  year={2026},
-  url={https://huggingface.co/McClain/PlasmidLM}
 }
 ```

 ---
 library_name: transformers
+license: apache-2.0
 tags:
+  - biology
+  - genomics
+  - plasmid
+  - dna
+  - causal-lm
+  - synthetic-biology
+language:
+  - en
 pipeline_tag: text-generation
 ---
 # PlasmidLM
+A 17.7M parameter autoregressive language model for **plasmid DNA sequence generation**, trained on ~108K plasmid sequences from Addgene.
+## Model Details
+| Property | Value |
+|---|---|
+| Parameters | 17.7M |
+| Architecture | Transformer decoder (dense MLP), LLaMA-style |
 | Hidden size | 384 |
 | Layers | 10 |
 | Attention heads | 8 |
+| Intermediate size | 1,536 |
+| Max sequence length | 16,384 tokens |
+| Tokenizer | Character-level (single DNA bases) |
+| Vocab size | 120 |
+### Training
+- **Data**: ~108K plasmid sequences from Addgene, annotated with functional components via pLannotate
 - **Steps**: 15,000
 - **Eval loss**: 0.093
 - **Token accuracy**: 96.1%
 ## Usage
 model = AutoModelForCausalLM.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)
 tokenizer = AutoTokenizer.from_pretrained("McClain/PlasmidLM", trust_remote_code=True)
+# Condition on antibiotic resistance + origin of replication
 prompt = "<BOS><AMR_KANAMYCIN><ORI_COLE1><SEP>"
 inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.8, do_sample=True, top_p=0.95)
+print(tokenizer.decode(outputs[0].tolist()))
 ```
+The model generates plasmid DNA sequences conditioned on functional annotations (antibiotic resistance markers, origins of replication, promoters, reporters, etc.) provided as special tokens in the prompt.
 ## Input Format
 ```
 The model generates DNA bases (A/T/C/G) after the `<SEP>` token until it produces `<EOS>` or hits the maximum length.
+## Special Tokens
+| Token | Purpose |
+|---|---|
+| `<BOS>` | Beginning of sequence |
+| `<EOS>` | End of sequence |
+| `<SEP>` | Separator between prompt annotations and DNA sequence |
+| `<PAD>` | Padding |
+| `<AMR_*>` | Antibiotic resistance markers (e.g., `<AMR_KANAMYCIN>`, `<AMR_AMPICILLIN>`) |
+| `<ORI_*>` | Origins of replication (e.g., `<ORI_COLE1>`, `<ORI_P15A>`) |
+| `<PROM_*>` | Promoters (e.g., `<PROM_CMV>`, `<PROM_T7>`) |
+| `<REP_*>` | Reporters (e.g., `<REP_EGFP>`, `<REP_MCHERRY>`) |
+## Related Models
+- [McClain/PlasmidLM-kmer6](https://huggingface.co/McClain/PlasmidLM-kmer6) — kmer6 tokenizer, 19.3M params, dense
+- [McClain/PlasmidLM-kmer6-MoE](https://huggingface.co/McClain/PlasmidLM-kmer6-MoE) — kmer6 tokenizer, 78.3M total params, Mixture-of-Experts
 ## Limitations
+- This is a **pretrained base model** -- generated sequences are not optimized for functional element placement. Post-training with RL improves fidelity.
+- Generated sequences are **not experimentally validated**. Always verify computationally and experimentally before synthesis.
+- Trained on Addgene plasmids, which are biased toward commonly deposited vectors.
+- Maximum context of 16K tokens (~16 kbp).
 ## Citation
 ```bibtex
 @misc{thiel2026plasmidlm,
+  title={PlasmidLM: Language Models for Plasmid DNA Generation},
   author={Thiel, McClain},
+  year={2026}
 }
 ```

config.json CHANGED Viewed

@@ -25,5 +25,8 @@
       "tokenization_plasmid_lm.PlasmidLMTokenizer",
       null
     ]
-  }
 }

       "tokenization_plasmid_lm.PlasmidLMTokenizer",
       null
     ]
+  },
+  "use_moe": false,
+  "tie_word_embeddings": true,
+  "use_cache": false
 }

configuration_plasmid_lm.py CHANGED Viewed

@@ -18,6 +18,16 @@ class PlasmidLMConfig(PretrainedConfig):
         max_position_embeddings: int = 16384,
         rope_theta: float = 10000.0,
         tie_word_embeddings: bool = True,
         **kwargs,
     ):
         self.hidden_size = hidden_size
@@ -28,6 +38,16 @@ class PlasmidLMConfig(PretrainedConfig):
         self.rms_norm_eps = rms_norm_eps
         self.max_position_embeddings = max_position_embeddings
         self.rope_theta = rope_theta
         super().__init__(
             vocab_size=vocab_size,
             tie_word_embeddings=tie_word_embeddings,

         max_position_embeddings: int = 16384,
         rope_theta: float = 10000.0,
         tie_word_embeddings: bool = True,
+        # MoE
+        use_moe: bool = False,
+        num_experts: int = 6,
+        num_experts_per_tok: int = 2,
+        moe_intermediate_size: int | None = None,
+        aux_loss_coef: float = 0.01,
+        # Tokenizer metadata (informational, saved in checkpoint)
+        tokenizer_type: str = "char",
+        kmer_k: int | None = None,
+        kmer_stride: int | None = None,
         **kwargs,
     ):
         self.hidden_size = hidden_size
         self.rms_norm_eps = rms_norm_eps
         self.max_position_embeddings = max_position_embeddings
         self.rope_theta = rope_theta
+        # MoE
+        self.use_moe = use_moe
+        self.num_experts = num_experts
+        self.num_experts_per_tok = num_experts_per_tok
+        self.moe_intermediate_size = moe_intermediate_size or intermediate_size
+        self.aux_loss_coef = aux_loss_coef
+        # Tokenizer metadata
+        self.tokenizer_type = tokenizer_type
+        self.kmer_k = kmer_k
+        self.kmer_stride = kmer_stride
         super().__init__(
             vocab_size=vocab_size,
             tie_word_embeddings=tie_word_embeddings,

modeling_plasmid_lm.py CHANGED Viewed

@@ -14,6 +14,7 @@ from transformers.generation import GenerationMixin
 from transformers.modeling_outputs import CausalLMOutputWithPast
 from .configuration_plasmid_lm import PlasmidLMConfig
 def _rope_freqs(dim: int, max_len: int, base: float) -> Tuple[torch.Tensor, torch.Tensor]:
@@ -48,6 +49,7 @@ class PlasmidLMAttention(nn.Module):
         rope_sin: torch.Tensor,
         past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
         position_offset: int = 0,
     ) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
         B, S, _ = hidden_states.shape
         q = self.q_proj(hidden_states).view(B, S, self.num_heads, self.head_dim).transpose(1, 2)
@@ -63,8 +65,11 @@ class PlasmidLMAttention(nn.Module):
             v = torch.cat([past_key_value[1], v], dim=2)
         new_kv = (k, v)
-        use_causal = past_key_value is None
-        attn = F.scaled_dot_product_attention(q, k, v, is_causal=use_causal)
         out = attn.transpose(1, 2).reshape(B, S, -1)
         return self.o_proj(out), new_kv
@@ -86,7 +91,11 @@ class PlasmidLMDecoderLayer(nn.Module):
         self.input_layernorm = nn.RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
         self.self_attn = PlasmidLMAttention(config)
         self.post_attention_layernorm = nn.RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
-        self.mlp = PlasmidLMMLP(config)
     def forward(
         self,
@@ -95,15 +104,21 @@ class PlasmidLMDecoderLayer(nn.Module):
         rope_sin: torch.Tensor,
         past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
         position_offset: int = 0,
-    ) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
         residual = hidden_states
         hidden_states = self.input_layernorm(hidden_states)
-        attn_out, new_kv = self.self_attn(hidden_states, rope_cos, rope_sin, past_key_value, position_offset)
         hidden_states = residual + attn_out
         residual = hidden_states
-        hidden_states = residual + self.mlp(self.post_attention_layernorm(hidden_states))
-        return hidden_states, new_kv
 class PlasmidLMPreTrainedModel(PreTrainedModel):
@@ -111,17 +126,6 @@ class PlasmidLMPreTrainedModel(PreTrainedModel):
     base_model_prefix = "model"
     supports_gradient_checkpointing = True
-    def _init_weights(self, module):
-        if isinstance(module, PlasmidLMModel):
-            # Recompute RoPE buffers — they are non-persistent so not saved in
-            # safetensors.  from_pretrained's fast-init path zeros them out.
-            head_dim = self.config.hidden_size // self.config.num_attention_heads
-            cos, sin = _rope_freqs(
-                head_dim, self.config.max_position_embeddings, self.config.rope_theta
-            )
-            module.rope_cos = cos
-            module.rope_sin = sin
     def _set_gradient_checkpointing(self, module, value=False):
         if isinstance(module, PlasmidLMModel):
             module.gradient_checkpointing = value
@@ -136,48 +140,114 @@ class PlasmidLMModel(PlasmidLMPreTrainedModel):
         self.layers = nn.ModuleList([PlasmidLMDecoderLayer(config) for _ in range(config.num_hidden_layers)])
         self.norm = nn.RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
-        head_dim = config.hidden_size // config.num_attention_heads
-        cos, sin = _rope_freqs(head_dim, config.max_position_embeddings, config.rope_theta)
-        self.register_buffer("rope_cos", cos, persistent=False)
-        self.register_buffer("rope_sin", sin, persistent=False)
         self.gradient_checkpointing = False
         self.post_init()
     def get_input_embeddings(self):
         return self.embed_tokens
     def set_input_embeddings(self, value):
         self.embed_tokens = value
     def forward(
         self,
         input_ids: torch.Tensor,
         past_key_values: Optional[list] = None,
         position_offset: int = 0,
         **kwargs,
-    ) -> Tuple[torch.Tensor, list]:
         hidden_states = self.embed_tokens(input_ids)
         new_kv_caches = []
         for i, layer in enumerate(self.layers):
             past_kv = past_key_values[i] if past_key_values else None
             if self.gradient_checkpointing and self.training:
                 # Gradient checkpointing recomputes activations on backward — no past_kv during training
                 def make_ckpt_fn(l):
                     def fn(h, cos, sin):
-                        out, kv = l(h, cos, sin, None, 0)
-                        return out, kv[0], kv[1]
                     return fn
-                hidden_states, k, v = torch.utils.checkpoint.checkpoint(
                     make_ckpt_fn(layer), hidden_states, self.rope_cos, self.rope_sin,
                     use_reentrant=False,
                 )
                 new_kv = (k, v)
             else:
-                hidden_states, new_kv = layer(hidden_states, self.rope_cos, self.rope_sin, past_kv, position_offset)
             new_kv_caches.append(new_kv)
         hidden_states = self.norm(hidden_states)
-        return hidden_states, new_kv_caches
 class PlasmidLMForCausalLM(PlasmidLMPreTrainedModel, GenerationMixin):
@@ -220,6 +290,7 @@ class PlasmidLMForCausalLM(PlasmidLMPreTrainedModel, GenerationMixin):
         return {
             "input_ids": input_ids,
             "past_key_values": past_key_values,
             "use_cache": True,
         }
@@ -257,7 +328,9 @@ class PlasmidLMForCausalLM(PlasmidLMPreTrainedModel, GenerationMixin):
         if kv_list is not None:
             position_offset = kv_list[0][0].shape[2]
-        hidden_states, new_kv_list = self.model(input_ids, kv_list, position_offset)
         logits = self.lm_head(hidden_states)
         loss = None
@@ -269,6 +342,7 @@ class PlasmidLMForCausalLM(PlasmidLMPreTrainedModel, GenerationMixin):
                 shift_labels.view(-1),
                 ignore_index=-100,
             )
         new_cache = None
         if use_cache:
@@ -289,8 +363,8 @@ class PlasmidLMForCausalLM(PlasmidLMPreTrainedModel, GenerationMixin):
         top_k: int = 50,
     ) -> torch.Tensor:
         """Simple autoregressive generation with KV cache."""
-        # Prefill
-        hidden_states, kv_caches = self.model(input_ids)
         logits = self.lm_head(hidden_states[:, -1:, :]).squeeze(1)
         cur_len = input_ids.shape[1]
@@ -305,7 +379,7 @@ class PlasmidLMForCausalLM(PlasmidLMPreTrainedModel, GenerationMixin):
             next_token = torch.multinomial(probs, 1)
             input_ids = torch.cat([input_ids, next_token], dim=1)
-            hidden_states, kv_caches = self.model(next_token, kv_caches, cur_len)
             logits = self.lm_head(hidden_states).squeeze(1)
             cur_len += 1

 from transformers.modeling_outputs import CausalLMOutputWithPast
 from .configuration_plasmid_lm import PlasmidLMConfig
+from .moe import PlasmidLMSparseMoE
 def _rope_freqs(dim: int, max_len: int, base: float) -> Tuple[torch.Tensor, torch.Tensor]:
         rope_sin: torch.Tensor,
         past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
         position_offset: int = 0,
+        attention_mask: Optional[torch.Tensor] = None,
     ) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
         B, S, _ = hidden_states.shape
         q = self.q_proj(hidden_states).view(B, S, self.num_heads, self.head_dim).transpose(1, 2)
             v = torch.cat([past_key_value[1], v], dim=2)
         new_kv = (k, v)
+        if attention_mask is not None:
+            attn = F.scaled_dot_product_attention(q, k, v, attn_mask=attention_mask)
+        else:
+            use_causal = past_key_value is None
+            attn = F.scaled_dot_product_attention(q, k, v, is_causal=use_causal)
         out = attn.transpose(1, 2).reshape(B, S, -1)
         return self.o_proj(out), new_kv
         self.input_layernorm = nn.RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
         self.self_attn = PlasmidLMAttention(config)
         self.post_attention_layernorm = nn.RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.use_moe = config.use_moe
+        if self.use_moe:
+            self.moe = PlasmidLMSparseMoE(config)
+        else:
+            self.mlp = PlasmidLMMLP(config)
     def forward(
         self,
         rope_sin: torch.Tensor,
         past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
         position_offset: int = 0,
+        attention_mask: Optional[torch.Tensor] = None,
+    ) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor], torch.Tensor]:
         residual = hidden_states
         hidden_states = self.input_layernorm(hidden_states)
+        attn_out, new_kv = self.self_attn(hidden_states, rope_cos, rope_sin, past_key_value, position_offset, attention_mask)
         hidden_states = residual + attn_out
         residual = hidden_states
+        if self.use_moe:
+            moe_out, aux_loss = self.moe(self.post_attention_layernorm(hidden_states))
+            hidden_states = residual + moe_out
+        else:
+            hidden_states = residual + self.mlp(self.post_attention_layernorm(hidden_states))
+            aux_loss = torch.tensor(0.0, device=hidden_states.device)
+        return hidden_states, new_kv, aux_loss
 class PlasmidLMPreTrainedModel(PreTrainedModel):
     base_model_prefix = "model"
     supports_gradient_checkpointing = True
     def _set_gradient_checkpointing(self, module, value=False):
         if isinstance(module, PlasmidLMModel):
             module.gradient_checkpointing = value
         self.layers = nn.ModuleList([PlasmidLMDecoderLayer(config) for _ in range(config.num_hidden_layers)])
         self.norm = nn.RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        # Lazy RoPE: computed on first forward call to ensure correct device
+        # placement after from_pretrained (which uses meta device tensors).
+        self.register_buffer("rope_cos", None, persistent=False)
+        self.register_buffer("rope_sin", None, persistent=False)
         self.gradient_checkpointing = False
         self.post_init()
+    def _init_rope(self, device: torch.device) -> None:
+        """Compute and cache RoPE cos/sin on the given device."""
+        head_dim = self.config.hidden_size // self.config.num_attention_heads
+        cos, sin = _rope_freqs(head_dim, self.config.max_position_embeddings, self.config.rope_theta)
+        self.register_buffer("rope_cos", cos.to(device), persistent=False)
+        self.register_buffer("rope_sin", sin.to(device), persistent=False)
     def get_input_embeddings(self):
         return self.embed_tokens
     def set_input_embeddings(self, value):
         self.embed_tokens = value
+    def _build_4d_attention_mask(
+        self,
+        attention_mask: Optional[torch.Tensor],
+        seq_len: int,
+        past_seq_len: int,
+        device: torch.device,
+        dtype: torch.dtype,
+    ) -> Optional[torch.Tensor]:
+        """Build a 4D causal+padding mask for SDPA.
+        Returns (B, 1, S, S+past) float mask with 0 for attend and -inf for ignore,
+        or None if no masking is needed (no padding, no past KV).
+        """
+        if attention_mask is None and past_seq_len == 0:
+            # No padding and no cache — SDPA's is_causal=True handles this
+            return None
+        total_len = past_seq_len + seq_len
+        # Causal mask: each query position can attend to itself and all prior positions
+        causal = torch.triu(
+            torch.full((seq_len, total_len), float("-inf"), device=device, dtype=dtype),
+            diagonal=past_seq_len + 1,
+        )  # (S, S+past)
+        mask_4d = causal.unsqueeze(0).unsqueeze(0)  # (1, 1, S, S+past)
+        if attention_mask is not None:
+            # attention_mask is (B, total_len) with 1=attend, 0=ignore
+            # Use a large finite negative instead of -inf for padding mask.
+            # With left-padding, the first padding positions can only attend
+            # to other padding positions (causal blocks future). If we use
+            # -inf, ALL keys are blocked → softmax([-inf,...]) = NaN.
+            # Using min_dtype keeps at least the self-attention score finite,
+            # so softmax produces a valid (though meaningless) output.
+            min_dtype = torch.finfo(dtype).min
+            pad_mask = torch.where(
+                attention_mask[:, None, None, :].bool(),
+                torch.zeros(1, device=device, dtype=dtype),
+                torch.tensor(min_dtype, device=device, dtype=dtype),
+            )  # (B, 1, 1, total_len)
+            mask_4d = mask_4d + pad_mask
+        return mask_4d
     def forward(
         self,
         input_ids: torch.Tensor,
         past_key_values: Optional[list] = None,
         position_offset: int = 0,
+        attention_mask: Optional[torch.Tensor] = None,
         **kwargs,
+    ) -> Tuple[torch.Tensor, list, torch.Tensor]:
+        # Lazy RoPE init: compute on first forward for correct device placement
+        if self.rope_cos is None:
+            self._init_rope(input_ids.device)
         hidden_states = self.embed_tokens(input_ids)
+        past_seq_len = past_key_values[0][0].shape[2] if past_key_values else 0
+        mask_4d = self._build_4d_attention_mask(
+            attention_mask, input_ids.shape[1], past_seq_len,
+            input_ids.device, hidden_states.dtype,
+        )
         new_kv_caches = []
+        total_aux_loss = torch.tensor(0.0, device=input_ids.device)
         for i, layer in enumerate(self.layers):
             past_kv = past_key_values[i] if past_key_values else None
             if self.gradient_checkpointing and self.training:
                 # Gradient checkpointing recomputes activations on backward — no past_kv during training
                 def make_ckpt_fn(l):
                     def fn(h, cos, sin):
+                        out, kv, aux = l(h, cos, sin, None, 0)
+                        return out, kv[0], kv[1], aux
                     return fn
+                hidden_states, k, v, layer_aux = torch.utils.checkpoint.checkpoint(
                     make_ckpt_fn(layer), hidden_states, self.rope_cos, self.rope_sin,
                     use_reentrant=False,
                 )
                 new_kv = (k, v)
             else:
+                hidden_states, new_kv, layer_aux = layer(
+                    hidden_states, self.rope_cos, self.rope_sin, past_kv, position_offset, mask_4d
+                )
             new_kv_caches.append(new_kv)
+            total_aux_loss = total_aux_loss + layer_aux
         hidden_states = self.norm(hidden_states)
+        return hidden_states, new_kv_caches, total_aux_loss
 class PlasmidLMForCausalLM(PlasmidLMPreTrainedModel, GenerationMixin):
         return {
             "input_ids": input_ids,
             "past_key_values": past_key_values,
+            "attention_mask": attention_mask,
             "use_cache": True,
         }
         if kv_list is not None:
             position_offset = kv_list[0][0].shape[2]
+        hidden_states, new_kv_list, aux_loss = self.model(
+            input_ids, kv_list, position_offset, attention_mask=attention_mask
+        )
         logits = self.lm_head(hidden_states)
         loss = None
                 shift_labels.view(-1),
                 ignore_index=-100,
             )
+            loss = loss + self.config.aux_loss_coef * aux_loss
         new_cache = None
         if use_cache:
         top_k: int = 50,
     ) -> torch.Tensor:
         """Simple autoregressive generation with KV cache."""
+        # Prefill (aux_loss ignored during generation)
+        hidden_states, kv_caches, _ = self.model(input_ids)
         logits = self.lm_head(hidden_states[:, -1:, :]).squeeze(1)
         cur_len = input_ids.shape[1]
             next_token = torch.multinomial(probs, 1)
             input_ids = torch.cat([input_ids, next_token], dim=1)
+            hidden_states, kv_caches, _ = self.model(next_token, kv_caches, cur_len)
             logits = self.lm_head(hidden_states).squeeze(1)
             cur_len += 1

moe.py ADDED Viewed

	@@ -0,0 +1,99 @@

+"""Mixture of Experts (Mixtral-style) for PlasmidLM."""
+from __future__ import annotations
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .configuration_plasmid_lm import PlasmidLMConfig
+class PlasmidLMExpertMLP(nn.Module):
+    """Single expert MLP — same architecture as PlasmidLMMLP."""
+    def __init__(self, hidden_size: int, intermediate_size: int):
+        super().__init__()
+        self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
+        self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)
+        self.act = nn.GELU()
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.down_proj(self.act(self.up_proj(x)))
+class PlasmidLMSparseMoE(nn.Module):
+    """Sparse Mixture of Experts with top-k routing and load balancing loss.
+    Implements Mixtral-style routing: softmax over all experts, then select
+    top-k. The output is a weighted sum of the selected experts' outputs.
+    Load balancing auxiliary loss: L_aux = N * sum(f_i * P_i) where
+    f_i = fraction of tokens routed to expert i, P_i = mean routing
+    probability for expert i.
+    """
+    def __init__(self, config: PlasmidLMConfig):
+        super().__init__()
+        self.num_experts = config.num_experts
+        self.top_k = config.num_experts_per_tok
+        intermediate = config.moe_intermediate_size
+        self.router = nn.Linear(config.hidden_size, self.num_experts, bias=False)
+        self.experts = nn.ModuleList(
+            [PlasmidLMExpertMLP(config.hidden_size, intermediate) for _ in range(self.num_experts)]
+        )
+    def forward(self, hidden_states: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
+        """
+        Args:
+            hidden_states: (batch, seq_len, hidden_size)
+        Returns:
+            output: (batch, seq_len, hidden_size)
+            aux_loss: scalar load balancing loss
+        """
+        batch, seq_len, hidden = hidden_states.shape
+        flat = hidden_states.view(-1, hidden)  # (B*S, H)
+        num_tokens = flat.shape[0]
+        # Router: compute probabilities over experts
+        router_logits = self.router(flat)  # (B*S, num_experts)
+        router_probs = F.softmax(router_logits, dim=-1)
+        # Top-k selection
+        top_weights, top_indices = torch.topk(router_probs, self.top_k, dim=-1)  # (B*S, top_k)
+        # Normalize selected weights to sum to 1
+        top_weights = top_weights / top_weights.sum(dim=-1, keepdim=True)
+        # Dispatch: loop over experts, gather tokens, compute, scatter back
+        output = torch.zeros_like(flat)
+        for i, expert in enumerate(self.experts):
+            # Mask for tokens where expert i is in the top-k
+            mask = (top_indices == i).any(dim=-1)  # (B*S,)
+            if not mask.any():
+                continue
+            expert_input = flat[mask]  # (n_tokens, H)
+            expert_output = expert(expert_input)  # (n_tokens, H)
+            # Weight for this expert for selected tokens
+            # Find which top-k slot(s) matched and get corresponding weight
+            match_positions = (top_indices[mask] == i)  # (n_tokens, top_k)
+            weights = (top_weights[mask] * match_positions.float()).sum(dim=-1, keepdim=True)  # (n_tokens, 1)
+            output[mask] += weights * expert_output
+        output = output.view(batch, seq_len, hidden)
+        # Load balancing auxiliary loss
+        # f_i: fraction of tokens dispatched to expert i
+        # P_i: mean routing probability assigned to expert i
+        with torch.no_grad():
+            # Count tokens per expert (based on top-k assignments)
+            expert_counts = torch.zeros(self.num_experts, device=flat.device)
+            for k in range(self.top_k):
+                expert_counts.scatter_add_(0, top_indices[:, k], torch.ones(num_tokens, device=flat.device))
+            f = expert_counts / (num_tokens * self.top_k)  # fraction
+        P = router_probs.mean(dim=0)  # (num_experts,)
+        aux_loss = self.num_experts * (f * P).sum()
+        return output, aux_loss

tokenization_plasmid_lm.py CHANGED Viewed

@@ -62,6 +62,22 @@ class PlasmidLMTokenizer(PreTrainedTokenizer):
     def vocab_size(self) -> int:
         return len(self._vocab)
     def get_vocab(self) -> dict:
         return dict(self._vocab)

     def vocab_size(self) -> int:
         return len(self._vocab)
+    @property
+    def pad_token_id(self) -> int:
+        return self._vocab.get("<PAD>", 0)
+    @property
+    def bos_token_id(self) -> int:
+        return self._vocab.get("<BOS>", 1)
+    @property
+    def eos_token_id(self) -> int:
+        return self._vocab.get("<EOS>", 2)
+    @property
+    def sep_token_id(self) -> int:
+        return self._vocab.get("<SEP>", 3)
     def get_vocab(self) -> dict:
         return dict(self._vocab)