Add preprocess docs, examples, tokenizer mapping assets

Browse files

Files changed (15) hide show

README.md +90 -0
config.json +22 -0
configuration_genemamba.py +97 -0
examples/0_preprocess_to_input_ids.py +75 -0
examples/1_extract_embeddings.py +150 -0
examples/__pycache__/0_preprocess_to_input_ids.cpython-39.pyc +0 -0
model.safetensors +3 -0
modeling_genemamba.py +395 -0
modeling_outputs.py +81 -0
special_tokens_map.json +4 -0
tokenizer.json +0 -0
tokenizer_assets/gene_tokenizer.json +0 -0
tokenizer_assets/id2symbol.pkl +3 -0
tokenizer_assets/symbol2id.pkl +3 -0
tokenizer_config.json +8 -0

README.md ADDED Viewed

	@@ -0,0 +1,90 @@

+---
+library_name: transformers
+tags:
+  - genomics
+  - single-cell
+  - mamba
+  - biology
+pipeline_tag: feature-extraction
+---
+# GeneMamba2-24l-512d
+This repository contains a GeneMamba checkpoint plus full usage assets:
+- model weights (`model.safetensors`)
+- custom modeling/config files for `trust_remote_code=True`
+- preprocessing example from `h5ad` to `input_ids`
+- tokenizer assets and id mapping files
+## 1) Input format (very important)
+GeneMamba input is **ranked gene token IDs** per cell:
+1. Start from one cell expression vector
+2. Keep genes with expression > 0
+3. Sort genes by expression descending
+4. Convert each gene ID (Ensembl, e.g. `ENSG00000000003`) to token ID
+5. Use resulting list as `input_ids`
+Each sample is one list of integers:
+```python
+{"input_ids": [145, 2088, 531, 91, ...]}
+```
+For batch input, shape is typically `(batch_size, seq_len)` after padding/truncation.
+## 2) Where tokenizer and id mapping come from
+- Main tokenizer used for model inference: `tokenizer.json`
+- Original full tokenizer table: `tokenizer_assets/gene_tokenizer.json`
+- Gene symbol -> token id mapping: `tokenizer_assets/symbol2id.pkl`
+- Token id -> gene symbol mapping: `tokenizer_assets/id2symbol.pkl`
+Special tokens:
+- `[UNK]` = 0
+- `[PAD]` = 1
+## 3) Preprocess your data
+See script:
+- `examples/0_preprocess_to_input_ids.py`
+Example:
+```bash
+python examples/0_preprocess_to_input_ids.py \
+  --h5ad /path/to/your_data.h5ad \
+  --tokenizer_json tokenizer.json \
+  --output_arrow ./my_data/sorted_gene_token_ids.arrow
+```
+This output Arrow file has one column: `input_ids`.
+## 4) Load model and extract embedding
+```python
+from transformers import AutoModel, AutoTokenizer
+model = AutoModel.from_pretrained(
+    "mineself2016/GeneMamba2-24l-512d",
+    trust_remote_code=True
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    "mineself2016/GeneMamba2-24l-512d",
+    trust_remote_code=True
+)
+```
+More complete example:
+- `examples/1_extract_embeddings.py`
+## 5) Source of preprocessing logic
+The preprocessing/tokenization pipeline is aligned with assets from:
+- `/project/zhiwei/cq5/PythonWorkSpace/gene_mamba`
+Key references used:
+- tokenizer: `gene_tokenizer.json`
+- mappings: `symbol2id.pkl`, `id2symbol.pkl`
+- dataset build logic (Arrow + `input_ids`): `utils.py` (`build_dataset`)

config.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+  "model_type": "genemamba",
+  "architectures": [
+    "GeneMambaModel"
+  ],
+  "vocab_size": 25426,
+  "max_position_embeddings": 2048,
+  "hidden_size": 512,
+  "num_hidden_layers": 24,
+  "intermediate_size": 2048,
+  "hidden_dropout_prob": 0.1,
+  "initializer_range": 0.02,
+  "mamba_mode": "gate",
+  "embedding_pooling": "mean",
+  "num_labels": 2,
+  "pad_token_id": 1,
+  "eos_token_id": 2,
+  "bos_token_id": 0,
+  "use_cache": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.40.2"
+}

configuration_genemamba.py ADDED Viewed

	@@ -0,0 +1,97 @@

+"""
+Configuration for GeneMamba model.
+Defines all hyperparameters and settings for the GeneMamba architecture.
+"""
+from transformers import PretrainedConfig
+from typing import Optional
+class GeneMambaConfig(PretrainedConfig):
+    """
+    Configuration class for GeneMamba model.
+    This class stores the configuration of a GeneMamba model, inheriting from PretrainedConfig.
+    It can be used to instantiate models from pretrained checkpoints or customize model initialization.
+    Args:
+        vocab_size (int, optional, defaults to 25426):
+            Vocabulary size of the model. Number of gene tokens (Ensembl Gene IDs).
+        hidden_size (int, optional, defaults to 512):
+            Dimensionality of the hidden/embedding layers (d_model in Mamba).
+        num_hidden_layers (int, optional, defaults to 24):
+            Number of Mamba layers (mamba_layer).
+        intermediate_size (int, optional, defaults to 2048):
+            Dimensionality of intermediate representations in MLP.
+        max_position_embeddings (int, optional, defaults to 2048):
+            Maximum sequence length (seq_len).
+        hidden_dropout_prob (float, optional, defaults to 0.1):
+            Dropout probability for hidden states.
+        initializer_range (float, optional, defaults to 0.02):
+            Standard deviation of truncated normal initializer.
+        mamba_mode (str, optional, defaults to "gate"):
+            Aggregation mode for bidirectional Mamba layers.
+            Options: "mean", "sum", "concat", "gate".
+        embedding_pooling (str, optional, defaults to "mean"):
+            Method for pooling to get cell embedding.
+            Options: "CLS", "mean", "weighted".
+        num_labels (int, optional, defaults to 2):
+            Number of labels for sequence classification tasks.
+        pad_token_id (int, optional, defaults to 1):
+            Token ID for padding.
+        bos_token_id (int, optional, defaults to None):
+            Token ID for beginning of sequence.
+        eos_token_id (int, optional, defaults to None):
+            Token ID for end of sequence.
+    """
+    model_type = "genemamba"
+    attribute_map = {
+        "hidden_size": "hidden_size",
+        "num_hidden_layers": "num_hidden_layers",
+    }
+    def __init__(
+        self,
+        vocab_size: int = 25426,
+        hidden_size: int = 512,
+        num_hidden_layers: int = 24,
+        intermediate_size: int = 2048,
+        max_position_embeddings: int = 2048,
+        hidden_dropout_prob: float = 0.1,
+        initializer_range: float = 0.02,
+        mamba_mode: str = "gate",
+        embedding_pooling: str = "mean",
+        num_labels: int = 2,
+        pad_token_id: int = 1,
+        bos_token_id: Optional[int] = None,
+        eos_token_id: Optional[int] = None,
+        **kwargs
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.intermediate_size = intermediate_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.initializer_range = initializer_range
+        self.mamba_mode = mamba_mode
+        self.embedding_pooling = embedding_pooling
+        self.num_labels = num_labels
+        self.pad_token_id = pad_token_id
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id

examples/0_preprocess_to_input_ids.py ADDED Viewed

	@@ -0,0 +1,75 @@

+import argparse
+import json
+from pathlib import Path
+import numpy as np
+import pandas as pd
+import scanpy as sc
+import pyarrow as pa
+def load_vocab(tokenizer_json_path: str):
+    with open(tokenizer_json_path, "r") as f:
+        tokenizer = json.load(f)
+    vocab = tokenizer["model"]["vocab"]
+    pad_id = vocab.get("[PAD]", 1)
+    unk_id = vocab.get("[UNK]", 0)
+    return vocab, pad_id, unk_id
+def ranked_gene_ids_for_cell(expr_values, gene_names, vocab):
+    nonzero_idx = np.where(expr_values > 0)[0]
+    if len(nonzero_idx) == 0:
+        return []
+    genes = np.array(gene_names)[nonzero_idx]
+    values = expr_values[nonzero_idx]
+    order = np.argsort(-values)
+    ranked_genes = genes[order]
+    token_ids = [vocab[g] for g in ranked_genes if g in vocab]
+    return token_ids
+def main():
+    parser = argparse.ArgumentParser(description="Convert h5ad to GeneMamba input_ids (Arrow)")
+    parser.add_argument("--h5ad", required=True, help="Input h5ad file")
+    parser.add_argument("--tokenizer_json", required=True, help="Path to tokenizer.json or gene_tokenizer.json")
+    parser.add_argument("--output_arrow", required=True, help="Output arrow file path")
+    parser.add_argument("--max_cells", type=int, default=None, help="Optional: process first N cells only")
+    args = parser.parse_args()
+    adata = sc.read_h5ad(args.h5ad)
+    vocab, _, _ = load_vocab(args.tokenizer_json)
+    gene_names = list(adata.var_names)
+    n_cells = adata.n_obs if args.max_cells is None else min(args.max_cells, adata.n_obs)
+    rows = []
+    X = adata.X
+    for i in range(n_cells):
+        row = X[i]
+        if hasattr(row, "toarray"):
+            expr = row.toarray().ravel()
+        else:
+            expr = np.asarray(row).ravel()
+        token_ids = ranked_gene_ids_for_cell(expr, gene_names, vocab)
+        rows.append(token_ids)
+    df = pd.DataFrame({"input_ids": rows})
+    table = pa.Table.from_pandas(df)
+    output_path = Path(args.output_arrow)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with pa.OSFile(str(output_path), "wb") as sink:
+        with pa.ipc.new_stream(sink, table.schema) as writer:
+            writer.write_table(table)
+    print(f"Saved {len(rows)} cells to {output_path}")
+if __name__ == "__main__":
+    main()

examples/1_extract_embeddings.py ADDED Viewed

	@@ -0,0 +1,150 @@

+"""
+Phase 1: Extract Cell Embeddings
+Demonstrates how to load GeneMamba and extract cell embeddings for downstream analysis.
+Usage:
+    python examples/1_extract_embeddings.py
+"""
+import torch
+import numpy as np
+from transformers import AutoTokenizer, AutoModel
+def main():
+    print("=" * 80)
+    print("GeneMamba Phase 1: Extract Cell Embeddings")
+    print("=" * 80)
+    # ============================================================
+    # Step 1: Load pretrained model and tokenizer
+    # ============================================================
+    print("\n[Step 1] Loading model and tokenizer...")
+    # For this example, we use a local model path
+    # In practice, you would use: "username/GeneMamba-24l-512d"
+    model_name = "GeneMamba-24l-512d"  # Change to HF Hub path when available
+    try:
+        tokenizer = AutoTokenizer.from_pretrained(
+            model_name,
+            trust_remote_code=True,
+            local_files_only=True  # Try local first
+        )
+        model = AutoModel.from_pretrained(
+            model_name,
+            trust_remote_code=True,
+            local_files_only=True
+        )
+    except Exception as e:
+        print(f"Note: Could not load from '{model_name}': {e}")
+        print("Using mock data for demonstration...")
+        # For demonstration without actual checkpoint
+        from configuration_genemamba import GeneMambaConfig
+        from modeling_genemamba import GeneMambaModel
+        config = GeneMambaConfig(
+            vocab_size=25426,
+            hidden_size=512,
+            num_hidden_layers=24,
+            embedding_pooling="mean",
+        )
+        model = GeneMambaModel(config)
+        tokenizer = None
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    model = model.to(device)
+    model.eval()
+    print(f"✓ Model loaded on device: {device}")
+    print(f"✓ Model config: hidden_size={model.config.hidden_size}, "
+          f"num_layers={model.config.num_hidden_layers}")
+    # ============================================================
+    # Step 2: Prepare simulated single-cell data
+    # ============================================================
+    print("\n[Step 2] Preparing sample data...")
+    batch_size = 8
+    seq_len = 2048
+    vocab_size = 25426
+    # Simulate ranked gene sequences
+    # In practice, this would come from your scRNA-seq data
+    # Genes should be ranked by expression (highest first)
+    input_ids = torch.randint(2, vocab_size, (batch_size, seq_len)).to(device)
+    print(f"✓ Created sample input:")
+    print(f"  - Batch size: {batch_size}")
+    print(f"  - Sequence length: {seq_len}")
+    print(f"  - Input shape: {input_ids.shape}")
+    # ============================================================
+    # Step 3: Inference - Extract embeddings
+    # ============================================================
+    print("\n[Step 3] Extracting cell embeddings...")
+    with torch.no_grad():
+        outputs = model(input_ids, output_hidden_states=False)
+    # Get the pooled embedding (cell representation)
+    cell_embeddings = outputs.pooled_embedding
+    print(f"✓ Extraction complete!")
+    print(f"  - Cell embeddings shape: {cell_embeddings.shape}")
+    print(f"  - Pooling method used: {outputs.embedding_pooling}")
+    print(f"  - Embedding type: {cell_embeddings.dtype}")
+    # ============================================================
+    # Step 4: Example downstream analyses
+    # ============================================================
+    print("\n[Step 4] Example downstream uses...")
+    # Example 1: Clustering (KMeans)
+    from sklearn.cluster import KMeans
+    n_clusters = 3
+    kmeans = KMeans(n_clusters=n_clusters, n_init=10)
+    clusters = kmeans.fit_predict(cell_embeddings.cpu().numpy())
+    print(f"✓ Clustering: Assigned {len(np.unique(clusters))} clusters")
+    # Example 2: Dimensionality reduction (PCA)
+    from sklearn.decomposition import PCA
+    pca = PCA(n_components=2)
+    embedding_2d = pca.fit_transform(cell_embeddings.cpu().numpy())
+    print(f"✓ PCA reduction: {cell_embeddings.shape} → {embedding_2d.shape}")
+    # Example 3: Similarity search
+    # Find the most similar cell to the first cell
+    similarities = torch.nn.functional.cosine_similarity(
+        cell_embeddings[0:1],
+        cell_embeddings
+    )
+    most_similar_idx = torch.argmax(similarities).item()
+    print(f"✓ Similarity search: Most similar cell to cell 0 is cell {most_similar_idx} "
+          f"(similarity: {similarities[most_similar_idx]:.4f})")
+    # Example 4: Statistics
+    print("\n[Step 5] Embedding statistics:")
+    print(f"  - Mean: {cell_embeddings.mean(dim=0).norm():.4f}")
+    print(f"  - Std: {cell_embeddings.std(dim=0).mean():.4f}")
+    print(f"  - Min: {cell_embeddings.min():.4f}")
+    print(f"  - Max: {cell_embeddings.max():.4f}")
+    # ============================================================
+    # Step 6: Save embeddings (optional)
+    # ============================================================
+    print("\n[Step 6] Saving embeddings...")
+    np.save("cell_embeddings.npy", cell_embeddings.cpu().numpy())
+    print("✓ Embeddings saved to 'cell_embeddings.npy'")
+    print("\n" + "=" * 80)
+    print("Phase 1 Complete!")
+    print("=" * 80)
+    return model, cell_embeddings
+if __name__ == "__main__":
+    model, embeddings = main()

examples/__pycache__/0_preprocess_to_input_ids.cpython-39.pyc ADDED Viewed

Binary file (2.64 kB). View file

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ccb1fcb0ee4b3ea2013099b9b187455e160d3b66b76c606715231b70b13c2784
+size 262998656

modeling_genemamba.py ADDED Viewed

	@@ -0,0 +1,395 @@

+"""
+PyTorch implementation of GeneMamba model for Hugging Face Transformers.
+Includes backbone model and task-specific heads for various downstream tasks.
+"""
+import math
+import logging
+from typing import Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn.init import normal_, constant_
+from transformers import PreTrainedModel, PretrainedConfig
+from transformers.modeling_outputs import SequenceClassifierOutput, ModelOutput
+from transformers.models.auto import register_model_for_auto_class
+from mamba_ssm import Mamba
+from mamba_ssm.ops.triton.layer_norm import RMSNorm
+from .configuration_genemamba import GeneMambaConfig
+from .modeling_outputs import GeneMambaModelOutput, GeneMambaSequenceClassifierOutput, GeneMambaMaskedLMOutput
+logger = logging.getLogger(__name__)
+# ===========================
+# Core Architecture Components
+# ===========================
+class EncoderLayer(nn.Module):
+    """
+    Single Mamba encoder layer with residual connection.
+    Applies a Mamba2 or Mamba layer followed by addition with input.
+    Args:
+        hidden_size (int): Dimension of hidden states.
+    """
+    def __init__(self, hidden_size: int):
+        super(EncoderLayer, self).__init__()
+        self.mamba = Mamba(d_model=hidden_size, d_state=64, d_conv=4, expand=2)
+    def forward(self, X: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            X (torch.Tensor): Input tensor of shape (batch_size, seq_len, hidden_size).
+        Returns:
+            torch.Tensor: Output after Mamba layer and residual connection.
+        """
+        output = self.mamba(X) + X
+        return output
+class MambaMixer(nn.Module):
+    """
+    Stack of Mamba encoder layers with bidirectional processing and aggregation.
+    Processes sequences in both forward and reverse directions, then aggregates.
+    Args:
+        mode (str): Aggregation mode. Options: "mean", "sum", "concat", "gate".
+        hidden_size (int): Dimension of hidden states.
+        num_hidden_layers (int): Number of Mamba layers.
+    """
+    def __init__(
+        self,
+        mode: str = "gate",
+        hidden_size: int = 512,
+        num_hidden_layers: int = 24
+    ):
+        super(MambaMixer, self).__init__()
+        self.mode = mode
+        self.hidden_size = hidden_size
+        # Create Mamba layers
+        self.layers = nn.ModuleList(
+            [EncoderLayer(hidden_size) for _ in range(num_hidden_layers)]
+        )
+        # Aggregation modules for certain modes
+        if mode in ["concat", "gate"]:
+            self.aggr = nn.Linear(hidden_size * 2, hidden_size)
+    def flip_sequence(self, X: torch.Tensor, mask: Optional[torch.Tensor] = None) -> torch.Tensor:
+        """
+        Reverse a sequence based on actual length (ignoring padding).
+        Args:
+            X (torch.Tensor): Input tensor of shape (batch_size, seq_len, hidden_size).
+            mask (torch.Tensor, optional): Padding mask of shape (batch_size, seq_len).
+        Returns:
+            torch.Tensor: Reversed tensor.
+        """
+        batch_size, seq_length, embedding_dim = X.size()
+        if mask is None:
+            # Simple flip
+            return X.flip([1])
+        # Flip based on actual sequence length (marked by mask)
+        lengths = (~mask).sum(dim=1)
+        pos_tensor = torch.arange(seq_length, device=X.device).unsqueeze(0).expand(batch_size, -1)
+        flip_mask = pos_tensor < lengths.unsqueeze(1)
+        reversed_positions = torch.where(
+            flip_mask,
+            lengths.unsqueeze(1) - 1 - pos_tensor,
+            pos_tensor
+        )
+        X_reverse = torch.gather(X, 1, reversed_positions.unsqueeze(-1).expand(-1, -1, embedding_dim))
+        return X_reverse
+    def forward(
+        self,
+        X: torch.Tensor,
+        padding_mask: Optional[torch.Tensor] = None
+    ) -> torch.Tensor:
+        """
+        Process sequence through bidirectional Mamba layers.
+        Args:
+            X (torch.Tensor): Input tensor of shape (batch_size, seq_len, hidden_size).
+            padding_mask (torch.Tensor, optional): Padding mask.
+        Returns:
+            torch.Tensor: Output after processing all layers and aggregation.
+        """
+        for layer in self.layers:
+            # Flip sequence for reverse processing
+            X_flip = self.flip_sequence(X, padding_mask)
+            # Forward and reverse passes
+            X_f = layer(X)
+            X_b = layer(X_flip)
+            # Flip back the reverse output
+            X_b = self.flip_sequence(X_b, padding_mask)
+            # Aggregate forward and reverse
+            if self.mode == "mean":
+                X = (X_f + X_b) / 2
+            elif self.mode == "sum":
+                X = X_f + X_b
+            elif self.mode == "concat":
+                X = torch.cat([X_f, X_b], dim=-1)
+                X = self.aggr(X)
+            elif self.mode == "gate":
+                z = torch.sigmoid(self.aggr(torch.cat([X_f, X_b], dim=-1)))
+                X = z * X_f + (1 - z) * X_b
+            else:
+                raise ValueError(f"Invalid aggregation mode: {self.mode}")
+        return X
+# ===========================
+# Base Model Classes
+# ===========================
+class GeneMambaPreTrainedModel(PreTrainedModel):
+    """
+    Base class for all GeneMamba models.
+    Handles weight initialization and provides standard model interfaces.
+    """
+    config_class = GeneMambaConfig
+    base_model_prefix = "genemamba"
+    supports_gradient_checkpointing = True
+    def _init_weights(self, module):
+        """Initialize module weights."""
+        if isinstance(module, nn.Linear):
+            normal_(module.weight, std=self.config.initializer_range)
+            if module.bias is not None:
+                constant_(module.bias, 0.0)
+        elif isinstance(module, nn.Embedding):
+            normal_(module.weight, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LayerNorm):
+            constant_(module.bias, 0.0)
+            constant_(module.weight, 1.0)
+class GeneMambaModel(GeneMambaPreTrainedModel):
+    """
+    GeneMamba backbone model - outputs cell embeddings and hidden states.
+    This is the core model used by task-specific heads.
+    Args:
+        config (GeneMambaConfig): Model configuration class.
+    """
+    def __init__(self, config: GeneMambaConfig):
+        super().__init__(config)
+        self.config = config
+        # Embedding layer
+        self.embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        # Mamba layers with bidirectional aggregation
+        self.mamba_mixer = MambaMixer(
+            mode=config.mamba_mode,
+            hidden_size=config.hidden_size,
+            num_hidden_layers=config.num_hidden_layers
+        )
+        # Final layer normalization
+        self.norm = RMSNorm(config.hidden_size)
+        self.apply(self._init_weights)
+    def get_input_embeddings(self) -> nn.Embedding:
+        """Return embedding layer."""
+        return self.embeddings
+    def set_input_embeddings(self, value: nn.Embedding):
+        """Set embedding layer."""
+        self.embeddings = value
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        output_hidden_states: bool = False,
+    ) -> GeneMambaModelOutput:
+        """
+        Args:
+            input_ids (torch.Tensor): Token indices of shape (batch_size, seq_len).
+            attention_mask (torch.Tensor, optional): Attention mask of shape (batch_size, seq_len).
+            output_hidden_states (bool): Whether to output hidden states from all layers.
+        Returns:
+            GeneMambaModelOutput: Contains last_hidden_state, pooled_embedding, etc.
+        """
+        # Get embeddings
+        hidden_states = self.embeddings(input_ids)
+        # Pass through Mamba layers
+        hidden_states = self.mamba_mixer(hidden_states, attention_mask)
+        # Apply final normalization
+        hidden_states = self.norm(hidden_states)
+        # Compute pooled embedding (cell representation)
+        if self.config.embedding_pooling == "CLS":
+            # Use first token (CLS)
+            pooled_embedding = hidden_states[:, 0, :]
+        elif self.config.embedding_pooling == "mean":
+            # Mean pooling over sequence
+            if attention_mask is not None:
+                mask = attention_mask.unsqueeze(-1).expand(hidden_states.shape).float()
+                pooled_embedding = (hidden_states * mask).sum(dim=1) / mask.sum(dim=1)
+            else:
+                pooled_embedding = hidden_states.mean(dim=1)
+        else:
+            raise ValueError(f"Unsupported embedding_pooling: {self.config.embedding_pooling}")
+        return GeneMambaModelOutput(
+            last_hidden_state=hidden_states,
+            pooled_embedding=pooled_embedding,
+            hidden_states=hidden_states if output_hidden_states else None,
+            embedding_pooling=self.config.embedding_pooling,
+        )
+# ===========================
+# Task-Specific Models
+# ===========================
+@register_model_for_auto_class("AutoModel")
+class GeneMambaForMaskedLM(GeneMambaPreTrainedModel):
+    """
+    GeneMamba model for masked language modeling (MLM).
+    Suitable for pretraining and domain adaptation.
+    Args:
+        config (GeneMambaConfig): Model configuration class.
+    """
+    def __init__(self, config: GeneMambaConfig):
+        super().__init__(config)
+        self.genemamba = GeneMambaModel(config)
+        # Language modeling head
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size)
+        self.apply(self._init_weights)
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_hidden_states: bool = False,
+    ) -> GeneMambaMaskedLMOutput:
+        """
+        Args:
+            input_ids (torch.Tensor): Token indices of shape (batch_size, seq_len).
+            attention_mask (torch.Tensor, optional): Attention mask.
+            labels (torch.Tensor, optional): Target token ids for MLM loss.
+            output_hidden_states (bool): Whether to output hidden states.
+        Returns:
+            GeneMambaMaskedLMOutput: Contains logits and optional loss.
+        """
+        outputs = self.genemamba(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            output_hidden_states=output_hidden_states,
+        )
+        logits = self.lm_head(outputs.last_hidden_state)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1))
+        return GeneMambaMaskedLMOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states if output_hidden_states else None,
+        )
+@register_model_for_auto_class("AutoModelForSequenceClassification")
+class GeneMambaForSequenceClassification(GeneMambaPreTrainedModel):
+    """
+    GeneMamba model for sequence classification tasks.
+    Ideal for cell type annotation, tissue classification, etc.
+    Args:
+        config (GeneMambaConfig): Model configuration class.
+    """
+    def __init__(self, config: GeneMambaConfig):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.config = config
+        self.genemamba = GeneMambaModel(config)
+        # Classification head
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        self.apply(self._init_weights)
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_hidden_states: bool = False,
+    ) -> GeneMambaSequenceClassifierOutput:
+        """
+        Args:
+            input_ids (torch.Tensor): Token indices of shape (batch_size, seq_len).
+            attention_mask (torch.Tensor, optional): Attention mask.
+            labels (torch.Tensor, optional): Class labels for classification loss.
+            output_hidden_states (bool): Whether to output hidden states.
+        Returns:
+            GeneMambaSequenceClassifierOutput: Contains logits, optional loss, and embedding.
+        """
+        outputs = self.genemamba(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            output_hidden_states=output_hidden_states,
+        )
+        pooled_embedding = outputs.pooled_embedding
+        logits = self.classifier(self.dropout(pooled_embedding))
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(logits, labels)
+        return GeneMambaSequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states if output_hidden_states else None,
+            pooled_embedding=pooled_embedding,
+        )
+# Register tokenizer class
+register_model_for_auto_class("AutoModelForMaskedLM")(GeneMambaForMaskedLM)

modeling_outputs.py ADDED Viewed

	@@ -0,0 +1,81 @@

+"""
+Custom ModelOutput classes for GeneMamba.
+Defines the output structure for different GeneMamba tasks.
+"""
+from dataclasses import dataclass
+from typing import Optional, Tuple
+import torch
+from transformers.utils import ModelOutput
+@dataclass
+class GeneMambaModelOutput(ModelOutput):
+    """
+    Base output class for GeneMamba models.
+    Attributes:
+        last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (tuple(torch.FloatTensor), optional):
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        pooled_embedding (torch.FloatTensor of shape (batch_size, hidden_size)):
+            Cell/sequence-level embedding (pooled representation) used for downstream tasks.
+            This is the recommended embedding to use for classification, clustering, etc.
+        embedding_pooling (str):
+            The pooling method used to generate pooled_embedding.
+    """
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    pooled_embedding: torch.FloatTensor = None
+    embedding_pooling: str = "mean"
+@dataclass
+class GeneMambaSequenceClassifierOutput(ModelOutput):
+    """
+    Output class for GeneMamba sequence classification models.
+    Attributes:
+        loss (torch.FloatTensor of shape (), optional):
+            Classification loss (if labels were provided).
+        logits (torch.FloatTensor of shape (batch_size, num_labels)):
+            Classification scores (before softmax).
+        hidden_states (tuple(torch.FloatTensor), optional):
+            Hidden-states of the model at the output of each layer.
+        pooled_embedding (torch.FloatTensor of shape (batch_size, hidden_size), optional):
+            Cell embedding before classification head.
+    """
+    loss: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    pooled_embedding: Optional[torch.FloatTensor] = None
+@dataclass
+class GeneMambaMaskedLMOutput(ModelOutput):
+    """
+    Output class for GeneMamba masked language modeling.
+    Attributes:
+        loss (torch.FloatTensor of shape (), optional):
+            MLM loss (if labels were provided).
+        logits (torch.FloatTensor of shape (batch_size, sequence_length, vocab_size)):
+            Prediction scores of the language modeling head.
+        hidden_states (tuple(torch.FloatTensor), optional):
+            Hidden-states of the model at the output of each layer.
+    """
+    loss: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "pad_token": "[PAD]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_assets/gene_tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_assets/id2symbol.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3d5090ff562a77a03b19c37f6a010d639b8d64b1624db2e9a7c3291f9d389293
+size 634589

tokenizer_assets/symbol2id.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ecc7193b1f549e513903ba37410788632252a2dda4d07876a1d91730d8697dbe
+size 526232

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "added_tokens_decoder": {},
+  "clean_up_tokenization_spaces": true,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "[PAD]",
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "[UNK]"
+}