Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

README.md +165 -0
__pycache__/modeling_virtual_cell_distil.cpython-312.pyc +0 -0
config.json +16 -0
gene_names.txt +0 -0
model.safetensors +3 -0
modeling_virtual_cell_distil.py +183 -0
requirements.txt +10 -0
train.py +145 -0

README.md ADDED Viewed

	@@ -0,0 +1,165 @@

+# Virtual Cell — Distilled Bulk Encoder
+A bulk RNA-seq encoder distilled from
+[ConvergeBio/virtual-cell-patient](https://huggingface.co/ConvergeBio/virtual-cell-patient).
+It maps bulk gene expression directly into the same 512-dimensional patient embedding space,
+making single-cell-trained representations accessible when only bulk data is available.
+## Model architecture
+```
+input  [batch, 18301 genes]
+  → MLP encoder (Linear → BN → PReLU)²   → [batch, 512]
+```
+Training objective: cosine distillation loss, with teacher embeddings produced by
+`virtual-cell-patient` on matched single-cell RNA-seq data from the same patients.
+## Relationship to virtual-cell-patient
+| | [virtual-cell-patient](https://huggingface.co/ConvergeBio/virtual-cell-patient) | virtual-cell-distil-bulk |
+|---|---|---|
+| Input | `[batch, n_cells, 18301]` single-cell matrix | `[batch, 18301]` bulk expression vector |
+| Output | `[batch, 512]` patient embedding + class logits | `[batch, 512]` patient embedding |
+| Requires single-cell data | Yes | No |
+Both models use the same 18,301-gene vocabulary (`gene_names.txt`) and produce embeddings
+in the same 512-dimensional space.
+## Installation
+```bash
+pip install -r requirements.txt
+```
+`wandb` is optional and only needed when training with `--wandb_project`.
+## Quick start
+### Inference — extract embeddings
+```python
+import torch
+from transformers import AutoModel
+model = AutoModel.from_pretrained(
+    "ConvergeBio/virtual-cell-distil-bulk",
+    trust_remote_code=True,
+).eval()
+x = torch.randn(4, 18_301)   # [batch, num_genes]
+with torch.no_grad():
+    out = model(input_ids=x)
+print(out["embeddings"].shape)   # [4, 512]
+```
+> **Note:** the model uses BatchNorm — always call `.eval()` for inference.
+### Inference on real data
+```python
+from datasets import load_dataset
+import torch
+from transformers import AutoModel
+ds = load_dataset("ConvergeBio/virtual-cell-distil-bulk-example", token="...", split="validation")
+model = AutoModel.from_pretrained(
+    "ConvergeBio/virtual-cell-distil-bulk",
+    trust_remote_code=True,
+).eval()
+sample = torch.tensor(ds[0]["bulk_expression"]).unsqueeze(0)  # [1, 18301]
+with torch.no_grad():
+    out = model(input_ids=sample)
+print(out["embeddings"].shape)   # [1, 512]
+```
+> **Note:** `ConvergeBio/virtual-cell-distil-bulk-example` is a minimal sample dataset
+> intended only to verify the data format and run a quick end-to-end check.
+> Metrics produced from this dataset should not be interpreted.
+## Fine-tuning for classification
+The pretrained encoder can be fine-tuned on any bulk RNA-seq classification task.
+A linear head is added on top; the encoder weights are initialised from the distilled
+checkpoint and optionally frozen.
+```python
+from transformers import AutoModelForSequenceClassification
+model = AutoModelForSequenceClassification.from_pretrained(
+    "ConvergeBio/virtual-cell-distil-bulk",
+    num_labels=2,
+    ignore_mismatched_sizes=True,   # classification head is randomly initialised
+    trust_remote_code=True,
+)
+```
+**Binary classification (e.g. disease vs. healthy) with frozen encoder:**
+```bash
+python train.py \
+  --dataset_path <your_dataset> \
+  --num_classes 2 \
+  --freeze_encoder \
+  --output_dir ./my_binary_model
+```
+**Multi-class fine-tuning:**
+```bash
+python train.py \
+  --dataset_path <your_dataset> \
+  --num_classes <N> \
+  --output_dir ./my_finetuned_model \
+  --num_train_epochs 15 \
+  --learning_rate 1e-4
+```
+## Preparing your data
+`train.py` expects a HuggingFace dataset with `train` (and optionally `validation`) splits.
+Each row represents one patient sample:
+| Column | Shape | Type | Description |
+|---|---|---|---|
+| `bulk_expression` | [18301] | float32 | Log-normalised bulk gene expression, aligned to `gene_names.txt` |
+| `labels` | scalar | int | Class index |
+Input expression should be library-size normalised (target sum 10,000) and log1p
+transformed. The gene axis must be aligned to the 18,301 genes in `gene_names.txt` —
+missing genes are zero-filled, extra genes are dropped.
+For a guide on building this dataset from raw count matrices, see the
+[example dataset](https://huggingface.co/datasets/ConvergeBio/virtual-cell-distil-bulk-example).
+## Repository contents
+| File | Description |
+|---|---|
+| `modeling_virtual_cell_distil.py` | Full model implementation |
+| `config.json` | Architecture config |
+| `gene_names.txt` | Ordered list of 18,301 HGNC gene symbols |
+| `train.py` | Classification fine-tuning script |
+| `requirements.txt` | Python dependencies |
+| `model.safetensors` | Pretrained encoder weights |
+## Citation
+If you use this model, please cite:
+```bibtex
+@article{convergecell2026,
+  author    = {ConvergeBio},
+  title     = {ConvergeCELL: An end-to-end platform from patient transcriptomics to therapeutic hypotheses},
+  year      = {2026},
+  note      = {Preprint available on bioRxiv},
+}
+```
+## License
+[TBD]

__pycache__/modeling_virtual_cell_distil.cpython-312.pyc ADDED Viewed

Binary file (9.05 kB). View file

config.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+  "model_type": "virtual_cell_distil",
+  "n_genes": 18301,
+  "output_dim": 512,
+  "hidden_dim": [512, 512],
+  "dropout": 0.2044838332376416,
+  "residual": false,
+  "activation": "prelu",
+  "num_labels": 2,
+  "classifier_dropout": 0.1,
+  "auto_map": {
+    "AutoConfig": "modeling_virtual_cell_distil.VirtualCellDistilConfig",
+    "AutoModel": "modeling_virtual_cell_distil.VirtualCellDistilModel",
+    "AutoModelForSequenceClassification": "modeling_virtual_cell_distil.VirtualCellDistilForSequenceClassification"
+  }
+}

gene_names.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b41cdc6ccc9caded37f4fb68e9d511aaeea77ef0e2f685eabc53edc7cdd060b8
+size 39601856

modeling_virtual_cell_distil.py ADDED Viewed

	@@ -0,0 +1,183 @@

+"""
+Virtual Cell — Distilled Bulk Encoder — HuggingFace release.
+Encodes bulk RNA-seq gene expression into the same 512-d patient embedding
+space as ConvergeBio/virtual-cell-patient, without requiring single-cell data.
+Trained by cosine distillation against patient model embeddings.
+Two classes are provided:
+  VirtualCellDistilModel
+    Pure encoder. Returns 512-d embeddings for each sample.
+    Use this for clustering, visualisation, or as a frozen backbone.
+  VirtualCellDistilForSequenceClassification
+    Adds a dropout + linear classification head on top of the encoder.
+    Load the pretrained encoder weights and fine-tune on your labels.
+Usage — inference:
+    from transformers import AutoModel
+    model = AutoModel.from_pretrained(
+        "ConvergeBio/virtual-cell-distil-bulk", trust_remote_code=True
+    ).eval()
+    out = model(input_ids=x)          # out["embeddings"]: [batch, 512]
+Usage — classification fine-tuning:
+    from transformers import AutoModelForSequenceClassification
+    model = AutoModelForSequenceClassification.from_pretrained(
+        "ConvergeBio/virtual-cell-distil-bulk",
+        num_labels=2,
+        ignore_mismatched_sizes=True,  # head is randomly initialised
+        trust_remote_code=True,
+    )
+    out = model(input_ids=x, labels=y)
+    # out["loss"], out["logits"], out["embeddings"]
+Note: the model contains BatchNorm layers — always call .eval() for inference.
+"""
+from typing import List, Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import PretrainedConfig, PreTrainedModel
+def _get_activation(activation: str) -> nn.Module:
+    if activation == "prelu":
+        return nn.PReLU()
+    elif activation == "relu":
+        return nn.ReLU()
+    elif activation == "gelu":
+        return nn.GELU()
+    elif activation == "tanh":
+        return nn.Tanh()
+    raise ValueError(f"Unsupported activation: {activation!r}")
+class MLP(nn.Module):
+    def __init__(
+        self,
+        input_dim: int,
+        output_dim: int = 512,
+        hidden_dim: Optional[List[int]] = None,
+        dropout: float = 0.0,
+        residual: bool = False,
+        activation: str = "prelu",
+    ):
+        super().__init__()
+        if hidden_dim is None:
+            hidden_dim = [512, 512]
+        self.latent_dim = output_dim
+        self.residual = residual
+        self.network = nn.ModuleList()
+        if residual:
+            assert len(set(hidden_dim)) == 1, "Residual connections require all hidden dims to be equal"
+        for i in range(len(hidden_dim)):
+            if i == 0:
+                self.network.append(nn.Sequential(
+                    nn.Linear(input_dim, hidden_dim[i]),
+                    nn.BatchNorm1d(hidden_dim[i]),
+                    _get_activation(activation),
+                ))
+            else:
+                self.network.append(nn.Sequential(
+                    nn.Dropout(p=dropout),
+                    nn.Linear(hidden_dim[i - 1], hidden_dim[i]),
+                    nn.BatchNorm1d(hidden_dim[i]),
+                    _get_activation(activation),
+                ))
+        self.network.append(nn.Linear(hidden_dim[-1], output_dim))
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        for i, layer in enumerate(self.network):
+            if self.residual and (0 < i < len(self.network) - 1):
+                x = layer(x) + x
+            else:
+                x = layer(x)
+        return x
+class VirtualCellDistilConfig(PretrainedConfig):
+    model_type = "virtual_cell_distil"
+    def __init__(
+        self,
+        n_genes: int = 18301,
+        output_dim: int = 512,
+        hidden_dim: Optional[List[int]] = None,
+        dropout: float = 0.0,
+        residual: bool = False,
+        activation: str = "prelu",
+        num_labels: int = 2,
+        classifier_dropout: float = 0.1,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.n_genes = n_genes
+        self.output_dim = output_dim
+        self.hidden_dim = hidden_dim if hidden_dim is not None else [512, 512]
+        self.dropout = dropout
+        self.residual = residual
+        self.activation = activation
+        self.num_labels = num_labels
+        self.classifier_dropout = classifier_dropout
+class VirtualCellDistilModel(PreTrainedModel):
+    """Pure encoder — returns 512-d patient embeddings from bulk expression."""
+    config_class = VirtualCellDistilConfig
+    def __init__(self, config: VirtualCellDistilConfig):
+        super().__init__(config)
+        self.encoder = MLP(
+            input_dim=config.n_genes,
+            output_dim=config.output_dim,
+            hidden_dim=config.hidden_dim,
+            dropout=config.dropout,
+            residual=config.residual,
+            activation=config.activation,
+        )
+    def forward(self, input_ids: torch.Tensor, **kwargs) -> dict:
+        return {"embeddings": self.encoder(input_ids)}
+class VirtualCellDistilForSequenceClassification(PreTrainedModel):
+    """
+    Encoder + linear classification head.
+    The encoder is initialised from pretrained distilled weights.
+    The classification head is randomly initialised and trained on your labels.
+    Use ignore_mismatched_sizes=True when loading from the pretrained checkpoint.
+    """
+    config_class = VirtualCellDistilConfig
+    def __init__(self, config: VirtualCellDistilConfig):
+        super().__init__(config)
+        self.encoder = MLP(
+            input_dim=config.n_genes,
+            output_dim=config.output_dim,
+            hidden_dim=config.hidden_dim,
+            dropout=config.dropout,
+            residual=config.residual,
+            activation=config.activation,
+        )
+        self.dropout = nn.Dropout(config.classifier_dropout)
+        self.classifier = nn.Linear(config.output_dim, config.num_labels)
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        labels: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> dict:
+        embeddings = self.encoder(input_ids)
+        logits = self.classifier(self.dropout(embeddings))
+        loss = None
+        if labels is not None:
+            loss = F.cross_entropy(logits, labels)
+        return {"loss": loss, "logits": logits, "embeddings": embeddings}

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+torch>=2.0
+transformers>=4.40,<5.0
+accelerate>=0.26
+datasets>=2.19
+scikit-learn>=1.3
+numpy>=1.24
+safetensors>=0.4
+# optional: only needed with --wandb_project
+# wandb

train.py ADDED Viewed

	@@ -0,0 +1,145 @@

+import argparse
+import os
+import sys
+from dataclasses import dataclass
+from typing import Dict, List, Optional
+import numpy as np
+import torch
+from datasets import DatasetDict, load_dataset
+from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
+from transformers import EarlyStoppingCallback, Trainer, TrainingArguments
+from transformers.trainer_utils import EvalPrediction
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from modeling_virtual_cell_distil import (
+    VirtualCellDistilConfig,
+    VirtualCellDistilForSequenceClassification,
+)
+@dataclass
+class BulkCollator:
+    def __call__(self, features: List[Dict]) -> Dict[str, torch.Tensor]:
+        return {
+            "input_ids": torch.stack([
+                torch.tensor(f["bulk_expression"], dtype=torch.float32) for f in features
+            ]),
+            "labels": torch.tensor([f["labels"] for f in features], dtype=torch.long),
+        }
+def compute_metrics(eval_pred: EvalPrediction) -> Dict[str, float]:
+    logits = eval_pred.predictions
+    if isinstance(logits, tuple):
+        logits = logits[0]
+    labels = eval_pred.label_ids
+    preds  = np.argmax(logits, axis=1)
+    return {
+        "accuracy":  accuracy_score(labels, preds),
+        "f1_macro":  f1_score(labels, preds, average="macro",  zero_division=0),
+        "precision": precision_score(labels, preds, average="macro", zero_division=0),
+        "recall":    recall_score(labels, preds, average="macro",    zero_division=0),
+    }
+def parse_args():
+    p = argparse.ArgumentParser()
+    p.add_argument("--dataset_path",     required=True,
+                   help="HF dataset ID or local path with train (and optionally validation) splits")
+    p.add_argument("--model_name_or_path", default="ConvergeBio/virtual-cell-distil-bulk")
+    p.add_argument("--hf_token",           default=None)
+    p.add_argument("--output_dir",         default="./vc_distil_output")
+    p.add_argument("--num_classes",        type=int,   default=None)
+    p.add_argument("--freeze_encoder",     action="store_true",
+                   help="Freeze the pretrained encoder and train the classification head only")
+    p.add_argument("--num_train_epochs",   type=int,   default=15)
+    p.add_argument("--per_device_train_batch_size", type=int,   default=32)
+    p.add_argument("--per_device_eval_batch_size",  type=int,   default=32)
+    p.add_argument("--learning_rate",      type=float, default=1e-4)
+    p.add_argument("--weight_decay",       type=float, default=0.05)
+    p.add_argument("--warmup_ratio",       type=float, default=0.1)
+    p.add_argument("--lr_scheduler_type",             default="cosine")
+    p.add_argument("--patience",           type=int,   default=5)
+    p.add_argument("--num_workers",        type=int,   default=4)
+    p.add_argument("--prefetch_factor",    type=int,   default=2)
+    p.add_argument("--wandb_project",      default=None)
+    p.add_argument("--run_name",           default=None)
+    return p.parse_args()
+def main():
+    args = parse_args()
+    if os.path.isdir(args.dataset_path):
+        ds = DatasetDict.load_from_disk(args.dataset_path)
+    else:
+        ds = load_dataset(args.dataset_path,
+                          num_proc=args.num_workers or None,
+                          token=args.hf_token or True)
+    train_ds = ds["train"]
+    val_ds: Optional[object] = ds.get("validation")
+    hf_kwargs = {"trust_remote_code": True}
+    if args.hf_token:
+        hf_kwargs["token"] = args.hf_token
+    config = VirtualCellDistilConfig.from_pretrained(args.model_name_or_path, **hf_kwargs)
+    if args.num_classes is not None:
+        config.num_labels = args.num_classes
+        config.id2label   = {str(i): str(i) for i in range(args.num_classes)}
+        config.label2id   = {str(i): i       for i in range(args.num_classes)}
+    model = VirtualCellDistilForSequenceClassification.from_pretrained(
+        args.model_name_or_path,
+        config=config,
+        ignore_mismatched_sizes=True,
+        **hf_kwargs,
+    )
+    if args.freeze_encoder:
+        for param in model.encoder.parameters():
+            param.requires_grad = False
+    if args.wandb_project:
+        os.environ["WANDB_PROJECT"] = args.wandb_project
+    has_val = val_ds is not None
+    training_args = TrainingArguments(
+        output_dir=args.output_dir,
+        num_train_epochs=args.num_train_epochs,
+        per_device_train_batch_size=args.per_device_train_batch_size,
+        per_device_eval_batch_size=args.per_device_eval_batch_size,
+        learning_rate=args.learning_rate,
+        weight_decay=args.weight_decay,
+        warmup_ratio=args.warmup_ratio,
+        lr_scheduler_type=args.lr_scheduler_type,
+        eval_strategy="epoch" if has_val else "no",
+        save_strategy="epoch",
+        load_best_model_at_end=has_val,
+        metric_for_best_model="eval_loss" if has_val else None,
+        greater_is_better=False,
+        report_to="wandb" if args.wandb_project else "none",
+        run_name=args.run_name,
+        dataloader_num_workers=args.num_workers,
+        remove_unused_columns=False,
+    )
+    callbacks = [EarlyStoppingCallback(args.patience)] if has_val else []
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_ds,
+        eval_dataset=val_ds,
+        data_collator=BulkCollator(),
+        compute_metrics=compute_metrics if has_val else None,
+        callbacks=callbacks,
+    )
+    trainer.train()
+    trainer.save_model(args.output_dir)
+if __name__ == "__main__":
+    main()