perturblab
/

cellfm-800m

embedding_extractor

Model card Files Files and versions

xet

Community

krkawzq commited on Dec 23, 2025

Commit

c7cd685

verified ·

1 Parent(s): afb48ec

Update README.md

Browse files

Files changed (1) hide show

README.md +106 -81

README.md CHANGED Viewed

@@ -1,147 +1,172 @@
 # CellFM-800M
-## 模型描述
-CellFM 是一个在 1 亿人类单细胞转录组数据上预训练的大规模基础模型。
-- **参数量**: 800M (8 亿)
-- **预训练数据**: 100M 人类细胞
-- **架构**: Retention-based Transformer (MAE Autobin)
-- **基因词汇表**: 27,855 个基因
-- **预训练任务**: 掩码自编码 (Masked Autoencoding)
-## 模型规格
-- **隐藏维度**: 2048
-- **层数**: 8
-- **注意力头数**: 64
 - **Dropout**: 0.1
-- **最大序列长度**: 2048 个基因
-## 使用方法
-### 加载模型
 ```python
 from perturblab.model.cellfm import CellFMModel
-# 加载预训练模型
 model = CellFMModel.from_pretrained('cellfm-800m')
-# 或从本地路径加载
 model = CellFMModel.from_pretrained('./weights/cellfm-800m')
 ```
-### 生成细胞嵌入
 ```python
 import scanpy as sc
-# 加载数据
 adata = sc.read_h5ad('your_data.h5ad')
-# 预处理
 adata = CellFMModel.prepare_data(adata)
-# 获取嵌入
 embeddings = model.predict_embeddings(
     adata,
-    batch_size=8,  # 800M 模型使用较小的批次大小
     return_cls_token=True,
 )
-# 访问细胞嵌入
-cell_embeddings = embeddings['cell_embeddings']  # Shape: (n_cells, 2048)
 ```
-### 微调分类任务
 ```python
 from perturblab.model.cellfm import CellFMModel, CellFMConfig
-# 初始化带分类头的模型
 config = CellFMConfig(
     model_name='800M',
-    n_genes=27855,
-    enc_dims=2048,
-    enc_nlayers=8,
-    enc_num_heads=64,
-    num_cls=10,  # 细胞类型数量
 )
 model = CellFMModel(config, for_finetuning=True)
-# 加载预训练权重
 model.load_weights('./weights/cellfm-800m/model.pt')
-# 在标注数据上微调
-# ... (训练代码)
 ```
-### 扰动预测
 ```python
 from perturblab.model.cellfm import CellFMPerturbationModel
 from perturblab.data import PerturbationData
-# 加载扰动数据
 data = PerturbationData.from_anndata(adata)
 data.split_data(train=0.7, val=0.15, test=0.15)
-# 初始化模型
 model = CellFMPerturbationModel.from_pretrained('cellfm-800m')
-# 初始化扰动头
 model.init_perturbation_head_from_dataset(data)
-# 训练
-model.train_model(data, epochs=20, batch_size=8)
-# 预测
 predictions = model.predict_perturbation(data, split='test')
 ```
-## 性能说明
-- **内存需求**: 推理需要约 3-4GB GPU 显存 (batch_size=16)
-- **推荐批次大小**: 推理 8-16，训练 4-8
-- **推理速度**: 比 80M 模型慢约 2-3 倍
-- **加载时间**: 约 5-10 秒
-## 模型架构
-- **编码器**: Retention-based Transformer (MAE Autobin)
-  - 自动离散化嵌入层
-  - 8 个 retention 层，每层 64 个注意力头
-  - 隐藏维度: 2048
-  - 层归一化和残差连接
-- **预训练**: 掩码自编码 (MAE)
-  - 掩码 50% 的基因
-  - 重建被掩码的基因表达
-- **输出**: 基因级嵌入 + CLS token (2048 维)
-## 与 80M 模型对比
-| 特性 | 80M | 800M |
-|------|-----|------|
-| 参数量 | 80M | 800M |
-| 隐藏维度 | 1536 | 2048 |
-| 层数 | 2 | 8 |
-| 注意力头 | 48 | 64 |
-| 内存 (推理) | ~1-2GB | ~3-4GB |
-| 速度 | 更快 | 较慢 |
-| 性能 | 良好 | 更好 |
-800M 模型提供更好的表示质量，但需要更多计算资源。
-## 文件说明
-- `config.json`: 模型配置文件
-- `model.pt`: PyTorch 格式的模型权重 (~3.2GB)
-- `README.md`: 本说明文档
-## 引用
-如果您在研究中使用 CellFM，请引用：
 ```bibtex
 @article{cellfm2024,
@@ -152,15 +177,15 @@ predictions = model.predict_perturbation(data, split='test')
 }
 ```
-## 参考资料
-- 原始仓库: https://github.com/biomed-AI/CellFM
-- PyTorch 版本: https://github.com/biomed-AI/CellFM-torch
-- 论文: [待发布]
-## 转换信息
-- **原始格式**: MindSpore checkpoint
-- **转换后格式**: PyTorch state_dict
-- **转换日期**: 2025-12-23
-- **转换工具**: PerturbLab conversion script

 # CellFM-800M
+## Model Description
+CellFM is a large-scale foundation model pre-trained on transcriptomics of 100 million human cells using a retention-based architecture (MAE Autobin).
+- **Model Size**: 800M parameters
+- **Pre-training Data**: 100M human cells
+- **Architecture**: Retention-based Transformer (MAE Autobin)
+- **Vocabulary**: 24,072 genes
+- **Pre-training Task**: Masked Autoencoding (MAE)
+## Model Details
+- **Source**: [biomed-AI/CellFM](https://github.com/biomed-AI/CellFM)
+- **Original Framework**: MindSpore
+- **Converted to**: PyTorch (PerturbLab format)
+- **License**: See original repository for details
+## Architecture Specifications
+- **Hidden Dimension**: 1536
+- **Number of Layers**: 40
+- **Number of Attention Heads**: 48
 - **Dropout**: 0.1
+- **Max Sequence Length**: 2048 genes
+## Usage
+### Load Model
 ```python
 from perturblab.model.cellfm import CellFMModel
+# Load pretrained model (automatically downloads if needed)
 model = CellFMModel.from_pretrained('cellfm-800m')
+# Or use short name
+model = CellFMModel.from_pretrained('800m')
+# Or from local path
 model = CellFMModel.from_pretrained('./weights/cellfm-800m')
 ```
+### Generate Cell Embeddings
 ```python
 import scanpy as sc
+# Load your data
 adata = sc.read_h5ad('your_data.h5ad')
+# Preprocess
 adata = CellFMModel.prepare_data(adata)
+# Get embeddings (use smaller batch size for 800M model)
 embeddings = model.predict_embeddings(
     adata,
+    batch_size=8,  # Smaller batch size for larger model
     return_cls_token=True,
 )
+# Access cell embeddings
+cell_embeddings = embeddings['cell_embeddings']  # Shape: (n_cells, 1536)
 ```
+### Fine-tune for Classification
 ```python
 from perturblab.model.cellfm import CellFMModel, CellFMConfig
+# Initialize model with classification head
 config = CellFMConfig(
     model_name='800M',
+    n_genes=24072,
+    enc_dims=1536,
+    enc_nlayers=40,
+    enc_num_heads=48,
+    num_cls=10,  # Number of cell types
 )
 model = CellFMModel(config, for_finetuning=True)
+# Load pretrained weights
 model.load_weights('./weights/cellfm-800m/model.pt')
+# Get dataloaders
+train_loader = model.get_dataloader(train_data, batch_size=4)['train']
+val_loader = model.get_dataloader(val_data, batch_size=4)['train']
+# Train
+model.train_model(
+    train_dataloader=train_loader,
+    val_dataloader=val_loader,
+    num_epochs=10,
+    learning_rate=1e-4,
+)
 ```
+### Perturbation Prediction
 ```python
 from perturblab.model.cellfm import CellFMPerturbationModel
 from perturblab.data import PerturbationData
+# Load perturbation data
 data = PerturbationData.from_anndata(adata)
 data.split_data(train=0.7, val=0.15, test=0.15)
+# Initialize model
 model = CellFMPerturbationModel.from_pretrained('cellfm-800m')
+# Initialize perturbation head from dataset
 model.init_perturbation_head_from_dataset(data)
+# Train (use smaller batch size)
+model.train_model(data, epochs=20, batch_size=4)
+# Predict
 predictions = model.predict_perturbation(data, split='test')
+# Evaluate
+metrics = model.evaluate(data, split='test')
+print(f"Pearson correlation: {metrics['pearson']:.4f}")
 ```
+## Performance Notes
+- **Memory Requirements**: ~3-4GB GPU memory for inference (batch_size=8)
+- **Recommended Batch Size**: 4-8 for training, 8-16 for inference
+- **Inference Speed**: ~2-3x slower than 80M model
+- **Loading Time**: ~5-10 seconds
+## Model Architecture
+- **Encoder**: Retention-based Transformer (MAE Autobin)
+  - Auto-discretization embedding layer
+  - 40 retention layers with 48 attention heads each
+  - Hidden dimension: 1536
+  - Layer normalization and residual connections
+- **Pre-training**: Masked Autoencoding (MAE)
+  - Masks 50% of genes
+  - Reconstructs masked gene expression
+- **Output**: Gene-level embeddings + CLS token (1536-dimensional)
+## Comparison with 80M Model
+| Feature | 80M | 800M |
+|---------|-----|------|
+| Parameters | 80M | 800M |
+| Hidden Dim | 1536 | 1536 |
+| Layers | 2 | 40 |
+| Heads | 48 | 48 |
+| Genes | 27,855 | 24,072 |
+| Memory (Inference) | ~1-2GB | ~3-4GB |
+| Speed | Faster | Slower |
+| Performance | Good | Better |
+The 800M model provides significantly better representation quality due to its deeper architecture (40 layers vs 2 layers), at the cost of increased computational requirements.
+## Files
+- `config.json`: Model configuration
+- `model.pt`: Model weights (PyTorch state dict, ~3.0GB)
+- `README.md`: This file
+- `.gitattributes`: Git LFS configuration
+## Citation
+If you use CellFM in your research, please cite:
 ```bibtex
 @article{cellfm2024,
 }
 ```
+## References
+- Original Repository: https://github.com/biomed-AI/CellFM
+- PyTorch Version: https://github.com/biomed-AI/CellFM-torch
+- Paper: [Link to paper when available]
+## Notes
+- This model was converted from the original MindSpore checkpoint
+- The gene vocabulary (24,072 genes) may differ from the 80M model (27,855 genes)
+- For best results, ensure your data preprocessing matches the model's expected input format
+- Use `CellFMModel.prepare_data()` to automatically preprocess your data