scbasset / README.md
ZhiyuanChen's picture
Upload folder using huggingface_hub
7144824 verified
---
language: dna
tags:
- Biology
- DNA
license: agpl-3.0
library_name: multimolecule
---
# scBasset
Sequence-based convolutional neural network for modeling single-cell ATAC-seq chromatin accessibility from DNA sequence.
## Disclaimer
This is an UNOFFICIAL implementation of [scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks](https://doi.org/10.1038/s41592-022-01562-8) by Han Yuan et al.
The OFFICIAL repository of scBasset is at [calico/scBasset](https://github.com/calico/scBasset).
> [!TIP]
> The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation.
**The team releasing scBasset did not write this model card for this model so this model card has been written by the MultiMolecule team.**
## Model Details
scBasset is a convolutional neural network (CNN) that predicts per-cell chromatin accessibility of a DNA peak sequence. The model consumes a fixed-length 1344 bp one-hot encoded DNA sequence and applies a pre-activation convolution stem, a reducing convolution tower, a pointwise convolution, and a dense bottleneck before a final cell-embedding layer that produces one accessibility logit per single cell.
scBasset uses a pre-activation block layout: each convolution block applies the activation (the sigmoid approximation of GELU, `sigmoid(1.702 * x) * x`) _before_ the convolution, then batch normalization and max pooling. The dense bottleneck flattens the convolution output in Keras channels-last (length-major) order; this ordering is load-bearing and is reconciled in the MultiMolecule implementation.
> [!IMPORTANT]
> The final cell-embedding (dense) layer of scBasset is **dataset-specific**: it has one row per single cell in the training atlas, so there is no dataset-independent foundation checkpoint. The shipped weights correspond to the **Buenrostro2018 hematopoiesis** tutorial dataset distributed by the scBasset authors, which has **2034 single cells** (so `num_labels = 2034`). A different scBasset dataset would have a different number of cells and a differently sized cell-embedding layer.
The cell-embedding layer is exposed through the shared [`SequencePredictionHead`][multimolecule.SequencePredictionHead]; the per-cell accessibility task is modeled as a binary problem (`problem_type="binary"`).
### Model Specification
- Architecture: 1 stem convolution + 6 tower convolutions + 1 pointwise convolution + dense bottleneck
- Stem: 288 filters, kernel 17, max-pool 3
- Tower: 6 blocks, filters 288/323/363/407/456/512, kernel 5, max-pool 2
- Pointwise: 256 filters, kernel 1
- Bottleneck (hidden size): 32
- Input length: 1344 bp
- Number of labels: 2034 single cells (Buenrostro2018 hematopoiesis, **dataset-specific**)
| Num Conv Layers | Hidden Size | Num Cells | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens |
| --------------- | ----------- | --------- | ------------------ | --------- | -------- | -------------- |
| 8 | 32 | 2034 | 4.59 | 0.95 | 0.47 | 1344 |
### Links
- **Code**: [multimolecule.scbasset](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/scbasset)
- **Weights**: [multimolecule/scbasset](https://huggingface.co/multimolecule/scbasset)
- **Data**: [scBasset Buenrostro2018 tutorial data](https://storage.googleapis.com/scbasset_tutorial_data/buen_ad_sc.h5ad)
- **Paper**: [scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks](https://doi.org/10.1038/s41592-022-01562-8)
- **Developed by**: Han Yuan, David R. Kelley
- **Original Repository**: [calico/scBasset](https://github.com/calico/scBasset)
## Usage
The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip:
```bash
pip install multimolecule
```
### Direct Use
#### Single-Cell Chromatin Accessibility Prediction
You can use this model directly to predict per-cell chromatin accessibility of a DNA peak sequence:
```python
>>> from multimolecule import DnaTokenizer, ScBassetForSequencePrediction
>>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/scbasset")
>>> model = ScBassetForSequencePrediction.from_pretrained("multimolecule/scbasset")
>>> input = tokenizer("ACGT" * 336, return_tensors="pt")
>>> output = model(**input)
>>> output.logits.shape
torch.Size([1, 2034])
```
Each of the 2034 logits is a per-cell accessibility score for the Buenrostro2018 hematopoiesis atlas.
## Training Details
scBasset was trained to predict the per-cell chromatin accessibility of DNA peak sequences across a single-cell ATAC-seq atlas.
### Training Data
The shipped converted checkpoint corresponds to the **Buenrostro2018 hematopoiesis** scBasset tutorial model (`buen_model_sc.h5`) distributed by the scBasset authors, trained on the Buenrostro et al. 2018 single-cell ATAC-seq hematopoiesis dataset (2034 single cells). Each 1344 bp peak is associated with a per-cell binary accessibility vector.
### Training Procedure
#### Pre-training
The model was trained to minimize a per-cell binary cross-entropy loss, comparing its predicted per-cell accessibility probabilities (sigmoid of the cell-embedding logits) against the observed single-cell ATAC-seq accessibility labels.
- Optimizer: Adam
- Loss: Per-cell binary cross-entropy
- Regularization: Batch normalization and dropout
- Activation: Sigmoid approximation of GELU (`sigmoid(1.702 * x) * x`)
## Citation
```bibtex
@article{yuan2022scbasset,
author = {Yuan, Han and Kelley, David R.},
title = {scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks},
journal = {Nature Methods},
volume = 19,
number = 9,
pages = {1088--1096},
year = 2022,
publisher = {Nature Publishing Group},
doi = {10.1038/s41592-022-01562-8}
}
```
> [!NOTE]
> The artifacts distributed in this repository are part of the MultiMolecule project.
> If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows:
```bibtex
@software{chen_2024_12638419,
author = {Chen, Zhiyuan and Zhu, Sophia Y.},
title = {MultiMolecule},
doi = {10.5281/zenodo.12638419},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.12638419},
year = 2024,
month = may,
day = 4
}
```
## Contact
Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card.
Please contact the authors of the [scBasset paper](https://doi.org/10.1038/s41592-022-01562-8) for questions or comments on the paper/model.
## License
This model implementation is licensed under the [GNU Affero General Public License](license.md).
For additional terms and clarifications, please refer to our [License FAQ](license-faq.md).
```spdx
SPDX-License-Identifier: AGPL-3.0-or-later
```