--- language: dna tags: - Biology - DNA license: agpl-3.0 library_name: multimolecule --- # scBasset Sequence-based convolutional neural network for modeling single-cell ATAC-seq chromatin accessibility from DNA sequence. ## Disclaimer This is an UNOFFICIAL implementation of [scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks](https://doi.org/10.1038/s41592-022-01562-8) by Han Yuan et al. The OFFICIAL repository of scBasset is at [calico/scBasset](https://github.com/calico/scBasset). > [!TIP] > The MultiMolecule team has confirmed that the provided model and checkpoints are producing the same intermediate representations as the original implementation. **The team releasing scBasset did not write this model card for this model so this model card has been written by the MultiMolecule team.** ## Model Details scBasset is a convolutional neural network (CNN) that predicts per-cell chromatin accessibility of a DNA peak sequence. The model consumes a fixed-length 1344 bp one-hot encoded DNA sequence and applies a pre-activation convolution stem, a reducing convolution tower, a pointwise convolution, and a dense bottleneck before a final cell-embedding layer that produces one accessibility logit per single cell. scBasset uses a pre-activation block layout: each convolution block applies the activation (the sigmoid approximation of GELU, `sigmoid(1.702 * x) * x`) _before_ the convolution, then batch normalization and max pooling. The dense bottleneck flattens the convolution output in Keras channels-last (length-major) order; this ordering is load-bearing and is reconciled in the MultiMolecule implementation. > [!IMPORTANT] > The final cell-embedding (dense) layer of scBasset is **dataset-specific**: it has one row per single cell in the training atlas, so there is no dataset-independent foundation checkpoint. The shipped weights correspond to the **Buenrostro2018 hematopoiesis** tutorial dataset distributed by the scBasset authors, which has **2034 single cells** (so `num_labels = 2034`). A different scBasset dataset would have a different number of cells and a differently sized cell-embedding layer. The cell-embedding layer is exposed through the shared [`SequencePredictionHead`][multimolecule.SequencePredictionHead]; the per-cell accessibility task is modeled as a binary problem (`problem_type="binary"`). ### Model Specification - Architecture: 1 stem convolution + 6 tower convolutions + 1 pointwise convolution + dense bottleneck - Stem: 288 filters, kernel 17, max-pool 3 - Tower: 6 blocks, filters 288/323/363/407/456/512, kernel 5, max-pool 2 - Pointwise: 256 filters, kernel 1 - Bottleneck (hidden size): 32 - Input length: 1344 bp - Number of labels: 2034 single cells (Buenrostro2018 hematopoiesis, **dataset-specific**) | Num Conv Layers | Hidden Size | Num Cells | Num Parameters (M) | FLOPs (G) | MACs (G) | Max Num Tokens | | --------------- | ----------- | --------- | ------------------ | --------- | -------- | -------------- | | 8 | 32 | 2034 | 4.59 | 0.95 | 0.47 | 1344 | ### Links - **Code**: [multimolecule.scbasset](https://github.com/DLS5-Omics/multimolecule/tree/master/multimolecule/models/scbasset) - **Weights**: [multimolecule/scbasset](https://huggingface.co/multimolecule/scbasset) - **Data**: [scBasset Buenrostro2018 tutorial data](https://storage.googleapis.com/scbasset_tutorial_data/buen_ad_sc.h5ad) - **Paper**: [scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks](https://doi.org/10.1038/s41592-022-01562-8) - **Developed by**: Han Yuan, David R. Kelley - **Original Repository**: [calico/scBasset](https://github.com/calico/scBasset) ## Usage The model file depends on the [`multimolecule`](https://multimolecule.danling.org) library. You can install it using pip: ```bash pip install multimolecule ``` ### Direct Use #### Single-Cell Chromatin Accessibility Prediction You can use this model directly to predict per-cell chromatin accessibility of a DNA peak sequence: ```python >>> from multimolecule import DnaTokenizer, ScBassetForSequencePrediction >>> tokenizer = DnaTokenizer.from_pretrained("multimolecule/scbasset") >>> model = ScBassetForSequencePrediction.from_pretrained("multimolecule/scbasset") >>> input = tokenizer("ACGT" * 336, return_tensors="pt") >>> output = model(**input) >>> output.logits.shape torch.Size([1, 2034]) ``` Each of the 2034 logits is a per-cell accessibility score for the Buenrostro2018 hematopoiesis atlas. ## Training Details scBasset was trained to predict the per-cell chromatin accessibility of DNA peak sequences across a single-cell ATAC-seq atlas. ### Training Data The shipped converted checkpoint corresponds to the **Buenrostro2018 hematopoiesis** scBasset tutorial model (`buen_model_sc.h5`) distributed by the scBasset authors, trained on the Buenrostro et al. 2018 single-cell ATAC-seq hematopoiesis dataset (2034 single cells). Each 1344 bp peak is associated with a per-cell binary accessibility vector. ### Training Procedure #### Pre-training The model was trained to minimize a per-cell binary cross-entropy loss, comparing its predicted per-cell accessibility probabilities (sigmoid of the cell-embedding logits) against the observed single-cell ATAC-seq accessibility labels. - Optimizer: Adam - Loss: Per-cell binary cross-entropy - Regularization: Batch normalization and dropout - Activation: Sigmoid approximation of GELU (`sigmoid(1.702 * x) * x`) ## Citation ```bibtex @article{yuan2022scbasset, author = {Yuan, Han and Kelley, David R.}, title = {scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks}, journal = {Nature Methods}, volume = 19, number = 9, pages = {1088--1096}, year = 2022, publisher = {Nature Publishing Group}, doi = {10.1038/s41592-022-01562-8} } ``` > [!NOTE] > The artifacts distributed in this repository are part of the MultiMolecule project. > If you use MultiMolecule in your research, you must cite the MultiMolecule project as follows: ```bibtex @software{chen_2024_12638419, author = {Chen, Zhiyuan and Zhu, Sophia Y.}, title = {MultiMolecule}, doi = {10.5281/zenodo.12638419}, publisher = {Zenodo}, url = {https://doi.org/10.5281/zenodo.12638419}, year = 2024, month = may, day = 4 } ``` ## Contact Please use GitHub issues of [MultiMolecule](https://github.com/DLS5-Omics/multimolecule/issues) for any questions or comments on the model card. Please contact the authors of the [scBasset paper](https://doi.org/10.1038/s41592-022-01562-8) for questions or comments on the paper/model. ## License This model implementation is licensed under the [GNU Affero General Public License](license.md). For additional terms and clarifications, please refer to our [License FAQ](license-faq.md). ```spdx SPDX-License-Identifier: AGPL-3.0-or-later ```