Added files for 1B C2S-Scale-Pythia model
Browse files
README.md
CHANGED
|
@@ -1,3 +1,59 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: cc-by-nc-nd-4.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-nd-4.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
base_model: EleutherAI/pythia-1b
|
| 6 |
+
library_name: transformers
|
| 7 |
+
tags:
|
| 8 |
+
- biology
|
| 9 |
+
- scRNAseq
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
# Overview
|
| 14 |
+
This is the C2S-Scale-1B pretrained model, based on the Pythia-1b architecture
|
| 15 |
+
developed by EleutherAI, fine-tuned using the Cell2Sentence (C2S) framework on a wide array of single-cell RNA sequencing
|
| 16 |
+
(scRNA-seq) datasets from CellxGene and the Human Cell Atlas. Cell2Sentence is a cutting-edge method that
|
| 17 |
+
adapts large language models (LLMs) to single-cell biology by converting scRNA-seq data into
|
| 18 |
+
"cell sentences" — ordered sequences of gene names based on expression levels. This model has been trained
|
| 19 |
+
to perform a broad range of single- and multi-cell tasks, making it a versatile tool for various single-cell
|
| 20 |
+
and multi-cell analyses.
|
| 21 |
+
|
| 22 |
+
# Training Data
|
| 23 |
+
This model was trained on over 57 million human and mouse cells gathered from over 800 single-cell RNA sequencing
|
| 24 |
+
datasets from CellxGene and the Human Cell Atlas. This dataset covers a broad range of cell types and conditions
|
| 25 |
+
from multiple tissues in both human and mouse.
|
| 26 |
+
|
| 27 |
+
This model was trained with a variable number of genes per cell sentence, with a maximum context length of 8192 tokens.
|
| 28 |
+
The context length of the default Pythia model was extended using rotary positional embeddings prior to C2S training.
|
| 29 |
+
- Cells: For multi cell samples, each training sample contained between 5 and 20 cells, with the same number of genes for each of the cells in the same sample.
|
| 30 |
+
- Genes: For single cell samples, each cell sentence contained between 100 and 2048 genes. For multi cell samples, each cell sentence per cell contained between 100 and 400 genes.
|
| 31 |
+
|
| 32 |
+
# Tasks
|
| 33 |
+
This model is designed for the following tasks:
|
| 34 |
+
|
| 35 |
+
Single-Cell Tasks
|
| 36 |
+
- Unconditional single-cell generation: Generate single cell sentences unconditionally.
|
| 37 |
+
- Cell type prediction: Predict the cell type of a given single cell.
|
| 38 |
+
- Cell type-conditioned generation: Generate a single cell sentence conditioned on a specific cell type.
|
| 39 |
+
|
| 40 |
+
Multi-Cell Tasks
|
| 41 |
+
- Unconditional multi-cell generation: Generate multiple cell sentences unconditionally.
|
| 42 |
+
- Tissue prediction: Predict the tissue of origin for a group of cells.
|
| 43 |
+
- Cell type prediction: Predict the cell type for each cell in a group of multiple cells.
|
| 44 |
+
- Tissue-conditioned multi-cell generation: Generate multiple cell sentences conditioned on a specific tissue.
|
| 45 |
+
- Cell type-conditioned multi-cell generation: Generate multiple cell sentences conditioned on the cell type of each individual cell.
|
| 46 |
+
- Multi-cells to abstract: Generate a research paper abstract based on the provided multi-cell sentences.
|
| 47 |
+
- Abstract to multi-cells: Generate multiple cell sentences based on a given research paper abstract.
|
| 48 |
+
|
| 49 |
+
Gene Set Tasks
|
| 50 |
+
- Gene set name to genes: Generate an alphabetical list of genes given a gene set name.
|
| 51 |
+
- Genes to gene set name: Generate the name of a gene set given an alphabetical list of genes.
|
| 52 |
+
|
| 53 |
+
# Cell2Sentence Links
|
| 54 |
+
- GitHub: https://github.com/vandijklab/cell2sentence
|
| 55 |
+
- Paper: https://www.biorxiv.org/content/10.1101/2023.09.11.557287v3
|
| 56 |
+
|
| 57 |
+
# Pythia Links
|
| 58 |
+
- Paper: https://arxiv.org/pdf/2304.01373
|
| 59 |
+
- Hugging Face: https://huggingface.co/EleutherAI/pythia-410m
|