Christina Theodoris
commited on
Commit
·
c34ead6
1
Parent(s):
d468697
Add further explanation regarding input file format for transcriptome tokenizer
Browse files
examples/tokenizing_scRNAseq_data.ipynb
CHANGED
|
@@ -17,7 +17,7 @@
|
|
| 17 |
"source": [
|
| 18 |
"#### Input data is a directory with .loom files containing raw counts from single cell RNAseq data, including all genes detected in the transcriptome without feature selection. \n",
|
| 19 |
"\n",
|
| 20 |
-
"#### Genes should be labeled with Ensembl IDs (row attribute \"ensembl_id\"), which provide a unique identifer for conversion to tokens.\n",
|
| 21 |
"\n",
|
| 22 |
"#### No cell metadata is required, but custom cell attributes may be passed onto the tokenized dataset by providing a dictionary of custom attributes to be added, which is formatted as loom_col_attr_name : desired_dataset_col_attr_name. For example, if the original .loom dataset has column attributes \"cell_type\" and \"organ_major\" and one would like to retain these attributes as labels in the tokenized dataset with the new names \"cell_type\" and \"organ\", respectively, the following custom attribute dictionary should be provided: {\"cell_type\": \"cell_type\", \"organ_major\": \"organ\"}. \n",
|
| 23 |
"\n",
|
|
|
|
| 17 |
"source": [
|
| 18 |
"#### Input data is a directory with .loom files containing raw counts from single cell RNAseq data, including all genes detected in the transcriptome without feature selection. \n",
|
| 19 |
"\n",
|
| 20 |
+
"#### Genes should be labeled with Ensembl IDs (row attribute \"ensembl_id\"), which provide a unique identifer for conversion to tokens. Cells should be labeled with the total read count in the cell (column attribute \"n_counts\") to be used for normalization.\n",
|
| 21 |
"\n",
|
| 22 |
"#### No cell metadata is required, but custom cell attributes may be passed onto the tokenized dataset by providing a dictionary of custom attributes to be added, which is formatted as loom_col_attr_name : desired_dataset_col_attr_name. For example, if the original .loom dataset has column attributes \"cell_type\" and \"organ_major\" and one would like to retain these attributes as labels in the tokenized dataset with the new names \"cell_type\" and \"organ\", respectively, the following custom attribute dictionary should be provided: {\"cell_type\": \"cell_type\", \"organ_major\": \"organ\"}. \n",
|
| 23 |
"\n",
|
geneformer/tokenizer.py
CHANGED
|
@@ -1,6 +1,13 @@
|
|
| 1 |
"""
|
| 2 |
Geneformer tokenizer.
|
| 3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
Usage:
|
| 5 |
from geneformer import TranscriptomeTokenizer
|
| 6 |
tk = TranscriptomeTokenizer({"cell_type": "cell_type", "organ_major": "organ_major"}, nproc=4)
|
|
|
|
| 1 |
"""
|
| 2 |
Geneformer tokenizer.
|
| 3 |
|
| 4 |
+
Input data:
|
| 5 |
+
Required format: raw counts scRNAseq data without feature selection as .loom file
|
| 6 |
+
Required row (gene) attribute: "ensembl_id"; Ensembl ID for each gene
|
| 7 |
+
Required col (cell) attribute: "n_counts"; total read counts in that cell
|
| 8 |
+
Optional col (cell) attribute: "filter_pass"; binary indicator of whether cell should be tokenized based on user-defined filtering criteria
|
| 9 |
+
Optional col (cell) attributes: any other cell metadata can be passed on to the tokenized dataset as a custom attribute dictionary as shown below
|
| 10 |
+
|
| 11 |
Usage:
|
| 12 |
from geneformer import TranscriptomeTokenizer
|
| 13 |
tk = TranscriptomeTokenizer({"cell_type": "cell_type", "organ_major": "organ_major"}, nproc=4)
|