Add further explanation regarding input file format for transcriptome tokenizer

Files changed (2) hide show

examples/tokenizing_scRNAseq_data.ipynb CHANGED Viewed

@@ -17,7 +17,7 @@
    "source": [
     "#### Input data is a directory with .loom files containing raw counts from single cell RNAseq data, including all genes detected in the transcriptome without feature selection. \n",
     "\n",
-    "#### Genes should be labeled with Ensembl IDs (row attribute \"ensembl_id\"), which provide a unique identifer for conversion to tokens.\n",
     "\n",
     "#### No cell metadata is required, but custom cell attributes may be passed onto the tokenized dataset by providing a dictionary of custom attributes to be added, which is formatted as loom_col_attr_name : desired_dataset_col_attr_name. For example, if the original .loom dataset has column attributes \"cell_type\" and \"organ_major\" and one would like to retain these attributes as labels in the tokenized dataset with the new names \"cell_type\" and \"organ\", respectively, the following custom attribute dictionary should be provided: {\"cell_type\": \"cell_type\", \"organ_major\": \"organ\"}. \n",
     "\n",

    "source": [
     "#### Input data is a directory with .loom files containing raw counts from single cell RNAseq data, including all genes detected in the transcriptome without feature selection. \n",
     "\n",
+    "#### Genes should be labeled with Ensembl IDs (row attribute \"ensembl_id\"), which provide a unique identifer for conversion to tokens. Cells should be labeled with the total read count in the cell (column attribute \"n_counts\") to be used for normalization.\n",
     "\n",
     "#### No cell metadata is required, but custom cell attributes may be passed onto the tokenized dataset by providing a dictionary of custom attributes to be added, which is formatted as loom_col_attr_name : desired_dataset_col_attr_name. For example, if the original .loom dataset has column attributes \"cell_type\" and \"organ_major\" and one would like to retain these attributes as labels in the tokenized dataset with the new names \"cell_type\" and \"organ\", respectively, the following custom attribute dictionary should be provided: {\"cell_type\": \"cell_type\", \"organ_major\": \"organ\"}. \n",
     "\n",

geneformer/tokenizer.py CHANGED Viewed

@@ -1,6 +1,13 @@
 """
 Geneformer tokenizer.
 Usage:
   from geneformer import TranscriptomeTokenizer
   tk = TranscriptomeTokenizer({"cell_type": "cell_type", "organ_major": "organ_major"}, nproc=4)

 """
 Geneformer tokenizer.
+Input data:
+Required format: raw counts scRNAseq data without feature selection as .loom file
+Required row (gene) attribute: "ensembl_id"; Ensembl ID for each gene
+Required col (cell) attribute: "n_counts"; total read counts in that cell
+Optional col (cell) attribute: "filter_pass"; binary indicator of whether cell should be tokenized based on user-defined filtering criteria
+Optional col (cell) attributes: any other cell metadata can be passed on to the tokenized dataset as a custom attribute dictionary as shown below
 Usage:
   from geneformer import TranscriptomeTokenizer
   tk = TranscriptomeTokenizer({"cell_type": "cell_type", "organ_major": "organ_major"}, nproc=4)