Parameter tuning for tokenizer

#132
by Theodore94 - opened

Hello, first of all thank you for your wonderful work, in the process of Tokenizing I have a lot of confusion,
I use barcodes TSV, the features, TSV and matrix. These three files for loom MTX format file, after tokening in the dataset arrow file is only 464bytes,
Then I downloaded loom file from GEO, and found that after tokening it is still only a few hundred bytes, my own data is now in fastq format, so do you have any good suggestions for input data?
And for {"cell_type": "cell_type", "organ_major": "organ_major"}, since it is optional, why does the code not work when I remove this parameter? Looking forward to your reply!

Thank you for your interest in Geneformer! Please pull the current version. The default is no custom cell attribute dictionary so there should not be a problem with removing this. Regarding the input format, the input file should be a .loom file in the format as described in detail in the transcriptome tokenizing example in the repository. If you believe you are providing a file in this format and you are still having issues, please respond with a sample of the column and row attributes and their values and a sample of the value matrix so I can confirm if there is anything unexpected about the format.

ctheodoris changed discussion status to closed

git clone https://huggingface.co/ctheodoris/Geneformer

cd Geneformer

pip install .

I installed it according to this code, what should I do if I update to the latest version?

Yes, from inside the Geneformer directory, the command would be:
git pull

Thank you very much for your patient guidance. Here are the contents of my loom file and the code I used. Please check why the arrow file is only 464bytes long.
Gene number: 36,601

Cell count: 13770

All row property names:

['ensembl_id', 'feature_types', 'var_names']

All column property names:

['n_counts', 'obs_names', 'sample_type']

Gene name: ['MIR1302-2HG' 'FAM138A' 'OR4F5'... 'AC007325.1' 'AC007325.4'

'AC007325.2]

Cell name: ['AAACCCAAGCTGCCTG-1' 'AAACCCACACGCGTCA-1' 'AAACCCAGTACAATAG-1'...

'TTTGTTGTCGTTGTTT-1' 'TTTGTTGTCTGTCAGA-1' 'TTTGTTGTCTGTCGTC-1']

Gene expression matrix:

[[0.0.0.... 0.0.0. 0.

[0. 0. 0.... 0. 0. 0.

[0. 0. 0.... 0. 0. 0.

...

[0. 0. 0.... 0. 0. 0.

[0. 0. 0.... 0. 0. 0.

[0. 0. 0.... 0. 0. 0.]

Total reading count per cell: [20354.8665. 14138.... 19214.17543.6870.]

tk = TranscriptomeTokenizer({"obs_names": "obs_names", "sample_type": "sample_type"}, nproc=4)
tk.tokenize_data("/home/data/t110343/test/data_test.loom", "/home/data/t110343/test", "mydata")

What do your Ensembl IDs look like?

Also, I wanted to emphasize that the input should be raw counts - I am mentioning this since your counts and n_counts values imply the counts are not integers and may not be raw counts.

Thank you very much for your help, now that I have converted the seurat object to a loom file and found that there is no ensembl id, I wonder how the loom file fits the following two requirements:
Required row (gene) attribute: "ensembl_id"; Ensembl ID for each gene
Required col (cell) attribute: "n_counts"; total read counts in that cell
This question has puzzled me for several days, and I am looking forward to your answer.

Thank you for your question. As indicated in the example for tokenizing transcriptomes, you can use Ensembl Biomart to convert other types of gene annotations to Ensembl IDs (https://useast.ensembl.org/info/data/biomart/index.html).

The n_counts are just a sum of the total number of read counts in that cell. Again, please note that the input should be the raw count matrix.

You could do something like this:

with lp.connect(f"{input_file}") as data:
    total_count = []

    for (ix, selection, view) in data.scan(axis=1):
        # sum total counts
        total_count_view = np.sum(view[:,:], axis=0)
        total_count += total_count_view.tolist()

    data.ca["n_counts"] = total_count

Sign up or log in to comment