GeneMamba2-24l-512d
This repository contains a GeneMamba checkpoint plus full usage assets:
- model weights (
model.safetensors) - custom modeling/config files for
trust_remote_code=True - preprocessing example from
h5adtoinput_ids - tokenizer assets and id mapping files
1) Input format (very important)
GeneMamba input is ranked gene token IDs per cell:
- Start from one cell expression vector
- Keep genes with expression > 0
- Sort genes by expression descending
- Convert each gene ID (Ensembl, e.g.
ENSG00000000003) to token ID - Use resulting list as
input_ids
Each sample is one list of integers:
{"input_ids": [145, 2088, 531, 91, ...]}
For batch input, shape is typically (batch_size, seq_len) after padding/truncation.
2) Where tokenizer and id mapping come from
- Main tokenizer used for model inference:
tokenizer.json - Original full tokenizer table:
tokenizer_assets/gene_tokenizer.json - Gene symbol -> token id mapping:
tokenizer_assets/symbol2id.pkl - Token id -> gene symbol mapping:
tokenizer_assets/id2symbol.pkl
Special tokens:
[UNK]= 0[PAD]= 1
3) Preprocess your data
See script:
examples/00_preprocess_to_input_ids.py
Example:
python examples/00_preprocess_to_input_ids.py \
--h5ad /path/to/your_data.h5ad \
--tokenizer_json tokenizer.json \
--output_arrow ./my_data/sorted_gene_token_ids.arrow
This output Arrow file has one column: input_ids.
4) Load model and extract embedding
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(
"mineself2016/GeneMamba2-24l-512d",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"mineself2016/GeneMamba2-24l-512d",
trust_remote_code=True
)
More complete example:
examples/01_extract_embeddings.py
6) Downstream task examples (added)
See:
examples/downstream/README.md
Included downstream tasks:
- cell type annotation fine-tuning
- zero-shot embedding + logistic regression
- batch integration proxy evaluation
- original legacy downstream scripts from
gene_mamba/analysis/cell_type_annotation
7) Source of preprocessing logic
The preprocessing/tokenization pipeline is aligned with assets from:
/project/zhiwei/cq5/PythonWorkSpace/gene_mamba
Key references used:
- tokenizer:
gene_tokenizer.json - mappings:
symbol2id.pkl,id2symbol.pkl - dataset build logic (Arrow +
input_ids):utils.py(build_dataset)
- Downloads last month
- 131