GeneMamba2-24l-512d

This repository contains a GeneMamba checkpoint plus full usage assets:

model weights (model.safetensors)
custom modeling/config files for trust_remote_code=True
preprocessing example from h5ad to input_ids
tokenizer assets and id mapping files

1) Input format (very important)

GeneMamba input is ranked gene token IDs per cell:

Start from one cell expression vector
Keep genes with expression > 0
Sort genes by expression descending
Convert each gene ID (Ensembl, e.g. ENSG00000000003) to token ID
Use resulting list as input_ids

Each sample is one list of integers:

{"input_ids": [145, 2088, 531, 91, ...]}

For batch input, shape is typically (batch_size, seq_len) after padding/truncation.

2) Where tokenizer and id mapping come from

Main tokenizer used for model inference: tokenizer.json
Original full tokenizer table: tokenizer_assets/gene_tokenizer.json
Gene symbol -> token id mapping: tokenizer_assets/symbol2id.pkl
Token id -> gene symbol mapping: tokenizer_assets/id2symbol.pkl

Special tokens:

[UNK] = 0
[PAD] = 1

3) Preprocess your data

See script:

examples/00_preprocess_to_input_ids.py

Example:

python examples/00_preprocess_to_input_ids.py \
  --h5ad /path/to/your_data.h5ad \
  --tokenizer_json tokenizer.json \
  --output_arrow ./my_data/sorted_gene_token_ids.arrow

This output Arrow file has one column: input_ids.

4) Load model and extract embedding

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "mineself2016/GeneMamba2-24l-512d",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    "mineself2016/GeneMamba2-24l-512d",
    trust_remote_code=True
)

More complete example:

examples/01_extract_embeddings.py

6) Downstream task examples (added)

See:

examples/downstream/README.md

Included downstream tasks:

cell type annotation fine-tuning
zero-shot embedding + logistic regression
batch integration proxy evaluation
original legacy downstream scripts from gene_mamba/analysis/cell_type_annotation

7) Source of preprocessing logic

The preprocessing/tokenization pipeline is aligned with assets from:

/project/zhiwei/cq5/PythonWorkSpace/gene_mamba

Key references used:

tokenizer: gene_tokenizer.json
mappings: symbol2id.pkl, id2symbol.pkl
dataset build logic (Arrow + input_ids): utils.py (build_dataset)

Downloads last month: 3

Safetensors

Model size

65.7M params

Tensor type

F32