GeneMamba2-24l-512d

This repository contains a GeneMamba checkpoint plus full usage assets:

  • model weights (model.safetensors)
  • custom modeling/config files for trust_remote_code=True
  • preprocessing example from h5ad to input_ids
  • tokenizer assets and id mapping files

1) Input format (very important)

GeneMamba input is ranked gene token IDs per cell:

  1. Start from one cell expression vector
  2. Keep genes with expression > 0
  3. Sort genes by expression descending
  4. Convert each gene ID (Ensembl, e.g. ENSG00000000003) to token ID
  5. Use resulting list as input_ids

Each sample is one list of integers:

{"input_ids": [145, 2088, 531, 91, ...]}

For batch input, shape is typically (batch_size, seq_len) after padding/truncation.

2) Where tokenizer and id mapping come from

  • Main tokenizer used for model inference: tokenizer.json
  • Original full tokenizer table: tokenizer_assets/gene_tokenizer.json
  • Gene symbol -> token id mapping: tokenizer_assets/symbol2id.pkl
  • Token id -> gene symbol mapping: tokenizer_assets/id2symbol.pkl

Special tokens:

  • [UNK] = 0
  • [PAD] = 1

3) Preprocess your data

See script:

  • examples/00_preprocess_to_input_ids.py

Example:

python examples/00_preprocess_to_input_ids.py \
  --h5ad /path/to/your_data.h5ad \
  --tokenizer_json tokenizer.json \
  --output_arrow ./my_data/sorted_gene_token_ids.arrow

This output Arrow file has one column: input_ids.

4) Load model and extract embedding

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "mineself2016/GeneMamba2-24l-512d",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    "mineself2016/GeneMamba2-24l-512d",
    trust_remote_code=True
)

More complete example:

  • examples/01_extract_embeddings.py

6) Downstream task examples (added)

See:

  • examples/downstream/README.md

Included downstream tasks:

  • cell type annotation fine-tuning
  • zero-shot embedding + logistic regression
  • batch integration proxy evaluation
  • original legacy downstream scripts from gene_mamba/analysis/cell_type_annotation

7) Source of preprocessing logic

The preprocessing/tokenization pipeline is aligned with assets from:

  • /project/zhiwei/cq5/PythonWorkSpace/gene_mamba

Key references used:

  • tokenizer: gene_tokenizer.json
  • mappings: symbol2id.pkl, id2symbol.pkl
  • dataset build logic (Arrow + input_ids): utils.py (build_dataset)
Downloads last month
131
Safetensors
Model size
65.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support