ChemMRL / README.md
eacortes's picture
Upload 14 files
f919ea5 verified
|
raw
history blame
22.9 kB
metadata
license: apache-2.0
tags:
  - sentence-transformers
  - modchembert
  - cheminformatics
  - smiles
  - molecular-similarity
  - feature-extraction
  - dense
  - generated_from_trainer
  - dataset_size:19381001
  - loss:Matryoshka2dLoss
  - loss:MatryoshkaLoss
  - loss:TanimotoSentLoss
base_model: Derify/ModChemBERT-IR-BASE
widget:
  - source_sentence: COC(=O)c1sc(-c2ccc(C)cc2)c2c1NC(=O)C2(c1ccccc1)c1ccccc1
    sentences:
      - COC(=O)c1sc(Nc2ccc(Br)cn2)c2c1NC(=O)C2(c1ccccc1)c1ccccc1
      - CC[NH+]1CCOC(C(NN)c2ccccc2Br)C1
      - CC([NH2+]C(C)c1ccccc1)C(=O)P(C)C(C)(C)C
  - source_sentence: O=C(C=Cc1ccccc1)CC(=O)c1ccccc1O
    sentences:
      - COCCN(NCc1c(C)n(C(C)=O)c2ccc(OC)cc12)c1nccs1
      - CCN(CCC(N)=O)C(=O)c1ccc(=O)[nH]n1
      - N=CCC(=Cc1ccccc1)C(=O)COc1ccccc1O
  - source_sentence: COc1cccc(-c2sc3ccccc3c2C#N)c1
    sentences:
      - COCC(C)(C)c1cnnn1CCCI
      - N#Cc1c(-c2cccc(CN)c2)sc2ccccc12
      - COc1ccccc1NC(=O)c1cc(NCc2ccco2)cc[nH+]1
  - source_sentence: Nc1nc(-c2ccccc2)c2nc(N)c(N)nc2n1
    sentences:
      - >-
        CC(C)CC1NC(=O)C(Cc2ccccc2)NC(=O)c2ccc(cc2)CN(C(=O)CC2CCOCC2)CCCCNC(=O)C(C)NC1=O
      - O=Nc1cccc(OCCC(F)F)c1
      - CCCCNCc1nc(N)nc2nc(N)c(N)nc12
  - source_sentence: OCCCc1cc(F)cc(F)c1
    sentences:
      - CCC(C)C(=O)C1(C(NN)C(C)C)CCCC1
      - Cc1[nH]c2c(C(N)=O)ccc(C(=O)N3CCCCC3)c2c1C
      - Fc1cc(F)cc(-n2cc[o+]n2)c1
datasets:
  - Derify/pubchem_10m_genmol_similarity
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - spearman
co2_eq_emissions:
  emissions: 4039.5232961852894
  energy_consumed: 19.679154905865374
  source: codecarbon
  training_type: fine-tuning
  on_cloud: false
  cpu_model: AMD Ryzen 7 3700X 8-Core Processor
  ram_total_size: 62.69887161254883
  hours_used: 74.966
  hardware_used: 2 x NVIDIA GeForce RTX 3090
model-index:
  - name: 'ChemMRL: SMILES Matryoshka Representation Learning Embedding Transformer'
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: pubchem 10m genmol similarity (validation)
          type: pubchem_10m_genmol_similarity_validation
        metrics:
          - type: spearman
            value: 0.9881056976837288
            name: Spearman
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: pubchem 10m genmol similarity (test)
          type: pubchem_10m_genmol_similarity_test
        metrics:
          - type: spearman
            value: 0.988127555600757
            name: Spearman
new_version: Derify/ChemMRL

ChemMRL: SMILES Matryoshka Representation Learning Embedding Transformer

This is a Chem-MRL (sentence-transformers) model finetuned from Derify/ModChemBERT-IR-BASE on the pubchem_10m_genmol_similarity dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for molecular similarity, semantic search, database indexing, molecular classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'ModChemBertModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Chem-MRL)

First install the Chem-MRL library:

pip install -U chem-mrl>=0.7.3

Then you can load this model and run inference.

from chem_mrl import ChemMRL

# Download from the 🤗 Hub
model = ChemMRL(
    "Derify/ChemMRL",
    trust_remote_code=True,
    model_kwargs={"torch_dtype": "bfloat16"},
)
# Run inference
sentences = [
    'OCCCc1cc(F)cc(F)c1',
    'Fc1cc(F)cc(-n2cc[o+]n2)c1',
    'CCC(C)C(=O)C1(C(NN)C(C)C)CCCC1',
]
embeddings = model.backbone.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.backbone.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.4184, 0.0166],
#         [0.4158, 1.0000, 0.0136],
#         [0.0167, 0.0137, 1.0000]])

Direct Usage (Sentence Transformers)

Click to see the direct usage in Transformers

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer(
    "Derify/ChemMRL",
    # SentenceTransformer doesn't support tanimoto similarity natively so we set a different similarity function here
    similarity_fn_name="cosine",
    trust_remote_code=True,
    model_kwargs={"torch_dtype": "bfloat16"},
)
# Run inference
sentences = [
    'OCCCc1cc(F)cc(F)c1',
    'Fc1cc(F)cc(-n2cc[o+]n2)c1',
    'CCC(C)C(=O)C1(C(NN)C(C)C)CCCC1',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.5887, 0.0327],
#         [0.5887, 1.0000, 0.0269],
#         [0.0327, 0.0269, 1.0000]])

Evaluation

Metrics

Semantic Similarity

  • Dataset: pubchem_10m_genmol_similarity
  • Evaluated with chem_mrl.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator with these parameters:
    {
        "precision": "float32"
    }
    
Split Metric Value
validation spearman 0.98811
test spearman 0.98813

Training Details

Training Dataset

pubchem_10m_genmol_similarity

  • Dataset: pubchem_10m_genmol_similarity at 9aec8fd
  • Size: 19,381,001 training samples
  • Columns: smiles_a, smiles_b, and label
  • Approximate statistics based on the first 1000 samples:
    smiles_a smiles_b label
    type string string float
    details
    • min: 17 tokens
    • mean: 42.36 tokens
    • max: 122 tokens
    • min: 11 tokens
    • mean: 40.93 tokens
    • max: 122 tokens
    • min: 0.02
    • mean: 0.56
    • max: 1.0
  • Samples:
    smiles_a smiles_b label
    COc1ccc(NC(=O)C2CCNH+CC2)cc1NC(=O)C1CCCCC1 Cc1cc(C(=O)Nc2ccc(F)c(F)c2)ccc1NC(=O)C(C)[NH+]1CCC(C(=O)Nc2cccc(NC(=O)C3CCCCC3)c2)CC1 0.8495575189590454
    OCCN1CCNH+CC1 OCCN1CCNH+CC1 0.6615384817123413
    CC1CN(C(=O)C2CCNH+CC2)CC(C)O1 CC1CN(C(=O)C2CCNH+CC2)CC(C)O1 0.7123287916183472
  • Loss: Matryoshka2dLoss with these parameters:
    {
        "loss": "TanimotoSentLoss",
        "n_layers_per_step": 11,
        "last_layer_weight": 1.0,
        "prior_layers_weight": 1.5,
        "kl_div_weight": 0.5,
        "kl_temperature": 0.3,
        "matryoshka_dims": [
            1024,
            512,
            256,
            128,
            64,
            32,
            16,
            8
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": 4
    }
    

Evaluation Dataset

pubchem_10m_genmol_similarity

  • Dataset: pubchem_10m_genmol_similarity at 9aec8fd
  • Size: 1,080,394 evaluation samples
  • Columns: smiles_a, smiles_b, and label
  • Approximate statistics based on the first 1000 samples:
    smiles_a smiles_b label
    type string string float
    details
    • min: 16 tokens
    • mean: 42.05 tokens
    • max: 101 tokens
    • min: 11 tokens
    • mean: 40.23 tokens
    • max: 104 tokens
    • min: 0.0
    • mean: 0.57
    • max: 1.0
  • Samples:
    smiles_a smiles_b label
    N#CCCN(Cc1cnc(N)cn1)C1CC1 N#CCCN(Cc1cnc(N)cn1)C1CCCC1 0.8600000143051147
    N#CCCN(Cc1cnc(N)cn1)C1CC1 N#CCCN(Cc1cnc(N)cn1)C1CCOCC1 0.7962962985038757
    N#CCCN(Cc1cnc(N)cn1)C1CC1 N#CCCN(Cc1cnc(N)cn1)CC(F)F 0.5517241358757019
  • Loss: Matryoshka2dLoss with these parameters:
    {
        "loss": "TanimotoSentLoss",
        "n_layers_per_step": 11,
        "last_layer_weight": 1.0,
        "prior_layers_weight": 1.5,
        "kl_div_weight": 0.5,
        "kl_temperature": 0.3,
        "matryoshka_dims": [
            1024,
            512,
            256,
            128,
            64,
            32,
            16,
            8
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": 4
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 192
  • per_device_eval_batch_size: 512
  • learning_rate: 8e-06
  • weight_decay: 1e-05
  • max_grad_norm: None
  • lr_scheduler_type: warmup_stable_decay
  • lr_scheduler_kwargs: {'num_decay_steps': 100943, 'warmup_type': 'linear', 'decay_type': '1-sqrt'}
  • warmup_steps: 100943
  • data_seed: 42
  • bf16: True
  • bf16_full_eval: True
  • tf32: True
  • optim: stable_adamw
  • optim_args: decouple_lr=True,max_lr=8.0e-6
  • dataloader_pin_memory: False
  • eval_on_start: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 192
  • per_device_eval_batch_size: 512
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 8e-06
  • weight_decay: 1e-05
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: None
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: warmup_stable_decay
  • lr_scheduler_kwargs: {'num_decay_steps': 100943, 'warmup_type': 'linear', 'decay_type': '1-sqrt'}
  • warmup_ratio: 0.0
  • warmup_steps: 100943
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: 42
  • jit_mode_eval: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: True
  • fp16_full_eval: False
  • tf32: True
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: stable_adamw
  • optim_args: decouple_lr=True,max_lr=8.0e-6
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: False
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: True
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Click to expand
Epoch Step Training Loss pubchem 10m genmol similarity loss pubchem_10m_genmol_similarity_spearman
0 0 - 85.7997 0.7261
0.0000 1 69.0605 - -
0.2477 25000 47.1696 - -
0.2500 25235 - 56.9634 0.8997
0.4978 50250 45.6212 - -
0.5000 50470 - 55.4366 0.9599
0.7479 75500 45.1404 - -
0.7500 75705 - 54.5667 0.9755
0.9981 100750 44.5023 - -
1.0000 100940 - 54.1244 0.9810
1.2482 126000 43.7545 - -
1.2500 126175 - 53.6974 0.9838
1.4984 151250 43.7865 - -
1.5000 151410 - 53.4775 0.9855
1.7485 176500 43.3512 - -
1.7499 176645 - 53.3775 0.9866
1.9987 201750 43.5808 - -
1.9999 201880 - 53.3119 0.9874
2.2488 227000 43.281 - -
2.2499 227115 - 53.1854 0.9879
2.4989 252250 43.3097 - -
2.4999 252350 - 53.1972 0.9880
2.7491 277500 43.2376 - -
2.7499 277585 - 53.1833 0.9881
2.9992 302750 43.2006 - -
2.9999 302820 - 53.1241 0.9881
3.0000 302829 - - 0.98811

Environmental Impact

Carbon emissions were measured using CodeCarbon.

  • Energy Consumed: 19.679 kWh
  • Carbon Emitted: 4.040 kg of CO2
  • Hours Used: 74.966 hours

Training Hardware

  • On Cloud: No
  • GPU Model: 1 x NVIDIA GeForce RTX 3090
  • CPU Model: AMD Ryzen 7 3700X 8-Core Processor
  • RAM Size: 62.70 GB

Framework Versions

  • Python: 3.13.7
  • Sentence Transformers: 5.1.1
  • Transformers: 4.57.1
  • PyTorch: 2.8.0+cu128
  • Accelerate: 1.10.1
  • Datasets: 3.6.0
  • Tokenizers: 0.22.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

Matryoshka2dLoss

@misc{li20242d,
    title={2D Matryoshka Sentence Embeddings},
    author={Xianming Li and Zongxi Li and Jing Li and Haoran Xie and Qing Li},
    year={2024},
    eprint={2402.14776},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

CoSENTLoss

@online{kexuefm-8847,
    title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
    author={Su Jianlin},
    year={2022},
    month={Jan},
    url={https://kexue.fm/archives/8847},
}

TanimotoSentLoss

@online{cortes-2025-tanimotosentloss,
    title={TanimotoSentLoss: Tanimoto Loss for SMILES Embeddings},
    author={Emmanuel Cortes},
    year={2025},
    month={Jan},
    url={https://github.com/emapco/chem-mrl},
}

Model Card Authors

@eacortes

Model Card Contact

Manny Cortes (manny@derifyai.com)