eacortes commited on
Commit
5d62f32
·
verified ·
1 Parent(s): 73468e8

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -52,7 +52,7 @@ model-index:
52
 
53
  This [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) is finetuned from [Derify/ModChemBERT-IR-BASE](https://huggingface.co/Derify/ModChemBERT-IR-BASE) using hard-negative triplets derived from [Derify/pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity). Positive SMILES pairs are first filtered by quality and similarity constraints, then reduced to one strongest positive target per anchor molecule to create a high-signal training set for reranking. The model computes relevance scores for pairs of SMILES strings, enabling SMILES reranking and molecular semantic search.
54
 
55
- For this variant, the positive selection objective is pure similarity ranking: each anchor keeps the highest-similarity candidate after filtering, rather than using a QED+similarity composite score. The quality stage uses strict inequality filtering (`QED > 0.85`, `similarity > 0.5`, with similarity also bounded below 1.0).
56
 
57
  Hard negatives are mined with [Sentence Transformers](https://www.sbert.net/) using [Derify/ChemMRL-beta](https://huggingface.co/Derify/ChemMRL-beta) as the teacher model and a TopK-PercPos-style margin setting based on [NV-Retriever](https://arxiv.org/abs/2407.15831), with `relative_margin=0.05` and `max_negative_score_threshold = pos_score * percentage_margin`. Training uses triplet-format samples with 5 mined negatives per anchor-positive pair and optimizes a multiple-negatives ranking objective, while reranking evaluation uses n-tuple samples with 30 mined negatives per query.
58
 
@@ -60,12 +60,11 @@ Hard negatives are mined with [Sentence Transformers](https://www.sbert.net/) us
60
 
61
  ### Model Description
62
  - **Model Type:** Cross Encoder
63
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
64
  - **Maximum Sequence Length:** 512 tokens
65
  - **Number of Output Labels:** 1 label
66
  - **Training Dataset:**
67
  - [Derify/pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) Mined Hard Negatives
68
- <!-- - **Language:** Unknown -->
69
  - **License:** apache-2.0
70
 
71
  ### Model Sources
@@ -253,11 +252,12 @@ You can finetune this model on your own dataset.
253
  - `optim`: stable_adamw
254
  - `optim_args`: decouple_lr=True,max_lr=3e-05
255
  - `dataloader_persistent_workers`: True
256
- - `resume_from_checkpoint`: True
257
  - `gradient_checkpointing`: True
258
  - `torch_compile`: True
259
  - `torch_compile_backend`: inductor
260
  - `torch_compile_mode`: max-autotune
 
261
  - `batch_sampler`: no_duplicates
262
 
263
  #### All Hyperparameters
@@ -344,7 +344,7 @@ You can finetune this model on your own dataset.
344
  - `skip_memory_metrics`: True
345
  - `use_legacy_prediction_loop`: False
346
  - `push_to_hub`: False
347
- - `resume_from_checkpoint`: True
348
  - `hub_model_id`: None
349
  - `hub_strategy`: every_save
350
  - `hub_private_repo`: None
@@ -372,7 +372,7 @@ You can finetune this model on your own dataset.
372
  - `neftune_noise_alpha`: None
373
  - `optim_target_modules`: None
374
  - `batch_eval_metrics`: False
375
- - `eval_on_start`: False
376
  - `use_liger_kernel`: False
377
  - `liger_kernel_config`: None
378
  - `eval_use_gather_object`: False
 
52
 
53
  This [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) is finetuned from [Derify/ModChemBERT-IR-BASE](https://huggingface.co/Derify/ModChemBERT-IR-BASE) using hard-negative triplets derived from [Derify/pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity). Positive SMILES pairs are first filtered by quality and similarity constraints, then reduced to one strongest positive target per anchor molecule to create a high-signal training set for reranking. The model computes relevance scores for pairs of SMILES strings, enabling SMILES reranking and molecular semantic search.
54
 
55
+ For this variant, the positive selection objective is pure similarity ranking where each anchor keeps the highest-similarity candidate after filtering, rather than using a QED+similarity composite score. The quality stage uses strict inequality filtering (`QED > 0.85`, `similarity > 0.5`, with similarity also bounded below 1.0), and then keeps the top-scoring pair per anchor molecule.
56
 
57
  Hard negatives are mined with [Sentence Transformers](https://www.sbert.net/) using [Derify/ChemMRL-beta](https://huggingface.co/Derify/ChemMRL-beta) as the teacher model and a TopK-PercPos-style margin setting based on [NV-Retriever](https://arxiv.org/abs/2407.15831), with `relative_margin=0.05` and `max_negative_score_threshold = pos_score * percentage_margin`. Training uses triplet-format samples with 5 mined negatives per anchor-positive pair and optimizes a multiple-negatives ranking objective, while reranking evaluation uses n-tuple samples with 30 mined negatives per query.
58
 
 
60
 
61
  ### Model Description
62
  - **Model Type:** Cross Encoder
63
+ - **Base model:** [Derify/ModChemBERT-IR-BASE](https://huggingface.co/Derify/ModChemBERT-IR-BASE) <!-- at revision 1d8fd449edb3eadeaa5ebdd1c891e3ce95aebc3d -->
64
  - **Maximum Sequence Length:** 512 tokens
65
  - **Number of Output Labels:** 1 label
66
  - **Training Dataset:**
67
  - [Derify/pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) Mined Hard Negatives
 
68
  - **License:** apache-2.0
69
 
70
  ### Model Sources
 
252
  - `optim`: stable_adamw
253
  - `optim_args`: decouple_lr=True,max_lr=3e-05
254
  - `dataloader_persistent_workers`: True
255
+ - `resume_from_checkpoint`: False
256
  - `gradient_checkpointing`: True
257
  - `torch_compile`: True
258
  - `torch_compile_backend`: inductor
259
  - `torch_compile_mode`: max-autotune
260
+ - `eval_on_start`: True
261
  - `batch_sampler`: no_duplicates
262
 
263
  #### All Hyperparameters
 
344
  - `skip_memory_metrics`: True
345
  - `use_legacy_prediction_loop`: False
346
  - `push_to_hub`: False
347
+ - `resume_from_checkpoint`: False
348
  - `hub_model_id`: None
349
  - `hub_strategy`: every_save
350
  - `hub_private_repo`: None
 
372
  - `neftune_noise_alpha`: None
373
  - `optim_target_modules`: None
374
  - `batch_eval_metrics`: False
375
+ - `eval_on_start`: True
376
  - `use_liger_kernel`: False
377
  - `liger_kernel_config`: None
378
  - `eval_use_gather_object`: False