Upload README.md
Browse files
README.md
CHANGED
|
@@ -52,7 +52,7 @@ model-index:
|
|
| 52 |
|
| 53 |
This [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) is finetuned from [Derify/ModChemBERT-IR-BASE](https://huggingface.co/Derify/ModChemBERT-IR-BASE) using hard-negative triplets derived from [Derify/pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity). Positive SMILES pairs are first filtered by quality and similarity constraints, then reduced to one strongest positive target per anchor molecule to create a high-signal training set for reranking. The model computes relevance scores for pairs of SMILES strings, enabling SMILES reranking and molecular semantic search.
|
| 54 |
|
| 55 |
-
For this variant, the positive selection objective is pure similarity ranking
|
| 56 |
|
| 57 |
Hard negatives are mined with [Sentence Transformers](https://www.sbert.net/) using [Derify/ChemMRL-beta](https://huggingface.co/Derify/ChemMRL-beta) as the teacher model and a TopK-PercPos-style margin setting based on [NV-Retriever](https://arxiv.org/abs/2407.15831), with `relative_margin=0.05` and `max_negative_score_threshold = pos_score * percentage_margin`. Training uses triplet-format samples with 5 mined negatives per anchor-positive pair and optimizes a multiple-negatives ranking objective, while reranking evaluation uses n-tuple samples with 30 mined negatives per query.
|
| 58 |
|
|
@@ -60,12 +60,11 @@ Hard negatives are mined with [Sentence Transformers](https://www.sbert.net/) us
|
|
| 60 |
|
| 61 |
### Model Description
|
| 62 |
- **Model Type:** Cross Encoder
|
| 63 |
-
|
| 64 |
- **Maximum Sequence Length:** 512 tokens
|
| 65 |
- **Number of Output Labels:** 1 label
|
| 66 |
- **Training Dataset:**
|
| 67 |
- [Derify/pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) Mined Hard Negatives
|
| 68 |
-
<!-- - **Language:** Unknown -->
|
| 69 |
- **License:** apache-2.0
|
| 70 |
|
| 71 |
### Model Sources
|
|
@@ -253,11 +252,12 @@ You can finetune this model on your own dataset.
|
|
| 253 |
- `optim`: stable_adamw
|
| 254 |
- `optim_args`: decouple_lr=True,max_lr=3e-05
|
| 255 |
- `dataloader_persistent_workers`: True
|
| 256 |
-
- `resume_from_checkpoint`:
|
| 257 |
- `gradient_checkpointing`: True
|
| 258 |
- `torch_compile`: True
|
| 259 |
- `torch_compile_backend`: inductor
|
| 260 |
- `torch_compile_mode`: max-autotune
|
|
|
|
| 261 |
- `batch_sampler`: no_duplicates
|
| 262 |
|
| 263 |
#### All Hyperparameters
|
|
@@ -344,7 +344,7 @@ You can finetune this model on your own dataset.
|
|
| 344 |
- `skip_memory_metrics`: True
|
| 345 |
- `use_legacy_prediction_loop`: False
|
| 346 |
- `push_to_hub`: False
|
| 347 |
-
- `resume_from_checkpoint`:
|
| 348 |
- `hub_model_id`: None
|
| 349 |
- `hub_strategy`: every_save
|
| 350 |
- `hub_private_repo`: None
|
|
@@ -372,7 +372,7 @@ You can finetune this model on your own dataset.
|
|
| 372 |
- `neftune_noise_alpha`: None
|
| 373 |
- `optim_target_modules`: None
|
| 374 |
- `batch_eval_metrics`: False
|
| 375 |
-
- `eval_on_start`:
|
| 376 |
- `use_liger_kernel`: False
|
| 377 |
- `liger_kernel_config`: None
|
| 378 |
- `eval_use_gather_object`: False
|
|
|
|
| 52 |
|
| 53 |
This [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) is finetuned from [Derify/ModChemBERT-IR-BASE](https://huggingface.co/Derify/ModChemBERT-IR-BASE) using hard-negative triplets derived from [Derify/pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity). Positive SMILES pairs are first filtered by quality and similarity constraints, then reduced to one strongest positive target per anchor molecule to create a high-signal training set for reranking. The model computes relevance scores for pairs of SMILES strings, enabling SMILES reranking and molecular semantic search.
|
| 54 |
|
| 55 |
+
For this variant, the positive selection objective is pure similarity ranking where each anchor keeps the highest-similarity candidate after filtering, rather than using a QED+similarity composite score. The quality stage uses strict inequality filtering (`QED > 0.85`, `similarity > 0.5`, with similarity also bounded below 1.0), and then keeps the top-scoring pair per anchor molecule.
|
| 56 |
|
| 57 |
Hard negatives are mined with [Sentence Transformers](https://www.sbert.net/) using [Derify/ChemMRL-beta](https://huggingface.co/Derify/ChemMRL-beta) as the teacher model and a TopK-PercPos-style margin setting based on [NV-Retriever](https://arxiv.org/abs/2407.15831), with `relative_margin=0.05` and `max_negative_score_threshold = pos_score * percentage_margin`. Training uses triplet-format samples with 5 mined negatives per anchor-positive pair and optimizes a multiple-negatives ranking objective, while reranking evaluation uses n-tuple samples with 30 mined negatives per query.
|
| 58 |
|
|
|
|
| 60 |
|
| 61 |
### Model Description
|
| 62 |
- **Model Type:** Cross Encoder
|
| 63 |
+
- **Base model:** [Derify/ModChemBERT-IR-BASE](https://huggingface.co/Derify/ModChemBERT-IR-BASE) <!-- at revision 1d8fd449edb3eadeaa5ebdd1c891e3ce95aebc3d -->
|
| 64 |
- **Maximum Sequence Length:** 512 tokens
|
| 65 |
- **Number of Output Labels:** 1 label
|
| 66 |
- **Training Dataset:**
|
| 67 |
- [Derify/pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) Mined Hard Negatives
|
|
|
|
| 68 |
- **License:** apache-2.0
|
| 69 |
|
| 70 |
### Model Sources
|
|
|
|
| 252 |
- `optim`: stable_adamw
|
| 253 |
- `optim_args`: decouple_lr=True,max_lr=3e-05
|
| 254 |
- `dataloader_persistent_workers`: True
|
| 255 |
+
- `resume_from_checkpoint`: False
|
| 256 |
- `gradient_checkpointing`: True
|
| 257 |
- `torch_compile`: True
|
| 258 |
- `torch_compile_backend`: inductor
|
| 259 |
- `torch_compile_mode`: max-autotune
|
| 260 |
+
- `eval_on_start`: True
|
| 261 |
- `batch_sampler`: no_duplicates
|
| 262 |
|
| 263 |
#### All Hyperparameters
|
|
|
|
| 344 |
- `skip_memory_metrics`: True
|
| 345 |
- `use_legacy_prediction_loop`: False
|
| 346 |
- `push_to_hub`: False
|
| 347 |
+
- `resume_from_checkpoint`: False
|
| 348 |
- `hub_model_id`: None
|
| 349 |
- `hub_strategy`: every_save
|
| 350 |
- `hub_private_repo`: None
|
|
|
|
| 372 |
- `neftune_noise_alpha`: None
|
| 373 |
- `optim_target_modules`: None
|
| 374 |
- `batch_eval_metrics`: False
|
| 375 |
+
- `eval_on_start`: True
|
| 376 |
- `use_liger_kernel`: False
|
| 377 |
- `liger_kernel_config`: None
|
| 378 |
- `eval_use_gather_object`: False
|