Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup
Paper
• 2101.06983 • Published
• 2
This is a sentence-transformers model finetuned from nomic-ai/modernbert-embed-base. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'What is the primary purpose of the FITS format?',
'FITS format\n\nAll of the ODF/SDF component files, with the exception of the summary\nfiles, reconstructed orbit file, and raw attitude file, are FITS files\nand conform to the standard. A description of the FITS format can be\nfound in , which is accessible also at the URL\nThe calibration files and the bulk of the PPS products also conform to\nthe FITS standard. Wherever possible and desirable the calibration files\nand the PPS products follow the conventions of the OGIP\n(http://heasarc.gsfc.nasa.gov/docs/heasarc/ofwg/ofwg_intro.html) (Office\nof Guest Investigator Programs) FITS working group. The HEASARC FITS\nWorking Group activities are described at the following URL:\nFor FITS files where OGIP FITS standards are not applicable or\navailable, new standards closely following the OGIP approach are used.\n\nThe FITS format is primarily designed to store scientific data sets\nconsisting of multidimensional arrays (1-D spectra, 2-D images or 3-D\ndata cubes) and 2-dimensional tables containing rows and columns of\ndata. A FITS data file is composed of a sequence of Header + Data Units\n(HDUs).\n\nThe general structure of a FITS file is as follows:\n\n- a primary header;\n\n- a primary data array of zero length;\n\n- zero or more extensions\n\nEach extension consists of an extension header and a data section.\nExtensions are named and can appear in any order. Only the following\nFITS extensions are used:\n\n- ASCII table: XTENSION=TABLE\n\n- binary table: XTENSION=BINTABLE\n\n- image: XTENSION=IMAGE\n\nThe header consists of keyword=value statements, which describe the\norganisation of the data in the HDU and the format of the contents. It\nmay also provide additional information, for example, about instrument\nstatus or the history of the data. The following block contains the\ndata, which are structured as specified in the header. The data section\nof the HDU may contain a digital image, a table or a multidimensional\nmatrix that is not an image. An HDU need not contain data.\n',
'ASCII\n\nASCII files are used to present script and some tabular information. In\nparticular, each ODF/SDF contains a single summary file, with a summary\nof the information relating to the observation or slew (see\nSect.\xa0[dfhb:par:odf]).\n',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
anchor and positive| anchor | positive | |
|---|---|---|
| type | string | string |
| details |
|
|
| anchor | positive |
|---|---|
What is the purpose of the document described in the preface? |
Preface |
What version of the document is described in the preface? |
Preface |
What is the main change in version 4.3 of the document? |
Preface |
CachedMultipleNegativesRankingLoss with these parameters:{
"scale": 1.0,
"similarity_fct": "get_similarity"
}
anchor and positive| anchor | positive | |
|---|---|---|
| type | string | string |
| details |
|
|
| anchor | positive |
|---|---|
In pn imaging mode event lists, what is the type of the OFFSETX column? |
- In pn event lists this extension contains the CCD columns to which |
What are the three binary table extensions created per source used for? |
- This product lists bright sources detected by EPIC which fall in the |
What is the purpose of the analysis steps outlined in the document? |
Structure of the document |
CachedMultipleNegativesRankingLoss with these parameters:{
"scale": 1.0,
"similarity_fct": "get_similarity"
}
eval_strategy: stepsper_device_train_batch_size: 16per_device_eval_batch_size: 4num_train_epochs: 2lr_scheduler_type: constantwarmup_ratio: 0.1bf16: Truebatch_sampler: no_duplicatesoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 16per_device_eval_batch_size: 4per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 2max_steps: -1lr_scheduler_type: constantlr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Truefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: proportional| Epoch | Step | Training Loss | Validation Loss |
|---|---|---|---|
| 0.0441 | 10 | 2.3929 | - |
| 0.0881 | 20 | 2.2876 | - |
| 0.1322 | 30 | 2.2502 | - |
| 0.1762 | 40 | 2.2265 | - |
| 0.2203 | 50 | 2.176 | 0.9569 |
| 0.2643 | 60 | 2.1931 | - |
| 0.3084 | 70 | 2.1666 | - |
| 0.3524 | 80 | 2.1637 | - |
| 0.3965 | 90 | 2.1684 | - |
| 0.4405 | 100 | 2.1373 | 0.9265 |
| 0.4846 | 110 | 2.135 | - |
| 0.5286 | 120 | 2.1159 | - |
| 0.5727 | 130 | 2.113 | - |
| 0.6167 | 140 | 2.098 | - |
| 0.6608 | 150 | 2.0931 | 0.9054 |
| 0.7048 | 160 | 2.0954 | - |
| 0.7489 | 170 | 2.0882 | - |
| 0.7930 | 180 | 2.0926 | - |
| 0.8370 | 190 | 2.1139 | - |
| 0.8811 | 200 | 2.1151 | 0.8745 |
| 0.9251 | 210 | 2.1033 | - |
| 0.9692 | 220 | 2.1014 | - |
| 1.0132 | 230 | 2.0139 | - |
| 1.0573 | 240 | 2.0408 | - |
| 1.1013 | 250 | 2.0257 | 0.9039 |
| 1.1454 | 260 | 2.0401 | - |
| 1.1894 | 270 | 2.0189 | - |
| 1.2335 | 280 | 2.0521 | - |
| 1.2775 | 290 | 2.055 | - |
| 1.3216 | 300 | 2.0407 | 0.9321 |
| 1.3656 | 310 | 2.0252 | - |
| 1.4097 | 320 | 2.0126 | - |
| 1.4537 | 330 | 2.0431 | - |
| 1.4978 | 340 | 2.0293 | - |
| 1.5419 | 350 | 2.042 | 0.9105 |
| 1.5859 | 360 | 2.0557 | - |
| 1.6300 | 370 | 2.0481 | - |
| 1.6740 | 380 | 2.0169 | - |
| 1.7181 | 390 | 2.0402 | - |
| 1.7621 | 400 | 2.0376 | 0.8873 |
| 1.8062 | 410 | 2.045 | - |
| 1.8502 | 420 | 1.9934 | - |
| 1.8943 | 430 | 2.0335 | - |
| 1.9383 | 440 | 2.0278 | - |
| 1.9824 | 450 | 2.0313 | 0.8658 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{gao2021scaling,
title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
year={2021},
eprint={2101.06983},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Base model
answerdotai/ModernBERT-base