ModernBERT Embed Base Legal Fine-tuned

This is a sentence-transformers model finetuned from nomic-ai/modernbert-embed-base on the legal-rag-positives-synthetic dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("aaa961/modernbert-embed-base-legal-MRL_reverse_dataset")
# Run inference
sentences = [
    'confidentiality agreement/order, that remain following those discussions.  This is a \nfinal report and notice of exceptions shall be filed within three days of the date of \nthis report, pursuant to Court of Chancery Rule 144(d)(2), given the expedited and \nsummary nature of Section 220 proceedings.  \n \n \n \n \n \n \n \nRespectfully, \n \n \n \n \n \n \n \n \n/s/ Patricia W. Griffin',
    'According to which court rule must the notice of exceptions be filed?',
    'decides whether to submit proposals on future procurements, and excluding mentor-protégé JVs \nfrom proposing on a solicitation due to Section 125.9(b)(3)(i) unnecessarily prevents protégés from \naccessing opportunities to grow as a business.  SHS MJAR at 22–23; VCH MJAR at 22–23.   \nSuch a critique, however, merely highlights Plaintiffs’ disagreement with the SBA’s',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.5258, 0.0577],
#         [0.5258, 1.0000, 0.0745],
#         [0.0577, 0.0745, 1.0000]])

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.5997
cosine_accuracy@3 0.7543
cosine_accuracy@5 0.8099
cosine_accuracy@10 0.8841
cosine_precision@1 0.5997
cosine_precision@3 0.2514
cosine_precision@5 0.162
cosine_precision@10 0.0884
cosine_recall@1 0.5997
cosine_recall@3 0.7543
cosine_recall@5 0.8099
cosine_recall@10 0.8841
cosine_ndcg@10 0.7363
cosine_mrr@10 0.6897
cosine_map@100 0.694

Information Retrieval

Metric Value
cosine_accuracy@1 0.5873
cosine_accuracy@3 0.7527
cosine_accuracy@5 0.8022
cosine_accuracy@10 0.8655
cosine_precision@1 0.5873
cosine_precision@3 0.2509
cosine_precision@5 0.1604
cosine_precision@10 0.0866
cosine_recall@1 0.5873
cosine_recall@3 0.7527
cosine_recall@5 0.8022
cosine_recall@10 0.8655
cosine_ndcg@10 0.7222
cosine_mrr@10 0.6767
cosine_map@100 0.6817

Information Retrieval

Metric Value
cosine_accuracy@1 0.5734
cosine_accuracy@3 0.7357
cosine_accuracy@5 0.7821
cosine_accuracy@10 0.8485
cosine_precision@1 0.5734
cosine_precision@3 0.2452
cosine_precision@5 0.1564
cosine_precision@10 0.0849
cosine_recall@1 0.5734
cosine_recall@3 0.7357
cosine_recall@5 0.7821
cosine_recall@10 0.8485
cosine_ndcg@10 0.7088
cosine_mrr@10 0.6643
cosine_map@100 0.6696

Information Retrieval

Metric Value
cosine_accuracy@1 0.51
cosine_accuracy@3 0.6615
cosine_accuracy@5 0.7326
cosine_accuracy@10 0.8176
cosine_precision@1 0.51
cosine_precision@3 0.2205
cosine_precision@5 0.1465
cosine_precision@10 0.0818
cosine_recall@1 0.51
cosine_recall@3 0.6615
cosine_recall@5 0.7326
cosine_recall@10 0.8176
cosine_ndcg@10 0.6554
cosine_mrr@10 0.6045
cosine_map@100 0.6104

Information Retrieval

Metric Value
cosine_accuracy@1 0.3849
cosine_accuracy@3 0.5487
cosine_accuracy@5 0.6151
cosine_accuracy@10 0.7187
cosine_precision@1 0.3849
cosine_precision@3 0.1829
cosine_precision@5 0.123
cosine_precision@10 0.0719
cosine_recall@1 0.3849
cosine_recall@3 0.5487
cosine_recall@5 0.6151
cosine_recall@10 0.7187
cosine_ndcg@10 0.5417
cosine_mrr@10 0.4864
cosine_map@100 0.495

Training Details

Training Dataset

legal-rag-positives-synthetic

  • Dataset: legal-rag-positives-synthetic at f11534a
  • Size: 11,644 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 7 tokens
    • mean: 57.45 tokens
    • max: 160 tokens
    • min: 8 tokens
    • mean: 57.77 tokens
    • max: 157 tokens
  • Samples:
    anchor positive
    What kinds of issues are mentioned in connection with wrongdoing? mismanagement, waste and wrongdoing – and that it has demonstrated more than a
    credible basis from which the Court can infer possible mismanagement. It claims
    DR’s management failed to follow corporate governance mechanics and made
    critical business decisions without consulting with the Board or stockholders;
    failed to act with due diligence related to undertaking an ICO and discontinuing
    Project, 504 F.2d at 248 n.15).
    More, the requirement of “substantial” authority suggests that the entity should be at the
    “center of gravity in the exercise of administrative power.” Id. at 882 (quoting Lombardo v.
    Handler, 397 F. Supp. 792, 796 (D.D.C. 1975), aff’d, 546 F.2d 1043 (D.C. Cir. 1976)). On this
    What page reference is given for the Lombardo v. Handler case in the aforementioned citation?
    Where can more detailed information regarding redactions be found? parties specifically with respect to the FOIA request at issue in Count Eighteen of No. 11-444. This is likely
    because the CIA has previously instituted a categorical policy of indicating the basis for redactions at a document
    level, rather than a redaction level, as discussed above. See supra Part III.C.2. In light of the Court’s holding that
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 32
  • num_train_epochs: 4
  • learning_rate: 2e-05
  • lr_scheduler_type: cosine
  • warmup_steps: 0.1
  • optim: adamw_torch_fused
  • gradient_accumulation_steps: 16
  • bf16: True
  • tf32: True
  • eval_strategy: epoch
  • per_device_eval_batch_size: 16
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • per_device_train_batch_size: 32
  • num_train_epochs: 4
  • max_steps: -1
  • learning_rate: 2e-05
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: None
  • warmup_steps: 0.1
  • optim: adamw_torch_fused
  • optim_args: None
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • optim_target_modules: None
  • gradient_accumulation_steps: 16
  • average_tokens_across_devices: True
  • max_grad_norm: 1.0
  • label_smoothing_factor: 0.0
  • bf16: True
  • fp16: False
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: True
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • use_liger_kernel: False
  • liger_kernel_config: None
  • use_cache: False
  • neftune_noise_alpha: None
  • torch_empty_cache_steps: None
  • auto_find_batch_size: False
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • include_num_input_tokens_seen: no
  • log_level: passive
  • log_level_replica: warning
  • disable_tqdm: False
  • project: huggingface
  • trackio_space_id: trackio
  • eval_strategy: epoch
  • per_device_eval_batch_size: 16
  • prediction_loss_only: True
  • eval_on_start: False
  • eval_do_concat_batches: True
  • eval_use_gather_object: False
  • eval_accumulation_steps: None
  • include_for_metrics: []
  • batch_eval_metrics: False
  • save_only_model: False
  • save_on_each_node: False
  • enable_jit_checkpoint: False
  • push_to_hub: False
  • hub_private_repo: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_always_push: False
  • hub_revision: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • restore_callback_states_from_checkpoint: False
  • full_determinism: False
  • seed: 42
  • data_seed: None
  • use_cpu: False
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • dataloader_prefetch_factor: None
  • remove_unused_columns: True
  • label_names: None
  • train_sampling_strategy: random
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • ddp_backend: None
  • ddp_timeout: 1800
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • deepspeed: None
  • debug: []
  • skip_memory_metrics: True
  • do_predict: False
  • resume_from_checkpoint: None
  • warmup_ratio: None
  • local_rank: -1
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss ir_dim_768_cosine_ndcg@10 ir_dim_512_cosine_ndcg@10 ir_dim_256_cosine_ndcg@10 ir_dim_128_cosine_ndcg@10 ir_dim_64_cosine_ndcg@10
-1 -1 - 0.5028 0.4902 0.4678 0.4258 0.3230
0.4396 10 7.8375 - - - - -
0.8791 20 4.0320 - - - - -
1.0 23 - 0.6992 0.6838 0.6627 0.6036 0.4931
1.3077 30 2.7947 - - - - -
1.7473 40 2.3759 - - - - -
2.0 46 - 0.7252 0.7094 0.6994 0.6427 0.5302
2.1758 50 2.1671 - - - - -
2.6154 60 1.8120 - - - - -
3.0 69 - 0.7344 0.7203 0.7077 0.6533 0.5394
3.0440 70 1.8638 - - - - -
3.4835 80 1.5476 - - - - -
3.9231 90 1.7850 - - - - -
4.0 92 - 0.7363 0.7222 0.7088 0.6554 0.5417
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.12.11
  • Sentence Transformers: 5.3.0
  • Transformers: 5.3.0
  • PyTorch: 2.5.1+cu121
  • Accelerate: 1.13.0
  • Datasets: 4.8.2
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{oord2019representationlearningcontrastivepredictive,
      title={Representation Learning with Contrastive Predictive Coding},
      author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
      year={2019},
      eprint={1807.03748},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1807.03748},
}
Downloads last month
6
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aaa961/modernbert-embed-base-legal-MRL_reverse_dataset

Finetuned
(111)
this model

Dataset used to train aaa961/modernbert-embed-base-legal-MRL_reverse_dataset

Papers for aaa961/modernbert-embed-base-legal-MRL_reverse_dataset

Evaluation results