ModernBERT DAPT Embed DAPT Math

This is a sentence-transformers model finetuned from Master-thesis-NAP/ModernBert-DAPT-math. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: Master-thesis-NAP/ModernBert-DAPT-math
Maximum Sequence Length: 8192 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Language: en
License: apache-2.0

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Master-thesis-NAP/ModernBERT-DAPT-Embed-DAPT-Math")
# Run inference
sentences = [
    "Does Werner-Young's inequality imply that the convolution of two $L^p$ spaces is always $L^r$ for $1 < r < \\infty$?",
    "[Werner-Young's inequality]\\label{Young op-op}\nSuppose $S\\in \\cS^p$ and $T\\in \\cS^q$ with $1+r^{-1}=p^{-1}+q^{-1}$.\nThen $S\\star T\\in L^r(\\R^{2d})$ and\n\\begin{align*}\n    \\|S\\star T\\|_{L^{r}}\\leq \\|S\\|_{\\cS^p}\\|T\\|_{\\cS^q}.\n\\end{align*}",
    '$\\cE^{(0)}_{p,\\alpha}$ satisfies the second Beurling-Deny criterion.  If $1 < p_- \\leq p_+ < \\infty$, it is reflexive and satisfies the $\\Delta_2$-condition.  \n %',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Dataset: TESTING
Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.568
cosine_accuracy@3	0.6324
cosine_accuracy@5	0.6586
cosine_accuracy@10	0.6938
cosine_precision@1	0.568
cosine_precision@3	0.3649
cosine_precision@5	0.2774
cosine_precision@10	0.1819
cosine_recall@1	0.0265
cosine_recall@3	0.0487
cosine_recall@5	0.0599
cosine_recall@10	0.0752
cosine_ndcg@10	0.2532
cosine_mrr@10	0.607
cosine_map@100	0.0742

Training Details

Training Dataset

Unnamed Dataset

Size: 79,876 training samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 9 tokens mean: 38.48 tokens max: 142 tokens	min: 5 tokens mean: 210.43 tokens max: 924 tokens	min: 14 tokens mean: 91.02 tokens max: 481 tokens

Samples:

anchor	positive	negative
`What is the limit of the proportion of 1's in the sequence $a_n$ as $n$ approaches infinity, given that $0 \leq 3g_n -2n \leq 4$?`	`Let $g_n$ be the number of $1$'s in the sequence $a_1 a_2 \cdots a_n$. Then \begin{equation} 0 \leq 3g_n -2n \leq 4 \label{star} \end{equation} for all $n$, and hence $\lim_{n \rightarrow \infty} g_n/n = 2/3$. \label{thm1}`	`\label{thm:bounds_initial} Let $\seqq{s}$ be a sequence of rank $r$ for which the roots of the characteristic polynomial are all different. Then, for any positive integer $M$, the rank of $\seq{s^M}$ is at most \begin{align} \rank s^M \leq \binom{M+r-1}{M}. \end{align}`
`Does the statement of \textbf{ThmConjAreTrue} imply that the maximum genus of a locally Cohen-Macaulay curve in $\mathbb{P}^3_{\mathbb{C}}$ of degree $d$ that does not lie on a surface of degree $s-1$ is always equal to $g(d,s)$?`	`\label{ThmConjAreTrue} Conjectures \ref{Conj1} and \ref{Conj2} are true. As a consequence, if either $d=s \geq 1$ or $d \geq 2s+1 \geq 3$, the maximum genus of a locally Cohen-Macaulay curve in $\mathbb{P}^3_{\mathbb{C}}$ of degree $d$ that does not lie on a surface of degree $s-1$ is equal to $g(d,s)$.`	`[{\cite[Corollary 2.2.2 with $p=3$]{BSY}}] Let $S$ be a non-trivial Severi-Brauer surface over a perfect field $\textbf{k}$. Then $S$ does not contain points of degree $d$, where $d$ is not divisible by $3$. On the other hand $S$ contains a point of degree $3$.`
`\emph{Is the statement \emph{If $X$ is a compact Hausdorff space, then $X$ is normal}, proven in the first isomorphism theorem for topological groups, or is it a well-known result in topology?}`	`} \newcommand{\ep}{`	`\label{prop:coherence} If $X$ is a qcqs scheme, then $RX$ is coherent in the sense that the set of quasi-compact open subsets of $RX$ is closed under finite intersections and forms a basis for the topology of $RX$.`

Loss: TripletLoss with these parameters:

{
    "distance_metric": "TripletDistanceMetric.COSINE",
    "triplet_margin": 0.1
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: epoch
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
gradient_accumulation_steps: 8
learning_rate: 2e-05
num_train_epochs: 4
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: True
tf32: True
load_best_model_at_end: True
optim: adamw_torch_fused
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: epoch
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 8
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 4
max_steps: -1
lr_scheduler_type: cosine
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: True
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: True
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: True
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
tp_size: 0
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch_fused
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Click to expand

Epoch	Step	Training Loss	TESTING_cosine_ndcg@10
0.0160	10	1.1162	-
0.0320	20	1.0465	-
0.0481	30	0.9663	-
0.0641	40	0.8758	-
0.0801	50	0.8215	-
0.0961	60	0.7492	-
0.1122	70	0.6356	-
0.1282	80	0.3573	-
0.1442	90	0.166	-
0.1602	100	0.0797	-
0.1762	110	0.046	-
0.1923	120	0.0419	-
0.2083	130	0.025	-
0.2243	140	0.0233	-
0.2403	150	0.0205	-
0.2564	160	0.0142	-
0.2724	170	0.017	-
0.2884	180	0.0157	-
0.3044	190	0.0104	-
0.3204	200	0.0126	-
0.3365	210	0.019	-
0.3525	220	0.0153	-
0.3685	230	0.0171	-
0.3845	240	0.0124	-
0.4006	250	0.01	-
0.4166	260	0.0071	-
0.4326	270	0.0125	-
0.4486	280	0.0096	-
0.4647	290	0.0092	-
0.4807	300	0.0067	-
0.4967	310	0.0069	-
0.5127	320	0.0054	-
0.5287	330	0.0107	-
0.5448	340	0.0115	-
0.5608	350	0.0083	-
0.5768	360	0.0175	-
0.5928	370	0.0162	-
0.6089	380	0.0094	-
0.6249	390	0.0124	-
0.6409	400	0.0078	-
0.6569	410	0.014	-
0.6729	420	0.0117	-
0.6890	430	0.0097	-
0.7050	440	0.0094	-
0.7210	450	0.0077	-
0.7370	460	0.0103	-
0.7531	470	0.0099	-
0.7691	480	0.0123	-
0.7851	490	0.0103	-
0.8011	500	0.0098	-
0.8171	510	0.0059	-
0.8332	520	0.0031	-
0.8492	530	0.0075	-
0.8652	540	0.0101	-
0.8812	550	0.0099	-
0.8973	560	0.0098	-
0.9133	570	0.0072	-
0.9293	580	0.0057	-
0.9453	590	0.0074	-
0.9613	600	0.0038	-
0.9774	610	0.0127	-
0.9934	620	0.0098	-
1.0	625	-	0.2532
1.0080	630	0.0064	-
1.0240	640	0.0066	-
1.0401	650	0.0056	-
1.0561	660	0.0031	-
1.0721	670	0.0023	-
1.0881	680	0.0032	-
1.1041	690	0.0021	-
1.1202	700	0.0011	-
1.1362	710	0.006	-
1.1522	720	0.0045	-
1.1682	730	0.0041	-
1.1843	740	0.0026	-
1.2003	750	0.0019	-
1.2163	760	0.0058	-
1.2323	770	0.0054	-
1.2483	780	0.0066	-
1.2644	790	0.0033	-
1.2804	800	0.004	-
1.2964	810	0.0028	-
1.3124	820	0.0027	-
1.3285	830	0.0017	-
1.3445	840	0.0009	-
1.3605	850	0.0048	-
1.3765	860	0.0037	-
1.3925	870	0.0045	-
1.4086	880	0.0043	-
1.4246	890	0.0046	-
1.4406	900	0.0023	-
1.4566	910	0.0031	-
1.4727	920	0.0027	-
1.4887	930	0.0022	-
1.5047	940	0.0042	-
1.5207	950	0.0026	-
1.5368	960	0.0049	-
1.5528	970	0.0024	-
1.5688	980	0.0019	-
1.5848	990	0.0038	-
1.6008	1000	0.0036	-
1.6169	1010	0.0023	-
1.6329	1020	0.0021	-
1.6489	1030	0.0011	-
1.6649	1040	0.0025	-
1.6810	1050	0.0026	-
1.6970	1060	0.0034	-
1.7130	1070	0.0024	-
1.7290	1080	0.0038	-
1.7450	1090	0.002	-
1.7611	1100	0.0046	-
1.7771	1110	0.0003	-
1.7931	1120	0.0062	-
1.8091	1130	0.0057	-
1.8252	1140	0.0012	-
1.8412	1150	0.0021	-
1.8572	1160	0.0038	-
1.8732	1170	0.0024	-
1.8892	1180	0.0026	-
1.9053	1190	0.0034	-
1.9213	1200	0.0064	-
1.9373	1210	0.0041	-
1.9533	1220	0.0032	-
1.9694	1230	0.0028	-
1.9854	1240	0.0009	-
2.0	1250	0.0042	0.2488
2.0160	1260	0.0005	-
2.0320	1270	0.0018	-
2.0481	1280	0.0009	-
2.0641	1290	0.001	-
2.0801	1300	0.0024	-
2.0961	1310	0.0011	-
2.1122	1320	0.0008	-
2.1282	1330	0.0001	-
2.1442	1340	0.0006	-
2.1602	1350	0.0005	-
2.1762	1360	0.0003	-
2.1923	1370	0.0	-
2.2083	1380	0.0	-
2.2243	1390	0.0001	-
2.2403	1400	0.0001	-
2.2564	1410	0.0027	-
2.2724	1420	0.0005	-
2.2884	1430	0.0007	-
2.3044	1440	0.0001	-
2.3204	1450	0.0002	-
2.3365	1460	0.001	-
2.3525	1470	0.0003	-
2.3685	1480	0.001	-
2.3845	1490	0.0	-
2.4006	1500	0.0006	-
2.4166	1510	0.0007	-
2.4326	1520	0.0007	-
2.4486	1530	0.0004	-
2.4647	1540	0.0007	-
2.4807	1550	0.0012	-
2.4967	1560	0.0015	-
2.5127	1570	0.0014	-
2.5287	1580	0.0005	-
2.5448	1590	0.0005	-
2.5608	1600	0.0014	-
2.5768	1610	0.0016	-
2.5928	1620	0.0	-
2.6089	1630	0.0002	-
2.6249	1640	0.0006	-
2.6409	1650	0.0002	-
2.6569	1660	0.0003	-
2.6729	1670	0.0007	-
2.6890	1680	0.0005	-
2.7050	1690	0.0007	-
2.7210	1700	0.0	-
2.7370	1710	0.0008	-
2.7531	1720	0.0019	-
2.7691	1730	0.0017	-
2.7851	1740	0.0002	-
2.8011	1750	0.0002	-
2.8171	1760	0.0002	-
2.8332	1770	0.0014	-
2.8492	1780	0.0005	-
2.8652	1790	0.0021	-
2.8812	1800	0.002	-
2.8973	1810	0.0021	-
2.9133	1820	0.0007	-
2.9293	1830	0.0	-
2.9453	1840	0.0011	-
2.9613	1850	0.0006	-
2.9774	1860	0.0008	-
2.9934	1870	0.0001	-
3.0	1875	-	0.2516
3.0080	1880	0.0033	-
3.0240	1890	0.0	-
3.0401	1900	0.0	-
3.0561	1910	0.0009	-
3.0721	1920	0.0001	-
3.0881	1930	0.001	-
3.1041	1940	0.0001	-
3.1202	1950	0.0001	-
3.1362	1960	0.0	-
3.1522	1970	0.0003	-
3.1682	1980	0.0001	-
3.1843	1990	0.0005	-
3.2003	2000	0.0	-
3.2163	2010	0.0	-
3.2323	2020	0.0	-
3.2483	2030	0.0	-
3.2644	2040	0.0	-
3.2804	2050	0.0	-
3.2964	2060	0.0001	-
3.3124	2070	0.0001	-
3.3285	2080	0.0	-
3.3445	2090	0.0001	-
3.3605	2100	0.0	-
3.3765	2110	0.0005	-
3.3925	2120	0.0001	-
3.4086	2130	0.0	-
3.4246	2140	0.0	-
3.4406	2150	0.0004	-
3.4566	2160	0.0005	-
3.4727	2170	0.0	-
3.4887	2180	0.0006	-
3.5047	2190	0.0002	-
3.5207	2200	0.0007	-
3.5368	2210	0.0	-
3.5528	2220	0.0	-
3.5688	2230	0.0008	-
3.5848	2240	0.0001	-
3.6008	2250	0.0013	-
3.6169	2260	0.0004	-
3.6329	2270	0.0006	-
3.6489	2280	0.0001	-
3.6649	2290	0.0	-
3.6810	2300	0.0011	-
3.6970	2310	0.0005	-
3.7130	2320	0.0	-
3.7290	2330	0.0	-
3.7450	2340	0.0006	-
3.7611	2350	0.0	-
3.7771	2360	0.0002	-
3.7931	2370	0.0006	-
3.8091	2380	0.0002	-
3.8252	2390	0.0004	-
3.8412	2400	0.0	-
3.8572	2410	0.0007	-
3.8732	2420	0.0006	-
3.8892	2430	0.0002	-
3.9053	2440	0.0009	-
3.9213	2450	0.0009	-
3.9373	2460	0.0	-
3.9533	2470	0.0001	-
3.9694	2480	0.0012	-
3.9854	2490	0.0003	-
3.9950	2496	-	0.2524
-1	-1	-	0.2532

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.11.12
Sentence Transformers: 4.1.0
Transformers: 4.51.3
PyTorch: 2.6.0+cu124
Accelerate: 1.6.0
Datasets: 2.14.4
Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

TripletLoss

@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification},
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Master-thesis-NAP/ModernBERT-DAPT-Embed-DAPT-Math

Base model

Master-thesis-NAP/ModernBert-DAPT-math

Finetuned

(2)

this model

Papers for Master-thesis-NAP/ModernBERT-DAPT-Embed-DAPT-Math

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 12

In Defense of the Triplet Loss for Person Re-Identification

Paper • 1703.07737 • Published Mar 22, 2017

Evaluation results

Cosine Accuracy@1 on TESTING
self-reported

0.568
Cosine Accuracy@3 on TESTING
self-reported

0.632
Cosine Accuracy@5 on TESTING
self-reported

0.659
Cosine Accuracy@10 on TESTING
self-reported

0.694
Cosine Precision@1 on TESTING
self-reported

0.568
Cosine Precision@3 on TESTING
self-reported

0.365
Cosine Precision@5 on TESTING
self-reported

0.277
Cosine Precision@10 on TESTING
self-reported

0.182
Cosine Recall@1 on TESTING
self-reported

0.027
Cosine Recall@3 on TESTING
self-reported

0.049
Cosine Recall@5 on TESTING
self-reported

0.060
Cosine Recall@10 on TESTING
self-reported

0.075
Cosine Ndcg@10 on TESTING
self-reported

0.253
Cosine Mrr@10 on TESTING
self-reported

0.607
Cosine Map@100 on TESTING
self-reported

0.074