--- language: - en license: apache-2.0 tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - dataset_size:79876 - loss:TripletLoss base_model: Master-thesis-NAP/ModernBert-DAPT-math widget: - source_sentence: What is the error estimate for the difference between the exact solution and the local oscillation decomposition (LOD) solution in terms of the $L_0$ norm? sentences: - '\label{RL1} The system \eqref{R3} has the following positive fixed points if $0 <\alpha\leq1$ and $b>d$ $$E^*=\left(\dfrac{d}{b}, \dfrac{(b-d) r}{b^2}\right)$$' - "\\label{theo1d}\nWith the assumptions and setting is this section, the finite\ \ difference solution computed using the improved harmonic average method applied\ \ to \\eqn{eq1d} or \\eqn{eq1dB} has second order convergence in the infinity\ \ norm, that is,\n\\eqm\n \\|\\mathbf{E} \\|_{\\infty}\\le C h^2,\n\\enm\nassuming\ \ that the true solution of \\eqn{eq1d} is piecewise $C^4$ excluding the interface\ \ $\\alf$, that is, \n$u(x) \\in C^4(0,\\alf) \\cup C^4(\\alf,1)$. \n%where $C$\ \ is a generic error constant." - "\\label{Corollary}\n Let Assumptions~\\ref{assum_1} and~\\ref{assump2} be\ \ satisfied. Let $u$ be the solution of~\\eqref{WeakForm} and let $u_{H,k}$ be\ \ the LOD solution of~\\eqref{local_probelm }. Then we have \n \\begin{equation}\\\ label{L2Estimate}\n \\|u-I_Hu_{H,k}\\|_0\\lesssim \\|u-I_Hu\\|_0+\\|u-u_{H,k}\\\ |_0 +H|u-u_{H,k}|_1.\n \\end{equation}\n %\\[\\|u-I_Hu_{H,k}\\|_0\\lesssim\ \ H |u|_1 +|u-u_{H,k}|_1.\\]" - source_sentence: What is the expected value of the number of individuals in a Markov branching process with non-homogeneous Poisson immigration (MBPNPI) at time $t=0$, given that the immigration rate is $\lambda$? sentences: - '\label{lemma-sampling} Fix an integer~$n\geq 1$. Consider the initial configuration with one active particle on each site of~$V_n$ and let the system evolve, with particles being killed when they jump out of~$V_n$, until no active particle remains in~$V_n$. Then the distribution of the resulting stable configuration is exactly the stationary distribution of the driven-dissipative Markov chain on~$V_n$. In particular, the number of sleeping particles remaining in~$V_n$ is distributed as~$S_n$.' - "The process $Y(t)$, $t\\geq 0,$ is called Markov branching process with\r\nnon-homogeneous\ \ Poisson immigration (MBPNPI)." - "For any $\\lambda \\in(0,1)$ and $s \\in\\mathbb N$,\n \\begin{equation*}\n\\\ sum_{k=s}^{\\infty}\\binom {k}{s}\n(1-\\lambda)^{k-s}= \\lambda^{-s-1}.\n\\\ end{equation*}" - source_sentence: Does the theorem imply that the rate of convergence of the sequence $T_{m,j}(E)$ to $T_{m+k_n,j+k_n}(E)$ is exponential in the distance between $m$ and $j$, and that this rate is bounded by a constant $C$ times an exponential decay factor involving the parameter $\gamma$? sentences: - "\\label{lem1}\n\t\tFor all $m,j\\in\\Z$,  we have\n\t\t\\begin{equation*}\n\t\ \t|| T_{m,j} (E)-T_{m+k_n,j+k_n}(E)||\\leq C e^{-\\gamma k_n} e^{(\\mathcal\ \ L(E)+\\varepsilon) |m-j|}. \n\t\t\\end{equation*}" - "[Divergence Theorem or Gauss-Green Theorem for Surfaces in $\\R^3$]\n\t\\label{thm:surface_int}\n\ \t Let $\\Sigma \\subset \\Omega\\subseteq\\R^3$ be a bounded smooth surface.\n\ \t Further, $\\bb a:\\Sigma\\to\\R^3$ is a continuously differentiable\ \ vector field that is either defined on the\n\t\t\t\t\tboundary $\\partial\\\ Sigma$ or has a bounded continuous extension to this boundary.\n\t Like\ \ in \\eqref{eq:decomp} it may be decomposed into tangential and normal components\n\ \t\t\t\t\tas follows $\\bb a = \\bb a^\\shortparallel + a_\\nu\\bs\\nu_\\Sigma$.\ \ By $\\dd l$ we denote the line element on \n\t\t\t\t\tthe curve $\\partial \\\ Sigma$. We assume that the curve is continuous and consists of finitely many\n\ \t\t\t\t\tsmooth pieces.\n\t Then the following divergence formula for\ \ surface integrals holds\n\t %\n\t \\begin{align}\n\t \ \ %\n\t \\int\\limits_\\Sigma \\left[\\nabla_\\Sigma\\cdot\\bb a^\\\ shortparallel\\right](\\x)\\;\\dd S\n\t\t\t\t\t\t\t= \\int\\limits_{\\partial\\\ Sigma} \\left[\\bb a\\cdot\\bs\\nu_{\\partial\\Sigma}\\right](\\x)\\,\\dd l .\n\ \t \\label{eq:surface_div}\n\t %\n\t \\end{align}\n\ \t\t\t\t\t%\n\t\t\t\t\tFrom this we obtain the formula\n\t\t\t\t\t%\n\t \ \ \\begin{align}\n\t %\n\t \\int\\limits_\\Sigma \\left[\\\ nabla_\\Sigma\\cdot\\bb a\\right](\\x)\\;\\dd S\n\t\t\t\t\t\t\t= \\int\\limits_{\\\ partial\\Sigma} \\left[\\bb a\\cdot\\bs\\nu_{\\partial\\Sigma}\\right](\\x)\\\ ,\\dd l \n\t\t\t\t\t\t\t-\\int\\limits_\\Sigma\\left[ 2\\kappa_Ma_\\nu\\right](\\\ x)\\;\\dd S.\n\t \\label{eq:surface_div_2}\n\t %\n\t \ \ \\end{align}\n\t %" - '\label{theo:helper3} Assume that $\{\PP_N\}_{N\ge 1}$ is a sequence of probability measures that is HT-appropriate in the sense of \cref{def:appropriate} and satisfies the LLN in the sense of \cref{def:LLN}. Let $(\kappa_n)_{n\ge 1}$ and $(m_n)_{n\ge 1}$ be the sequences that arise from these definitions. Moreover, assume that there exists a constant $C>0$ such that $|\kappa_n|\leq C^n$, for all $n \geq 1$. Then $(m_n)_{n\ge 1}$ is the sequence of moments of a unique probability measure on $\R$.' - source_sentence: What is the error estimate for the eigenfunction approximation in terms of the weak eigenvalue and the norm of the difference between the exact and approximate eigenfunctions? sentences: - "Consider dynamics \\eqref{avg} and define the corresponding average dynamics\ \ as $\\label{T-avg}\n\\mathring{\\chi} = \\epsilon h_{av}(\\chi)$, with the average\ \ function defined as\n\\begin{equation*} \nh_{av}(\\chi):=\\lim_{T \\to \\infty}\ \ \\frac{1}{T}\\int_{t}^{t+T} h(\\mu, \\chi, 0) d \\mu, \\ T>0,\n\\end{equation*}\n\ both \\eqref{avg} and \\eqref{T-avg} twice differentiable and bounded in every\ \ compact set of the $\\chi$-domain $\\mathcal{D} \\subset \\mathbb{R}^{3}$. \n\ %\nLet $\\chi(\\tau,\\epsilon)$ and $\\chi_{av}(\\epsilon\\tau)$ denote the solutions\ \ of \\eqref{avg} and \\eqref{T-avg}, respectively. If $\\chi_{av}(\\epsilon\\\ tau)\\in \\mathcal{D}$ for all $\\tau\\in[0,\\zeta/\\epsilon]$, $\\zeta\\geq 0$,\ \ and $\\chi(0,\\epsilon) - \\chi_{av}(0)=\\mathcal{O}(\\nu(\\epsilon))$, then\ \ there exists an $\\epsilon^{*}>0$ such that for all $0<\\epsilon<\\epsilon^{*}$,\ \ $\\chi(\\tau,\\epsilon)$ is well defined and\n$$\n\\chi(\\tau,\\epsilon) - \\\ chi_{av}(\\epsilon\\tau) = \\mathcal{O}(\\nu(\\epsilon)) \\ \\textnormal{on} \\\ \ \\tau \\in [0, \\zeta/\\epsilon],\n$$\nfor some function $\\nu\\in \\mathcal{K}$." - "(\\cite{DangWangXieZhou})\\label{Theorem_Error_Estimate_k}\nLet us define the\ \ spectral projection $F_{k,h}^{(\\ell)}: V\\mapsto {\\rm span}\\{u_{1,h}^{(\\\ ell)}, \\cdots, u_{k,h}^{(\\ell)}\\}$ for any integer $\\ell \\geq 1$ as follows:\n\ \\begin{eqnarray*}\na(F_{k,h}^{(\\ell)}w, u_{i,h}^{(\\ell)}) = a(w, u_{i,h}^{(\\\ ell)}), \\ \\ \\ i=1, \\cdots, k\\ \\ {\\rm for}\\ w\\in V.\n\\end{eqnarray*}\n\ Then the exact eigenfunctions $\\bar u_{1,h},\\cdots, \\bar u_{k,h}$ of (\\ref{Weak_Eigenvalue_Discrete})\ \ and the eigenfunction approximations $u_{1,h}^{(\\ell+1)}$, $\\cdots$, $u_{k,h}^{(\\\ ell+1)}$ from Algorithm \\ref{Algorithm_k} with the integer $\\ell > 1$ have the\ \ following error estimate:\n\\begin{eqnarray*}\\label{Error_Estimate_Inverse}\n\ \ \\left\\|\\bar u_{i,h} - F_{k,h}^{(\\ell+1)}\\bar u_{i,h} \\right\\|_a \\leq\n\ \ \\bar\\lambda_{i,h} \\sqrt{1+\\frac{\\eta_a^2(V_H)}{\\bar\\lambda_{1,h}\\big(\\\ delta_{k,i,h}^{(\\ell+1)}\\big)^2}}\n\\left(1+\\frac{\\bar\\mu_{1,h}}{\\delta_{k,i,h}^{(\\\ ell)}}\\right)\\eta_a^2(V_H)\\left\\|\\bar u_{i,h} - F_{k,h}^{(\\ell)}\\bar u_{i,h}\ \ \\right\\|_a,\n\\end{eqnarray*}\nwhere $\\delta_{k,i,h}^{(\\ell)} $ is defined\ \ as follows:\n\\begin{eqnarray*}\n\\delta_{k,i,h}^{(\\ell)} = \\min_{j\\not\\\ in \\{1, \\cdots, k\\}}\\left|\\frac{1}{\\lambda_{j,h}^{(\\ell)}}-\\frac{1}{\\\ bar\\lambda_{i,h}}\\right|,\\ \\ \\ i=1, \\cdots, k.\n\\end{eqnarray*}\nFurthermore,\ \ the following $\\left\\|\\cdot\\right\\|_b$-norm error estimate holds:\n\\begin{eqnarray*}\n\ \\left\\|\\bar u_{i,h} -F_{k,h}^{(\\ell+1)}\\bar u_{i,h} \\right\\|_b\\leq \n\\\ left(1+\\frac{\\bar\\mu_{1,h}}{\\delta_{k,i,h}^{(\\ell+1)}}\\right)\\eta_a(V_H)\ \ \\left\\|\\bar u_{i,h} -F_{k,h}^{(\\ell+1)}\\bar u_{i,h}\\right\\|_a.\n\\end{eqnarray*}" - "\\big[{\\bf Condition $SD1(h)$}\\big]\\label{DefnSD1(h)}\n\nIn \\cite{MDL} an\ \ approximation order $O(h^s)$, as $h\\to 0$, is proved, where $h$ is the sampling\ \ distance. The achievable order $s$ is of course limited by the smoothness order\ \ of the boundaries of $Graph(F)$. Then, the order $s$ depends upon the degree\ \ of the polynomials used to approximate the boundary near the neighborhood of\ \ points of topology change and upon the degree of splines used at regular regions.\ \ \n\nFor example, let us view Step C of the approximation algorithm described\ \ in Section 5.2 of \\cite{MDL}. \nIt is assumed that the boundary curves are\ \ $C^{2k}$ smooth, and it is implicitly assumed that $h$ is small enough so that\ \ there are $2k$ sample points close to the point of topology change, for computing\ \ the polynomial $p_{2k-1}$ therein.\nThis condition is related to the more general\ \ condition $SD(h)$ and it can serve as a practical way of checking it for the\ \ case $d=1$. That is, near a point of topology change, we check whether there\ \ are enough sample points for applying the approximation algorithm in \\cite{MDL}.\ \ We denote this condition as the $SD1(h)$ condition." - source_sentence: Does Werner-Young's inequality imply that the convolution of two $L^p$ spaces is always $L^r$ for $1 < r < \infty$? sentences: - "$\\cE^{(0)}_{p,\\alpha}$ satisfies the second Beurling-Deny criterion. If $1\ \ < p_- \\leq p_+ < \\infty$, it is reflexive and satisfies the $\\Delta_2$-condition.\ \ \n %" - "A \\emph{bond system} is a tuple $(B,C,s,t,1,\\cdot)$, where $B$ is a set of\ \ \\emph{bonds}, $C$ is a set of \\emph{content} relations, and $s,t:C\\to B$\ \ are \\emph{source} and \\emph{target} functions. For $c\\in C$ with $s(c)=x$\ \ and $t(c)=y$, we write $x\\xrightarrow{c}y$ or $c:x\\to y$, indicating that\ \ $x$ \\emph{contains} $y$. Each bond $x\\in B$ has an \\emph{identity} containment\ \ $1_x:x\\to x$, meaning every bond trivially contains itself. For $c:x\\to y$\ \ and $c':y\\to z$, their composition is $cc':x\\to z$. These data must satisfy:\n\ \ \\begin{enumerate}\n \\item Identity laws: For each $c:x\\to y$, $1_x\ \ c= c=c1_y$\n \\item Associativity: For $c:x\\to y$, $c':y\\to z$, $c'':z\\\ to w$, $c(c'c'')=(cc')c''$\n \\item Anti-symmetry: For $c:x\\to y$ and\ \ $c':y\\to x$, $x=y$\n \\item Left cancellation: For $c,c':x\\to y$ and\ \ $c'':y\\to z$, if $cc''=c'c''$, then $c=c'$\n \\end{enumerate}" - "[Werner-Young's inequality]\\label{Young op-op}\nSuppose $S\\in \\cS^p$ and $T\\\ in \\cS^q$ with $1+r^{-1}=p^{-1}+q^{-1}$.\nThen $S\\star T\\in L^r(\\R^{2d})$\ \ and\n\\begin{align*}\n \\|S\\star T\\|_{L^{r}}\\leq \\|S\\|_{\\cS^p}\\|T\\\ |_{\\cS^q}.\n\\end{align*}" pipeline_tag: sentence-similarity library_name: sentence-transformers metrics: - cosine_accuracy@1 - cosine_accuracy@3 - cosine_accuracy@5 - cosine_accuracy@10 - cosine_precision@1 - cosine_precision@3 - cosine_precision@5 - cosine_precision@10 - cosine_recall@1 - cosine_recall@3 - cosine_recall@5 - cosine_recall@10 - cosine_ndcg@10 - cosine_mrr@10 - cosine_map@100 model-index: - name: ModernBERT DAPT Embed DAPT Math results: - task: type: information-retrieval name: Information Retrieval dataset: name: TESTING type: TESTING metrics: - type: cosine_accuracy@1 value: 0.5679510844485464 name: Cosine Accuracy@1 - type: cosine_accuracy@3 value: 0.6324411628980157 name: Cosine Accuracy@3 - type: cosine_accuracy@5 value: 0.6586294416243654 name: Cosine Accuracy@5 - type: cosine_accuracy@10 value: 0.6938163359483156 name: Cosine Accuracy@10 - type: cosine_precision@1 value: 0.5679510844485464 name: Cosine Precision@1 - type: cosine_precision@3 value: 0.36494385479157054 name: Cosine Precision@3 - type: cosine_precision@5 value: 0.27741116751269035 name: Cosine Precision@5 - type: cosine_precision@10 value: 0.18192201199815417 name: Cosine Precision@10 - type: cosine_recall@1 value: 0.026541702012005317 name: Cosine Recall@1 - type: cosine_recall@3 value: 0.048742014322369596 name: Cosine Recall@3 - type: cosine_recall@5 value: 0.0598887341486898 name: Cosine Recall@5 - type: cosine_recall@10 value: 0.07516536747041261 name: Cosine Recall@10 - type: cosine_ndcg@10 value: 0.25320633940615317 name: Cosine Ndcg@10 - type: cosine_mrr@10 value: 0.6070309695944213 name: Cosine Mrr@10 - type: cosine_map@100 value: 0.07416668442975916 name: Cosine Map@100 --- # ModernBERT DAPT Embed DAPT Math This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Master-thesis-NAP/ModernBert-DAPT-math](https://huggingface.co/Master-thesis-NAP/ModernBert-DAPT-math). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [Master-thesis-NAP/ModernBert-DAPT-math](https://huggingface.co/Master-thesis-NAP/ModernBert-DAPT-math) - **Maximum Sequence Length:** 8192 tokens - **Output Dimensionality:** 768 dimensions - **Similarity Function:** Cosine Similarity - **Language:** en - **License:** apache-2.0 ### Model Sources - **Documentation:** [Sentence Transformers Documentation](https://sbert.net) - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) (2): Normalize() ) ``` ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("Master-thesis-NAP/ModernBERT-DAPT-Embed-DAPT-Math") # Run inference sentences = [ "Does Werner-Young's inequality imply that the convolution of two $L^p$ spaces is always $L^r$ for $1 < r < \\infty$?", "[Werner-Young's inequality]\\label{Young op-op}\nSuppose $S\\in \\cS^p$ and $T\\in \\cS^q$ with $1+r^{-1}=p^{-1}+q^{-1}$.\nThen $S\\star T\\in L^r(\\R^{2d})$ and\n\\begin{align*}\n \\|S\\star T\\|_{L^{r}}\\leq \\|S\\|_{\\cS^p}\\|T\\|_{\\cS^q}.\n\\end{align*}", '$\\cE^{(0)}_{p,\\alpha}$ satisfies the second Beurling-Deny criterion. If $1 < p_- \\leq p_+ < \\infty$, it is reflexive and satisfies the $\\Delta_2$-condition. \n %', ] embeddings = model.encode(sentences) print(embeddings.shape) # [3, 768] # Get the similarity scores for the embeddings similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] ``` ## Evaluation ### Metrics #### Information Retrieval * Dataset: `TESTING` * Evaluated with [InformationRetrievalEvaluator](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator) | Metric | Value | |:--------------------|:-----------| | cosine_accuracy@1 | 0.568 | | cosine_accuracy@3 | 0.6324 | | cosine_accuracy@5 | 0.6586 | | cosine_accuracy@10 | 0.6938 | | cosine_precision@1 | 0.568 | | cosine_precision@3 | 0.3649 | | cosine_precision@5 | 0.2774 | | cosine_precision@10 | 0.1819 | | cosine_recall@1 | 0.0265 | | cosine_recall@3 | 0.0487 | | cosine_recall@5 | 0.0599 | | cosine_recall@10 | 0.0752 | | **cosine_ndcg@10** | **0.2532** | | cosine_mrr@10 | 0.607 | | cosine_map@100 | 0.0742 | ## Training Details ### Training Dataset #### Unnamed Dataset * Size: 79,876 training samples * Columns: anchor, positive, and negative * Approximate statistics based on the first 1000 samples: | | anchor | positive | negative | |:--------|:-----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------| | type | string | string | string | | details | | | | * Samples: | anchor | positive | negative | |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | What is the limit of the proportion of 1's in the sequence $a_n$ as $n$ approaches infinity, given that $0 \leq 3g_n -2n \leq 4$? | Let $g_n$ be the number of $1$'s in the sequence $a_1 a_2 \cdots a_n$.
Then
\begin{equation}
0 \leq 3g_n -2n \leq 4
\label{star}
\end{equation}
for all $n$, and hence
$\lim_{n \rightarrow \infty} g_n/n = 2/3$.
\label{thm1}
| \label{thm:bounds_initial}
Let $\seqq{s}$ be a sequence of rank $r$ for which the roots of the characteristic polynomial are all different. Then, for any positive integer $M$, the rank of $\seq{s^M}$ is at most
\begin{align*}
\rank s^M \leq \binom{M+r-1}{M}.
\end{align*}
| | Does the statement of \textbf{ThmConjAreTrue} imply that the maximum genus of a locally Cohen-Macaulay curve in $\mathbb{P}^3_{\mathbb{C}}$ of degree $d$ that does not lie on a surface of degree $s-1$ is always equal to $g(d,s)$? | \label{ThmConjAreTrue}
Conjectures \ref{Conj1} and \ref{Conj2} are true.
As a consequence,
if either $d=s \geq 1$ or $d \geq 2s+1 \geq 3$,
the maximum genus of a locally Cohen-Macaulay curve in $\mathbb{P}^3_{\mathbb{C}}$ of degree $d$ that does not lie on a surface of degree $s-1$ is equal to $g(d,s)$.
| [{\cite[Corollary 2.2.2 with $p=3$]{BSY}}]
Let $S$ be a non-trivial Severi-Brauer surface over a perfect field $\textbf{k}$. Then $S$ does not contain points of degree $d$, where $d$ is not divisible by $3$. On the other hand $S$ contains a point of degree $3$.
| | \\emph{Is the statement \emph{If $X$ is a compact Hausdorff space, then $X$ is normal}, proven in the first isomorphism theorem for topological groups, or is it a well-known result in topology?} | }
\newcommand{\ep}{
| \label{prop:coherence}
If $X$ is a qcqs scheme, then $RX$ is coherent in the sense that the set of quasi-compact open subsets of $RX$ is closed under finite intersections and forms a basis for the topology of $RX$.
| * Loss: [TripletLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#tripletloss) with these parameters: ```json { "distance_metric": "TripletDistanceMetric.COSINE", "triplet_margin": 0.1 } ``` ### Training Hyperparameters #### Non-Default Hyperparameters - `eval_strategy`: epoch - `per_device_train_batch_size`: 16 - `per_device_eval_batch_size`: 16 - `gradient_accumulation_steps`: 8 - `learning_rate`: 2e-05 - `num_train_epochs`: 4 - `lr_scheduler_type`: cosine - `warmup_ratio`: 0.1 - `bf16`: True - `tf32`: True - `load_best_model_at_end`: True - `optim`: adamw_torch_fused - `batch_sampler`: no_duplicates #### All Hyperparameters
Click to expand - `overwrite_output_dir`: False - `do_predict`: False - `eval_strategy`: epoch - `prediction_loss_only`: True - `per_device_train_batch_size`: 16 - `per_device_eval_batch_size`: 16 - `per_gpu_train_batch_size`: None - `per_gpu_eval_batch_size`: None - `gradient_accumulation_steps`: 8 - `eval_accumulation_steps`: None - `torch_empty_cache_steps`: None - `learning_rate`: 2e-05 - `weight_decay`: 0.0 - `adam_beta1`: 0.9 - `adam_beta2`: 0.999 - `adam_epsilon`: 1e-08 - `max_grad_norm`: 1.0 - `num_train_epochs`: 4 - `max_steps`: -1 - `lr_scheduler_type`: cosine - `lr_scheduler_kwargs`: {} - `warmup_ratio`: 0.1 - `warmup_steps`: 0 - `log_level`: passive - `log_level_replica`: warning - `log_on_each_node`: True - `logging_nan_inf_filter`: True - `save_safetensors`: True - `save_on_each_node`: False - `save_only_model`: False - `restore_callback_states_from_checkpoint`: False - `no_cuda`: False - `use_cpu`: False - `use_mps_device`: False - `seed`: 42 - `data_seed`: None - `jit_mode_eval`: False - `use_ipex`: False - `bf16`: True - `fp16`: False - `fp16_opt_level`: O1 - `half_precision_backend`: auto - `bf16_full_eval`: False - `fp16_full_eval`: False - `tf32`: True - `local_rank`: 0 - `ddp_backend`: None - `tpu_num_cores`: None - `tpu_metrics_debug`: False - `debug`: [] - `dataloader_drop_last`: False - `dataloader_num_workers`: 0 - `dataloader_prefetch_factor`: None - `past_index`: -1 - `disable_tqdm`: False - `remove_unused_columns`: True - `label_names`: None - `load_best_model_at_end`: True - `ignore_data_skip`: False - `fsdp`: [] - `fsdp_min_num_params`: 0 - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} - `tp_size`: 0 - `fsdp_transformer_layer_cls_to_wrap`: None - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} - `deepspeed`: None - `label_smoothing_factor`: 0.0 - `optim`: adamw_torch_fused - `optim_args`: None - `adafactor`: False - `group_by_length`: False - `length_column_name`: length - `ddp_find_unused_parameters`: None - `ddp_bucket_cap_mb`: None - `ddp_broadcast_buffers`: False - `dataloader_pin_memory`: True - `dataloader_persistent_workers`: False - `skip_memory_metrics`: True - `use_legacy_prediction_loop`: False - `push_to_hub`: False - `resume_from_checkpoint`: None - `hub_model_id`: None - `hub_strategy`: every_save - `hub_private_repo`: None - `hub_always_push`: False - `gradient_checkpointing`: False - `gradient_checkpointing_kwargs`: None - `include_inputs_for_metrics`: False - `include_for_metrics`: [] - `eval_do_concat_batches`: True - `fp16_backend`: auto - `push_to_hub_model_id`: None - `push_to_hub_organization`: None - `mp_parameters`: - `auto_find_batch_size`: False - `full_determinism`: False - `torchdynamo`: None - `ray_scope`: last - `ddp_timeout`: 1800 - `torch_compile`: False - `torch_compile_backend`: None - `torch_compile_mode`: None - `include_tokens_per_second`: False - `include_num_input_tokens_seen`: False - `neftune_noise_alpha`: None - `optim_target_modules`: None - `batch_eval_metrics`: False - `eval_on_start`: False - `use_liger_kernel`: False - `eval_use_gather_object`: False - `average_tokens_across_devices`: False - `prompts`: None - `batch_sampler`: no_duplicates - `multi_dataset_batch_sampler`: proportional
### Training Logs
Click to expand | Epoch | Step | Training Loss | TESTING_cosine_ndcg@10 | |:-------:|:-------:|:-------------:|:----------------------:| | 0.0160 | 10 | 1.1162 | - | | 0.0320 | 20 | 1.0465 | - | | 0.0481 | 30 | 0.9663 | - | | 0.0641 | 40 | 0.8758 | - | | 0.0801 | 50 | 0.8215 | - | | 0.0961 | 60 | 0.7492 | - | | 0.1122 | 70 | 0.6356 | - | | 0.1282 | 80 | 0.3573 | - | | 0.1442 | 90 | 0.166 | - | | 0.1602 | 100 | 0.0797 | - | | 0.1762 | 110 | 0.046 | - | | 0.1923 | 120 | 0.0419 | - | | 0.2083 | 130 | 0.025 | - | | 0.2243 | 140 | 0.0233 | - | | 0.2403 | 150 | 0.0205 | - | | 0.2564 | 160 | 0.0142 | - | | 0.2724 | 170 | 0.017 | - | | 0.2884 | 180 | 0.0157 | - | | 0.3044 | 190 | 0.0104 | - | | 0.3204 | 200 | 0.0126 | - | | 0.3365 | 210 | 0.019 | - | | 0.3525 | 220 | 0.0153 | - | | 0.3685 | 230 | 0.0171 | - | | 0.3845 | 240 | 0.0124 | - | | 0.4006 | 250 | 0.01 | - | | 0.4166 | 260 | 0.0071 | - | | 0.4326 | 270 | 0.0125 | - | | 0.4486 | 280 | 0.0096 | - | | 0.4647 | 290 | 0.0092 | - | | 0.4807 | 300 | 0.0067 | - | | 0.4967 | 310 | 0.0069 | - | | 0.5127 | 320 | 0.0054 | - | | 0.5287 | 330 | 0.0107 | - | | 0.5448 | 340 | 0.0115 | - | | 0.5608 | 350 | 0.0083 | - | | 0.5768 | 360 | 0.0175 | - | | 0.5928 | 370 | 0.0162 | - | | 0.6089 | 380 | 0.0094 | - | | 0.6249 | 390 | 0.0124 | - | | 0.6409 | 400 | 0.0078 | - | | 0.6569 | 410 | 0.014 | - | | 0.6729 | 420 | 0.0117 | - | | 0.6890 | 430 | 0.0097 | - | | 0.7050 | 440 | 0.0094 | - | | 0.7210 | 450 | 0.0077 | - | | 0.7370 | 460 | 0.0103 | - | | 0.7531 | 470 | 0.0099 | - | | 0.7691 | 480 | 0.0123 | - | | 0.7851 | 490 | 0.0103 | - | | 0.8011 | 500 | 0.0098 | - | | 0.8171 | 510 | 0.0059 | - | | 0.8332 | 520 | 0.0031 | - | | 0.8492 | 530 | 0.0075 | - | | 0.8652 | 540 | 0.0101 | - | | 0.8812 | 550 | 0.0099 | - | | 0.8973 | 560 | 0.0098 | - | | 0.9133 | 570 | 0.0072 | - | | 0.9293 | 580 | 0.0057 | - | | 0.9453 | 590 | 0.0074 | - | | 0.9613 | 600 | 0.0038 | - | | 0.9774 | 610 | 0.0127 | - | | 0.9934 | 620 | 0.0098 | - | | **1.0** | **625** | **-** | **0.2532** | | 1.0080 | 630 | 0.0064 | - | | 1.0240 | 640 | 0.0066 | - | | 1.0401 | 650 | 0.0056 | - | | 1.0561 | 660 | 0.0031 | - | | 1.0721 | 670 | 0.0023 | - | | 1.0881 | 680 | 0.0032 | - | | 1.1041 | 690 | 0.0021 | - | | 1.1202 | 700 | 0.0011 | - | | 1.1362 | 710 | 0.006 | - | | 1.1522 | 720 | 0.0045 | - | | 1.1682 | 730 | 0.0041 | - | | 1.1843 | 740 | 0.0026 | - | | 1.2003 | 750 | 0.0019 | - | | 1.2163 | 760 | 0.0058 | - | | 1.2323 | 770 | 0.0054 | - | | 1.2483 | 780 | 0.0066 | - | | 1.2644 | 790 | 0.0033 | - | | 1.2804 | 800 | 0.004 | - | | 1.2964 | 810 | 0.0028 | - | | 1.3124 | 820 | 0.0027 | - | | 1.3285 | 830 | 0.0017 | - | | 1.3445 | 840 | 0.0009 | - | | 1.3605 | 850 | 0.0048 | - | | 1.3765 | 860 | 0.0037 | - | | 1.3925 | 870 | 0.0045 | - | | 1.4086 | 880 | 0.0043 | - | | 1.4246 | 890 | 0.0046 | - | | 1.4406 | 900 | 0.0023 | - | | 1.4566 | 910 | 0.0031 | - | | 1.4727 | 920 | 0.0027 | - | | 1.4887 | 930 | 0.0022 | - | | 1.5047 | 940 | 0.0042 | - | | 1.5207 | 950 | 0.0026 | - | | 1.5368 | 960 | 0.0049 | - | | 1.5528 | 970 | 0.0024 | - | | 1.5688 | 980 | 0.0019 | - | | 1.5848 | 990 | 0.0038 | - | | 1.6008 | 1000 | 0.0036 | - | | 1.6169 | 1010 | 0.0023 | - | | 1.6329 | 1020 | 0.0021 | - | | 1.6489 | 1030 | 0.0011 | - | | 1.6649 | 1040 | 0.0025 | - | | 1.6810 | 1050 | 0.0026 | - | | 1.6970 | 1060 | 0.0034 | - | | 1.7130 | 1070 | 0.0024 | - | | 1.7290 | 1080 | 0.0038 | - | | 1.7450 | 1090 | 0.002 | - | | 1.7611 | 1100 | 0.0046 | - | | 1.7771 | 1110 | 0.0003 | - | | 1.7931 | 1120 | 0.0062 | - | | 1.8091 | 1130 | 0.0057 | - | | 1.8252 | 1140 | 0.0012 | - | | 1.8412 | 1150 | 0.0021 | - | | 1.8572 | 1160 | 0.0038 | - | | 1.8732 | 1170 | 0.0024 | - | | 1.8892 | 1180 | 0.0026 | - | | 1.9053 | 1190 | 0.0034 | - | | 1.9213 | 1200 | 0.0064 | - | | 1.9373 | 1210 | 0.0041 | - | | 1.9533 | 1220 | 0.0032 | - | | 1.9694 | 1230 | 0.0028 | - | | 1.9854 | 1240 | 0.0009 | - | | 2.0 | 1250 | 0.0042 | 0.2488 | | 2.0160 | 1260 | 0.0005 | - | | 2.0320 | 1270 | 0.0018 | - | | 2.0481 | 1280 | 0.0009 | - | | 2.0641 | 1290 | 0.001 | - | | 2.0801 | 1300 | 0.0024 | - | | 2.0961 | 1310 | 0.0011 | - | | 2.1122 | 1320 | 0.0008 | - | | 2.1282 | 1330 | 0.0001 | - | | 2.1442 | 1340 | 0.0006 | - | | 2.1602 | 1350 | 0.0005 | - | | 2.1762 | 1360 | 0.0003 | - | | 2.1923 | 1370 | 0.0 | - | | 2.2083 | 1380 | 0.0 | - | | 2.2243 | 1390 | 0.0001 | - | | 2.2403 | 1400 | 0.0001 | - | | 2.2564 | 1410 | 0.0027 | - | | 2.2724 | 1420 | 0.0005 | - | | 2.2884 | 1430 | 0.0007 | - | | 2.3044 | 1440 | 0.0001 | - | | 2.3204 | 1450 | 0.0002 | - | | 2.3365 | 1460 | 0.001 | - | | 2.3525 | 1470 | 0.0003 | - | | 2.3685 | 1480 | 0.001 | - | | 2.3845 | 1490 | 0.0 | - | | 2.4006 | 1500 | 0.0006 | - | | 2.4166 | 1510 | 0.0007 | - | | 2.4326 | 1520 | 0.0007 | - | | 2.4486 | 1530 | 0.0004 | - | | 2.4647 | 1540 | 0.0007 | - | | 2.4807 | 1550 | 0.0012 | - | | 2.4967 | 1560 | 0.0015 | - | | 2.5127 | 1570 | 0.0014 | - | | 2.5287 | 1580 | 0.0005 | - | | 2.5448 | 1590 | 0.0005 | - | | 2.5608 | 1600 | 0.0014 | - | | 2.5768 | 1610 | 0.0016 | - | | 2.5928 | 1620 | 0.0 | - | | 2.6089 | 1630 | 0.0002 | - | | 2.6249 | 1640 | 0.0006 | - | | 2.6409 | 1650 | 0.0002 | - | | 2.6569 | 1660 | 0.0003 | - | | 2.6729 | 1670 | 0.0007 | - | | 2.6890 | 1680 | 0.0005 | - | | 2.7050 | 1690 | 0.0007 | - | | 2.7210 | 1700 | 0.0 | - | | 2.7370 | 1710 | 0.0008 | - | | 2.7531 | 1720 | 0.0019 | - | | 2.7691 | 1730 | 0.0017 | - | | 2.7851 | 1740 | 0.0002 | - | | 2.8011 | 1750 | 0.0002 | - | | 2.8171 | 1760 | 0.0002 | - | | 2.8332 | 1770 | 0.0014 | - | | 2.8492 | 1780 | 0.0005 | - | | 2.8652 | 1790 | 0.0021 | - | | 2.8812 | 1800 | 0.002 | - | | 2.8973 | 1810 | 0.0021 | - | | 2.9133 | 1820 | 0.0007 | - | | 2.9293 | 1830 | 0.0 | - | | 2.9453 | 1840 | 0.0011 | - | | 2.9613 | 1850 | 0.0006 | - | | 2.9774 | 1860 | 0.0008 | - | | 2.9934 | 1870 | 0.0001 | - | | 3.0 | 1875 | - | 0.2516 | | 3.0080 | 1880 | 0.0033 | - | | 3.0240 | 1890 | 0.0 | - | | 3.0401 | 1900 | 0.0 | - | | 3.0561 | 1910 | 0.0009 | - | | 3.0721 | 1920 | 0.0001 | - | | 3.0881 | 1930 | 0.001 | - | | 3.1041 | 1940 | 0.0001 | - | | 3.1202 | 1950 | 0.0001 | - | | 3.1362 | 1960 | 0.0 | - | | 3.1522 | 1970 | 0.0003 | - | | 3.1682 | 1980 | 0.0001 | - | | 3.1843 | 1990 | 0.0005 | - | | 3.2003 | 2000 | 0.0 | - | | 3.2163 | 2010 | 0.0 | - | | 3.2323 | 2020 | 0.0 | - | | 3.2483 | 2030 | 0.0 | - | | 3.2644 | 2040 | 0.0 | - | | 3.2804 | 2050 | 0.0 | - | | 3.2964 | 2060 | 0.0001 | - | | 3.3124 | 2070 | 0.0001 | - | | 3.3285 | 2080 | 0.0 | - | | 3.3445 | 2090 | 0.0001 | - | | 3.3605 | 2100 | 0.0 | - | | 3.3765 | 2110 | 0.0005 | - | | 3.3925 | 2120 | 0.0001 | - | | 3.4086 | 2130 | 0.0 | - | | 3.4246 | 2140 | 0.0 | - | | 3.4406 | 2150 | 0.0004 | - | | 3.4566 | 2160 | 0.0005 | - | | 3.4727 | 2170 | 0.0 | - | | 3.4887 | 2180 | 0.0006 | - | | 3.5047 | 2190 | 0.0002 | - | | 3.5207 | 2200 | 0.0007 | - | | 3.5368 | 2210 | 0.0 | - | | 3.5528 | 2220 | 0.0 | - | | 3.5688 | 2230 | 0.0008 | - | | 3.5848 | 2240 | 0.0001 | - | | 3.6008 | 2250 | 0.0013 | - | | 3.6169 | 2260 | 0.0004 | - | | 3.6329 | 2270 | 0.0006 | - | | 3.6489 | 2280 | 0.0001 | - | | 3.6649 | 2290 | 0.0 | - | | 3.6810 | 2300 | 0.0011 | - | | 3.6970 | 2310 | 0.0005 | - | | 3.7130 | 2320 | 0.0 | - | | 3.7290 | 2330 | 0.0 | - | | 3.7450 | 2340 | 0.0006 | - | | 3.7611 | 2350 | 0.0 | - | | 3.7771 | 2360 | 0.0002 | - | | 3.7931 | 2370 | 0.0006 | - | | 3.8091 | 2380 | 0.0002 | - | | 3.8252 | 2390 | 0.0004 | - | | 3.8412 | 2400 | 0.0 | - | | 3.8572 | 2410 | 0.0007 | - | | 3.8732 | 2420 | 0.0006 | - | | 3.8892 | 2430 | 0.0002 | - | | 3.9053 | 2440 | 0.0009 | - | | 3.9213 | 2450 | 0.0009 | - | | 3.9373 | 2460 | 0.0 | - | | 3.9533 | 2470 | 0.0001 | - | | 3.9694 | 2480 | 0.0012 | - | | 3.9854 | 2490 | 0.0003 | - | | 3.9950 | 2496 | - | 0.2524 | | -1 | -1 | - | 0.2532 | * The bold row denotes the saved checkpoint.
### Framework Versions - Python: 3.11.12 - Sentence Transformers: 4.1.0 - Transformers: 4.51.3 - PyTorch: 2.6.0+cu124 - Accelerate: 1.6.0 - Datasets: 2.14.4 - Tokenizers: 0.21.1 ## Citation ### BibTeX #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` #### TripletLoss ```bibtex @misc{hermans2017defense, title={In Defense of the Triplet Loss for Person Re-Identification}, author={Alexander Hermans and Lucas Beyer and Bastian Leibe}, year={2017}, eprint={1703.07737}, archivePrefix={arXiv}, primaryClass={cs.CV} } ```