DigitalAsocial's picture
Update README.md
be5bf28 verified
metadata
tags:
  - Machine-learning
  - Data-Science
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dense
  - generated_from_trainer
  - dataset_size:134200
  - loss:MultipleNegativesRankingLoss
widget:
  - source_sentence: >-
      Currently, there seems to be breaking transformer news nearly every week
      with no sign of slowing.
    sentences:
      - >-
        Another active research aspect is concerned with improving the
        understanding of transformer-based models to further advance them.
      - >-
        For convenience, we choose to use a vector of unit length (its norm is
        1) and obtain this by dividing w by its norm, w ∥w∥ .
      - Measure the error (log Z Plog Z Q ) in bits.
  - source_sentence: >-
      Using the first three years of data, develop an appropriate ARIMA model
      and a procedure for these data.
    sentences:
      - >-
        Recall that the agent makes a decision at times determined by external
        events (or by other parts of the robot's control system).
      - >-
        A related fact is that in approximation in value space with multistep
        lookahead, J μ is the result of a step of Newton's method that starts at
        the function obtained by applying multiple value iterations to J.
      - c. Make one-step-ahead forecasts of the last 12 months.
  - source_sentence: >-
      The x-u-J notation is standard in deterministic optimal control textbooks
      (e.g., the classical books [AtF66] and [BrH75], noted earlier, as well as
      the more recent books by Stengel [Ste94], Kirk [Kir04], and Liberzon
      [Lib11]).
    sentences:
      - >-
        Sometimes the alternative notation p(j | i, u) is used for the
        transition probabilities. In the artificial intelligence literature, the
        focus is primarily on finitestate MDPs, particularly discounted and
        stochastic shortest path infinite horizon problems.
      - >-
        Research on artificial neural networks is as old as the digital
        computer.
      - >-
        Then the summations u P (u)f (u) should be written x P (x)f (u(x)). This
        means that P (u) is a finite sum of delta functions. This restriction
        guarantees that the mean and variance of u do exist, which is not
        necessarily the case for general P (u).
  - source_sentence: But how small can the error probability be?
    sentences:
      - >-
        The results described above allow us to offer the following service: if
        he tells us the properties of his channel, the desired rate R and the
        desired error probability p B , we can, after working out the relevant
        functions C, E r (R), and E sp (R), advise him that there exists a
        solution to his problem using a particular blocklength N ; indeed that
        almost any randomly chosen code with that blocklength should do the job.
      - >-
        If the problem space is complex, a few base-classifiers may be cascaded
        increasing the complexity at each stage.
      - >-
        Based on the fact that all these major implementations use
        normalization, it is clearly important, but why not just used standard
        batch normalization? Unfortunately, batch normalization is too memory
        intensive at our resolution. We have to come up with something that
        allows us to work with a few examples-that fit into our GPU memory with
        the two network graphs-but still works well.
  - source_sentence: >-
      If the weights are independent and the prior is taken as Gaussian, N (0,
      1/2λ)


      the MAP estimate minimizes the augmented error function


      where E is the usual classification or regression error (negative log
      likelihood).
    sentences:
      - >-
        This approach of removing unnecessary parameters is known as ridge
        regression in statistics.
      - >-
        If reselecting the test data doesn't help, you have other generalization
        problems.
      - >-
        These architectures are used across supervised, unsupervised, and
        reinforcement learning.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - pearson_cosine
  - spearman_cosine
model-index:
  - name: SentenceTransformer
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: val
          type: val
        metrics:
          - type: pearson_cosine
            value: null
            name: Pearson Cosine
          - type: spearman_cosine
            value: null
            name: Spearman Cosine
license: apache-2.0
datasets:
  - DigitalAsocial/ds-tb-17-g-sns-aml
language:
  - en
base_model:
  - sentence-transformers/all-MiniLM-L6-v2

SentenceTransformer

This is a sentence-transformers model trained. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Maximum Sequence Length: 256 tokens
  • Output Dimensionality: 384 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'If the weights are independent and the prior is taken as Gaussian, N (0, 1/2λ)\n\nthe MAP estimate minimizes the augmented error function\n\nwhere E is the usual classification or regression error (negative log likelihood).',
    'This approach of removing unnecessary parameters is known as ridge regression in statistics.',
    'These architectures are used across supervised, unsupervised, and reinforcement learning.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.4685, 0.1523],
#         [0.4685, 1.0000, 0.1570],
#         [0.1523, 0.1570, 1.0000]])

Evaluation

Metrics

Semantic Similarity

Metric Value
pearson_cosine nan
spearman_cosine nan

Training Details

Training Dataset

Training Data

The model was fine-tuned using 17 reference books in Data Science and Machine Learning, including: All source books were preprocessed using GROBID, an open-source tool for extracting and structuring text from PDF documents.
The raw PDF files were converted into structured text, segmented into sentences, and cleaned before being used for training.
This ensured consistent formatting and reliable sentence boundaries across the dataset.

  1. Aßenmacher, Matthias. Multimodal Deep Learning. Self-published, 2023.
  2. Bertsekas, Dimitri P. A Course in Reinforcement Learning. Arizona State University.
  3. Boykis, Vicki. What are Embeddings. Self-published, 2023.
  4. Bruce, Peter, and Andrew Bruce. Practical Statistics for Data Scientists: 50 Essential Concepts. O’Reilly Media, 2017.
  5. Daumé III, Hal. A Course in Machine Learning. Self-published.
  6. Deisenroth, Marc Peter, A. Aldo Faisal, and Cheng Soon Ong. Mathematics for Machine Learning. Cambridge University Press, 2020.
  7. Devlin, Hannah, Guo Kunin, Xiang Tian. Seeing Theory. Self-published.
  8. Gutmann, Michael U. Pen & Paper: Exercises in Machine Learning. Self-published.
  9. Jung, Alexander. Machine Learning: The Basics. Springer, 2022.
  10. Langr, Jakub, and Vladimir Bok. Deep Learning with Generative Adversarial Networks. Manning Publications, 2019.
  11. MacKay, David J.C. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.
  12. Montgomery, Douglas C., Cheryl L. Jennings, and Murat Kulahci. Introduction to Time Series Analysis and Forecasting. 2nd Edition, Wiley, 2015.
  13. Nilsson, Nils J. Introduction to Machine Learning: An Early Draft of a Proposed Textbook. Stanford University, 1996.
  14. Prince, Simon J.D. Understanding Deep Learning. Draft Edition, 2024.
  15. Shashua, Amnon. Introduction to Machine Learning. The Hebrew University of Jerusalem, 2008.
  16. Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. 2nd Edition, MIT Press, 2018.
  17. Alpaydin, Ethem. Introduction to Machine Learning. 3rd Edition, MIT Press, 2014.

⚠️ Note: Due to copyright restrictions, the full text of these books is not included in this repository. Only the fine-tuned model weights are shared.

Unnamed Dataset

  • Size: 134,200 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 8 tokens
    • mean: 39.42 tokens
    • max: 256 tokens
    • min: 7 tokens
    • mean: 41.22 tokens
    • max: 256 tokens
  • Samples:
    sentence_0 sentence_1
    This equation is somewhat similar to what you have seen before (as a high-level simplification of equation 5.1), with some important differences. The critic is trying to estimate the earth mover's distance, and looks for the maximum difference between the real (first term) and the generated (second term) distribution under different (valid) parametrizations of the f w function.
    [2, p.173] Sketch the mutual information for this channel as a function of the input distribution p. Pick a convenient two-dimensional representation of p.

    The optimization routine must therefore take account of the possibility that, as we go up hill on I(X; Y ), we may run into the inequality constraints p i ≥ 0.
    I(X; Y ) is a convex function of the channel parameters.
    • Derive the AdaBoost algorithm. • Understand the relationship between boosting decision stumps and linear classification.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • num_train_epochs: 6
  • fp16: True
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 6
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

📈 Loss Curve (per epoch)

Epoch Loss Grad Norm
0.06 1.5921 19.76
0.30 1.2517 13.44
0.60 1.0856 13.96
0.95 0.9268 13.22
1.25 0.7569 14.07
1.61 0.6757 11.21
1.97 0.6409 12.76
2.32 0.5111 20.53
2.74 0.5059 14.88
3.10 0.3880 9.56
3.46 0.3792 9.78
3.87 0.3750 inf
4.11 0.3345 11.03
4.47 0.3271 13.21
4.83 0.3064 11.37
5.07 0.2752 19.23
5.36 0.2740 15.45
5.72 0.2773 6.55
5.96 0.2710 19.79
6.00 0.5610 — (final train loss)

Framework Versions

  • Python: 3.11.7
  • Sentence Transformers: 5.1.1
  • Transformers: 4.57.3
  • PyTorch: 2.5.1+cu121
  • Accelerate: 1.12.0
  • Datasets: 4.4.1
  • Tokenizers: 0.22.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

If you use this model, please cite:

@misc{aghakhani2025synergsticrag,
  author       = {Danial Aghakhani Zadeh},
  title        = {Fine-tuned all-MiniLM-L6-v2 for Data Science RAG},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/DigitalAsocial/all-MiniLM-L6-v2-ds-rag-s}}
}