MNLP_M3_document_encoder / README.md

YujinPang

Add new SentenceTransformer model

9789f72 verified 7 months ago

preview code

raw

history blame contribute delete

24.9 kB

metadata

tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:100000
  - loss:MultipleNegativesRankingLoss
base_model: YujinPang/docemb_M3_1
widget:
  - source_sentence: |-
      Course structure
      Mechatronics students take courses in various fields:
    sentences:
      - >-
        Robotics is one of the newest emerging subfield of mechatronics. It is
        the study of robots that how they are manufactured and operated. Since
        2000, this branch of mechatronics is attracting a number of aspirants.
        Robotics is interrelated with automation because here also not much
        human intervention is required. A large number of factories especially
        in automobile factories, robots are founds in assembly lines where they
        perform the job of drilling, installation and fitting. Programming
        skills are necessary for specialization in robotics. Knowledge of
        programming language —ROBOTC is important for functioning robots. An
        industrial robot is a prime example of a mechatronics system; it
        includes aspects of electronics, mechanics, and computing to do its
        day-to-day jobs.
      - >-
        Melting and boiling points 

        Melting and boiling points, typically expressed in degrees Celsius at a
        pressure of one atmosphere, are commonly used in characterizing the
        various elements. While known for most elements, either or both of these
        measurements is still undetermined for some of the radioactive elements
        available in only tiny quantities. Since helium remains a liquid even at
        absolute zero at atmospheric pressure, it has only a boiling point, and
        not a melting point, in conventional presentations.
      - >-
        Capsicum chili peppers are commonly used to add pungency in cuisines
        worldwide. The range of pepper heat reflected by a Scoville score is
        from 500 or less (sweet peppers) to over 2.6 million (Pepper X) (table
        below; Scoville scales for individual chili peppers are in the
        respective linked article). Some peppers such as the Guntur chilli and
        Rocoto are excluded from the list due to their very wide SHU range.
        Others such as Dragon's Breath and Chocolate 7-pot have not been
        officially verified.
  - source_sentence: >-
      In contrast to the South Pole neutrino telescopes AMANDA and IceCube,
      ANTARES uses water instead of ice as its Cherenkov medium. As light in
      water is less scattered than in ice this results in a better resolving
      power. On the other hand, water contains more sources of background light
      than ice (radioactive isotopes potassium-40 in the sea salt and
      bioluminescent organisms), leading to a higher energy thresholds for
      ANTARES with respect to IceCube and making more sophisticated
      background-suppression methods necessary.
    sentences:
      - >-
        Deployment and connection of the detector are performed in cooperation
        with the French oceanographic institute, IFREMER, currently using the
        ROV Victor, and for some past operations the submarine Nautile.
      - >-
        To distinguish the other types of multithreading from SMT, the term
        "temporal multithreading" is used to denote when instructions from only
        one thread can be issued at a time.
      - >-
        The two most important classes of divergences are the f-divergences and
        Bregman divergences; however, other types of divergence functions are
        also encountered in the literature. The only divergence that is both an
        f-divergence and a Bregman divergence is the Kullback–Leibler
        divergence; the squared Euclidean divergence is a Bregman divergence
        (corresponding to the function ) but not an f-divergence.
  - source_sentence: >-
      The term "hyperbolic geometry" was introduced by Felix Klein in 1871.
      Klein followed an initiative of Arthur Cayley to use the transformations
      of projective geometry to produce isometries. The idea used a conic
      section or quadric to define a region, and used cross ratio to define a
      metric. The projective transformations that leave the conic section or
      quadric stable are the isometries. "Klein showed that if the Cayley
      absolute is a real curve then the part of the projective plane in its
      interior is isometric to the hyperbolic plane..."
    sentences:
      - >-
        The mathematics is not difficult but is intertwined so the following is
        only a brief sketch. Starting with a non-symmetric tensor , the
        Lagrangian density is split into
      - >-
        Because Euclidean, hyperbolic and elliptic geometry are all consistent,
        the question arises: which is the real geometry of space, and if it is
        hyperbolic or elliptic, what is its curvature?
      - >-
        Wind farm waste is less toxic than other garbage. Wind turbine blades
        represent only a fraction of overall waste in the US, according to the
        Wind-industry trade association, American Wind Energy Association.
  - source_sentence: >-
      The StyleGAN-2-ADA paper points out a further point on data augmentation:
      it must be invertible. Continue with the example of generating ImageNet
      pictures. If the data augmentation is "randomly rotate the picture by 0,
      90, 180, 270 degrees with equal probability", then there is no way for the
      generator to know which is the true orientation: Consider two generators ,
      such that for any latent , the generated image  is a 90-degree rotation of
      . They would have exactly the same expected loss, and so neither is
      preferred over the other.
    sentences:
      - >-
        The key method to distinguish between these different models involves
        study of the particles' interactions ("coupling") and exact decay
        processes ("branching ratios"), which can be measured and tested
        experimentally in particle collisions. In the Type-I 2HDM model one
        Higgs doublet couples to up and down quarks, while the second doublet
        does not couple to quarks. This model has two interesting limits, in
        which the lightest Higgs couples to just fermions ("gauge-phobic") or
        just gauge bosons ("fermiophobic"), but not both. In the Type-II 2HDM
        model, one Higgs doublet only couples to up-type quarks, the other only
        couples to down-type quarks. The heavily researched Minimal
        Supersymmetric Standard Model (MSSM) includes a Type-II 2HDM Higgs
        sector, so it could be disproven by evidence of a Type-I 2HDM Higgs.
      - >-
        Model variants 

        Several different model variants of the S4 are sold, with most variants
        varying mainly in handling regional network types and bands. To prevent
        grey market reselling, models of the S4 manufactured after July 2013
        implement a regional lockout system in certain regions, requiring that
        the first SIM card used on a European and North American model be from a
        carrier in that region. Samsung stated that the lock would be removed
        once a local SIM card is used. SIM format for all variants is Micro-SIM,
        which can have one or two depending on model.
      - >-
        Another inspiration for GANs was noise-contrastive estimation, which
        uses the same loss function as GANs and which Goodfellow studied during
        his PhD in 2010–2014.
  - source_sentence: >-
      The final step for the BoW model is to convert vector-represented patches
      to "codewords" (analogous to words in text documents), which also produces
      a "codebook" (analogy to a word dictionary). A codeword can be considered
      as a representative of several similar patches. One simple method is
      performing k-means clustering over all the vectors. Codewords are then
      defined as the centers of the learned clusters. The number of the clusters
      is the codebook size (analogous to the size of the word dictionary).
    sentences:
      - "Pathria retired from the University of Waterloo in August 1998 and, soon thereafter, moved to the west coast of the US and became an adjunct professor of physics at the University of California at San Diego\_– a position he continued to hold till 2010. In 2009, Pathria's newest publishers (Elsevier/Academic) prevailed upon him to produce a third edition of this book. He now sought the help of Paul Beale, of the University of Colorado at Boulder, whose co-authorship resulted in another brand new edition in March 2011. Ten years later, in 2021, Pathria and Beale produced a fourth edition of this book."
      - >-
        C++

        In the 1970s, software engineers needed language support to break large
        projects down into modules. One obvious feature was to decompose large
        projects physically into separate files. A less obvious feature was to
        decompose large projects logically into abstract datatypes. At the time,
        languages supported concrete (scalar) datatypes like integer numbers,
        floating-point numbers, and strings of characters. Abstract datatypes
        are structures of concrete datatypes, with a new name assigned. For
        example, a list of integers could be called integer_list.
      - |-
        External links
         Bag of Visual Words in a Nutshell a short tutorial by Bethea Davida. A demo for two bag-of-words classifiers by L. Fei-Fei, R. Fergus, and A. Torralba. Caltech Large Scale Image Search Toolbox: a Matlab/C++ toolbox implementing Inverted File search for Bag of Words model. It also contains implementations for fast approximate nearest neighbor search using randomized k-d tree, locality-sensitive hashing, and hierarchical k-means. DBoW2 library: a library that implements a fast bag of words in C++ with support for OpenCV.
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on YujinPang/docemb_M3_1

This is a sentence-transformers model finetuned from YujinPang/docemb_M3_1. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: YujinPang/docemb_M3_1
Maximum Sequence Length: 256 tokens
Output Dimensionality: 384 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("YujinPang/docemb_M3_1_9")
# Run inference
sentences = [
    'The final step for the BoW model is to convert vector-represented patches to "codewords" (analogous to words in text documents), which also produces a "codebook" (analogy to a word dictionary). A codeword can be considered as a representative of several similar patches. One simple method is performing k-means clustering over all the vectors. Codewords are then defined as the centers of the learned clusters. The number of the clusters is the codebook size (analogous to the size of the word dictionary).',
    'External links\n Bag of Visual Words in a Nutshell a short tutorial by Bethea Davida. A demo for two bag-of-words classifiers by L. Fei-Fei, R. Fergus, and A. Torralba. Caltech Large Scale Image Search Toolbox: a Matlab/C++ toolbox implementing Inverted File search for Bag of Words model. It also contains implementations for fast approximate nearest neighbor search using randomized k-d tree, locality-sensitive hashing, and hierarchical k-means. DBoW2 library: a library that implements a fast bag of words in C++ with support for OpenCV.',
    'C++\nIn the 1970s, software engineers needed language support to break large projects down into modules. One obvious feature was to decompose large projects physically into separate files. A less obvious feature was to decompose large projects logically into abstract datatypes. At the time, languages supported concrete (scalar) datatypes like integer numbers, floating-point numbers, and strings of characters. Abstract datatypes are structures of concrete datatypes, with a new name assigned. For example, a list of integers could be called integer_list.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

Size: 100,000 training samples
Columns: sentence_0 and sentence_1
Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1
type string string
details
min: 10 tokens
mean: 96.25 tokens
max: 256 tokens

min: 11 tokens
mean: 93.51 tokens
max: 256 tokens

	sentence_0	sentence_1
type	string	string
details	min: 10 tokens mean: 96.25 tokens max: 256 tokens	min: 11 tokens mean: 93.51 tokens max: 256 tokens

Samples:

sentence_0	sentence_1
`The character has been portrayed by Silas Carson in Episodes I-III, and voiced by Tom Kenny in The Clone Wars.`	`The character has been voiced by Dee Bradley Baker in The Clone Wars and The Bad Batch.`
`Abdomen The muscles of the abdominal wall are subdivided into a superficial and a deep group.`	`The muscles of the hip are divided into a dorsal and a ventral group.`
Resonant frequency When placed in a magnetic field, NMR active nuclei (such as 1H or 13C) absorb electromagnetic radiation at a frequency characteristic of the isotope. The resonant frequency, energy of the radiation absorbed, and the intensity of the signal are proportional to the strength of the magnetic field. For example, in a 21 Tesla magnetic field, hydrogen nuclei (commonly referred to as protons) resonate at 900 MHz. It is common to refer to a 21 T magnet as a 900 MHz magnet since hydrogen is the most common nucleus detected. However, different nuclei will resonate at different frequencies at this field strength in proportion to their nuclear magnetic moments.	`Spectral interpretation NMR signals are ordinarily characterized by three variables: chemical shift, spin-spin coupling, and relaxation time.`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 256
per_device_eval_batch_size: 256
num_train_epochs: 1
multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 256
per_device_eval_batch_size: 256
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin

Framework Versions

Python: 3.10.11
Sentence Transformers: 4.1.0
Transformers: 4.52.3
PyTorch: 2.7.0+cu126
Accelerate: 1.7.0
Datasets: 3.6.0
Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}