SentenceTransformer based on marroyo777/bge-99GPT-v1

This is a sentence-transformers model finetuned from marroyo777/bge-99GPT-v1. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: marroyo777/bge-99GPT-v1
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 384 tokens
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("marroyo777/bge-99GPT-v1")
# Run inference
sentences = [
    'How does gamification enhance the learning experience in data science according to the blog?',
    "Title: Unlocking Potential: The Power of Gamification in Employee Data Science Learning\nPublished: April, 2024\nAuthor(s): Fern Zhang\nClaps: 5\nComments: 0\nWord Count: 1661\nURL: https://medium.com/99p-labs/unlocking-potential-the-power-of-gamification-in-employee-data-science-learning-5f88e97c74aa\n\nThe blog article discusses the use of gamification in employee data science learning. It highlights the challenges in data science training and the team's initiative to revolutionize it using gamification strategies. The team adopted a multifaceted approach to understand the diverse backgrounds and prior knowledge of their target learners to design effective instruction. The article also discusses the gamification strategies for manager and practitioner training, as well as the user testing feedback and future plans for employee training in data science. Overall, the article emphasizes the importance of data science training and the use of gamification to make it an engaging and impactful learning experience.",
    'Title: CMU Capstone Project\u200a—\u200aVisualization Framework Of Telematics Data\nPublished: April, 2024\nAuthor(s): Yiheng Zhang, Yixue Yin, Rui Huang\nClaps: 1\nComments: 0\nWord Count: 2520\nURL: https://medium.com/99p-labs/cmu-capstone-project-visualization-framework-of-telematics-data-abb74fcbb975\n\nThe blog article discusses the development of an application to display telematic trajectory data in various formats on a web browser. The project involved brainstorming, user interviews, experimentation, and necessary pivots to define the trajectory of the development process. The team also focused on enhancing the foundational dashboard, building up a plugin system, fixing problems, and building new features. The final sprint involved finalizing and enhancing the user interface of the visualization framework. The article also outlines future works for the project.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric Value
cosine_accuracy 0.9887
dot_accuracy 0.0113
manhattan_accuracy 0.9887
euclidean_accuracy 0.9887
max_accuracy 0.9887

Triplet

Metric Value
cosine_accuracy 0.9915
dot_accuracy 0.0085
manhattan_accuracy 0.9915
euclidean_accuracy 0.9915
max_accuracy 0.9915

Training Details

Training Dataset

Unnamed Dataset

  • Size: 1,416 training samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 8 tokens
    • mean: 17.71 tokens
    • max: 36 tokens
    • min: 125 tokens
    • mean: 190.68 tokens
    • max: 331 tokens
    • min: 125 tokens
    • mean: 190.0 tokens
    • max: 331 tokens
  • Samples:
    anchor positive negative
    What guidance does the article provide for creating a co-design protocol? Title: Interactive Co-Design Sessions for Customer Research — Part 2: Co-Design Protocol
    Published: November, 2020
    Author(s): Langley Vogt
    Claps: 0
    Comments: 0
    Word Count: 497
    URL: https://medium.com/99p-labs/interactive-co-design-sessions-for-customer-research-part-2-co-design-protocol-2c60291e88c9

    The article discusses the process of creating an interactive co-design protocol for customer research. It emphasizes the importance of creating a thorough protocol and interactive board simultaneously, and provides guidance on creating a preliminary protocol and laying out the rest of the protocol in a table format. The article also mentions that Part 3 will share co-design learnings and takeaways.
    Title: What is Software-defined Mobility?
    Published: March, 2023
    Author(s): Rajeev Chhajer and Ryan Lingo
    Claps: 56
    Comments: 0
    Word Count: 742
    URL: https://medium.com/99p-labs/what-is-software-defined-mobility/

    The article discusses the concept of Software-defined Mobility and its impact on the automotive industry. It emphasizes the importance of incorporating intelligence into the mobility ecosystem through software to create a more integrated, sustainable, and emotional mobility experience. The authors believe that participation and cooperation are key to success in this new mobility paradigm, and they aim to leverage cutting-edge technologies and innovative approaches to address the challenges facing the automotive industry.
    What was the goal of the MHCI 99P Labs Capstone Team's project? Title: Interactions, Car Data, and Play Dynamics…Oh My!—2021 MHCI Capstone Part 8
    Published: January, 2022
    Author(s): MHCI 99P Labs Capstone Team
    Claps: 0
    Comments: 0
    Word Count: 1061
    URL: https://medium.com/99p-labs/interactions-car-data-and-play-dynamics-oh-my-2021-mhci-capstone-part-8-b3ac8dd1ceef

    The MHCI 99P Labs Capstone Team shares their experiences and learnings from Sprint 2 of their project. They explored various interactions in the car, including shared motion and collaboration, button-based games, and co-creation with data input from the car. The team aimed to foster connections between families through play and successfully learned how these new interactions could achieve this goal. The marble game was the most successful, while the other two prototypes had mixed success. The team plans to take their learnings forward in the next sprint.
    Title: Introducing the 99P Labs Blog Chatbot
    Published: February, 2024
    Author(s): Martin Arroyo
    Claps: 4
    Comments: 1
    Word Count: 3208
    URL: https://medium.com/99p-labs/99gpt-building-a-chatbot-fdde8b689df4

    The 99P Labs blog has introduced a chatbot called 99GPT, designed to answer questions about blog content. The chatbot aims to provide a more engaging and interactive way for readers to explore insights from the blog archive. The article discusses the technical considerations, challenges, and lessons learned in building 99GPT, including the ingestion phase, model selection, and developing a querying strategy. The blog also highlights the importance of frameworks like Langchain and LlamaIndex in bridging the gap between raw data and AI-driven interactive applications. The article concludes with the deployment of the chatbot on the Streamlit community cloud.
    What are the ideal data quality outputs mentioned in the article? Title: Weighing the Value of Data Quality Checks
    Published: July, 2022
    Author(s): Ryan Lingo
    Claps: 36
    Comments: 0
    Word Count: 2572
    URL: https://medium.com/99p-labs/weighing-the-value-of-data-quality-checks-4a5d0da1f3ff

    The article discusses the exploration of implementing data quality checks into a data platform, the goals, limits, and expectations, and the small experiments conducted to validate thinking. It also covers the flexibility and customization of data quality, potential actions to take when finding inadequate data quality, ideal data quality output, metrics to report, and where in the pipeline data quality checks best fit. The article also explores general deployment options and closing thoughts on the exploration of data quality ideas and architecture.
    Title: Sprint 2: Robot You Can Drive My Car
    Published: May, 2022
    Author(s): MHCI x 99P Labs Capstone Team
    Claps: 0
    Comments: 0
    Word Count: 648
    URL: https://medium.com/99p-labs/sprint-2-robot-you-can-drive-my-car-e4d988826555

    The blog article discusses the progress of the MHCI x 99P Labs Capstone Team in their project, focusing on the preliminary research and brainstorming they have conducted. The team has updated their research plan and is preparing to conduct informal interviews and observations in various related fields. They also plan to explore pretotyping in their next sprint to understand what form of attendants is most helpful to human passengers.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 354 evaluation samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 354 samples:
    anchor positive negative
    type string string string
    details
    • min: 7 tokens
    • mean: 17.68 tokens
    • max: 32 tokens
    • min: 125 tokens
    • mean: 187.96 tokens
    • max: 331 tokens
    • min: 125 tokens
    • mean: 189.88 tokens
    • max: 331 tokens
  • Samples:
    anchor positive negative
    What challenges did the 99P capstone team face in their project? Title: Decoding Travel Times: Exploring Telematics Data Dynamics
    Published: May, 2024
    Author(s): Qamar Mohamoud
    Claps: 3
    Comments: 1
    Word Count: 1880
    URL: https://medium.com/99p-labs/decoding-travel-times-exploring-telematics-data-dynamics

    The blog article discusses the challenges faced by the 99P capstone team of the MTDA program at The Ohio State University in building a model to compare real-life trip times to ideal times projected by the Google Distance Matrix. The team explored telematics data dynamics and the impact of geography, time of day, and local weather on trip times. The article also highlights the team's approach to feature creation, weather analysis, zone identification, data filtering, and modeling. Despite their efforts, the predictive models tested did not exceed 60% accuracy, leading to several key conclusions. The team advises caution in replicating their analysis and suggests addressing data bias, exploring alternative data sources, and considering route information for more accurate analyses in the future.
    Title: Sprint 5: Optimizing HRI Research with Smart Guide — A Co-Design Journey
    Published: May, 2024
    Author(s): Honda Research Institute MHCI @ CMU
    Claps: 2
    Comments: 0
    Word Count: 970
    URL: https://medium.com/99p-labs/sprint-5-optimizing-hri-research-with-smart-guide-a-co-design-journey-fa5d64a56a3d

    The blog article discusses the Smart Guide as an AI research companion for HRI researchers, aimed at enhancing the efficiency of human-AI teaming (HAIT) research. The article details the goals and testing process for the Smart Guide, as well as the insights gained from co-creation sessions with CMU researchers. The article also outlines the prototype and the key takeaways from the research process.
    What challenges did the author face during the internship? Title: Harnessing Sensors and Software
    Published: August, 2023
    Author(s): Edward Lui
    Claps: 0
    Comments: 0
    Word Count: 1133
    URL: https://medium.com/99p-labs/harnessing-sensors-and-software

    The blog article discusses the author's two-month internship at 99P, focusing on sensors and their integration with the Robot Operating System (ROS). The author worked on the SOMEthings project, exploring technologies such as the Intel Realsense D435i Depth Camera, HC-SR04 Ultrasonic Sensor, and DW1000 UWB Module. The challenges faced and accomplishments achieved during the internship are highlighted, providing valuable insights and hands-on experience. The article concludes with an invitation for collaboration and engagement with 99P Labs.
    Title: Sprint 6: Designing a Mobile Mentor
    Published: October, 2023
    Author(s): Alana Levene
    Claps: 1
    Comments: 0
    Word Count: 1015
    URL: https://medium.com/99p-labs/sprint-6-designing-a-mobile-mentor

    The 99P Labs x CMU MHCI Capstone Team has transitioned from research to design, focusing on creating a Mobile Mentor for Gen Z to facilitate on-the-go learning. The team has identified key insights from their research and has begun the prototyping process using a low-fidelity cardboard model. They are actively involving participants in the design process and are considering various influencing factors on their product. The team plans to transition to a design sprint timeline and is excited to continue developing this innovative product.
    What are the goals of the SOMEThings project? Title: Introducing the SOMEThings Project
    Published: July, 2023
    Author(s): Ryan Lingo
    Claps: 15
    Comments: 0
    Word Count: 2794
    URL: https://medium.com/99p-labs/introducing-the-somethings-project-f5eb8b0cf572

    The blog introduces the SOMEThings project, which is an initiative to build a miniature smart city for testing and experimenting with real-world challenges in the mobility ecosystem and IoT. The project aims to revolutionize the mobility sector, enhance efficiency and accessibility of mobility through IoT integration, and foster a culture of continuous learning and improvement. The blog also discusses the development of the SOMEThings Lab, the car, and the track for the project. The project is expected to have a substantial impact on the future of mobility and society at large.
    Title: An Overview of Machine Learning — Part 2: All About Regression
    Published: January, 2023
    Author(s): Luka Brkljacic
    Claps: 2
    Comments: 0
    Word Count: 4550
    URL: https://medium.com/99p-labs/an-overview-of-machine-learning-part-2-all-about-regression-2f991281932e

    The blog article provides an in-depth overview of regression in machine learning. It covers linear regression, calculating R, limitations of R, multiple regression, adjusted R, and logistic regression. The article also includes practical Python examples for linear regression and multiple regression. The author also mentions that the next post will cover decision trees.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • fp16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step 99GPT-Finetuning-Embedding-test-01_max_accuracy
1.0 89 0.9915

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.1.1
  • Transformers: 4.44.2
  • PyTorch: 2.4.1+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.0.1
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
3
Safetensors
Model size
33.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for marroyo777/bge-99GPT-v1

Unable to build the model tree, the base model loops to the model itself. Learn more.

Papers for marroyo777/bge-99GPT-v1

Evaluation results

  • Cosine Accuracy on 99GPT Finetuning Embedding test 01
    self-reported
    0.989
  • Dot Accuracy on 99GPT Finetuning Embedding test 01
    self-reported
    0.011
  • Manhattan Accuracy on 99GPT Finetuning Embedding test 01
    self-reported
    0.989
  • Euclidean Accuracy on 99GPT Finetuning Embedding test 01
    self-reported
    0.989
  • Max Accuracy on 99GPT Finetuning Embedding test 01
    self-reported
    0.989
  • Cosine Accuracy on 99GPT Finetuning Embedding test 01
    self-reported
    0.992
  • Dot Accuracy on 99GPT Finetuning Embedding test 01
    self-reported
    0.008
  • Manhattan Accuracy on 99GPT Finetuning Embedding test 01
    self-reported
    0.992
  • Euclidean Accuracy on 99GPT Finetuning Embedding test 01
    self-reported
    0.992
  • Max Accuracy on 99GPT Finetuning Embedding test 01
    self-reported
    0.992