SentenceTransformer based on sentence-transformers/all-distilroberta-v1

This is a sentence-transformers model finetuned from sentence-transformers/all-distilroberta-v1 on the ai-job-embedding-finetuning dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("AmberJin4526/distilroberta-ai-jobembeddings")
# Run inference
sentences = [
    '"Data Scientist Transformers BERT genomics distributed computing"',
    "experience with Transformers\nNeed to be 8+ year's of work experience. \nWe need a Data Scientist with demonstrated expertise in training and evaluating transformers such as BERT and its derivatives.\nRequired: Proficiency with Python, pyTorch, Linux, Docker, Kubernetes, Jupyter. Expertise in Deep Learning, Transformers, Natural Language Processing, Large Language Models\nPreferred: Experience with genomics data, molecular genetics. Distributed computing tools like Ray, Dask, Spark",
    'requirements to pull required data to measure the current state of these assets, set up usage metrics for internal and external stakeholders.Table Metadata to improve documentation coverage for tables, including table descriptions, column definitions, and data lineage.Implement a centralized metadata management system to maintain and access asset documentation.Ensure that all existing and new data assets are properly documented according to established standards.Pipeline Clean-up and ConsolidationConsolidate and streamline pipelines by eliminating redundancies and unnecessary elements according to the set of provided rules.Clean up and restructure data tables, ensuring consistent naming conventions, data types, and schema definitions.Retire or archive obsolete dashboards and workflows.Implement monitoring and alerting mechanisms for critical workflows to ensure timely issue detection and resolution.Set up a foundation for scalable Data Model for the Stock Business - Implement and build performant data models to solve common analytics use-Knowledge Transfer and DocumentationThoroughly document the work performed, including methodologies, decisions, and any scripts or tools developed.Provide comprehensive knowledge transfer to the data team, ensuring a smooth transition and the ability to maintain the optimized data environment.\nSkills: Proven experience in data engineering and data asset management.Proficiency in SQL, Python, and other relevant data processing languages and tools.Expertise in data modeling, ETL processes, and workflow orchestration (e.g., Airflow, Databricks).Strong analytical and problem-solving skills.Excellent communication and documentation abilities.Familiarity with cloud data platforms (e.g., Azure, AWS, GCP) is a plus.\nPride Global offers eligible employee’s comprehensive healthcare coverage (medical, dental, and vision plans), supplemental coverage (accident insurance, critical illness insurance and hospital indemnity), 401(k)-retirement savings, life & disability insurance, an employee assistance program, legal support, auto, home insurance, pet insurance and employee discounts with preferred vendors.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric ai-job-validation ai-job-test
cosine_accuracy 1.0 0.9903

Training Details

Training Dataset

ai-job-embedding-finetuning

  • Dataset: ai-job-embedding-finetuning at daf4809
  • Size: 813 training samples
  • Columns: query, job_description_pos, and job_description_neg
  • Approximate statistics based on the first 813 samples:
    query job_description_pos job_description_neg
    type string string string
    details
    • min: 9 tokens
    • mean: 15.84 tokens
    • max: 32 tokens
    • min: 7 tokens
    • mean: 350.09 tokens
    • max: 512 tokens
    • min: 7 tokens
    • mean: 351.48 tokens
    • max: 512 tokens
  • Samples:
    query job_description_pos job_description_neg
    AWS data pipeline design, ETL process implementation, data governance and security. requirements and design data solutions that meet their needs, including understanding data models/schemas and implementing ETL (Extract, Transform, and Load) processes to transform raw data into a usable format in the destinationResponsible for monitoring and optimizing the performance of data pipelines, troubleshooting any issues that arise, and ensuring data quality and integrity.


    Qualifications

    Proficient in programming languages such as Python and SQL for database querying and manipulation. Strong understanding of AWS services related to data engineering, such as Amazon S3, Amazon Redshift, Amazon Aurora Postgres, AWS Glue, AWS Lambda, AWS Step Function, AWS Lake Formation, Amazon Data Zone, Amazon Kinesis, MSK, and Amazon EMR. Knowledge of database design principles and experience with database management systems. Experience with data storage technologies like relational databases (e.g., SQL Server, PostgreSQL) and distributed storage systems (e.g., PySpark). Understanding of ...
    Qualifications:

    7+ years of experience in data science or analytics roles, with a focus on analytics and machine learning.Expertise in programming languages such as Python, R, or SQL for data extraction, cleaning, and analysis.Expertise in working with machine data / time series data Excellent communication skills to effectively convey complex technical concepts to non-technical stakeholders.Strong analytical and problem-solving skills to derive insights from large datasets.Bachelor's degree in data science, computer science, statistics, or a related field (master’s or PhD preferred)

    Key Competencies:

    Expertise in statistics, supervised and unsupervised machine learning techniques and their appropriate uses; ability to apply common modeling best practices to build models using high-volume, asynchronous time series dataStrategic Thinking- Ability to develop and implement a strategic framework on how to deploy Artificial Intelligence within HRCustomer focus- The need to design soluti...
    "Marketing campaign analytics, A/B testing, web analytics tools" skills to provide best-in-class analytics to the business

    Required Qualifications, Capabilities, And Skills

    Bachelor’s and Master’s degree in a quantitative discipline (Data Science/Analytics, Mathematics, Statistics, Physics, Engineering, Economics, Finance or related fields)3+ years of experience in applying statistical methods to real world problems3+ years of experience with SQL and at least one of the following analytical tools: SAS, Python, R Experience with visualization techniques for data analysis and presentationExperience with web analytics tools (Google Analytics, Adobe/Omniture Insight/Visual Sciences, Webtrends, CoreMetrics, etc.)Superior written, oral communication and presentation skills with experience communicating concisely and effectively with all levels of management and partners

    Preferred Qualifications, Capabilities, And Skills

    Tableau and Python preferredIntellectually curious and eager to become subject matter expert in their focus areaA strategic thinker w...
    Experience in the biotech industry is advantageous. Requirements: Ø Expertise in deep learning techniques, with a focus on Generative AI and Large Language Models (LLMs).Ø Proficiency in Python programming and familiarity with libraries such as TensorFlow, PyTorch, or Keras.Ø Knowledge of cloud computing platforms, particularly AWS.Ø Strong analytical and problem-solving skills.Ø Excellent communication and collaboration abilities.Ø Experience in the biotech industry is a plus. Educational Qualifications: PhD in Computer Science or Machine Learning.
    "Senior Cloud Data Engineer, Databricks, Delta Lake, Data Warehousing" Experience of Delta Lake, DWH, Data Integration, Cloud, Design and Data Modelling.• Proficient in developing programs in Python and SQL• Experience with Data warehouse Dimensional data modeling.• Working with event based/streaming technologies to ingest and process data.• Working with structured, semi structured and unstructured data.• Optimize Databricks jobs for performance and scalability to handle big data workloads. • Monitor and troubleshoot Databricks jobs, identify and resolve issues or bottlenecks. • Implement best practices for data management, security, and governance within the Databricks environment. Experience designing and developing Enterprise Data Warehouse solutions.• Proficient writing SQL queries and programming including stored procedures and reverse engineering existing process.• Perform code reviews to ensure fit to requirements, optimal execution patterns and adherence to established standards.
    Qualifications:
    • 5+ years Python coding experience.• 5+ years - SQL...
    experience in using, manipulating, and extracting insights from healthcare data with a particular focus on using machine learning with claims data. The applicant will be driven by curiosity, collaborating with a cross-functional team of Product Managers, Software Engineers, and Data Analysts.

    Responsibilities

    Apply data science, machine learning, and healthcare domain expertise to advance and oversee Lucina’s pregnancy identification and risk-scoring algorithms.Analyze healthcare data to study patterns of care and patient conditions which correlate to specific outcomes.Collaborate on clinical committee research and development work.Complete ad hoc analyses and reports from internal or external customers prioritized by management throughout the year.

    Qualifications

    Degree or practical experience in Applied Math, Statistics, Engineering, Information Management with 3 or more years of data analytics experience, Masters degree a plus.Experience manipulating and analyzing healthcare dat...
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

ai-job-embedding-finetuning

  • Dataset: ai-job-embedding-finetuning at daf4809
  • Size: 101 evaluation samples
  • Columns: query, job_description_pos, and job_description_neg
  • Approximate statistics based on the first 101 samples:
    query job_description_pos job_description_neg
    type string string string
    details
    • min: 10 tokens
    • mean: 16.27 tokens
    • max: 47 tokens
    • min: 10 tokens
    • mean: 344.34 tokens
    • max: 512 tokens
    • min: 14 tokens
    • mean: 334.05 tokens
    • max: 512 tokens
  • Samples:
    query job_description_pos job_description_neg
    Data Engineer, AWS Big Data Services, Oracle EBS, NoSQL Data Sources experience. Excellent knowledge of database concepts - Defining schemas, relational table structures, SQL querying Proficient with AWS Big data services (Glue, Athena, Redshift, Lake formation, Lambda) Proficient in writing Python code for data pipelines, AWS CDK and data processing logic A standout candidate has working experience with Oracle EBS and Agile PLM data

    Preferred Skills

    Experience working with NoSQL data sources at scale (In Terabytes) - Understanding of shards, partitions etc. Understanding of Financial reporting in Oracle EBSWill be exposed to Data Lake, Glue, Lambda and Infrastructure as code. If have that experience is a plus

    Benefits

    Company-sponsored Health, Dental, and Vision insurance plans.

    EQUAL OPPORTUNITY STATEMENT

    Advantis Global is

    #AGIT
    requirements to support data-driven solutions/decisions.complex data insights in a clear and effective manner to stakeholders across the organization, which includes non-technical audience.informed and stay current on all the latest data science techniques and technologies.for exploring and implementing innovative solutions to improve data analysis, modeling capabilities, and business outcomes.use case design and build teams by providing guidance/ feedback as they develop data science models and algorithms to solve operational challenges. The incumbent must bring these skills/qualifications:Master’s or PhD in Computer Science, Statistics, Applied Mathematics.If degree is in non-related field, must have at least 5 – 7 years’ experience in data science or a similar role.Must be proficient in at least one analytical programming language relevant for data science, such as Python. R will be acceptable. Machine learning libraries & frameworks are a must. Must be familiar with data processing...
    Big Data Engineer, Spark, AWS/GCP, Hadoop Skills • Expertise and hands-on experience on Spark, and Hadoop echo system components – Must Have • Good and hand-on experience* of any of the Cloud (AWS/GCP) – Must Have • Good knowledge of HiveQL & SparkQL – Must Have Good knowledge of Shell script & Java/Scala/python – Good to Have • Good knowledge of SQL – Good to Have • Good knowledge of migration projects on Hadoop – Good to Have • Good Knowledge of one of the Workflow engines like Oozie, Autosys – Good to Have Good knowledge of Agile Development– Good to Have • Passionate about exploring new technologies – Good to Have • Automation approach – Good to Have
    Thanks & RegardsShahrukh KhanEmail: shahrukh@zentekinfosoft.com
    Requirements: We're looking for a candidate with exceptional proficiency in Google Sheets. This expertise should include manipulating, analyzing, and managing data within Google Sheets. The candidate should be outstanding at extracting business logic from existing reports and implementing it into new ones. Although a basic understanding of SQL for tasks related to data validation and metrics calculations is beneficial, the primary skill we are seeking is proficiency in Google Sheets. This role will involve working across various cross-functional teams, so strong communication skills are essential. The position requires a meticulous eye for detail, a commitment to delivering high-quality results, and above all, exceptional competency in Google Sheets

    Google sheet knowledge is preferred.Strong Excel experience without Google will be considered.Data Validation and formulas to extract data are a mustBasic SQL knowledge is required.Strong communications skills are requiredInterview process...
    Data Scientist, time series surveillance, production operations, data pipeline design Experience in Production Operations or Well Engineering Strong scripting/programming skills (Python preferable)

    Desired:

    Strong time series surveillance background (eg. OSI PI, PI AF, Seeq) Strong scripting/programming skills (Python preferable) Strong communication and collaboration skills Working knowledge of machine learning application (eg. scikit-learn) Working knowledge of SQL and process historians Delivers positive results through realistic planning to accomplish goals Must be able to handle multiple concurrent tasks with an ability to prioritize and manage tasks effectively



    Apex Systems is

    Apex Systems is a world-class IT services company that serves thousands of clients across the globe. When you join Apex, you become part of a team that values innovation, collaboration, and continuous learning. We offer quality career resources, training, certifications, development opportunities, and a comprehensive benefits package. Our commitment to excellence is reflected in man...
    SKILLS:1. Work experience in a Human Services agency ideally related to human services programs including Electronic Benefits Transfer (EBT) including SNAP and TANF benefits.2. Experience with Quick Base platform and SQL.
    3. Strong proficiency in data science tools such as R or Python. Experience with data visualization tools such as Tableau or Power BI
    4. Ability to transform issuance and notices files.

    Responsibilities
    1. Data analysis and modelling, including Designing and developing machine learning and predictive models and algorithms.
    Performing exploratory data analysis to identify patterns and trends.Developing and maintaining database and data systems to support business needs.Interpreting and communicating data analysis results to stakeholders.Collaborating with other teams to develop and implement data-driven solutions.2. Data management and governance, including Ensuring compliance with data privacy regulations and company data governance policies.
    Developing and impleme...
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • learning_rate: 2e-05
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step ai-job-validation_cosine_accuracy ai-job-test_cosine_accuracy
0 0 0.9307 -
1.0 51 1.0 0.9903

Framework Versions

  • Python: 3.10.19
  • Sentence Transformers: 3.3.1
  • Transformers: 4.48.0
  • PyTorch: 2.10.0
  • Accelerate: 1.12.0
  • Datasets: 4.5.0
  • Tokenizers: 0.21.4

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
2
Safetensors
Model size
82.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AmberJin4526/distilroberta-ai-jobembeddings

Finetuned
(50)
this model

Dataset used to train AmberJin4526/distilroberta-ai-jobembeddings

Papers for AmberJin4526/distilroberta-ai-jobembeddings

Evaluation results