SentenceTransformer based on sentence-transformers/all-distilroberta-v1

This is a sentence-transformers model finetuned from sentence-transformers/all-distilroberta-v1 on the ai-job-embedding-finetuning dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: sentence-transformers/all-distilroberta-v1
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- ai-job-embedding-finetuning

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("AmberJin4526/distilroberta-ai-jobembeddings")
# Run inference
sentences = [
    '"Data Scientist Transformers BERT genomics distributed computing"',
    "experience with Transformers\nNeed to be 8+ year's of work experience. \nWe need a Data Scientist with demonstrated expertise in training and evaluating transformers such as BERT and its derivatives.\nRequired: Proficiency with Python, pyTorch, Linux, Docker, Kubernetes, Jupyter. Expertise in Deep Learning, Transformers, Natural Language Processing, Large Language Models\nPreferred: Experience with genomics data, molecular genetics. Distributed computing tools like Ray, Dask, Spark",
    'requirements to pull required data to measure the current state of these assets, set up usage metrics for internal and external stakeholders.Table Metadata to improve documentation coverage for tables, including table descriptions, column definitions, and data lineage.Implement a centralized metadata management system to maintain and access asset documentation.Ensure that all existing and new data assets are properly documented according to established standards.Pipeline Clean-up and ConsolidationConsolidate and streamline pipelines by eliminating redundancies and unnecessary elements according to the set of provided rules.Clean up and restructure data tables, ensuring consistent naming conventions, data types, and schema definitions.Retire or archive obsolete dashboards and workflows.Implement monitoring and alerting mechanisms for critical workflows to ensure timely issue detection and resolution.Set up a foundation for scalable Data Model for the Stock Business - Implement and build performant data models to solve common analytics use-Knowledge Transfer and DocumentationThoroughly document the work performed, including methodologies, decisions, and any scripts or tools developed.Provide comprehensive knowledge transfer to the data team, ensuring a smooth transition and the ability to maintain the optimized data environment.\nSkills: Proven experience in data engineering and data asset management.Proficiency in SQL, Python, and other relevant data processing languages and tools.Expertise in data modeling, ETL processes, and workflow orchestration (e.g., Airflow, Databricks).Strong analytical and problem-solving skills.Excellent communication and documentation abilities.Familiarity with cloud data platforms (e.g., Azure, AWS, GCP) is a plus.\nPride Global offers eligible employee’s comprehensive healthcare coverage (medical, dental, and vision plans), supplemental coverage (accident insurance, critical illness insurance and hospital indemnity), 401(k)-retirement savings, life & disability insurance, an employee assistance program, legal support, auto, home insurance, pet insurance and employee discounts with preferred vendors.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Datasets: ai-job-validation and ai-job-test
Evaluated with TripletEvaluator

Metric	ai-job-validation	ai-job-test
cosine_accuracy	1.0	0.9903

Training Details

Training Dataset

ai-job-embedding-finetuning

Dataset: ai-job-embedding-finetuning at daf4809
Size: 813 training samples
Columns: query, job_description_pos, and job_description_neg

Approximate statistics based on the first 813 samples:

	query	job_description_pos	job_description_neg
type	string	string	string
details	min: 9 tokens mean: 15.84 tokens max: 32 tokens	min: 7 tokens mean: 350.09 tokens max: 512 tokens	min: 7 tokens mean: 351.48 tokens max: 512 tokens

Samples:

query	job_description_pos	job_description_neg
`AWS data pipeline design, ETL process implementation, data governance and security.`	requirements and design data solutions that meet their needs, including understanding data models/schemas and implementing ETL (Extract, Transform, and Load) processes to transform raw data into a usable format in the destinationResponsible for monitoring and optimizing the performance of data pipelines, troubleshooting any issues that arise, and ensuring data quality and integrity. Qualifications Proficient in programming languages such as Python and SQL for database querying and manipulation. Strong understanding of AWS services related to data engineering, such as Amazon S3, Amazon Redshift, Amazon Aurora Postgres, AWS Glue, AWS Lambda, AWS Step Function, AWS Lake Formation, Amazon Data Zone, Amazon Kinesis, MSK, and Amazon EMR. Knowledge of database design principles and experience with database management systems. Experience with data storage technologies like relational databases (e.g., SQL Server, PostgreSQL) and distributed storage systems (e.g., PySpark). Understanding of ...	Qualifications: 7+ years of experience in data science or analytics roles, with a focus on analytics and machine learning.Expertise in programming languages such as Python, R, or SQL for data extraction, cleaning, and analysis.Expertise in working with machine data / time series data Excellent communication skills to effectively convey complex technical concepts to non-technical stakeholders.Strong analytical and problem-solving skills to derive insights from large datasets.Bachelor's degree in data science, computer science, statistics, or a related field (master’s or PhD preferred) Key Competencies: Expertise in statistics, supervised and unsupervised machine learning techniques and their appropriate uses; ability to apply common modeling best practices to build models using high-volume, asynchronous time series dataStrategic Thinking- Ability to develop and implement a strategic framework on how to deploy Artificial Intelligence within HRCustomer focus- The need to design soluti...
`"Marketing campaign analytics, A/B testing, web analytics tools"`	skills to provide best-in-class analytics to the business Required Qualifications, Capabilities, And Skills Bachelor’s and Master’s degree in a quantitative discipline (Data Science/Analytics, Mathematics, Statistics, Physics, Engineering, Economics, Finance or related fields)3+ years of experience in applying statistical methods to real world problems3+ years of experience with SQL and at least one of the following analytical tools: SAS, Python, R Experience with visualization techniques for data analysis and presentationExperience with web analytics tools (Google Analytics, Adobe/Omniture Insight/Visual Sciences, Webtrends, CoreMetrics, etc.)Superior written, oral communication and presentation skills with experience communicating concisely and effectively with all levels of management and partners Preferred Qualifications, Capabilities, And Skills Tableau and Python preferredIntellectually curious and eager to become subject matter expert in their focus areaA strategic thinker w...	Experience in the biotech industry is advantageous. Requirements: Ø Expertise in deep learning techniques, with a focus on Generative AI and Large Language Models (LLMs).Ø Proficiency in Python programming and familiarity with libraries such as TensorFlow, PyTorch, or Keras.Ø Knowledge of cloud computing platforms, particularly AWS.Ø Strong analytical and problem-solving skills.Ø Excellent communication and collaboration abilities.Ø Experience in the biotech industry is a plus. Educational Qualifications: PhD in Computer Science or Machine Learning.
`"Senior Cloud Data Engineer, Databricks, Delta Lake, Data Warehousing"`	Experience of Delta Lake, DWH, Data Integration, Cloud, Design and Data Modelling.• Proficient in developing programs in Python and SQL• Experience with Data warehouse Dimensional data modeling.• Working with event based/streaming technologies to ingest and process data.• Working with structured, semi structured and unstructured data.• Optimize Databricks jobs for performance and scalability to handle big data workloads. • Monitor and troubleshoot Databricks jobs, identify and resolve issues or bottlenecks. • Implement best practices for data management, security, and governance within the Databricks environment. Experience designing and developing Enterprise Data Warehouse solutions.• Proficient writing SQL queries and programming including stored procedures and reverse engineering existing process.• Perform code reviews to ensure fit to requirements, optimal execution patterns and adherence to established standards. Qualifications: • 5+ years Python coding experience.• 5+ years - SQL...	experience in using, manipulating, and extracting insights from healthcare data with a particular focus on using machine learning with claims data. The applicant will be driven by curiosity, collaborating with a cross-functional team of Product Managers, Software Engineers, and Data Analysts. Responsibilities Apply data science, machine learning, and healthcare domain expertise to advance and oversee Lucina’s pregnancy identification and risk-scoring algorithms.Analyze healthcare data to study patterns of care and patient conditions which correlate to specific outcomes.Collaborate on clinical committee research and development work.Complete ad hoc analyses and reports from internal or external customers prioritized by management throughout the year. Qualifications Degree or practical experience in Applied Math, Statistics, Engineering, Information Management with 3 or more years of data analytics experience, Masters degree a plus.Experience manipulating and analyzing healthcare dat...

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Evaluation Dataset

ai-job-embedding-finetuning

Dataset: ai-job-embedding-finetuning at daf4809
Size: 101 evaluation samples
Columns: query, job_description_pos, and job_description_neg

Approximate statistics based on the first 101 samples:

	query	job_description_pos	job_description_neg
type	string	string	string
details	min: 10 tokens mean: 16.27 tokens max: 47 tokens	min: 10 tokens mean: 344.34 tokens max: 512 tokens	min: 14 tokens mean: 334.05 tokens max: 512 tokens

Samples:

query	job_description_pos	job_description_neg
`Data Engineer, AWS Big Data Services, Oracle EBS, NoSQL Data Sources`	experience. Excellent knowledge of database concepts - Defining schemas, relational table structures, SQL querying Proficient with AWS Big data services (Glue, Athena, Redshift, Lake formation, Lambda) Proficient in writing Python code for data pipelines, AWS CDK and data processing logic A standout candidate has working experience with Oracle EBS and Agile PLM data Preferred Skills Experience working with NoSQL data sources at scale (In Terabytes) - Understanding of shards, partitions etc. Understanding of Financial reporting in Oracle EBSWill be exposed to Data Lake, Glue, Lambda and Infrastructure as code. If have that experience is a plus Benefits Company-sponsored Health, Dental, and Vision insurance plans. EQUAL OPPORTUNITY STATEMENT Advantis Global is #AGIT	requirements to support data-driven solutions/decisions.complex data insights in a clear and effective manner to stakeholders across the organization, which includes non-technical audience.informed and stay current on all the latest data science techniques and technologies.for exploring and implementing innovative solutions to improve data analysis, modeling capabilities, and business outcomes.use case design and build teams by providing guidance/ feedback as they develop data science models and algorithms to solve operational challenges. The incumbent must bring these skills/qualifications:Master’s or PhD in Computer Science, Statistics, Applied Mathematics.If degree is in non-related field, must have at least 5 – 7 years’ experience in data science or a similar role.Must be proficient in at least one analytical programming language relevant for data science, such as Python. R will be acceptable. Machine learning libraries & frameworks are a must. Must be familiar with data processing...
`Big Data Engineer, Spark, AWS/GCP, Hadoop`	Skills • Expertise and hands-on experience on Spark, and Hadoop echo system components – Must Have • Good and hand-on experience* of any of the Cloud (AWS/GCP) – Must Have • Good knowledge of HiveQL & SparkQL – Must Have Good knowledge of Shell script & Java/Scala/python – Good to Have • Good knowledge of SQL – Good to Have • Good knowledge of migration projects on Hadoop – Good to Have • Good Knowledge of one of the Workflow engines like Oozie, Autosys – Good to Have Good knowledge of Agile Development– Good to Have • Passionate about exploring new technologies – Good to Have • Automation approach – Good to Have Thanks & RegardsShahrukh KhanEmail: shahrukh@zentekinfosoft.com	Requirements: We're looking for a candidate with exceptional proficiency in Google Sheets. This expertise should include manipulating, analyzing, and managing data within Google Sheets. The candidate should be outstanding at extracting business logic from existing reports and implementing it into new ones. Although a basic understanding of SQL for tasks related to data validation and metrics calculations is beneficial, the primary skill we are seeking is proficiency in Google Sheets. This role will involve working across various cross-functional teams, so strong communication skills are essential. The position requires a meticulous eye for detail, a commitment to delivering high-quality results, and above all, exceptional competency in Google Sheets Google sheet knowledge is preferred.Strong Excel experience without Google will be considered.Data Validation and formulas to extract data are a mustBasic SQL knowledge is required.Strong communications skills are requiredInterview process...
`Data Scientist, time series surveillance, production operations, data pipeline design`	Experience in Production Operations or Well Engineering Strong scripting/programming skills (Python preferable) Desired: Strong time series surveillance background (eg. OSI PI, PI AF, Seeq) Strong scripting/programming skills (Python preferable) Strong communication and collaboration skills Working knowledge of machine learning application (eg. scikit-learn) Working knowledge of SQL and process historians Delivers positive results through realistic planning to accomplish goals Must be able to handle multiple concurrent tasks with an ability to prioritize and manage tasks effectively Apex Systems is Apex Systems is a world-class IT services company that serves thousands of clients across the globe. When you join Apex, you become part of a team that values innovation, collaboration, and continuous learning. We offer quality career resources, training, certifications, development opportunities, and a comprehensive benefits package. Our commitment to excellence is reflected in man...	SKILLS:1. Work experience in a Human Services agency ideally related to human services programs including Electronic Benefits Transfer (EBT) including SNAP and TANF benefits.2. Experience with Quick Base platform and SQL. 3. Strong proficiency in data science tools such as R or Python. Experience with data visualization tools such as Tableau or Power BI 4. Ability to transform issuance and notices files. Responsibilities 1. Data analysis and modelling, including Designing and developing machine learning and predictive models and algorithms. Performing exploratory data analysis to identify patterns and trends.Developing and maintaining database and data systems to support business needs.Interpreting and communicating data analysis results to stakeholders.Collaborating with other teams to develop and implement data-driven solutions.2. Data management and governance, including Ensuring compliance with data privacy regulations and company data governance policies. Developing and impleme...

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
learning_rate: 2e-05
num_train_epochs: 1
warmup_ratio: 0.1
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	ai-job-validation_cosine_accuracy	ai-job-test_cosine_accuracy
0	0	0.9307	-
1.0	51	1.0	0.9903

Framework Versions

Python: 3.10.19
Sentence Transformers: 3.3.1
Transformers: 4.48.0
PyTorch: 2.10.0
Accelerate: 1.12.0
Datasets: 4.5.0
Tokenizers: 0.21.4

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: 2

Safetensors

Model size

82.1M params

Tensor type

F32

Model tree for AmberJin4526/distilroberta-ai-jobembeddings

Base model

sentence-transformers/all-distilroberta-v1

Finetuned

(50)

this model

Dataset used to train AmberJin4526/distilroberta-ai-jobembeddings

Papers for AmberJin4526/distilroberta-ai-jobembeddings

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 12

Efficient Natural Language Response Suggestion for Smart Reply

Paper • 1705.00652 • Published May 1, 2017

Evaluation results

Cosine Accuracy on ai job validation
self-reported

1.000
Cosine Accuracy on ai job test
self-reported

0.990