multilingual-e5-large / README.md

kangbeom

Initial model commit

ea27c58 verified 6 months ago

preview code

raw

history blame contribute delete

22.9 kB

metadata

tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dense
  - generated_from_trainer
  - dataset_size:7200
  - loss:MultipleNegativesRankingLoss
base_model: intfloat/multilingual-e5-large
widget:
  - source_sentence: 'query: 어디서 입력 신호의 SNR이 엄청 낮게 나타나?'
    sentences:
      - >-
        passage: 클럭잡음 모델을 생성할 때 랜덤데이터를 이용하기 때문에 여러번 시뮬레이션을수행하여 실제 모델에 최대한 근접한
        데이터를 얻어야 한다.
      - >-
        passage: 둘 이상의 다른 네트워크 접속 인터페이스를 갖는 다중모드 단말에서 MIH 는 미디어에 독립적으로 이기종 망간
        핸드오버를 지원하기 위해 하위 물리 계층의 정보를 이용한다.
      - >-
        passage: 돌발 잡음은 일반적으로 에너지가 매우 크기 때문에 돌발 잡음이 존재하는 구간에서 입력 신호의 SNR이 매우 낮게
        나타난다.
  - source_sentence: 'query: ADPSS는 DPSS를 이용한 채널 보간과 예측에 무엇을 적용한 기법이니?'
    sentences:
      - >-
        passage: CORVIS의 구성 및 진행 사항기능설명구현 여부카투닝비디오 영상에서 추출한 프레임단위의 이미지들을 흑백만화에서
        사용하는 형태로 변환하는 기법이며, Image Abstraction, Stroke Generation, Stylization,
        Texturing의 기법이 적용.구현스타일 폰트인물 또는 사물의 의성어와 의태어를 표현하거나 극중의 분위기를 나타내기 위한
        기법.구현말풍선인물의 대사를 표현하기 위해 사용되며, 다양한종류의 말풍선 형태를 지원.구현스피드 라인인물이나 사물의 속도감을
        표현하기 위한 기법.구현배경효과인물의 기분이나 극의 분위기를 표현하기 위한 기법이다. 집중선, 수평선, 그라데이션을 이용한 다양한
        종류의 배경효과를 지원.미구현아이콘특정 개체를 돋보이게 하거나 인물의 심리 상태를 과장되게 표현하는 기법.미구현
      - >-
        passage: 실제 전송망과 동기망에서 측정된 원더생성성분은(그림 2)에 나타내었으며, 각 그림의 X축은 MTIE와 TDEV를
        계산할 때 사용되는 관측시간(\(\mathrm{sec}\))이고, Y축은 각 관측시간 별로 계산하여 얻어진 (3)식의 MTIE와
        (4)식의 TDEV값을 나타내고 있다.
      - 'passage: ADPSS는 DPSS를 이용한 채널 보간과 예측에 스무딩을 적용한 기법으로 과정은 다음과 같다.'
  - source_sentence: 'query: 디지털 프린터의 취약성 중 복사에서는 데이터 유출을 무엇을 통해 발생할 수 있어?'
    sentences:
      - >-
        passage: 계속 변화하는 노드 관계를 위한 이웃 노드정보 리스트 구조의 예 노드-X의 이웃정보 리스트 (level /
        relation )NumberID \BrNum[1][2][3][4][5]\( \mathrm { P } _ {\text {
        parent } } \)\( \mathrm { P } _ {\text { sibling } } \)\( \mathrm { P }
        _ {\text { child } }
        \)1A1/P1/P1.0002B2/S2/P0.40.603C2/S3/S01.004E2/S4/C0.20.40.45F3/C3/S00.20.86G3/C4/C001.0노드-X의
        레벨23\( \mathrm { N } _ {\mathrm { p } } \)1.6\( \mathrm { N } _ {\mathrm
        { s } } \)2.2\( \mathrm { N } _ {\mathrm { c } } \)2.2
      - >-
        passage: 따라서 인식된 차량에 대한 추적(tracking) 등을 통하여 인식된 차량에 대한 연산을 줄여주는 방법이
        필요하다.
      - >-
        passage: CIAC-2304에서 보고된 취약성분류취약성팩스메시지 인증이 불가하여 공격자가 중간에서 데이터를 위변조 할 수
        있음가입자의 전화번호나 서비스 제공자의 ID를 위변조할 수 있음팩스 기기에 대한 인증이 되지 않을 경우, 전화번호를 스푸핑할 수
        있음팩스 전송 시, 암호화하지 않는 경우 도청을 통해 중요 데이터가 유출될 수 있음잘못된 팩스 설정이나 사용자의 부주의로 인해
        시스템이 취약해질 수 있음하드웨어 자원의 한계로 인해 저장된 데이터가 삭제될 수 있음복사네트워크를 통해 저장 데이터 유출될 수
        있음인쇄할 데이터를 위변조하여 인쇄될 수 있음
  - source_sentence: >-
      query: Number of clusters \( 37 \)의 결과값이 19인 것과 상관있는 Cluster location은
      뭐에요?
    sentences:
      - >-
        passage: 또한 서비스 세션을 위해 선택되는 디바이스는 사용자의 위치나 업무, 서비스가 요청되는 시기에 따라 수시로 변하게
        된다.
      - 'passage: 이렇게 대리 서명용 개인 키와 공개 키를 생성함으로써 안전하게 대리 서명 권한을 위임 받게 된다.'
      - >-
        passage: 〈표 \( 2 \)〉 필요한 서브키 스트링 비교 Cluster locationRequired \( \mathrm
        { SS } \)Number of clusters\( 7 \)\( 19 \)\( 37 \)Group A\( 1 O L P_
        {\text { INTRA } } + 3 O L P_ {\text { INTER } } \)\( 6 \)\( 6 \)\( 6
        \)Group B\( 2 O L P_ {\text { INTRA } } + 4 O L P_ {\text { INTER } }
        \)\( 0 \)\( 6 \)\( 12 \)Group C\( 3 O L P_ { I N T R A } + 6 O L P_ { I
        N T E R } \)\( 1 \)\( 7 \)\( 19 \)
  - source_sentence: 'query: 주문의 설명은 무엇인가?'
    sentences:
      - >-
        passage: 〈표 2〉 천연비누 쇼핑몰 시스템의 단어 사전 일부 단어영문명약어명동의어설명주문ORDERORD상품의 생산이나
        서비스 의 제공을 요구번호NUMBERNo차례를 나타내거나 식별하기 위해 붙이는 숫자일자DATEDT날짜, 일날짜...
      - >-
        passage: \(\mathrm{a1} \) 정점처럼 상위 정점이 없이 시작하여 \(\mathrm{an} \)에 이르는 정점들은
        멀티태스킹의 가능성을 나타내는 정보를 표시할 때 같은 이름인 m1으로 표시한다.
      - >-
        passage: 〈표 \(2 \)〉 Binary NAF Method 스칼라곱 연산과 RENAF Method 멱 승 연산 종류연산
        결과 \( ( \mathrm { X } =2) \)BinaryNAF \((7X) \)\( (100-1)_ {\text { NAF
        } } \cdot 2=2 * 2 \rightarrow 2 * 4 \rightarrow 2 * 3-2=14 \)RENAF \(
        \left (X ^ { 7 } \right ) \)\( 2 ^ { (100)-11_ {\text { RENAF } } } =2 *
        2 \rightarrow 4 * 4 \rightarrow 16 * 16 / 2=128 \)
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@3
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@3
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@3
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@10
  - cosine_mrr@10
  - cosine_map@100
model-index:
  - name: SentenceTransformer based on intfloat/multilingual-e5-large
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: lora evaluation
          type: lora-evaluation
        metrics:
          - type: cosine_accuracy@1
            value: 0.467
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.699
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.786
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.8755
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.467
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.233
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.15719999999999998
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.08755000000000002
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.467
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.699
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.786
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.8755
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.6679734745637986
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.6018291666666665
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.6075979460358361
            name: Cosine Map@100

SentenceTransformer based on intfloat/multilingual-e5-large

This is a sentence-transformers model finetuned from intfloat/multilingual-e5-large. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: intfloat/multilingual-e5-large
Maximum Sequence Length: 512 tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'PeftModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("kangbeom/multilingual-e5-large")
# Run inference
sentences = [
    'query: 주문의 설명은 무엇인가?',
    'passage: 〈표 2〉 천연비누 쇼핑몰 시스템의 단어 사전 일부 단어영문명약어명동의어설명주문ORDERORD상품의 생산이나 서비스 의 제공을 요구번호NUMBERNo차례를 나타내거나 식별하기 위해 붙이는 숫자일자DATEDT날짜, 일날짜...',
    'passage: 〈표  $2$ 〉 Binary NAF Method 스칼라곱 연산과 RENAF Method 멱 승 연산 종류연산 결과  $( \\mathrm { X } =2)$ BinaryNAF  $(7 X)$ \\( (100-1)_ {\\text { NAF } } \\cdot 2=2 * 2 \\rightarrow 2 * 4 \\rightarrow 2 * 3-2=14 \\)RENAF  $\\left (X ^ { 7 } \\right )$ \\( 2 ^ { (100)-11_ {\\text { RENAF } } } =2 * 2 \\rightarrow 4 * 4 \\rightarrow 16 * 16 / 2=128 \\)',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.5568, 0.1304],
#         [0.5568, 1.0000, 0.1720],
#         [0.1304, 0.1720, 1.0000]])

Evaluation

Metrics

Information Retrieval

Dataset: lora-evaluation
Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.467
cosine_accuracy@3	0.699
cosine_accuracy@5	0.786
cosine_accuracy@10	0.8755
cosine_precision@1	0.467
cosine_precision@3	0.233
cosine_precision@5	0.1572
cosine_precision@10	0.0876
cosine_recall@1	0.467
cosine_recall@3	0.699
cosine_recall@5	0.786
cosine_recall@10	0.8755
cosine_ndcg@10	0.668
cosine_mrr@10	0.6018
cosine_map@100	0.6076

Training Details

Training Dataset

Unnamed Dataset

Size: 7,200 training samples
Columns: sentence_0 and sentence_1
Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1
type string string
details
min: 11 tokens
mean: 28.15 tokens
max: 105 tokens

min: 10 tokens
mean: 127.24 tokens
max: 512 tokens

	sentence_0	sentence_1
type	string	string
details	min: 11 tokens mean: 28.15 tokens max: 105 tokens	min: 10 tokens mean: 127.24 tokens max: 512 tokens

Samples:

sentence_0	sentence_1
`query: 실험에서 그림 12(d)는 참값에 아주 근접했나요?`	`passage: 그림 12(d)는 본 논문에서 제안하는 알고리즘을 이용한 디블러링 결과영상을 이용하여 3차원 형상 복원을 수행한 결과이다.`
`query: 표에서, 15mm의 메쉬분할 수행시간은 얼마인가?`	`passage: 〈표 5〉 임계값에 따른 메쉬정보 및 수행시간(데이터 A) 임계값 ( ( \mathrm { mm } ) )분할전(15 )(10 )(5 )총메쉬수(8,200 )(163,124 )(297,207 )( 1,185,145 )평균에지길이( 20.43 )( 6.52 )( 4.85 )( 2.45 )수행시간 (sec)메쉬분할( 0.113 )( 0.181 )( 0.785 )거리기반 대응( 0.137 )( 2.966 )( 5.453 )( 21.228 )`
`query: 기업 시스템 인증 및 정보자산 보호관리에 주로 사용되는 표준은 무엇일까?`	passage: 각 표준 특성 및 취약점 표준특성단일 표준으로 적용 시 취약점ISO 20022금융기관 상호 운영을 위한 표준 모듈간 상호보안 부족 클라이언트의 행동에 관한 보안기능 부족ISO 27001기업 시스템 인증 및 정보자산 보호관리 모듈간 상호보안 부족 기술적인 보안 부족Common CriteriaIT 제품의 개발, 평가, 조달 지침현재 인터넷 뱅킹 관련 보호프로파일 없음웹 환경 구축 및 운영을 위한 보안 관리 지침웹 환경 안전을 위한 기술 특화된 위협에 대한 대응 부족 기능 요구사항 부족

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim",
    "gather_across_devices": false
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 4
per_device_eval_batch_size: 4
num_train_epochs: 2
multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 4
per_device_eval_batch_size: 4
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 2
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch_fused
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
hub_revision: None
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
liger_kernel_config: None
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	Training Loss	lora-evaluation_cosine_ndcg@10
0.2778	500	0.2637	0.6548
0.5556	1000	0.0663	0.6597
0.8333	1500	0.0647	0.6609
1.0	1800	-	0.6658
1.1111	2000	0.0432	0.6680

Framework Versions

Python: 3.12.11
Sentence Transformers: 5.1.0
Transformers: 4.55.4
PyTorch: 2.8.0+cu126
Accelerate: 1.10.1
Datasets: 4.0.0
Tokenizers: 0.21.4

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}