| | --- |
| | license: mit |
| | language: |
| | - ko |
| | base_model: |
| | - klue/bert-base |
| | pipeline_tag: feature-extraction |
| | tags: |
| | - medical |
| | --- |
| | # π SapBERT-Ko-EN |
| |
|
| | ## 1. Intro |
| |
|
| | νκ΅μ΄ λͺ¨λΈμ μ΄μ©ν **SapBERT**(Self-alignment pretraining for BERT)μ
λλ€. |
| | νΒ·μ μλ£ μ©μ΄ μ¬μ μΈ KOSTOMμ μ¬μ©ν΄ νκ΅μ΄ μ©μ΄μ μμ΄ μ©μ΄λ₯Ό μ λ ¬νμ΅λλ€. |
| | μ°Έκ³ : [SapBERT](https://aclanthology.org/2021.naacl-main.334.pdf), [Original Code](https://github.com/cambridgeltl/sapbert) |
| |
|
| | ## 2. SapBERT-KO-EN |
| | **SapBERT**λ μλ§μ μλ£ λμμ΄λ₯Ό λμΌν μλ―Έλ‘ μ²λ¦¬νκΈ° μν μ¬μ νμ΅ λ°©λ²λ‘ μ
λλ€. |
| | **SapBERT-KO-EN**λ **νΒ·μ νΌμ©μ²΄μ μλ£ κΈ°λ‘**μ μ²λ¦¬νκΈ° μν΄ νΒ·μ μλ£ μ©μ΄λ₯Ό μ λ ¬νμ΅λλ€. |
| |
|
| | β» μμΈν μ€λͺ
λ° νμ΅ μ½λ: [Github](https://github.com/millet04/SapBERT-KO-EN) |
| |
|
| | ## 3. Training |
| |
|
| |
|
| | λͺ¨λΈ νμ΅μ νμ©ν λ² μ΄μ€ λͺ¨λΈ λ° νμ΄νΌ νλΌλ―Έν°λ λ€μκ³Ό κ°μ΅λλ€. |
| |
|
| | - Model : klue/bert-base |
| | - Epochs : 1 |
| | - Batch Size : 64 |
| | - Max Length : 64 |
| | - Dropout : 0.1 |
| | - Pooler : 'cls' |
| | - Eval Step : 100 |
| | - Threshold : 0.8 |
| | - Scale Positive Sample : 1 |
| | - Scale Negative Sample : 60 |
| |
|
| | SapBERT-KO-ENμ νμ **Fine-tuning**μ μ§ννλ λ°©μμΌλ‘ νΉμ ν
μ€ν¬μ μ μ©ν μ μμ΅λλ€. |
| |
|
| | β» μμ΄ μ©μ΄μ κ²½μ° λλΆλΆ μνλ²³ λ¨μλ‘ μ²λ¦¬ν©λλ€. |
| | β» λμΌν μ§λ³μ κ°λ¦¬ν€λ μ©μ΄ κ°μ μ μ¬λλ₯Ό μλμ μΌλ‘ ν¬κ² νκ°ν©λλ€. |
| |
|
| | ```python |
| | import numpy as np |
| | from transformers import AutoModel, AutoTokenizer |
| | |
| | model_path = 'snumin44/sap-bert-ko-en' |
| | model = AutoModel.from_pretrained(model_path) |
| | tokenizer = AutoTokenizer.from_pretrained(model_path) |
| | |
| | query = 'κ°κ²½ν' |
| | |
| | targets = [ |
| | 'liver cirrhosis', |
| | 'κ°κ²½λ³', |
| | 'liver cancer', |
| | 'κ°μ', |
| | 'brain tumor', |
| | 'λμ’
μ' |
| | ] |
| | |
| | query_feature = tokenizer(query, return_tensors='pt') |
| | query_outputs = model(**query_feature, return_dict=True) |
| | query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze() |
| | |
| | def cos_sim(A, B): |
| | return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B)) |
| | |
| | for idx, target in enumerate(targets): |
| | target_feature = tokenizer(target, return_tensors='pt') |
| | target_outputs = model(**target_feature, return_dict=True) |
| | target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze() |
| | similarity = cos_sim(query_embeddings, target_embeddings) |
| | print(f"Similarity between query and target {idx}: {similarity:.4f}") |
| | ``` |
| | ``` |
| | Similarity between query and target 0: 0.7145 |
| | Similarity between query and target 1: 0.7186 |
| | Similarity between query and target 2: 0.6183 |
| | Similarity between query and target 3: 0.6972 |
| | Similarity between query and target 4: 0.3929 |
| | Similarity between query and target 5: 0.4260 |
| | ``` |
| |
|
| | ## Citing |
| | ``` |
| | @inproceedings{liu2021self, |
| | title={Self-Alignment Pretraining for Biomedical Entity Representations}, |
| | author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel}, |
| | booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies}, |
| | pages={4228--4238}, |
| | month = jun, |
| | year={2021} |
| | } |
| | ``` |
| |
|