Model Description

This repository hosts the 'LLaMA-3-8B-Instructed' model fine-tuned for the specific task of de-identifying personal information in multi-turn conversations. It effectively removes sensitive data such as names, contact information, and account numbers, enhancing privacy and compliance with data protection regulations.

Training Data

The model was fine-tuned on the 'irene93/deidentification-chat-ko' dataset, which includes Korean chat messages that have been anonymized to remove personal identifiable information (PII). This dataset is ideal for training models to handle and protect sensitive information in text. More details on the dataset can be found on its Hugging Face repository.

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("irene93/Llama-3-deidentifier")
tokenizer = AutoTokenizer.from_pretrained("irene93/Llama-3-deidentifier")

test_converse = '''고객: μ•ˆλ…•ν•˜μ„Έμš”, 제 이름은 κΉ€μ§€μˆ˜μž…λ‹ˆλ‹€. μ–Όλ§ˆ 전에 μ£Όλ¬Έν•œ μ œν’ˆμ΄ 아직 λ„μ°©ν•˜μ§€ μ•Šμ•˜μŠ΅λ‹ˆλ‹€. 제 μ£Όλ¬Έλ²ˆν˜ΈλŠ” SH12345이고, 저희 μ§‘ μ£Όμ†ŒλŠ” μ„œμšΈμ‹œ 강남ꡬ ν…Œν—€λž€λ‘œ 123-45μž…λ‹ˆλ‹€. 확인 λΆ€νƒλ“œλ¦½λ‹ˆλ‹€.

상담원: μ•ˆλ…•ν•˜μ„Έμš”, κΉ€μ§€μˆ˜ κ³ κ°λ‹˜. 저희 μ‡Όν•‘λͺ°μ„ μ΄μš©ν•΄ μ£Όμ…”μ„œ κ°μ‚¬ν•©λ‹ˆλ‹€. κ³ κ°λ‹˜μ˜ μ£Όλ¬Έ 상황을 ν™•μΈν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. μž μ‹œλ§Œ κΈ°λ‹€λ € μ£Όμ‹œκ² μ–΄μš”?

상담원: 확인 κ²°κ³Ό, κ³ κ°λ‹˜μ˜ 주문은 νƒλ°°μ‚¬μ˜ λ¬Όλ₯˜ μ§€μ—°μœΌλ‘œ 인해 배솑이 μ§€μ—°λ˜κ³  μžˆμŠ΅λ‹ˆλ‹€. ν†΅μƒμ μœΌλ‘œ 2~3일 λ‚΄μ—λŠ” 배솑될 μ˜ˆμ •μž…λ‹ˆλ‹€. λΆˆνŽΈμ„ λ“œλ € μ£„μ†‘ν•©λ‹ˆλ‹€.

고객: μ§€μ—° μ‚¬μœ λ₯Ό μ•Œλ €μ£Όμ…”μ„œ κ°μ‚¬ν•©λ‹ˆλ‹€. 배솑 μ˜ˆμ •μΌμ„ μ•Œ 수 μžˆμ„κΉŒμš”?

상담원: λ„€, ν˜„μž¬λ‘œμ„œλŠ” λͺ©μš”μΌκΉŒμ§€λŠ” 도착할 κ²ƒμœΌλ‘œ μ˜ˆμƒλ©λ‹ˆλ‹€. λ„μ°©ν•˜λ©΄ κ³ κ°λ‹˜κ»˜ μ—°λ½λ“œλ¦¬κ² μŠ΅λ‹ˆλ‹€. 연락 λ°›μœΌμ‹€ μ „ν™”λ²ˆν˜Έκ°€ 010-1234-5678이 λ§žμœΌμ‹ κ°€μš”?

고객: λ„€, λ§žμŠ΅λ‹ˆλ‹€. κ·Έ μ „ν™”λ²ˆν˜Έλ‘œ μ•Œλ €μ£Όμ„Έμš”.

상담원: ν™•μΈν–ˆμŠ΅λ‹ˆλ‹€. λ§Œμ•½ 배솑 κ΄€λ ¨ μΆ”κ°€ λ¬Έμ˜μ‚¬ν•­μ΄ μžˆμœΌμ‹œλ©΄ μ–Έμ œλ“ μ§€ μ—°λ½μ£Όμ„Έμš”. ν˜Ήμ‹œ ν™˜λΆˆμ΄ ν•„μš”ν•˜μ‹œλ‹€λ©΄ μ—°κ²°λœ κ³„μ’Œλ‘œ ν™˜λΆˆ μ²˜λ¦¬ν•΄ λ“œλ¦΄ 수 μžˆμŠ΅λ‹ˆλ‹€. κ³„μ’Œλ²ˆν˜ΈλŠ” ν•˜λ‚˜μ€ν–‰ 123-456-789012둜 ν™•μΈλ©λ‹ˆλ‹€.

고객: λ„€, κ°μ‚¬ν•©λ‹ˆλ‹€. 그럼 κΈ°λ‹€λ¦¬κ² μŠ΅λ‹ˆλ‹€.

상담원: κ°μ‚¬ν•©λ‹ˆλ‹€, κΉ€μ§€μˆ˜ κ³ κ°λ‹˜. λΆˆνŽΈμ„ λ“œλ € λ‹€μ‹œ ν•œλ²ˆ μ‚¬κ³Όλ“œλ¦½λ‹ˆλ‹€. 쒋은 ν•˜λ£¨ λ³΄λ‚΄μ„Έμš”!'''

messages = [
    {"role": "system", "content": "당신은 κ°œμΈμ •λ³΄λ₯Ό κ°μΆ°μ£ΌλŠ” λ‘œλ΄‡μž…λ‹ˆλ‹€.\n\n## μ§€μ‹œ 사항 ##\n1.μ£Όμ–΄μ§„ λŒ€ν™”μ—μ„œ μ‚¬λžŒμ΄λ¦„μ„ [PERSON1], [PERSON2] λ“±μœΌλ‘œ λ“±μž₯ μˆœμ„œμ— 따라 λŒ€μ²΄ν•˜κ³ , λ™μΌν•œ 이름이 반볡될 경우 같은 λŒ€μΉ˜μ–΄λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€.\n2.μ—°λ½μ²˜, 이메일, μ£Όμ†Œ , κ³„μ’Œλ²ˆν˜Έλ„ 각각 [CONTACT1], [CONTACT2] λ“±, [EMAIL1],[EMAIL2] λ“±, [ADDRESS1],[ADDRESS2]λ“± , [ACCOUNT1], [ACCOUNT2] λ“± 으둜 λŒ€μΉ˜ν•˜κ³  λ™μΌν•œ 정보가 λ°˜λ³΅λ˜λŠ” κ²½μš°μ—λŠ” 같은 λŒ€μΉ˜μ–΄λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€.\n3.λŒ€μΉ˜μ–΄λ₯Ό μž‘μ„±ν• λ•Œ 글머리 κΈ°ν˜Έλ‚˜, λ‚˜μ—΄μ‹ 방식을 쓰지말고 ν‰λ¬ΈμœΌλ‘œ μ΄μ–΄μ„œ μ“°μ‹­μ‹œμ˜€ \n4.μœ„ κ·œμΉ™μ€ λŒ€ν™” 전체에 걸쳐 μΌκ΄€λ˜κ²Œ μ μš©ν•©λ‹ˆλ‹€. \n당신이 κ°œμΈμ •λ³΄λ₯Ό 감좜 λŒ€ν™”λ‚΄μ—­μž…λ‹ˆλ‹€."},
    {"role": "user", "content": f"μž…λ ₯: {test_converse}"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=False,
)

response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

Model Performance Metrics

The performance of the 'LLaMA-3-8B-Instructed' model on the task of de-identifying personal information in chat data is quantified using the following metrics:

Metric Value
Precision 0.8441
Recall 0.8199
F1 Score 0.8318

These metrics were calculated based on a comprehensive evaluation using the test subset of the 'irene93/deidentification-chat-ko' dataset.

Downloads last month
1
Safetensors
Model size
8B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for irene93/Llama-3-deidentifier

Finetuned
(2403)
this model

Dataset used to train irene93/Llama-3-deidentifier