File size: 4,446 Bytes
ee87962 d297175 ee87962 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
language:
- ko
license: apache-2.0
library_name: transformers
tags:
- diffusion
- text-generation
- korean
- deberta
- masked-language-model
- experimental
base_model: kakaobank/kf-deberta-base
pipeline_tag: fill-mask
---
# ๐ kf-deberta-gen
**Generative Diffusion BERT** - ํ๊ตญ์ด Diffusion ๊ธฐ๋ฐ ์์ฑ ์ธ์ด ๋ชจ๋ธ
[](https://github.com/hong-seongmin/GenerativeDiffusionBERT)
[](https://huggingface.co/spaces/solonsophy/kf-deberta-gen)
---
## ๋ชจ๋ธ ์ค๋ช
์ด ๋ชจ๋ธ์ [kakaobank/kf-deberta-base](https://huggingface.co/kakaobank/kf-deberta-base)๋ฅผ ๊ธฐ๋ฐ์ผ๋ก **Discrete Diffusion** ๋ฐฉ์์ผ๋ก chat fine-tuning์ ์ํํ **์คํ์ (PoC) ์์ฑ ๋ชจ๋ธ**์
๋๋ค.
> โ ๏ธ **์ฃผ์**: ์ด ๋ชจ๋ธ์ ๊ฐ๋
๊ฒ์ฆ(Proof of Concept) ๋จ๊ณ์
๋๋ค. ์์ฑ ํ์ง์ด ๋ถ์์ ํ๋ฉฐ, ๋ฐ๋ณต ์์ฑ ๋ฑ์ ๋ฌธ์ ๊ฐ ์์ ์ ์์ต๋๋ค.
### ํต์ฌ ํน์ง
| ํญ๋ชฉ | ๋ด์ฉ |
|------|------|
| ๊ธฐ๋ฐ ๋ชจ๋ธ | kakaobank/kf-deberta-base (DeBERTa-v2) |
| ํ์ต ๋ฐฉ์ | Masked Diffusion Language Model (MDLM) |
| Noise Schedule | Cosine |
| ์์ฑ ๋ฐฉ์ | Iterative Denoising (Confidence-based) |
### ๊ธฐ์กด MLM๊ณผ์ ์ฐจ์ด์
```
๊ธฐ์กด MLM: 15% ๊ณ ์ ๋ง์คํน โ ๋น์นธ ์ฑ์ฐ๊ธฐ๋ง ๊ฐ๋ฅ
Diffusion: 0~100% ์ฐ์ ๋ง์คํน โ ์ ์ฒด ์ํ์ค ์์ฑ ๊ฐ๋ฅ
```
---
## ์ฌ์ฉ ๋ฐฉ๋ฒ
### ๊ธฐ๋ณธ ์ฌ์ฉ
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("solonsophy/kf-deberta-gen")
model = AutoModelForMaskedLM.from_pretrained("solonsophy/kf-deberta-gen")
```
### Diffusion ์์ฑ (Iterative Denoising)
```python
import torch
import torch.nn.functional as F
def generate_diffusion(model, tokenizer, question, num_steps=15, max_answer_len=80):
model.eval()
device = next(model.parameters()).device
MASK_ID = tokenizer.mask_token_id
CLS_ID = tokenizer.cls_token_id
SEP_ID = tokenizer.sep_token_id
# ์ง๋ฌธ ํ ํฐํ
q_tokens = tokenizer.encode(question, add_special_tokens=False)[:100]
# ์ด๊ธฐ: [CLS] Q [SEP] [MASK]*N
input_ids = [CLS_ID] + q_tokens + [SEP_ID] + [MASK_ID] * max_answer_len
input_ids = torch.tensor([input_ids[:256]], device=device)
answer_start = len(q_tokens) + 2
# Iterative denoising
for step in range(num_steps):
with torch.no_grad():
logits = model(input_ids).logits
mask_pos = (input_ids[0, answer_start:] == MASK_ID).nonzero().squeeze(-1) + answer_start
if len(mask_pos) == 0:
break
# Confidence ๊ธฐ๋ฐ unmask
mask_logits = logits[0, mask_pos] / 0.8 # temperature
probs = F.softmax(mask_logits, dim=-1)
tokens = torch.multinomial(probs, 1).squeeze(-1)
conf = probs.gather(1, tokens.unsqueeze(-1)).squeeze(-1)
k = max(1, len(mask_pos) // (num_steps - step))
top_idx = conf.topk(k).indices
input_ids[0, mask_pos[top_idx]] = tokens[top_idx]
# ๊ฒฐ๊ณผ ์ถ์ถ
answer = input_ids[0, answer_start:]
answer = answer[(answer != MASK_ID) & (answer != tokenizer.pad_token_id)]
return tokenizer.decode(answer, skip_special_tokens=True)
# ์ฌ์ฉ ์์
answer = generate_diffusion(model, tokenizer, "์ธ๊ณต์ง๋ฅ์ด๋ ๋ฌด์์ธ๊ฐ์?")
print(answer)
```
---
## ํ์ต ์ ๋ณด
### Chat Fine-tuning ์ค์
| ํ๋ผ๋ฏธํฐ | ๊ฐ |
|----------|-----|
| Epochs | 3 |
| Batch Size | 64 |
| Learning Rate | 5e-5 |
| Max Length | 256 |
| Q Max Length | 100 |
| A Max Length | 153 |
| Noise Schedule | Cosine |
| Masking Ratio | 0% ~ 100% |
---
## ํ์ต ๋ฐ์ดํฐ
| ๋ฐ์ดํฐ์
| ๋ผ์ด์ ์ค |
|----------|----------|
| OpenLab-NLP/tiny-singleturn-chat-ko | MIT |
| davidkim205/kollm-converations | Apache-2.0 |
| heegyu/hh-rlhf-ko | MIT |
| nlpai-lab/kullm-v2 | Apache-2.0 |
| heegyu/OIG-small-chip2-ko | Apache-2.0 |
| AIdenU/orca_dpo_data_ko | Apache-2.0 |
---
## ๋ผ์ด์ ์ค
์ด ๋ชจ๋ธ์ **Apache-2.0 ๋ผ์ด์ ์ค**๋ก ๋ฐฐํฌ๋ฉ๋๋ค.
๊ธฐ๋ฐ ๋ชจ๋ธ (kakaobank/kf-deberta-base): MIT
---
## Citation
```bibtex
@misc{kf-deberta-gen,
author = {Hong Seongmin},
title = {Generative Diffusion BERT: Korean Discrete Diffusion Language Model},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/solonsophy/kf-deberta-gen}
}
```
|