--- language: - ko license: apache-2.0 library_name: transformers tags: - diffusion - text-generation - korean - deberta - masked-language-model - experimental base_model: kakaobank/kf-deberta-base pipeline_tag: fill-mask --- # ๐ŸŒ€ kf-deberta-gen **Generative Diffusion BERT** - ํ•œ๊ตญ์–ด Diffusion ๊ธฐ๋ฐ˜ ์ƒ์„ฑ ์–ธ์–ด ๋ชจ๋ธ [![GitHub](https://img.shields.io/badge/GitHub-Repository-black)](https://github.com/hong-seongmin/GenerativeDiffusionBERT) [![Space](https://img.shields.io/badge/๐Ÿค—%20Space-Demo-green)](https://huggingface.co/spaces/solonsophy/kf-deberta-gen) --- ## ๋ชจ๋ธ ์„ค๋ช… ์ด ๋ชจ๋ธ์€ [kakaobank/kf-deberta-base](https://huggingface.co/kakaobank/kf-deberta-base)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ **Discrete Diffusion** ๋ฐฉ์‹์œผ๋กœ chat fine-tuning์„ ์ˆ˜ํ–‰ํ•œ **์‹คํ—˜์ (PoC) ์ƒ์„ฑ ๋ชจ๋ธ**์ž…๋‹ˆ๋‹ค. > โš ๏ธ **์ฃผ์˜**: ์ด ๋ชจ๋ธ์€ ๊ฐœ๋… ๊ฒ€์ฆ(Proof of Concept) ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค. ์ƒ์„ฑ ํ’ˆ์งˆ์ด ๋ถˆ์•ˆ์ •ํ•˜๋ฉฐ, ๋ฐ˜๋ณต ์ƒ์„ฑ ๋“ฑ์˜ ๋ฌธ์ œ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ### ํ•ต์‹ฌ ํŠน์ง• | ํ•ญ๋ชฉ | ๋‚ด์šฉ | |------|------| | ๊ธฐ๋ฐ˜ ๋ชจ๋ธ | kakaobank/kf-deberta-base (DeBERTa-v2) | | ํ•™์Šต ๋ฐฉ์‹ | Masked Diffusion Language Model (MDLM) | | Noise Schedule | Cosine | | ์ƒ์„ฑ ๋ฐฉ์‹ | Iterative Denoising (Confidence-based) | ### ๊ธฐ์กด MLM๊ณผ์˜ ์ฐจ์ด์  ``` ๊ธฐ์กด MLM: 15% ๊ณ ์ • ๋งˆ์Šคํ‚น โ†’ ๋นˆ์นธ ์ฑ„์šฐ๊ธฐ๋งŒ ๊ฐ€๋Šฅ Diffusion: 0~100% ์—ฐ์† ๋งˆ์Šคํ‚น โ†’ ์ „์ฒด ์‹œํ€€์Šค ์ƒ์„ฑ ๊ฐ€๋Šฅ ``` --- ## ์‚ฌ์šฉ ๋ฐฉ๋ฒ• ### ๊ธฐ๋ณธ ์‚ฌ์šฉ ```python from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("solonsophy/kf-deberta-gen") model = AutoModelForMaskedLM.from_pretrained("solonsophy/kf-deberta-gen") ``` ### Diffusion ์ƒ์„ฑ (Iterative Denoising) ```python import torch import torch.nn.functional as F def generate_diffusion(model, tokenizer, question, num_steps=15, max_answer_len=80): model.eval() device = next(model.parameters()).device MASK_ID = tokenizer.mask_token_id CLS_ID = tokenizer.cls_token_id SEP_ID = tokenizer.sep_token_id # ์งˆ๋ฌธ ํ† ํฐํ™” q_tokens = tokenizer.encode(question, add_special_tokens=False)[:100] # ์ดˆ๊ธฐ: [CLS] Q [SEP] [MASK]*N input_ids = [CLS_ID] + q_tokens + [SEP_ID] + [MASK_ID] * max_answer_len input_ids = torch.tensor([input_ids[:256]], device=device) answer_start = len(q_tokens) + 2 # Iterative denoising for step in range(num_steps): with torch.no_grad(): logits = model(input_ids).logits mask_pos = (input_ids[0, answer_start:] == MASK_ID).nonzero().squeeze(-1) + answer_start if len(mask_pos) == 0: break # Confidence ๊ธฐ๋ฐ˜ unmask mask_logits = logits[0, mask_pos] / 0.8 # temperature probs = F.softmax(mask_logits, dim=-1) tokens = torch.multinomial(probs, 1).squeeze(-1) conf = probs.gather(1, tokens.unsqueeze(-1)).squeeze(-1) k = max(1, len(mask_pos) // (num_steps - step)) top_idx = conf.topk(k).indices input_ids[0, mask_pos[top_idx]] = tokens[top_idx] # ๊ฒฐ๊ณผ ์ถ”์ถœ answer = input_ids[0, answer_start:] answer = answer[(answer != MASK_ID) & (answer != tokenizer.pad_token_id)] return tokenizer.decode(answer, skip_special_tokens=True) # ์‚ฌ์šฉ ์˜ˆ์‹œ answer = generate_diffusion(model, tokenizer, "์ธ๊ณต์ง€๋Šฅ์ด๋ž€ ๋ฌด์—‡์ธ๊ฐ€์š”?") print(answer) ``` --- ## ํ•™์Šต ์ •๋ณด ### Chat Fine-tuning ์„ค์ • | ํŒŒ๋ผ๋ฏธํ„ฐ | ๊ฐ’ | |----------|-----| | Epochs | 3 | | Batch Size | 64 | | Learning Rate | 5e-5 | | Max Length | 256 | | Q Max Length | 100 | | A Max Length | 153 | | Noise Schedule | Cosine | | Masking Ratio | 0% ~ 100% | --- ## ํ•™์Šต ๋ฐ์ดํ„ฐ | ๋ฐ์ดํ„ฐ์…‹ | ๋ผ์ด์„ ์Šค | |----------|----------| | OpenLab-NLP/tiny-singleturn-chat-ko | MIT | | davidkim205/kollm-converations | Apache-2.0 | | heegyu/hh-rlhf-ko | MIT | | nlpai-lab/kullm-v2 | Apache-2.0 | | heegyu/OIG-small-chip2-ko | Apache-2.0 | | AIdenU/orca_dpo_data_ko | Apache-2.0 | --- ## ๋ผ์ด์„ ์Šค ์ด ๋ชจ๋ธ์€ **Apache-2.0 ๋ผ์ด์„ ์Šค**๋กœ ๋ฐฐํฌ๋ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ฐ˜ ๋ชจ๋ธ (kakaobank/kf-deberta-base): MIT --- ## Citation ```bibtex @misc{kf-deberta-gen, author = {Hong Seongmin}, title = {Generative Diffusion BERT: Korean Discrete Diffusion Language Model}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/solonsophy/kf-deberta-gen} } ```