Update model to diffusion-chat-final2 and remove Pre-training references

Browse files

Files changed (9) hide show

.gitattributes +2 -0
README.md +162 -0
config.json +37 -0
model.safetensors +3 -0
special_tokens_map.json +51 -0
tokenizer.json +0 -0
tokenizer_config.json +60 -0
training_args.bin +3 -0
vocab.txt +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ *.safetensors filter=lfs diff=lfs merge=lfs -text
2	+ *.bin filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,162 @@

+---
+language:
+- ko
+license: apache-2.0
+library_name: transformers
+tags:
+- diffusion
+- text-generation
+- korean
+- deberta
+- masked-language-model
+- experimental
+base_model: kakaobank/kf-deberta-base
+pipeline_tag: fill-mask
+---
+# 🌀 kf-deberta-gen
+**Generative Diffusion BERT** - 한국어 Diffusion 기반 생성 언어 모델
+[![GitHub](https://img.shields.io/badge/GitHub-Repository-black)](https://github.com/hong-seongmin/GenerativeDiffusionBERT)
+---
+## 모델 설명
+이 모델은 [kakaobank/kf-deberta-base](https://huggingface.co/kakaobank/kf-deberta-base)를 기반으로 **Discrete Diffusion** 방식으로 chat fine-tuning을 수행한 **실험적(PoC) 생성 모델**입니다.
+> ⚠️ **주의**: 이 모델은 개념 검증(Proof of Concept) 단계입니다. 생성 품질이 불안정하며, 반복 생성 등의 문제가 있을 수 있습니다.
+### 핵심 특징
+| 항목 | 내용 |
+|------|------|
+| 기반 모델 | kakaobank/kf-deberta-base (DeBERTa-v2) |
+| 학습 방식 | Masked Diffusion Language Model (MDLM) |
+| Noise Schedule | Cosine |
+| 생성 방식 | Iterative Denoising (Confidence-based) |
+### 기존 MLM과의 차이점
+```
+기존 MLM: 15% 고정 마스킹 → 빈칸 채우기만 가능
+Diffusion: 0~100% 연속 마스킹 → 전체 시퀀스 생성 가능
+```
+---
+## 사용 방법
+### 기본 사용
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained("solonsophy/kf-deberta-gen")
+model = AutoModelForMaskedLM.from_pretrained("solonsophy/kf-deberta-gen")
+```
+### Diffusion 생성 (Iterative Denoising)
+```python
+import torch
+import torch.nn.functional as F
+def generate_diffusion(model, tokenizer, question, num_steps=15, max_answer_len=80):
+    model.eval()
+    device = next(model.parameters()).device
+    MASK_ID = tokenizer.mask_token_id
+    CLS_ID = tokenizer.cls_token_id
+    SEP_ID = tokenizer.sep_token_id
+    # 질문 토큰화
+    q_tokens = tokenizer.encode(question, add_special_tokens=False)[:100]
+    # 초기: [CLS] Q [SEP] [MASK]*N
+    input_ids = [CLS_ID] + q_tokens + [SEP_ID] + [MASK_ID] * max_answer_len
+    input_ids = torch.tensor([input_ids[:256]], device=device)
+    answer_start = len(q_tokens) + 2
+    # Iterative denoising
+    for step in range(num_steps):
+        with torch.no_grad():
+            logits = model(input_ids).logits
+        mask_pos = (input_ids[0, answer_start:] == MASK_ID).nonzero().squeeze(-1) + answer_start
+        if len(mask_pos) == 0:
+            break
+        # Confidence 기반 unmask
+        mask_logits = logits[0, mask_pos] / 0.8  # temperature
+        probs = F.softmax(mask_logits, dim=-1)
+        tokens = torch.multinomial(probs, 1).squeeze(-1)
+        conf = probs.gather(1, tokens.unsqueeze(-1)).squeeze(-1)
+        k = max(1, len(mask_pos) // (num_steps - step))
+        top_idx = conf.topk(k).indices
+        input_ids[0, mask_pos[top_idx]] = tokens[top_idx]
+    # 결과 추출
+    answer = input_ids[0, answer_start:]
+    answer = answer[(answer != MASK_ID) & (answer != tokenizer.pad_token_id)]
+    return tokenizer.decode(answer, skip_special_tokens=True)
+# 사용 예시
+answer = generate_diffusion(model, tokenizer, "인공지능이란 무엇인가요?")
+print(answer)
+```
+---
+## 학습 정보
+### Chat Fine-tuning 설정
+| 파라미터 | 값 |
+|----------|-----|
+| Epochs | 3 |
+| Batch Size | 64 |
+| Learning Rate | 5e-5 |
+| Max Length | 256 |
+| Q Max Length | 100 |
+| A Max Length | 153 |
+| Noise Schedule | Cosine |
+| Masking Ratio | 0% ~ 100% |
+---
+## 학습 데이터
+| 데이터셋 | 라이선스 |
+|----------|----------|
+| OpenLab-NLP/tiny-singleturn-chat-ko | MIT |
+| davidkim205/kollm-converations | Apache-2.0 |
+| heegyu/hh-rlhf-ko | MIT |
+| nlpai-lab/kullm-v2 | Apache-2.0 |
+| heegyu/OIG-small-chip2-ko | Apache-2.0 |
+| AIdenU/orca_dpo_data_ko | Apache-2.0 |
+---
+## 라이선스
+이 모델은 **Apache-2.0 라이선스**로 배포됩니다.
+기반 모델 (kakaobank/kf-deberta-base): MIT
+---
+## Citation
+```bibtex
+@misc{kf-deberta-gen,
+  author = {Hong Seongmin},
+  title = {Generative Diffusion BERT: Korean Discrete Diffusion Language Model},
+  year = {2025},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/solonsophy/kf-deberta-gen}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "architectures": [
+    "DebertaV2ForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "conv_act": "gelu",
+  "conv_kernel_size": 0,
+  "dtype": "float32",
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-07,
+  "legacy": true,
+  "max_position_embeddings": 512,
+  "max_relative_positions": -1,
+  "model_type": "deberta-v2",
+  "norm_rel_ebd": "layer_norm",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "pooler_dropout": 0,
+  "pooler_hidden_act": "gelu",
+  "pooler_hidden_size": 768,
+  "pos_att_type": [
+    "p2c",
+    "c2p"
+  ],
+  "position_biased_input": false,
+  "position_buckets": 256,
+  "relative_attention": true,
+  "share_att_key": true,
+  "transformers_version": "4.57.3",
+  "type_vocab_size": 0,
+  "vocab_size": 130000
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f7c934f89d01f35a8abfebcabf6e475e383900d2ee4bd78af3626216273d0e2c
+size 744076256

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,60 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "[CLS]",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "eos_token": "[SEP]",
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2b0d7e63ab22877a5b10a58befa25f884cedbde588732fe0055f5073b77ed16f
+size 5777

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff