Update README.md
Browse files
README.md
CHANGED
|
@@ -10,4 +10,149 @@ tags:
|
|
| 10 |
- korean
|
| 11 |
- Proposition
|
| 12 |
- Atomic_fact
|
| 13 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
- korean
|
| 11 |
- Proposition
|
| 12 |
- Atomic_fact
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# Overview
|
| 16 |
+
This model is designed for the **abstractive proposition segmentation task** in Korean, as described in the paper [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf). The model segments text into atomic and self-contained units (atomic facts).
|
| 17 |
+
|
| 18 |
+
# Training Details
|
| 19 |
+
- Base Model: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
|
| 20 |
+
- Peft: LoRA
|
| 21 |
+
- Dataset: [RoSE](https://huggingface.co/datasets/Salesforce/rose)
|
| 22 |
+
- The dataset was split into training, validation, test sets for fine-tuning.
|
| 23 |
+
- The dataset was translated into Korean.
|
| 24 |
+
- More details about the dataset can be found here.
|
| 25 |
+
|
| 26 |
+
# Usage
|
| 27 |
+
## Data Preprocessing
|
| 28 |
+
```
|
| 29 |
+
from konlpy.tag import Kkma
|
| 30 |
+
|
| 31 |
+
sent_start_token = "<sent>"
|
| 32 |
+
sent_end_token = "</sent>"
|
| 33 |
+
instruction = "I will provide a passage split into sentences by <s> and </s> markers. For each sentence, generate its list of propositions. Each proposition contains a single fact mentioned in the corresponding sentence written as briefly and clearly as possible.\n\n"
|
| 34 |
+
|
| 35 |
+
kkma = Kkma()
|
| 36 |
+
|
| 37 |
+
def get_input(text, tokenizer):
|
| 38 |
+
sentences = kkma.sentences(text)
|
| 39 |
+
prompt = instruction + "Passage: " + sent_start_token + f"{sent_end_token}{sent_start_token}".join(sentences) + sent_end_token + "\nPropositions:\n"
|
| 40 |
+
messages = [{"role": "system", "content": "You are a helpful assistant."},
|
| 41 |
+
{"role": "user", "content": prompt}]
|
| 42 |
+
input_text = tokenizer.apply_chat_template(
|
| 43 |
+
messages,
|
| 44 |
+
tokenize=False,
|
| 45 |
+
add_generation_prompt=True)
|
| 46 |
+
return input_text
|
| 47 |
+
|
| 48 |
+
def get_output(text):
|
| 49 |
+
results = []
|
| 50 |
+
group = []
|
| 51 |
+
|
| 52 |
+
lines = text.strip().split("\n")
|
| 53 |
+
for line in lines:
|
| 54 |
+
if line.strip() == sent_start_token:
|
| 55 |
+
continue
|
| 56 |
+
elif line.strip() == sent_end_token:
|
| 57 |
+
results.append(group)
|
| 58 |
+
group = []
|
| 59 |
+
else:
|
| 60 |
+
if not line.strip().startswith("-"):
|
| 61 |
+
break
|
| 62 |
+
line = line[1:].strip()
|
| 63 |
+
group.append(line)
|
| 64 |
+
|
| 65 |
+
return results
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
## Loading Model and Tokenizer
|
| 69 |
+
```
|
| 70 |
+
import peft
|
| 71 |
+
|
| 72 |
+
LORA_PATH = seonjeongh/Korean-Propositionalizer
|
| 73 |
+
|
| 74 |
+
lora_config = peft.PeftConfig.from_pretrained(LORA_PATH)
|
| 75 |
+
base_model = AutoModelForCausalLM.from_pretrained(lora_config.base_model_name_or_path,
|
| 76 |
+
torch_dtype=torch.float16,
|
| 77 |
+
device_map="auto")
|
| 78 |
+
model = peft.PeftModel.from_pretrained(base_model, args.peft_model_dir)
|
| 79 |
+
model = model.merge_and_unload(progressbar=True)
|
| 80 |
+
tokenizer = AutoTokenizer.from_pretrained(LORA_PATH)
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
## Inference Example
|
| 84 |
+
```
|
| 85 |
+
device = "cuda"
|
| 86 |
+
|
| 87 |
+
text = "μ₯μ€ν¬λλ νμμΌ λ§¨μ²΄μ€ν° μ λμ΄ν°λμμ κ²½κΈ°μμ 3-2λ‘ ν¨ν κ²½κΈ°μμ 21μΈ μ΄ν νμΌλ‘ λμ νλ€. κ·Έ 골μ 16μΈ μ μμ 1κ΅° λ°λ· μ£Όμ₯μ κ°νν κ²μ΄λ€. μΌν°λ°±μ μ΄λ² μμ¦ μ¨μ€νΈν 1κ΅°κ³Ό ν¨κ» νλ ¨νλ€. μ¨μ€νΈν μ λμ΄ν°λμ μ΅μ λ΄μ€λ μ¬κΈ°λ₯Ό ν΄λ¦νμΈμ."
|
| 88 |
+
inputs = tokenizer([get_input(text, tokenizer)], return_tensors='pt').to(device)
|
| 89 |
+
output = model.generate(**inputs, max_new_tokens=512, pad_token_id = tokenizer.pad_token_id, eos_token_id = tokenizer.eos_token_id, use_cache=True)
|
| 90 |
+
response = tokenizer.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
|
| 91 |
+
results = get_output(response)
|
| 92 |
+
print(results)
|
| 93 |
+
```
|
| 94 |
+
<details>
|
| 95 |
+
|
| 96 |
+
<summary>Example output</summary>
|
| 97 |
+
|
| 98 |
+
```json
|
| 99 |
+
[
|
| 100 |
+
[
|
| 101 |
+
"μ₯μ€ν¬λλ 21μΈ μ΄ν νμΌλ‘ λμ νλ€.",
|
| 102 |
+
"μ₯μ€ν¬λλ 맨체μ€ν° μ λμ΄ν°λμμ κ²½κΈ°μμ λμ νλ€.",
|
| 103 |
+
"μ₯μ€ν¬λλ νμμΌ λ§¨μ²΄μ€ν° μ λμ΄ν°λμμ κ²½κΈ°μμ λμ νλ€.",
|
| 104 |
+
"μ₯μ€ν¬λλ 맨체μ€ν° μ λμ΄ν°λμμ κ²½κΈ°μμ 3-2λ‘ ν¨νλ€."
|
| 105 |
+
],
|
| 106 |
+
[
|
| 107 |
+
"κ·Έ 골μ μ₯μ€ν¬λμ μ£Όμ₯μ κ°νν κ²μ΄λ€.",
|
| 108 |
+
"μ₯μ€ν¬λλ 16μΈ μ μμ΄λ€.",
|
| 109 |
+
"μ₯μ€ν¬λλ 1κ΅° λ°λ·λ₯Ό μ£Όμ₯ν κ²μ΄λ€."
|
| 110 |
+
],
|
| 111 |
+
[
|
| 112 |
+
"μ₯μ€ν¬λλ μΌν°λ°±μ΄λ€.",
|
| 113 |
+
"μ₯μ€ν¬λλ μ¨μ€νΈν 1κ΅°κ³Ό ν¨κ» νλ ¨νλ€.",
|
| 114 |
+
"μ₯μ€ν¬λλ μ΄λ² μμ¦ μ¨μ€νΈν 1κ΅°κ³Ό ν¨κ» νλ ¨νλ€."
|
| 115 |
+
],
|
| 116 |
+
[
|
| 117 |
+
"μ¨μ€νΈν μ λμ΄ν°λμ μ΅μ λ΄μ€λ μ¬κΈ°λ₯Ό ν΄λ¦νμΈμ."
|
| 118 |
+
]
|
| 119 |
+
]
|
| 120 |
+
```
|
| 121 |
+
</details>
|
| 122 |
+
|
| 123 |
+
## Inputs and Outputs
|
| 124 |
+
- Input: Text.
|
| 125 |
+
- Output: List of propositions for all the sentences in the text passage. The propositions for each sentence are grouped separately.
|
| 126 |
+
|
| 127 |
+
## Evaluation Results
|
| 128 |
+
- Metric: Reference-less & reference-base metrics proposed in [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf).
|
| 129 |
+
- Models:
|
| 130 |
+
- Dynamic 10-shot models: For each test example, the most similar 10 examples were selected from the training set using BM25.
|
| 131 |
+
- Translate-test models: [google/gemma-7b-aps-it](https://huggingface.co/google/gemma-7b-aps-it) model + EN->KO, KO->EN translation using GPT-4o or GPT-4o-mini.
|
| 132 |
+
- Translate-train models: LoRA fine-tuned sLLMs using the Korean RoSE dataset.
|
| 133 |
+
|
| 134 |
+
**Reference-less metric**
|
| 135 |
+
| Model | Precision | Recall | F1 |
|
| 136 |
+
|--------------------------------------------|:---------:|:------:|:-----:|
|
| 137 |
+
| Gold | 97.46 | 96.28 | 95.88 |
|
| 138 |
+
| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct)| 98.86 | 93.99 | 95.58 |
|
| 139 |
+
| dynamic 10-shot GPT-4o | 97.61 | 97.00 | 96.87 |
|
| 140 |
+
| dynamic 10-shot GPT-4o-mini | 98.51 | 97.12 | 97.17 |
|
| 141 |
+
| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) | 97.38 | 96.93 | 96.52 |
|
| 142 |
+
| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) | 97.24 | 96.26 | 95.73 |
|
| 143 |
+
| Translate-Train (Qwen/Qwen2.5-7B-Instruct) | 94.66 | 92.81 | 92.08 |
|
| 144 |
+
| **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)** | 97.41 | 96.02 | 95.93 |
|
| 145 |
+
| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) | - | - | - |
|
| 146 |
+
|
| 147 |
+
**Reference-base metric**
|
| 148 |
+
| Model | Precision | Recall | F1 |
|
| 149 |
+
|--------------------------------------------|:---------:|:------:|:-----:|
|
| 150 |
+
| Gold | 100 | 100 | 100 |
|
| 151 |
+
| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct)| 48.49 | 40.27 | 42.99 |
|
| 152 |
+
| dynamic 10-shot GPT-4o | 49.16 | 44.72 | 46.05 |
|
| 153 |
+
| dynamic 10-shot GPT-4o-mini | 49.30 | 39.25 | 42.88 |
|
| 154 |
+
| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) | 57.02 | 47.52 | 51.10|
|
| 155 |
+
| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) | 57.19 | 47.68 | 51.26 |
|
| 156 |
+
| Translate-Train (Qwen/Qwen2.5-7B-Instruct) | 42.62 | 38.37 | 39.64 |
|
| 157 |
+
| **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)** | 50.82 | 45.89 | 47.44 |
|
| 158 |
+
| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) | - | - | - |
|