File size: 7,588 Bytes
dd1155f f70e532 dd1155f f70e532 dd1155f aafbe57 0a72c00 aafbe57 0a72c00 aafbe57 c5d5a5c aafbe57 c5d5a5c aafbe57 c5d5a5c aafbe57 c5d5a5c aafbe57 a976482 aafbe57 c5d5a5c aafbe57 c5d5a5c aafbe57 c5d5a5c aafbe57 c5d5a5c aafbe57 0a72c00 aafbe57 0a72c00 aafbe57 c5d5a5c aafbe57 c5d5a5c aafbe57 c5d5a5c aafbe57 c5d5a5c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
---
base_model:
- yanolja/EEVE-Korean-Instruct-10.8B-v1.0
datasets:
- Salesforce/rose
language:
- ko
license: apache-2.0
tags:
- korean
- Proposition
- Atomic_fact
---
# Overview
This model is designed for the **abstractive proposition segmentation task** in **Korean**, as described in the paper [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf). The model segments text into atomic and self-contained units (atomic facts).
# Training Details
- **Base Model**: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
- **Fine-tuning Method**: LoRA
- **Dataset**: [RoSE](https://huggingface.co/datasets/Salesforce/rose)
- **Translation**: The dataset was translated into Korean using GPT-4o.
- GPT-4o was prompted to translate propositions using the vocabulary in the text.
- **Data Split**: The dataset was randomly split into training, validation, and test sets (1900:100:500) for fine-tuning.
# Usage
## Data Preprocessing
```
from konlpy.tag import Kkma
sent_start_token = "<sent>"
sent_end_token = "</sent>"
instruction = "I will provide a passage split into sentences by <s> and </s> markers. For each sentence, generate its list of propositions. Each proposition contains a single fact mentioned in the corresponding sentence written as briefly and clearly as possible.\n\n"
kkma = Kkma()
def get_input(text, tokenizer):
sentences = kkma.sentences(text)
prompt = instruction + "Passage: " + sent_start_token + f"{sent_end_token}{sent_start_token}".join(sentences) + sent_end_token + "\nPropositions:\n"
messages = [{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}]
input_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True)
return input_text
def get_output(text):
results = []
group = []
if text.startswith("Propositions:"):
lines = text[len("Propositions:"):].strip().split("\n")
else:
lines = text.strip().split("\n")
for line in lines:
if line.strip() == sent_start_token:
continue
elif line.strip() == sent_end_token:
results.append(group)
group = []
else:
if not line.strip().startswith("-"):
break
line = line[1:].strip()
group.append(line)
return results
```
## Loading Model and Tokenizer
```
import peft, torch
from transformers import AutoModelForCausalLM, AutoTokenizer
LORA_PATH = "seonjeongh/Korean-Propositionalizer"
lora_config = peft.PeftConfig.from_pretrained(LORA_PATH)
base_model = AutoModelForCausalLM.from_pretrained(lora_config.base_model_name_or_path,
torch_dtype=torch.float16,
device_map="auto")
model = peft.PeftModel.from_pretrained(base_model, LORA_PATH)
model = model.merge_and_unload(progressbar=True)
tokenizer = AutoTokenizer.from_pretrained(lora_config.base_model_name_or_path)
```
## Inference Example
```
device = "cuda"
text = "์ฅ์คํฌ๋๋ ํ์์ผ ๋งจ์ฒด์คํฐ ์ ๋์ดํฐ๋์์ ๊ฒฝ๊ธฐ์์ 3-2๋ก ํจํ ๊ฒฝ๊ธฐ์์ 21์ธ ์ดํ ํ์ผ๋ก ๋์ ํ๋ค. ๊ทธ ๊ณจ์ 16์ธ ์ ์์ 1๊ตฐ ๋ฐ๋ท ์ฃผ์ฅ์ ๊ฐํํ ๊ฒ์ด๋ค. ์ผํฐ๋ฐฑ์ ์ด๋ฒ ์์ฆ ์จ์คํธํ 1๊ตฐ๊ณผ ํจ๊ป ํ๋ จํ๋ค. ์จ์คํธํ ์ ๋์ดํฐ๋์ ์ต์ ๋ด์ค๋ ์ฌ๊ธฐ๋ฅผ ํด๋ฆญํ์ธ์."
inputs = tokenizer([get_input(text, tokenizer)], return_tensors='pt').to(device)
output = model.generate(**inputs, max_new_tokens=512, pad_token_id = tokenizer.pad_token_id, eos_token_id = tokenizer.eos_token_id, use_cache=True)
response = tokenizer.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
results = get_output(response)
print(results)
```
<details>
<summary>Example output</summary>
```json
[
[
"์ฅ์คํฌ๋๋ 21์ธ ์ดํ ํ์ผ๋ก ๋์ ํ๋ค.",
"์ฅ์คํฌ๋๋ ๋งจ์ฒด์คํฐ ์ ๋์ดํฐ๋์์ ๊ฒฝ๊ธฐ์์ 3-2๋ก ํจํ๋ค.",
"์ฅ์คํฌ๋๋ ํ์์ผ ๊ฒฝ๊ธฐ๋ฅผ ํ๋ค.",
],
[
"๊ทธ ๊ณจ์ 16์ธ ์ ์์ ์ฃผ์ฅ์ ๊ฐํํ ๊ฒ์ด๋ค.",
"๊ทธ ๊ณจ์ 16์ธ ์ ์์ 1 ๊ตฐ ๋ฐ๋ท ์ฃผ์ฅ์ ๊ฐํํ ๊ฒ์ด๋ค.",
],
[
"์ผํฐ ๋ฐฑ์ ์จ์คํธ ํ 1 ๊ตฐ๊ณผ ํจ๊ป ํ๋ จํ๋ค.",
"์ผํฐ ๋ฐฑ์ ์ด๋ฒ ์์ฆ ์จ์คํธ ํ 1 ๊ตฐ๊ณผ ํจ๊ป ํ๋ จํ๋ค.",
],
[
"์จ์คํธํ ์ ๋์ดํฐ๋์ ์ต์ ๋ด์ค๋ ์ฌ๊ธฐ๋ฅผ ํด๋ฆญํ์ธ์."
]
]
```
</details>
## Inputs and Outputs
- **Input**: Text.
- **Output**: List of propositions for all the sentences in the text passage. The propositions for each sentence are grouped separately.
## Evaluation Results
- **Metric**: Reference-less & reference-base metrics proposed in [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf).
- **Models**:
- Dynamic 10-shot models: For each test example, the most similar 10 examples were selected from the training set using BM25.
- Translate-test models: [google/gemma-7b-aps-it](https://huggingface.co/google/gemma-7b-aps-it) model + EN->KO, KO->EN translation using GPT-4o or GPT-4o-mini.
- Translate-train models: LoRA fine-tuned sLLMs using the Korean RoSE dataset.
**Reference-less metric**
| Model | Precision | Recall | F1 |
|---------------------------------------------------------------------|:---------:|:------:|:-----:|
| Gold | 97.46 | 96.28 | 95.88 |
| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct) | 98.86 | 93.99 | 95.58 |
| dynamic 10-shot GPT-4o | 97.61 | 97.00 | 96.87 |
| dynamic 10-shot GPT-4o-mini | 98.51 | 97.12 | 97.17 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) | 97.38 | 96.93 | 96.52 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) | 97.24 | 96.26 | 95.73 |
| Translate-Train (Qwen/Qwen2.5-7B-Instruct) | 94.66 | 92.81 | 92.08 |
| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) | 93.80 | 93.29 | 92.80 |
| **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)** | 97.41 | 96.02 | 95.93 |
**Reference-base metric**
| Model | Precision | Recall | F1 |
|---------------------------------------------------------------------|:---------:|:------:|:-----:|
| Gold | 100 | 100 | 100 |
| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct) | 48.49 | 40.27 | 42.99 |
| dynamic 10-shot GPT-4o | 49.16 | 44.72 | 46.05 |
| dynamic 10-shot GPT-4o-mini | 49.30 | 39.25 | 42.88 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) | 57.02 | 47.52 | 51.10 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) | 57.19 | 47.68 | 51.26 |
| Translate-Train (Qwen/Qwen2.5-7B-Instruct) | 42.62 | 38.37 | 39.64 |
| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) | 46.82 | 43.08 | 44.02 |
| **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)** | 50.82 | 45.89 | 47.44 | |