|
|
--- |
|
|
base_model: |
|
|
- yanolja/EEVE-Korean-Instruct-10.8B-v1.0 |
|
|
datasets: |
|
|
- Salesforce/rose |
|
|
language: |
|
|
- ko |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- korean |
|
|
- Proposition |
|
|
- Atomic_fact |
|
|
--- |
|
|
|
|
|
# Overview |
|
|
This model is designed for the **abstractive proposition segmentation task** in **Korean**, as described in the paper [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf). The model segments text into atomic and self-contained units (atomic facts). |
|
|
|
|
|
# Training Details |
|
|
- **Base Model**: yanolja/EEVE-Korean-Instruct-10.8B-v1.0 |
|
|
- **Fine-tuning Method**: LoRA |
|
|
- **Dataset**: [RoSE](https://huggingface.co/datasets/Salesforce/rose) |
|
|
- **Translation**: The dataset was translated into Korean using GPT-4o. |
|
|
- GPT-4o was prompted to translate propositions using the vocabulary in the text. |
|
|
- **Data Split**: The dataset was randomly split into training, validation, and test sets (1900:100:500) for fine-tuning. |
|
|
|
|
|
# Usage |
|
|
## Data Preprocessing |
|
|
``` |
|
|
from konlpy.tag import Kkma |
|
|
|
|
|
sent_start_token = "<sent>" |
|
|
sent_end_token = "</sent>" |
|
|
instruction = "I will provide a passage split into sentences by <s> and </s> markers. For each sentence, generate its list of propositions. Each proposition contains a single fact mentioned in the corresponding sentence written as briefly and clearly as possible.\n\n" |
|
|
|
|
|
kkma = Kkma() |
|
|
|
|
|
def get_input(text, tokenizer): |
|
|
sentences = kkma.sentences(text) |
|
|
prompt = instruction + "Passage: " + sent_start_token + f"{sent_end_token}{sent_start_token}".join(sentences) + sent_end_token + "\nPropositions:\n" |
|
|
messages = [{"role": "system", "content": "You are a helpful assistant."}, |
|
|
{"role": "user", "content": prompt}] |
|
|
input_text = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True) |
|
|
return input_text |
|
|
|
|
|
def get_output(text): |
|
|
results = [] |
|
|
group = [] |
|
|
|
|
|
if text.startswith("Propositions:"): |
|
|
lines = text[len("Propositions:"):].strip().split("\n") |
|
|
else: |
|
|
lines = text.strip().split("\n") |
|
|
|
|
|
for line in lines: |
|
|
if line.strip() == sent_start_token: |
|
|
continue |
|
|
elif line.strip() == sent_end_token: |
|
|
results.append(group) |
|
|
group = [] |
|
|
else: |
|
|
if not line.strip().startswith("-"): |
|
|
break |
|
|
line = line[1:].strip() |
|
|
group.append(line) |
|
|
|
|
|
return results |
|
|
``` |
|
|
|
|
|
## Loading Model and Tokenizer |
|
|
``` |
|
|
import peft, torch |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
LORA_PATH = "seonjeongh/Korean-Propositionalizer" |
|
|
|
|
|
lora_config = peft.PeftConfig.from_pretrained(LORA_PATH) |
|
|
base_model = AutoModelForCausalLM.from_pretrained(lora_config.base_model_name_or_path, |
|
|
torch_dtype=torch.float16, |
|
|
device_map="auto") |
|
|
model = peft.PeftModel.from_pretrained(base_model, LORA_PATH) |
|
|
model = model.merge_and_unload(progressbar=True) |
|
|
tokenizer = AutoTokenizer.from_pretrained(lora_config.base_model_name_or_path) |
|
|
``` |
|
|
|
|
|
## Inference Example |
|
|
``` |
|
|
device = "cuda" |
|
|
|
|
|
text = "์ฅ์คํฌ๋๋ ํ์์ผ ๋งจ์ฒด์คํฐ ์ ๋์ดํฐ๋์์ ๊ฒฝ๊ธฐ์์ 3-2๋ก ํจํ ๊ฒฝ๊ธฐ์์ 21์ธ ์ดํ ํ์ผ๋ก ๋์ ํ๋ค. ๊ทธ ๊ณจ์ 16์ธ ์ ์์ 1๊ตฐ ๋ฐ๋ท ์ฃผ์ฅ์ ๊ฐํํ ๊ฒ์ด๋ค. ์ผํฐ๋ฐฑ์ ์ด๋ฒ ์์ฆ ์จ์คํธํ 1๊ตฐ๊ณผ ํจ๊ป ํ๋ จํ๋ค. ์จ์คํธํ ์ ๋์ดํฐ๋์ ์ต์ ๋ด์ค๋ ์ฌ๊ธฐ๋ฅผ ํด๋ฆญํ์ธ์." |
|
|
inputs = tokenizer([get_input(text, tokenizer)], return_tensors='pt').to(device) |
|
|
output = model.generate(**inputs, max_new_tokens=512, pad_token_id = tokenizer.pad_token_id, eos_token_id = tokenizer.eos_token_id, use_cache=True) |
|
|
response = tokenizer.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0] |
|
|
results = get_output(response) |
|
|
print(results) |
|
|
``` |
|
|
<details> |
|
|
|
|
|
<summary>Example output</summary> |
|
|
|
|
|
```json |
|
|
[ |
|
|
[ |
|
|
"์ฅ์คํฌ๋๋ 21์ธ ์ดํ ํ์ผ๋ก ๋์ ํ๋ค.", |
|
|
"์ฅ์คํฌ๋๋ ๋งจ์ฒด์คํฐ ์ ๋์ดํฐ๋์์ ๊ฒฝ๊ธฐ์์ 3-2๋ก ํจํ๋ค.", |
|
|
"์ฅ์คํฌ๋๋ ํ์์ผ ๊ฒฝ๊ธฐ๋ฅผ ํ๋ค.", |
|
|
], |
|
|
[ |
|
|
"๊ทธ ๊ณจ์ 16์ธ ์ ์์ ์ฃผ์ฅ์ ๊ฐํํ ๊ฒ์ด๋ค.", |
|
|
"๊ทธ ๊ณจ์ 16์ธ ์ ์์ 1 ๊ตฐ ๋ฐ๋ท ์ฃผ์ฅ์ ๊ฐํํ ๊ฒ์ด๋ค.", |
|
|
], |
|
|
[ |
|
|
"์ผํฐ ๋ฐฑ์ ์จ์คํธ ํ 1 ๊ตฐ๊ณผ ํจ๊ป ํ๋ จํ๋ค.", |
|
|
"์ผํฐ ๋ฐฑ์ ์ด๋ฒ ์์ฆ ์จ์คํธ ํ 1 ๊ตฐ๊ณผ ํจ๊ป ํ๋ จํ๋ค.", |
|
|
], |
|
|
[ |
|
|
"์จ์คํธํ ์ ๋์ดํฐ๋์ ์ต์ ๋ด์ค๋ ์ฌ๊ธฐ๋ฅผ ํด๋ฆญํ์ธ์." |
|
|
] |
|
|
] |
|
|
``` |
|
|
</details> |
|
|
|
|
|
## Inputs and Outputs |
|
|
- **Input**: Text. |
|
|
- **Output**: List of propositions for all the sentences in the text passage. The propositions for each sentence are grouped separately. |
|
|
|
|
|
## Evaluation Results |
|
|
- **Metric**: Reference-less & reference-base metrics proposed in [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf). |
|
|
- **Models**: |
|
|
- Dynamic 10-shot models: For each test example, the most similar 10 examples were selected from the training set using BM25. |
|
|
- Translate-test models: [google/gemma-7b-aps-it](https://huggingface.co/google/gemma-7b-aps-it) model + EN->KO, KO->EN translation using GPT-4o or GPT-4o-mini. |
|
|
- Translate-train models: LoRA fine-tuned sLLMs using the Korean RoSE dataset. |
|
|
|
|
|
**Reference-less metric** |
|
|
| Model | Precision | Recall | F1 | |
|
|
|---------------------------------------------------------------------|:---------:|:------:|:-----:| |
|
|
| Gold | 97.46 | 96.28 | 95.88 | |
|
|
| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct) | 98.86 | 93.99 | 95.58 | |
|
|
| dynamic 10-shot GPT-4o | 97.61 | 97.00 | 96.87 | |
|
|
| dynamic 10-shot GPT-4o-mini | 98.51 | 97.12 | 97.17 | |
|
|
| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) | 97.38 | 96.93 | 96.52 | |
|
|
| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) | 97.24 | 96.26 | 95.73 | |
|
|
| Translate-Train (Qwen/Qwen2.5-7B-Instruct) | 94.66 | 92.81 | 92.08 | |
|
|
| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) | 93.80 | 93.29 | 92.80 | |
|
|
| **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)** | 97.41 | 96.02 | 95.93 | |
|
|
|
|
|
**Reference-base metric** |
|
|
| Model | Precision | Recall | F1 | |
|
|
|---------------------------------------------------------------------|:---------:|:------:|:-----:| |
|
|
| Gold | 100 | 100 | 100 | |
|
|
| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct) | 48.49 | 40.27 | 42.99 | |
|
|
| dynamic 10-shot GPT-4o | 49.16 | 44.72 | 46.05 | |
|
|
| dynamic 10-shot GPT-4o-mini | 49.30 | 39.25 | 42.88 | |
|
|
| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) | 57.02 | 47.52 | 51.10 | |
|
|
| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) | 57.19 | 47.68 | 51.26 | |
|
|
| Translate-Train (Qwen/Qwen2.5-7B-Instruct) | 42.62 | 38.37 | 39.64 | |
|
|
| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) | 46.82 | 43.08 | 44.02 | |
|
|
| **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)** | 50.82 | 45.89 | 47.44 | |