seonjeongh's picture
Update README.md
a976482 verified
---
base_model:
- yanolja/EEVE-Korean-Instruct-10.8B-v1.0
datasets:
- Salesforce/rose
language:
- ko
license: apache-2.0
tags:
- korean
- Proposition
- Atomic_fact
---
# Overview
This model is designed for the **abstractive proposition segmentation task** in **Korean**, as described in the paper [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf). The model segments text into atomic and self-contained units (atomic facts).
# Training Details
- **Base Model**: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
- **Fine-tuning Method**: LoRA
- **Dataset**: [RoSE](https://huggingface.co/datasets/Salesforce/rose)
- **Translation**: The dataset was translated into Korean using GPT-4o.
- GPT-4o was prompted to translate propositions using the vocabulary in the text.
- **Data Split**: The dataset was randomly split into training, validation, and test sets (1900:100:500) for fine-tuning.
# Usage
## Data Preprocessing
```
from konlpy.tag import Kkma
sent_start_token = "<sent>"
sent_end_token = "</sent>"
instruction = "I will provide a passage split into sentences by <s> and </s> markers. For each sentence, generate its list of propositions. Each proposition contains a single fact mentioned in the corresponding sentence written as briefly and clearly as possible.\n\n"
kkma = Kkma()
def get_input(text, tokenizer):
sentences = kkma.sentences(text)
prompt = instruction + "Passage: " + sent_start_token + f"{sent_end_token}{sent_start_token}".join(sentences) + sent_end_token + "\nPropositions:\n"
messages = [{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}]
input_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True)
return input_text
def get_output(text):
results = []
group = []
if text.startswith("Propositions:"):
lines = text[len("Propositions:"):].strip().split("\n")
else:
lines = text.strip().split("\n")
for line in lines:
if line.strip() == sent_start_token:
continue
elif line.strip() == sent_end_token:
results.append(group)
group = []
else:
if not line.strip().startswith("-"):
break
line = line[1:].strip()
group.append(line)
return results
```
## Loading Model and Tokenizer
```
import peft, torch
from transformers import AutoModelForCausalLM, AutoTokenizer
LORA_PATH = "seonjeongh/Korean-Propositionalizer"
lora_config = peft.PeftConfig.from_pretrained(LORA_PATH)
base_model = AutoModelForCausalLM.from_pretrained(lora_config.base_model_name_or_path,
torch_dtype=torch.float16,
device_map="auto")
model = peft.PeftModel.from_pretrained(base_model, LORA_PATH)
model = model.merge_and_unload(progressbar=True)
tokenizer = AutoTokenizer.from_pretrained(lora_config.base_model_name_or_path)
```
## Inference Example
```
device = "cuda"
text = "์˜ฅ์Šคํฌ๋“œ๋Š” ํ™”์š”์ผ ๋งจ์ฒด์Šคํ„ฐ ์œ ๋‚˜์ดํ‹ฐ๋“œ์™€์˜ ๊ฒฝ๊ธฐ์—์„œ 3-2๋กœ ํŒจํ•œ ๊ฒฝ๊ธฐ์—์„œ 21์„ธ ์ดํ•˜ ํŒ€์œผ๋กœ ๋“์ ํ–ˆ๋‹ค. ๊ทธ ๊ณจ์€ 16์„ธ ์„ ์ˆ˜์˜ 1๊ตฐ ๋ฐ๋ท” ์ฃผ์žฅ์„ ๊ฐ•ํ™”ํ•  ๊ฒƒ์ด๋‹ค. ์„ผํ„ฐ๋ฐฑ์€ ์ด๋ฒˆ ์‹œ์ฆŒ ์›จ์ŠคํŠธํ–„ 1๊ตฐ๊ณผ ํ•จ๊ป˜ ํ›ˆ๋ จํ–ˆ๋‹ค. ์›จ์ŠคํŠธํ–„ ์œ ๋‚˜์ดํ‹ฐ๋“œ์˜ ์ตœ์‹  ๋‰ด์Šค๋Š” ์—ฌ๊ธฐ๋ฅผ ํด๋ฆญํ•˜์„ธ์š”."
inputs = tokenizer([get_input(text, tokenizer)], return_tensors='pt').to(device)
output = model.generate(**inputs, max_new_tokens=512, pad_token_id = tokenizer.pad_token_id, eos_token_id = tokenizer.eos_token_id, use_cache=True)
response = tokenizer.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
results = get_output(response)
print(results)
```
<details>
<summary>Example output</summary>
```json
[
[
"์˜ฅ์Šคํฌ๋“œ๋Š” 21์„ธ ์ดํ•˜ ํŒ€์œผ๋กœ ๋“์ ํ–ˆ๋‹ค.",
"์˜ฅ์Šคํฌ๋“œ๋Š” ๋งจ์ฒด์Šคํ„ฐ ์œ ๋‚˜์ดํ‹ฐ๋“œ์™€์˜ ๊ฒฝ๊ธฐ์—์„œ 3-2๋กœ ํŒจํ–ˆ๋‹ค.",
"์˜ฅ์Šคํฌ๋“œ๋Š” ํ™”์š”์ผ ๊ฒฝ๊ธฐ๋ฅผ ํ–ˆ๋‹ค.",
],
[
"๊ทธ ๊ณจ์€ 16์„ธ ์„ ์ˆ˜์˜ ์ฃผ์žฅ์„ ๊ฐ•ํ™”ํ•  ๊ฒƒ์ด๋‹ค.",
"๊ทธ ๊ณจ์€ 16์„ธ ์„ ์ˆ˜์˜ 1 ๊ตฐ ๋ฐ๋ท” ์ฃผ์žฅ์„ ๊ฐ•ํ™”ํ•  ๊ฒƒ์ด๋‹ค.",
],
[
"์„ผํ„ฐ ๋ฐฑ์€ ์›จ์ŠคํŠธ ํ–„ 1 ๊ตฐ๊ณผ ํ•จ๊ป˜ ํ›ˆ๋ จํ–ˆ๋‹ค.",
"์„ผํ„ฐ ๋ฐฑ์€ ์ด๋ฒˆ ์‹œ์ฆŒ ์›จ์ŠคํŠธ ํ–„ 1 ๊ตฐ๊ณผ ํ•จ๊ป˜ ํ›ˆ๋ จํ–ˆ๋‹ค.",
],
[
"์›จ์ŠคํŠธํ–„ ์œ ๋‚˜์ดํ‹ฐ๋“œ์˜ ์ตœ์‹  ๋‰ด์Šค๋Š” ์—ฌ๊ธฐ๋ฅผ ํด๋ฆญํ•˜์„ธ์š”."
]
]
```
</details>
## Inputs and Outputs
- **Input**: Text.
- **Output**: List of propositions for all the sentences in the text passage. The propositions for each sentence are grouped separately.
## Evaluation Results
- **Metric**: Reference-less & reference-base metrics proposed in [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf).
- **Models**:
- Dynamic 10-shot models: For each test example, the most similar 10 examples were selected from the training set using BM25.
- Translate-test models: [google/gemma-7b-aps-it](https://huggingface.co/google/gemma-7b-aps-it) model + EN->KO, KO->EN translation using GPT-4o or GPT-4o-mini.
- Translate-train models: LoRA fine-tuned sLLMs using the Korean RoSE dataset.
**Reference-less metric**
| Model | Precision | Recall | F1 |
|---------------------------------------------------------------------|:---------:|:------:|:-----:|
| Gold | 97.46 | 96.28 | 95.88 |
| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct) | 98.86 | 93.99 | 95.58 |
| dynamic 10-shot GPT-4o | 97.61 | 97.00 | 96.87 |
| dynamic 10-shot GPT-4o-mini | 98.51 | 97.12 | 97.17 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) | 97.38 | 96.93 | 96.52 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) | 97.24 | 96.26 | 95.73 |
| Translate-Train (Qwen/Qwen2.5-7B-Instruct) | 94.66 | 92.81 | 92.08 |
| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) | 93.80 | 93.29 | 92.80 |
| **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)** | 97.41 | 96.02 | 95.93 |
**Reference-base metric**
| Model | Precision | Recall | F1 |
|---------------------------------------------------------------------|:---------:|:------:|:-----:|
| Gold | 100 | 100 | 100 |
| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct) | 48.49 | 40.27 | 42.99 |
| dynamic 10-shot GPT-4o | 49.16 | 44.72 | 46.05 |
| dynamic 10-shot GPT-4o-mini | 49.30 | 39.25 | 42.88 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) | 57.02 | 47.52 | 51.10 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) | 57.19 | 47.68 | 51.26 |
| Translate-Train (Qwen/Qwen2.5-7B-Instruct) | 42.62 | 38.37 | 39.64 |
| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) | 46.82 | 43.08 | 44.02 |
| **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)** | 50.82 | 45.89 | 47.44 |