File size: 7,588 Bytes
dd1155f
f70e532
 
dd1155f
 
 
 
f70e532
dd1155f
 
 
 
aafbe57
 
 
0a72c00
aafbe57
 
0a72c00
 
 
 
 
 
aafbe57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c5d5a5c
 
 
 
 
aafbe57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c5d5a5c
 
aafbe57
c5d5a5c
aafbe57
 
 
 
 
c5d5a5c
aafbe57
a976482
aafbe57
 
 
 
 
 
 
 
 
c5d5a5c
aafbe57
 
 
 
 
 
 
 
 
 
 
c5d5a5c
 
aafbe57
 
c5d5a5c
 
aafbe57
 
c5d5a5c
 
aafbe57
 
 
 
 
 
 
 
 
0a72c00
 
aafbe57
 
0a72c00
 
aafbe57
 
 
 
 
c5d5a5c
 
 
 
 
 
aafbe57
 
c5d5a5c
 
 
aafbe57
 
c5d5a5c
 
 
 
 
 
 
aafbe57
c5d5a5c
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
base_model:
- yanolja/EEVE-Korean-Instruct-10.8B-v1.0
datasets:
- Salesforce/rose
language:
- ko
license: apache-2.0
tags:
- korean
- Proposition
- Atomic_fact
---

# Overview
This model is designed for the **abstractive proposition segmentation task** in **Korean**, as described in the paper [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf). The model segments text into atomic and self-contained units (atomic facts).

# Training Details
- **Base Model**: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
- **Fine-tuning Method**: LoRA
- **Dataset**: [RoSE](https://huggingface.co/datasets/Salesforce/rose)
  - **Translation**: The dataset was translated into Korean using GPT-4o.
    - GPT-4o was prompted to translate propositions using the vocabulary in the text.
  - **Data Split**: The dataset was randomly split into training, validation, and test sets (1900:100:500) for fine-tuning.

# Usage
## Data Preprocessing
```
from konlpy.tag import Kkma

sent_start_token = "<sent>"
sent_end_token = "</sent>"
instruction = "I will provide a passage split into sentences by <s> and </s> markers. For each sentence, generate its list of propositions. Each proposition contains a single fact mentioned in the corresponding sentence written as briefly and clearly as possible.\n\n"

kkma = Kkma()

def get_input(text, tokenizer):
  sentences = kkma.sentences(text)
  prompt = instruction + "Passage: " + sent_start_token + f"{sent_end_token}{sent_start_token}".join(sentences) + sent_end_token + "\nPropositions:\n"
  messages = [{"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": prompt}]
  input_text = tokenizer.apply_chat_template(
                      messages,
                      tokenize=False,
                      add_generation_prompt=True)
  return input_text

def get_output(text):
  results = []
  group = []

  if text.startswith("Propositions:"):
      lines = text[len("Propositions:"):].strip().split("\n")
  else:
      lines = text.strip().split("\n")
      
  for line in lines:
    if line.strip() == sent_start_token:
      continue
    elif line.strip() == sent_end_token:
      results.append(group)
      group = []
    else:
      if not line.strip().startswith("-"):
        break
      line = line[1:].strip()
      group.append(line)

  return results
```

## Loading Model and Tokenizer
```
import peft, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

LORA_PATH = "seonjeongh/Korean-Propositionalizer"

lora_config = peft.PeftConfig.from_pretrained(LORA_PATH)
base_model = AutoModelForCausalLM.from_pretrained(lora_config.base_model_name_or_path,
                                                  torch_dtype=torch.float16,
                                                  device_map="auto")
model = peft.PeftModel.from_pretrained(base_model, LORA_PATH)
model = model.merge_and_unload(progressbar=True)
tokenizer = AutoTokenizer.from_pretrained(lora_config.base_model_name_or_path)
```

## Inference Example
```
device = "cuda"

text = "์˜ฅ์Šคํฌ๋“œ๋Š” ํ™”์š”์ผ ๋งจ์ฒด์Šคํ„ฐ ์œ ๋‚˜์ดํ‹ฐ๋“œ์™€์˜ ๊ฒฝ๊ธฐ์—์„œ 3-2๋กœ ํŒจํ•œ ๊ฒฝ๊ธฐ์—์„œ 21์„ธ ์ดํ•˜ ํŒ€์œผ๋กœ ๋“์ ํ–ˆ๋‹ค. ๊ทธ ๊ณจ์€ 16์„ธ ์„ ์ˆ˜์˜ 1๊ตฐ ๋ฐ๋ท” ์ฃผ์žฅ์„ ๊ฐ•ํ™”ํ•  ๊ฒƒ์ด๋‹ค. ์„ผํ„ฐ๋ฐฑ์€ ์ด๋ฒˆ ์‹œ์ฆŒ ์›จ์ŠคํŠธํ–„ 1๊ตฐ๊ณผ ํ•จ๊ป˜ ํ›ˆ๋ จํ–ˆ๋‹ค. ์›จ์ŠคํŠธํ–„ ์œ ๋‚˜์ดํ‹ฐ๋“œ์˜ ์ตœ์‹  ๋‰ด์Šค๋Š” ์—ฌ๊ธฐ๋ฅผ ํด๋ฆญํ•˜์„ธ์š”."
inputs = tokenizer([get_input(text, tokenizer)], return_tensors='pt').to(device)
output = model.generate(**inputs, max_new_tokens=512, pad_token_id = tokenizer.pad_token_id, eos_token_id = tokenizer.eos_token_id, use_cache=True)
response = tokenizer.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
results = get_output(response)
print(results)
```
<details>

<summary>Example output</summary>

```json
[
   [
    "์˜ฅ์Šคํฌ๋“œ๋Š” 21์„ธ ์ดํ•˜ ํŒ€์œผ๋กœ ๋“์ ํ–ˆ๋‹ค.",
    "์˜ฅ์Šคํฌ๋“œ๋Š” ๋งจ์ฒด์Šคํ„ฐ ์œ ๋‚˜์ดํ‹ฐ๋“œ์™€์˜ ๊ฒฝ๊ธฐ์—์„œ 3-2๋กœ ํŒจํ–ˆ๋‹ค.",
    "์˜ฅ์Šคํฌ๋“œ๋Š” ํ™”์š”์ผ ๊ฒฝ๊ธฐ๋ฅผ ํ–ˆ๋‹ค.",
   ],
   [
    "๊ทธ ๊ณจ์€ 16์„ธ ์„ ์ˆ˜์˜ ์ฃผ์žฅ์„ ๊ฐ•ํ™”ํ•  ๊ฒƒ์ด๋‹ค.",
    "๊ทธ ๊ณจ์€ 16์„ธ ์„ ์ˆ˜์˜ 1 ๊ตฐ ๋ฐ๋ท” ์ฃผ์žฅ์„ ๊ฐ•ํ™”ํ•  ๊ฒƒ์ด๋‹ค.",
   ],
   [
    "์„ผํ„ฐ ๋ฐฑ์€ ์›จ์ŠคํŠธ ํ–„ 1 ๊ตฐ๊ณผ ํ•จ๊ป˜ ํ›ˆ๋ จํ–ˆ๋‹ค.",
    "์„ผํ„ฐ ๋ฐฑ์€ ์ด๋ฒˆ ์‹œ์ฆŒ ์›จ์ŠคํŠธ ํ–„ 1 ๊ตฐ๊ณผ ํ•จ๊ป˜ ํ›ˆ๋ จํ–ˆ๋‹ค.",
   ],
   [
    "์›จ์ŠคํŠธํ–„ ์œ ๋‚˜์ดํ‹ฐ๋“œ์˜ ์ตœ์‹  ๋‰ด์Šค๋Š” ์—ฌ๊ธฐ๋ฅผ ํด๋ฆญํ•˜์„ธ์š”."
   ]
]
```
</details>

## Inputs and Outputs
- **Input**: Text.
- **Output**: List of propositions for all the sentences in the text passage. The propositions for each sentence are grouped separately.

## Evaluation Results
- **Metric**: Reference-less & reference-base metrics proposed in [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf).
- **Models**:
  - Dynamic 10-shot models: For each test example, the most similar 10 examples were selected from the training set using BM25.
  - Translate-test models: [google/gemma-7b-aps-it](https://huggingface.co/google/gemma-7b-aps-it) model + EN->KO, KO->EN translation using GPT-4o or GPT-4o-mini.
  - Translate-train models: LoRA fine-tuned sLLMs using the Korean RoSE dataset.

**Reference-less metric**
| Model                                                               | Precision | Recall |   F1  |
|---------------------------------------------------------------------|:---------:|:------:|:-----:|
| Gold                                                                |   97.46   |  96.28 | 95.88 |
| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct)                         |   98.86   |  93.99 | 95.58 |
| dynamic 10-shot GPT-4o                                              |   97.61   |  97.00 | 96.87 |
| dynamic 10-shot GPT-4o-mini                                         |   98.51   |  97.12 | 97.17 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation)        |   97.38   |  96.93 | 96.52 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation)   |   97.24   |  96.26 | 95.73 |
| Translate-Train (Qwen/Qwen2.5-7B-Instruct)                          |   94.66   |  92.81 | 92.08 |
| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct)              |   93.80   |  93.29 | 92.80 |
| **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)**       |   97.41   |  96.02 | 95.93 |

**Reference-base metric**
| Model                                                               | Precision | Recall |   F1  |
|---------------------------------------------------------------------|:---------:|:------:|:-----:|
| Gold                                                                |   100     |  100   | 100   |
| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct)                         |   48.49   |  40.27 | 42.99 |
| dynamic 10-shot GPT-4o                                              |   49.16   |  44.72 | 46.05 |
| dynamic 10-shot GPT-4o-mini                                         |   49.30   |  39.25 | 42.88 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation)        |   57.02   |  47.52 | 51.10 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation)   |   57.19   |  47.68 | 51.26 |
| Translate-Train (Qwen/Qwen2.5-7B-Instruct)                          |   42.62   |  38.37 | 39.64 |
| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct)              |   46.82   |  43.08 | 44.02 |
| **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)**       |   50.82   |  45.89 | 47.44 |