seonjeongh commited on
Commit
aafbe57
Β·
verified Β·
1 Parent(s): dd1155f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +146 -1
README.md CHANGED
@@ -10,4 +10,149 @@ tags:
10
  - korean
11
  - Proposition
12
  - Atomic_fact
13
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - korean
11
  - Proposition
12
  - Atomic_fact
13
+ ---
14
+
15
+ # Overview
16
+ This model is designed for the **abstractive proposition segmentation task** in Korean, as described in the paper [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf). The model segments text into atomic and self-contained units (atomic facts).
17
+
18
+ # Training Details
19
+ - Base Model: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
20
+ - Peft: LoRA
21
+ - Dataset: [RoSE](https://huggingface.co/datasets/Salesforce/rose)
22
+ - The dataset was split into training, validation, test sets for fine-tuning.
23
+ - The dataset was translated into Korean.
24
+ - More details about the dataset can be found here.
25
+
26
+ # Usage
27
+ ## Data Preprocessing
28
+ ```
29
+ from konlpy.tag import Kkma
30
+
31
+ sent_start_token = "<sent>"
32
+ sent_end_token = "</sent>"
33
+ instruction = "I will provide a passage split into sentences by <s> and </s> markers. For each sentence, generate its list of propositions. Each proposition contains a single fact mentioned in the corresponding sentence written as briefly and clearly as possible.\n\n"
34
+
35
+ kkma = Kkma()
36
+
37
+ def get_input(text, tokenizer):
38
+ sentences = kkma.sentences(text)
39
+ prompt = instruction + "Passage: " + sent_start_token + f"{sent_end_token}{sent_start_token}".join(sentences) + sent_end_token + "\nPropositions:\n"
40
+ messages = [{"role": "system", "content": "You are a helpful assistant."},
41
+ {"role": "user", "content": prompt}]
42
+ input_text = tokenizer.apply_chat_template(
43
+ messages,
44
+ tokenize=False,
45
+ add_generation_prompt=True)
46
+ return input_text
47
+
48
+ def get_output(text):
49
+ results = []
50
+ group = []
51
+
52
+ lines = text.strip().split("\n")
53
+ for line in lines:
54
+ if line.strip() == sent_start_token:
55
+ continue
56
+ elif line.strip() == sent_end_token:
57
+ results.append(group)
58
+ group = []
59
+ else:
60
+ if not line.strip().startswith("-"):
61
+ break
62
+ line = line[1:].strip()
63
+ group.append(line)
64
+
65
+ return results
66
+ ```
67
+
68
+ ## Loading Model and Tokenizer
69
+ ```
70
+ import peft
71
+
72
+ LORA_PATH = seonjeongh/Korean-Propositionalizer
73
+
74
+ lora_config = peft.PeftConfig.from_pretrained(LORA_PATH)
75
+ base_model = AutoModelForCausalLM.from_pretrained(lora_config.base_model_name_or_path,
76
+ torch_dtype=torch.float16,
77
+ device_map="auto")
78
+ model = peft.PeftModel.from_pretrained(base_model, args.peft_model_dir)
79
+ model = model.merge_and_unload(progressbar=True)
80
+ tokenizer = AutoTokenizer.from_pretrained(LORA_PATH)
81
+ ```
82
+
83
+ ## Inference Example
84
+ ```
85
+ device = "cuda"
86
+
87
+ text = "μ˜₯μŠ€ν¬λ“œλŠ” ν™”μš”μΌ λ§¨μ²΄μŠ€ν„° μœ λ‚˜μ΄ν‹°λ“œμ™€μ˜ κ²½κΈ°μ—μ„œ 3-2둜 νŒ¨ν•œ κ²½κΈ°μ—μ„œ 21μ„Έ μ΄ν•˜ νŒ€μœΌλ‘œ λ“μ ν–ˆλ‹€. κ·Έ 골은 16μ„Έ μ„ μˆ˜μ˜ 1κ΅° 데뷔 μ£Όμž₯을 κ°•ν™”ν•  것이닀. 센터백은 이번 μ‹œμ¦Œ μ›¨μŠ€νŠΈν–„ 1κ΅°κ³Ό ν•¨κ»˜ ν›ˆλ ¨ν–ˆλ‹€. μ›¨μŠ€νŠΈν–„ μœ λ‚˜μ΄ν‹°λ“œμ˜ μ΅œμ‹  λ‰΄μŠ€λŠ” μ—¬κΈ°λ₯Ό ν΄λ¦­ν•˜μ„Έμš”."
88
+ inputs = tokenizer([get_input(text, tokenizer)], return_tensors='pt').to(device)
89
+ output = model.generate(**inputs, max_new_tokens=512, pad_token_id = tokenizer.pad_token_id, eos_token_id = tokenizer.eos_token_id, use_cache=True)
90
+ response = tokenizer.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
91
+ results = get_output(response)
92
+ print(results)
93
+ ```
94
+ <details>
95
+
96
+ <summary>Example output</summary>
97
+
98
+ ```json
99
+ [
100
+ [
101
+ "μ˜₯μŠ€ν¬λ“œλŠ” 21μ„Έ μ΄ν•˜ νŒ€μœΌλ‘œ λ“μ ν–ˆλ‹€.",
102
+ "μ˜₯μŠ€ν¬λ“œλŠ” λ§¨μ²΄μŠ€ν„° μœ λ‚˜μ΄ν‹°λ“œμ™€μ˜ κ²½κΈ°μ—μ„œ λ“μ ν–ˆλ‹€.",
103
+ "μ˜₯μŠ€ν¬λ“œλŠ” ν™”μš”μΌ λ§¨μ²΄μŠ€ν„° μœ λ‚˜μ΄ν‹°λ“œμ™€μ˜ κ²½κΈ°μ—μ„œ λ“μ ν–ˆλ‹€.",
104
+ "μ˜₯μŠ€ν¬λ“œλŠ” λ§¨μ²΄μŠ€ν„° μœ λ‚˜μ΄ν‹°λ“œμ™€μ˜ κ²½κΈ°μ—μ„œ 3-2둜 νŒ¨ν–ˆλ‹€."
105
+ ],
106
+ [
107
+ "κ·Έ 골은 μ˜₯μŠ€ν¬λ“œμ˜ μ£Όμž₯을 κ°•ν™”ν•  것이닀.",
108
+ "μ˜₯μŠ€ν¬λ“œλŠ” 16μ„Έ μ„ μˆ˜μ΄λ‹€.",
109
+ "μ˜₯μŠ€ν¬λ“œλŠ” 1κ΅° 데뷔λ₯Ό μ£Όμž₯ν•  것이닀."
110
+ ],
111
+ [
112
+ "μ˜₯μŠ€ν¬λ“œλŠ” 센터백이닀.",
113
+ "μ˜₯μŠ€ν¬λ“œλŠ” μ›¨μŠ€νŠΈν–„ 1κ΅°κ³Ό ν•¨κ»˜ ν›ˆλ ¨ν–ˆλ‹€.",
114
+ "μ˜₯μŠ€ν¬λ“œλŠ” 이번 μ‹œμ¦Œ μ›¨μŠ€νŠΈν–„ 1κ΅°κ³Ό ν•¨κ»˜ ν›ˆλ ¨ν–ˆλ‹€."
115
+ ],
116
+ [
117
+ "μ›¨μŠ€νŠΈν–„ μœ λ‚˜μ΄ν‹°λ“œμ˜ μ΅œμ‹  λ‰΄μŠ€λŠ” μ—¬κΈ°λ₯Ό ν΄λ¦­ν•˜μ„Έμš”."
118
+ ]
119
+ ]
120
+ ```
121
+ </details>
122
+
123
+ ## Inputs and Outputs
124
+ - Input: Text.
125
+ - Output: List of propositions for all the sentences in the text passage. The propositions for each sentence are grouped separately.
126
+
127
+ ## Evaluation Results
128
+ - Metric: Reference-less & reference-base metrics proposed in [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf).
129
+ - Models:
130
+ - Dynamic 10-shot models: For each test example, the most similar 10 examples were selected from the training set using BM25.
131
+ - Translate-test models: [google/gemma-7b-aps-it](https://huggingface.co/google/gemma-7b-aps-it) model + EN->KO, KO->EN translation using GPT-4o or GPT-4o-mini.
132
+ - Translate-train models: LoRA fine-tuned sLLMs using the Korean RoSE dataset.
133
+
134
+ **Reference-less metric**
135
+ | Model | Precision | Recall | F1 |
136
+ |--------------------------------------------|:---------:|:------:|:-----:|
137
+ | Gold | 97.46 | 96.28 | 95.88 |
138
+ | dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct)| 98.86 | 93.99 | 95.58 |
139
+ | dynamic 10-shot GPT-4o | 97.61 | 97.00 | 96.87 |
140
+ | dynamic 10-shot GPT-4o-mini | 98.51 | 97.12 | 97.17 |
141
+ | Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) | 97.38 | 96.93 | 96.52 |
142
+ | Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) | 97.24 | 96.26 | 95.73 |
143
+ | Translate-Train (Qwen/Qwen2.5-7B-Instruct) | 94.66 | 92.81 | 92.08 |
144
+ | **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)** | 97.41 | 96.02 | 95.93 |
145
+ | Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) | - | - | - |
146
+
147
+ **Reference-base metric**
148
+ | Model | Precision | Recall | F1 |
149
+ |--------------------------------------------|:---------:|:------:|:-----:|
150
+ | Gold | 100 | 100 | 100 |
151
+ | dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct)| 48.49 | 40.27 | 42.99 |
152
+ | dynamic 10-shot GPT-4o | 49.16 | 44.72 | 46.05 |
153
+ | dynamic 10-shot GPT-4o-mini | 49.30 | 39.25 | 42.88 |
154
+ | Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) | 57.02 | 47.52 | 51.10|
155
+ | Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) | 57.19 | 47.68 | 51.26 |
156
+ | Translate-Train (Qwen/Qwen2.5-7B-Instruct) | 42.62 | 38.37 | 39.64 |
157
+ | **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)** | 50.82 | 45.89 | 47.44 |
158
+ | Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) | - | - | - |