seonjeongh
/

Korean-Propositionalizer

 - korean
 - Proposition
 - Atomic_fact
+---
+# Overview
+This model is designed for the **abstractive proposition segmentation task** in Korean, as described in the paper [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf). The model segments text into atomic and self-contained units (atomic facts).
+# Training Details
+- Base Model: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
+- Peft: LoRA
+- Dataset: [RoSE](https://huggingface.co/datasets/Salesforce/rose)
+  - The dataset was split into training, validation, test sets for fine-tuning.
+  - The dataset was translated into Korean.
+  - More details about the dataset can be found here.
+# Usage
+## Data Preprocessing
+```
+from konlpy.tag import Kkma
+sent_start_token = "<sent>"
+sent_end_token = "</sent>"
+instruction = "I will provide a passage split into sentences by <s> and </s> markers. For each sentence, generate its list of propositions. Each proposition contains a single fact mentioned in the corresponding sentence written as briefly and clearly as possible.\n\n"
+kkma = Kkma()
+def get_input(text, tokenizer):
+  sentences = kkma.sentences(text)
+  prompt = instruction + "Passage: " + sent_start_token + f"{sent_end_token}{sent_start_token}".join(sentences) + sent_end_token + "\nPropositions:\n"
+  messages = [{"role": "system", "content": "You are a helpful assistant."},
+              {"role": "user", "content": prompt}]
+  input_text = tokenizer.apply_chat_template(
+                      messages,
+                      tokenize=False,
+                      add_generation_prompt=True)
+  return input_text
+def get_output(text):
+  results = []
+  group = []
+  lines = text.strip().split("\n")
+  for line in lines:
+    if line.strip() == sent_start_token:
+      continue
+    elif line.strip() == sent_end_token:
+      results.append(group)
+      group = []
+    else:
+      if not line.strip().startswith("-"):
+        break
+      line = line[1:].strip()
+      group.append(line)
+  return results
+```
+## Loading Model and Tokenizer
+```
+import peft
+LORA_PATH = seonjeongh/Korean-Propositionalizer
+lora_config = peft.PeftConfig.from_pretrained(LORA_PATH)
+base_model = AutoModelForCausalLM.from_pretrained(lora_config.base_model_name_or_path,
+                                                  torch_dtype=torch.float16,
+                                                  device_map="auto")
+model = peft.PeftModel.from_pretrained(base_model, args.peft_model_dir)
+model = model.merge_and_unload(progressbar=True)
+tokenizer = AutoTokenizer.from_pretrained(LORA_PATH)
+```
+## Inference Example
+```
+device = "cuda"
+text = "옥스포드는 화요일 맨체스터 유나이티드와의 경기에서 3-2로 패한 경기에서 21세 이하 팀으로 득점했다. 그 골은 16세 선수의 1군 데뷔 주장을 강화할 것이다. 센터백은 이번 시즌 웨스트햄 1군과 함께 훈련했다. 웨스트햄 유나이티드의 최신 뉴스는 여기를 클릭하세요."
+inputs = tokenizer([get_input(text, tokenizer)], return_tensors='pt').to(device)
+output = model.generate(**inputs, max_new_tokens=512, pad_token_id = tokenizer.pad_token_id, eos_token_id = tokenizer.eos_token_id, use_cache=True)
+response = tokenizer.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
+results = get_output(response)
+print(results)
+```
+<details>
+<summary>Example output</summary>
+```json
+[
+   [
+    "옥스포드는 21세 이하 팀으로 득점했다.",
+    "옥스포드는 맨체스터 유나이티드와의 경기에서 득점했다.",
+    "옥스포드는 화요일 맨체스터 유나이티드와의 경기에서 득점했다.",
+    "옥스포드는 맨체스터 유나이티드와의 경기에서 3-2로 패했다."
+   ],
+   [
+    "그 골은 옥스포드의 주장을 강화할 것이다.",
+    "옥스포드는 16세 선수이다.",
+    "옥스포드는 1군 데뷔를 주장할 것이다."
+   ],
+   [
+    "옥스포드는 센터백이다.",
+    "옥스포드는 웨스트햄 1군과 함께 훈련했다.",
+    "옥스포드는 이번 시즌 웨스트햄 1군과 함께 훈련했다."
+   ],
+   [
+    "웨스트햄 유나이티드의 최신 뉴스는 여기를 클릭하세요."
+   ]
+]
+```
+</details>
+## Inputs and Outputs
+- Input: Text.
+- Output: List of propositions for all the sentences in the text passage. The propositions for each sentence are grouped separately.
+## Evaluation Results
+- Metric: Reference-less & reference-base metrics proposed in [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf).
+- Models:
+  - Dynamic 10-shot models: For each test example, the most similar 10 examples were selected from the training set using BM25.
+  - Translate-test models: [google/gemma-7b-aps-it](https://huggingface.co/google/gemma-7b-aps-it) model + EN->KO, KO->EN translation using GPT-4o or GPT-4o-mini.
+  - Translate-train models: LoRA fine-tuned sLLMs using the Korean RoSE dataset.
+**Reference-less metric**
+| Model                                      | Precision | Recall |   F1  |
+|--------------------------------------------|:---------:|:------:|:-----:|
+| Gold                                       |   97.46   |  96.28 | 95.88 |
+| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct)|   98.86   |  93.99 | 95.58 |
+| dynamic 10-shot GPT-4o                     |   97.61   |  97.00 | 96.87 |
+| dynamic 10-shot GPT-4o-mini                |   98.51   |  97.12 | 97.17 |
+| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation)        |   97.38   |  96.93 | 96.52 |
+| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation)   |   97.24   |  96.26 | 95.73 |
+| Translate-Train (Qwen/Qwen2.5-7B-Instruct) |   94.66   |  92.81 | 92.08 |
+| **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)** |   97.41   |  96.02  | 95.93 |
+| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) |   -   |  - | - |
+**Reference-base metric**
+| Model                                      | Precision | Recall |   F1  |
+|--------------------------------------------|:---------:|:------:|:-----:|
+| Gold                                       |   100     |  100   | 100   |
+| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct)|   48.49   |  40.27 | 42.99 |
+| dynamic 10-shot GPT-4o                     |   49.16   |  44.72 | 46.05 |
+| dynamic 10-shot GPT-4o-mini                |   49.30   |  39.25 | 42.88 |
+| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation)        |   57.02   |  47.52 | 51.10|
+| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation)   |   57.19   |  47.68 | 51.26 |
+| Translate-Train (Qwen/Qwen2.5-7B-Instruct) |   42.62   |  38.37 | 39.64 |
+| **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)** |   50.82   |  45.89  | 47.44 |
+| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) |   -   |  - | - |