File size: 7,588 Bytes

dd1155f
f70e532
 
dd1155f
 
 
 
f70e532
dd1155f
 
 
 
aafbe57
 
 
0a72c00
aafbe57
 
0a72c00
 
 
 
 
 
aafbe57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c5d5a5c
 
 
 
 
aafbe57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c5d5a5c
 
aafbe57
c5d5a5c
aafbe57
 
 
 
 
c5d5a5c
aafbe57
a976482
aafbe57
 
 
 
 
 
 
 
 
c5d5a5c
aafbe57
 
 
 
 
 
 
 
 
 
 
c5d5a5c
 
aafbe57
 
c5d5a5c
 
aafbe57
 
c5d5a5c
 
aafbe57
 
 
 
 
 
 
 
 
0a72c00
 
aafbe57
 
0a72c00
 
aafbe57
 
 
 
 
c5d5a5c
 
 
 
 
 
aafbe57
 
c5d5a5c
 
 
aafbe57
 
c5d5a5c
 
 
 
 
 
 
aafbe57
c5d5a5c

---
base_model:
- yanolja/EEVE-Korean-Instruct-10.8B-v1.0
datasets:
- Salesforce/rose
language:
- ko
license: apache-2.0
tags:
- korean
- Proposition
- Atomic_fact
---

# Overview
This model is designed for the **abstractive proposition segmentation task** in **Korean**, as described in the paper [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf). The model segments text into atomic and self-contained units (atomic facts).

# Training Details
- **Base Model**: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
- **Fine-tuning Method**: LoRA
- **Dataset**: [RoSE](https://huggingface.co/datasets/Salesforce/rose)
  - **Translation**: The dataset was translated into Korean using GPT-4o.
    - GPT-4o was prompted to translate propositions using the vocabulary in the text.
  - **Data Split**: The dataset was randomly split into training, validation, and test sets (1900:100:500) for fine-tuning.

# Usage
## Data Preprocessing
```
from konlpy.tag import Kkma

sent_start_token = "<sent>"
sent_end_token = "</sent>"
instruction = "I will provide a passage split into sentences by <s> and </s> markers. For each sentence, generate its list of propositions. Each proposition contains a single fact mentioned in the corresponding sentence written as briefly and clearly as possible.\n\n"

kkma = Kkma()

def get_input(text, tokenizer):
  sentences = kkma.sentences(text)
  prompt = instruction + "Passage: " + sent_start_token + f"{sent_end_token}{sent_start_token}".join(sentences) + sent_end_token + "\nPropositions:\n"
  messages = [{"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": prompt}]
  input_text = tokenizer.apply_chat_template(
                      messages,
                      tokenize=False,
                      add_generation_prompt=True)
  return input_text

def get_output(text):
  results = []
  group = []

  if text.startswith("Propositions:"):
      lines = text[len("Propositions:"):].strip().split("\n")
  else:
      lines = text.strip().split("\n")
      
  for line in lines:
    if line.strip() == sent_start_token:
      continue
    elif line.strip() == sent_end_token:
      results.append(group)
      group = []
    else:
      if not line.strip().startswith("-"):
        break
      line = line[1:].strip()
      group.append(line)

  return results
```

## Loading Model and Tokenizer
```
import peft, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

LORA_PATH = "seonjeongh/Korean-Propositionalizer"

lora_config = peft.PeftConfig.from_pretrained(LORA_PATH)
base_model = AutoModelForCausalLM.from_pretrained(lora_config.base_model_name_or_path,
                                                  torch_dtype=torch.float16,
                                                  device_map="auto")
model = peft.PeftModel.from_pretrained(base_model, LORA_PATH)
model = model.merge_and_unload(progressbar=True)
tokenizer = AutoTokenizer.from_pretrained(lora_config.base_model_name_or_path)
```

## Inference Example
```
device = "cuda"

text = "옥스포드는 화요일 맨체스터 유나이티드와의 경기에서 3-2로 패한 경기에서 21세 이하 팀으로 득점했다. 그 골은 16세 선수의 1군 데뷔 주장을 강화할 것이다. 센터백은 이번 시즌 웨스트햄 1군과 함께 훈련했다. 웨스트햄 유나이티드의 최신 뉴스는 여기를 클릭하세요."
inputs = tokenizer([get_input(text, tokenizer)], return_tensors='pt').to(device)
output = model.generate(**inputs, max_new_tokens=512, pad_token_id = tokenizer.pad_token_id, eos_token_id = tokenizer.eos_token_id, use_cache=True)
response = tokenizer.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
results = get_output(response)
print(results)
```
<details>

<summary>Example output</summary>

```json
[
   [
    "옥스포드는 21세 이하 팀으로 득점했다.",
    "옥스포드는 맨체스터 유나이티드와의 경기에서 3-2로 패했다.",
    "옥스포드는 화요일 경기를 했다.",
   ],
   [
    "그 골은 16세 선수의 주장을 강화할 것이다.",
    "그 골은 16세 선수의 1 군 데뷔 주장을 강화할 것이다.",
   ],
   [
    "센터 백은 웨스트 햄 1 군과 함께 훈련했다.",
    "센터 백은 이번 시즌 웨스트 햄 1 군과 함께 훈련했다.",
   ],
   [
    "웨스트햄 유나이티드의 최신 뉴스는 여기를 클릭하세요."
   ]
]
```
</details>

## Inputs and Outputs
- **Input**: Text.
- **Output**: List of propositions for all the sentences in the text passage. The propositions for each sentence are grouped separately.

## Evaluation Results
- **Metric**: Reference-less & reference-base metrics proposed in [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf).
- **Models**:
  - Dynamic 10-shot models: For each test example, the most similar 10 examples were selected from the training set using BM25.
  - Translate-test models: [google/gemma-7b-aps-it](https://huggingface.co/google/gemma-7b-aps-it) model + EN->KO, KO->EN translation using GPT-4o or GPT-4o-mini.
  - Translate-train models: LoRA fine-tuned sLLMs using the Korean RoSE dataset.

**Reference-less metric**
| Model                                                               | Precision | Recall |   F1  |
|---------------------------------------------------------------------|:---------:|:------:|:-----:|
| Gold                                                                |   97.46   |  96.28 | 95.88 |
| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct)                         |   98.86   |  93.99 | 95.58 |
| dynamic 10-shot GPT-4o                                              |   97.61   |  97.00 | 96.87 |
| dynamic 10-shot GPT-4o-mini                                         |   98.51   |  97.12 | 97.17 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation)        |   97.38   |  96.93 | 96.52 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation)   |   97.24   |  96.26 | 95.73 |
| Translate-Train (Qwen/Qwen2.5-7B-Instruct)                          |   94.66   |  92.81 | 92.08 |
| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct)              |   93.80   |  93.29 | 92.80 |
| **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)**       |   97.41   |  96.02 | 95.93 |

**Reference-base metric**
| Model                                                               | Precision | Recall |   F1  |
|---------------------------------------------------------------------|:---------:|:------:|:-----:|
| Gold                                                                |   100     |  100   | 100   |
| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct)                         |   48.49   |  40.27 | 42.99 |
| dynamic 10-shot GPT-4o                                              |   49.16   |  44.72 | 46.05 |
| dynamic 10-shot GPT-4o-mini                                         |   49.30   |  39.25 | 42.88 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation)        |   57.02   |  47.52 | 51.10 |
| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation)   |   57.19   |  47.68 | 51.26 |
| Translate-Train (Qwen/Qwen2.5-7B-Instruct)                          |   42.62   |  38.37 | 39.64 |
| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct)              |   46.82   |  43.08 | 44.02 |
| **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)**       |   50.82   |  45.89 | 47.44 |