Update README.md

a976482 verified 11 months ago

7.59 kB

	---
	base_model:
	- yanolja/EEVE-Korean-Instruct-10.8B-v1.0
	datasets:
	- Salesforce/rose
	language:
	- ko
	license: apache-2.0
	tags:
	- korean
	- Proposition
	- Atomic_fact
	---

	# Overview
	This model is designed for the abstractive proposition segmentation task in Korean, as described in the paper [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf). The model segments text into atomic and self-contained units (atomic facts).

	# Training Details
	- Base Model: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
	- Fine-tuning Method: LoRA
	- Dataset: [RoSE](https://huggingface.co/datasets/Salesforce/rose)
	- Translation: The dataset was translated into Korean using GPT-4o.
	- GPT-4o was prompted to translate propositions using the vocabulary in the text.
	- Data Split: The dataset was randomly split into training, validation, and test sets (1900:100:500) for fine-tuning.

	# Usage
	## Data Preprocessing
	```
	from konlpy.tag import Kkma

	sent_start_token = "<sent>"
	sent_end_token = "</sent>"
	instruction = "I will provide a passage split into sentences by <s> and </s> markers. For each sentence, generate its list of propositions. Each proposition contains a single fact mentioned in the corresponding sentence written as briefly and clearly as possible.\n\n"

	kkma = Kkma()

	def get_input(text, tokenizer):
	sentences = kkma.sentences(text)
	prompt = instruction + "Passage: " + sent_start_token + f"{sent_end_token}{sent_start_token}".join(sentences) + sent_end_token + "\nPropositions:\n"
	messages = [{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": prompt}]
	input_text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True)
	return input_text

	def get_output(text):
	results = []
	group = []

	if text.startswith("Propositions:"):
	lines = text[len("Propositions:"):].strip().split("\n")
	else:
	lines = text.strip().split("\n")

	for line in lines:
	if line.strip() == sent_start_token:
	continue
	elif line.strip() == sent_end_token:
	results.append(group)
	group = []
	else:
	if not line.strip().startswith("-"):
	break
	line = line[1:].strip()
	group.append(line)

	return results
	```

	## Loading Model and Tokenizer
	```
	import peft, torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	LORA_PATH = "seonjeongh/Korean-Propositionalizer"

	lora_config = peft.PeftConfig.from_pretrained(LORA_PATH)
	base_model = AutoModelForCausalLM.from_pretrained(lora_config.base_model_name_or_path,
	torch_dtype=torch.float16,
	device_map="auto")
	model = peft.PeftModel.from_pretrained(base_model, LORA_PATH)
	model = model.merge_and_unload(progressbar=True)
	tokenizer = AutoTokenizer.from_pretrained(lora_config.base_model_name_or_path)
	```

	## Inference Example
	```
	device = "cuda"

	text = "옥스포드는 화요일 맨체스터 유나이티드와의 경기에서 3-2로 패한 경기에서 21세 이하 팀으로 득점했다. 그 골은 16세 선수의 1군 데뷔 주장을 강화할 것이다. 센터백은 이번 시즌 웨스트햄 1군과 함께 훈련했다. 웨스트햄 유나이티드의 최신 뉴스는 여기를 클릭하세요."
	inputs = tokenizer([get_input(text, tokenizer)], return_tensors='pt').to(device)
	output = model.generate(**inputs, max_new_tokens=512, pad_token_id = tokenizer.pad_token_id, eos_token_id = tokenizer.eos_token_id, use_cache=True)
	response = tokenizer.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
	results = get_output(response)
	print(results)
	```
	<details>

	<summary>Example output</summary>

	```json
	[
	[
	"옥스포드는 21세 이하 팀으로 득점했다.",
	"옥스포드는 맨체스터 유나이티드와의 경기에서 3-2로 패했다.",
	"옥스포드는 화요일 경기를 했다.",
	],
	[
	"그 골은 16세 선수의 주장을 강화할 것이다.",
	"그 골은 16세 선수의 1 군 데뷔 주장을 강화할 것이다.",
	],
	[
	"센터 백은 웨스트 햄 1 군과 함께 훈련했다.",
	"센터 백은 이번 시즌 웨스트 햄 1 군과 함께 훈련했다.",
	],
	[
	"웨스트햄 유나이티드의 최신 뉴스는 여기를 클릭하세요."
	]
	]
	```
	</details>

	## Inputs and Outputs
	- Input: Text.
	- Output: List of propositions for all the sentences in the text passage. The propositions for each sentence are grouped separately.

	## Evaluation Results
	- Metric: Reference-less & reference-base metrics proposed in [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf).
	- Models:
	- Dynamic 10-shot models: For each test example, the most similar 10 examples were selected from the training set using BM25.
	- Translate-test models: [google/gemma-7b-aps-it](https://huggingface.co/google/gemma-7b-aps-it) model + EN->KO, KO->EN translation using GPT-4o or GPT-4o-mini.
	- Translate-train models: LoRA fine-tuned sLLMs using the Korean RoSE dataset.

	Reference-less metric
	\| Model \| Precision \| Recall \| F1 \|
	\|---------------------------------------------------------------------\|:---------:\|:------:\|:-----:\|
	\| Gold \| 97.46 \| 96.28 \| 95.88 \|
	\| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct) \| 98.86 \| 93.99 \| 95.58 \|
	\| dynamic 10-shot GPT-4o \| 97.61 \| 97.00 \| 96.87 \|
	\| dynamic 10-shot GPT-4o-mini \| 98.51 \| 97.12 \| 97.17 \|
	\| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) \| 97.38 \| 96.93 \| 96.52 \|
	\| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) \| 97.24 \| 96.26 \| 95.73 \|
	\| Translate-Train (Qwen/Qwen2.5-7B-Instruct) \| 94.66 \| 92.81 \| 92.08 \|
	\| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) \| 93.80 \| 93.29 \| 92.80 \|
	\| Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0) \| 97.41 \| 96.02 \| 95.93 \|

	Reference-base metric
	\| Model \| Precision \| Recall \| F1 \|
	\|---------------------------------------------------------------------\|:---------:\|:------:\|:-----:\|
	\| Gold \| 100 \| 100 \| 100 \|
	\| dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct) \| 48.49 \| 40.27 \| 42.99 \|
	\| dynamic 10-shot GPT-4o \| 49.16 \| 44.72 \| 46.05 \|
	\| dynamic 10-shot GPT-4o-mini \| 49.30 \| 39.25 \| 42.88 \|
	\| Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) \| 57.02 \| 47.52 \| 51.10 \|
	\| Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) \| 57.19 \| 47.68 \| 51.26 \|
	\| Translate-Train (Qwen/Qwen2.5-7B-Instruct) \| 42.62 \| 38.37 \| 39.64 \|
	\| Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) \| 46.82 \| 43.08 \| 44.02 \|
	\| Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0) \| 50.82 \| 45.89 \| 47.44 \|