--- base_model: - yanolja/EEVE-Korean-Instruct-10.8B-v1.0 datasets: - Salesforce/rose language: - ko license: apache-2.0 tags: - korean - Proposition - Atomic_fact --- # Overview This model is designed for the **abstractive proposition segmentation task** in **Korean**, as described in the paper [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf). The model segments text into atomic and self-contained units (atomic facts). # Training Details - **Base Model**: yanolja/EEVE-Korean-Instruct-10.8B-v1.0 - **Fine-tuning Method**: LoRA - **Dataset**: [RoSE](https://huggingface.co/datasets/Salesforce/rose) - **Translation**: The dataset was translated into Korean using GPT-4o. - GPT-4o was prompted to translate propositions using the vocabulary in the text. - **Data Split**: The dataset was randomly split into training, validation, and test sets (1900:100:500) for fine-tuning. # Usage ## Data Preprocessing ``` from konlpy.tag import Kkma sent_start_token = "" sent_end_token = "" instruction = "I will provide a passage split into sentences by and markers. For each sentence, generate its list of propositions. Each proposition contains a single fact mentioned in the corresponding sentence written as briefly and clearly as possible.\n\n" kkma = Kkma() def get_input(text, tokenizer): sentences = kkma.sentences(text) prompt = instruction + "Passage: " + sent_start_token + f"{sent_end_token}{sent_start_token}".join(sentences) + sent_end_token + "\nPropositions:\n" messages = [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt}] input_text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True) return input_text def get_output(text): results = [] group = [] if text.startswith("Propositions:"): lines = text[len("Propositions:"):].strip().split("\n") else: lines = text.strip().split("\n") for line in lines: if line.strip() == sent_start_token: continue elif line.strip() == sent_end_token: results.append(group) group = [] else: if not line.strip().startswith("-"): break line = line[1:].strip() group.append(line) return results ``` ## Loading Model and Tokenizer ``` import peft, torch from transformers import AutoModelForCausalLM, AutoTokenizer LORA_PATH = "seonjeongh/Korean-Propositionalizer" lora_config = peft.PeftConfig.from_pretrained(LORA_PATH) base_model = AutoModelForCausalLM.from_pretrained(lora_config.base_model_name_or_path, torch_dtype=torch.float16, device_map="auto") model = peft.PeftModel.from_pretrained(base_model, LORA_PATH) model = model.merge_and_unload(progressbar=True) tokenizer = AutoTokenizer.from_pretrained(lora_config.base_model_name_or_path) ``` ## Inference Example ``` device = "cuda" text = "옥스포드는 화요일 맨체스터 유나이티드와의 경기에서 3-2로 패한 경기에서 21세 이하 팀으로 득점했다. 그 골은 16세 선수의 1군 데뷔 주장을 강화할 것이다. 센터백은 이번 시즌 웨스트햄 1군과 함께 훈련했다. 웨스트햄 유나이티드의 최신 뉴스는 여기를 클릭하세요." inputs = tokenizer([get_input(text, tokenizer)], return_tensors='pt').to(device) output = model.generate(**inputs, max_new_tokens=512, pad_token_id = tokenizer.pad_token_id, eos_token_id = tokenizer.eos_token_id, use_cache=True) response = tokenizer.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0] results = get_output(response) print(results) ```
Example output ```json [ [ "옥스포드는 21세 이하 팀으로 득점했다.", "옥스포드는 맨체스터 유나이티드와의 경기에서 3-2로 패했다.", "옥스포드는 화요일 경기를 했다.", ], [ "그 골은 16세 선수의 주장을 강화할 것이다.", "그 골은 16세 선수의 1 군 데뷔 주장을 강화할 것이다.", ], [ "센터 백은 웨스트 햄 1 군과 함께 훈련했다.", "센터 백은 이번 시즌 웨스트 햄 1 군과 함께 훈련했다.", ], [ "웨스트햄 유나이티드의 최신 뉴스는 여기를 클릭하세요." ] ] ```
## Inputs and Outputs - **Input**: Text. - **Output**: List of propositions for all the sentences in the text passage. The propositions for each sentence are grouped separately. ## Evaluation Results - **Metric**: Reference-less & reference-base metrics proposed in [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf). - **Models**: - Dynamic 10-shot models: For each test example, the most similar 10 examples were selected from the training set using BM25. - Translate-test models: [google/gemma-7b-aps-it](https://huggingface.co/google/gemma-7b-aps-it) model + EN->KO, KO->EN translation using GPT-4o or GPT-4o-mini. - Translate-train models: LoRA fine-tuned sLLMs using the Korean RoSE dataset. **Reference-less metric** | Model | Precision | Recall | F1 | |---------------------------------------------------------------------|:---------:|:------:|:-----:| | Gold | 97.46 | 96.28 | 95.88 | | dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct) | 98.86 | 93.99 | 95.58 | | dynamic 10-shot GPT-4o | 97.61 | 97.00 | 96.87 | | dynamic 10-shot GPT-4o-mini | 98.51 | 97.12 | 97.17 | | Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) | 97.38 | 96.93 | 96.52 | | Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) | 97.24 | 96.26 | 95.73 | | Translate-Train (Qwen/Qwen2.5-7B-Instruct) | 94.66 | 92.81 | 92.08 | | Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) | 93.80 | 93.29 | 92.80 | | **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)** | 97.41 | 96.02 | 95.93 | **Reference-base metric** | Model | Precision | Recall | F1 | |---------------------------------------------------------------------|:---------:|:------:|:-----:| | Gold | 100 | 100 | 100 | | dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct) | 48.49 | 40.27 | 42.99 | | dynamic 10-shot GPT-4o | 49.16 | 44.72 | 46.05 | | dynamic 10-shot GPT-4o-mini | 49.30 | 39.25 | 42.88 | | Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) | 57.02 | 47.52 | 51.10 | | Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) | 57.19 | 47.68 | 51.26 | | Translate-Train (Qwen/Qwen2.5-7B-Instruct) | 42.62 | 38.37 | 39.64 | | Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) | 46.82 | 43.08 | 44.02 | | **Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0)** | 50.82 | 45.89 | 47.44 |