seonjeongh
/

Korean-Propositionalizer

Model card Files Files and versions

seonjeongh commited on Mar 10, 2025

Commit

0a72c00

·

verified ·

1 Parent(s): 4df6a4c

Update README.md

Files changed (1) hide show

README.md +11 -10

README.md CHANGED Viewed

@@ -13,14 +13,15 @@ tags:
 ---
 # Overview
-This model is designed for the **abstractive proposition segmentation task** in Korean, as described in the paper [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf). The model segments text into atomic and self-contained units (atomic facts).
 # Training Details
-- Base Model: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
-- Peft: LoRA
-- Dataset: [RoSE](https://huggingface.co/datasets/Salesforce/rose)
-  - The dataset was randomly split into training, validation, and test sets for fine-tuning.
-  - The dataset was translated into Korean using GPT-4o.
 # Usage
 ## Data Preprocessing
@@ -120,12 +121,12 @@ print(results)
 </details>
 ## Inputs and Outputs
-- Input: Text.
-- Output: List of propositions for all the sentences in the text passage. The propositions for each sentence are grouped separately.
 ## Evaluation Results
-- Metric: Reference-less & reference-base metrics proposed in [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf).
-- Models:
   - Dynamic 10-shot models: For each test example, the most similar 10 examples were selected from the training set using BM25.
   - Translate-test models: [google/gemma-7b-aps-it](https://huggingface.co/google/gemma-7b-aps-it) model + EN->KO, KO->EN translation using GPT-4o or GPT-4o-mini.
   - Translate-train models: LoRA fine-tuned sLLMs using the Korean RoSE dataset.

 ---
 # Overview
+This model is designed for the **abstractive proposition segmentation task** in **Korean**, as described in the paper [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf). The model segments text into atomic and self-contained units (atomic facts).
 # Training Details
+- **Base Model**: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
+- **Fine-tuning Method**: LoRA
+- **Dataset**: [RoSE](https://huggingface.co/datasets/Salesforce/rose)
+  - **Translation**: The dataset was translated into Korean using GPT-4o.
+    - GPT-4o was prompted to translate propositions using the vocabulary in the text.
+  - **Data Split**: The dataset was randomly split into training, validation, and test sets (1900:100:500) for fine-tuning.
 # Usage
 ## Data Preprocessing
 </details>
 ## Inputs and Outputs
+- **Input**: Text.
+- **Output**: List of propositions for all the sentences in the text passage. The propositions for each sentence are grouped separately.
 ## Evaluation Results
+- **Metric**: Reference-less & reference-base metrics proposed in [Scalable and Domain-General Abstractive Proposition Segmentation](https://aclanthology.org/2024.findings-emnlp.517.pdf).
+- **Models**:
   - Dynamic 10-shot models: For each test example, the most similar 10 examples were selected from the training set using BM25.
   - Translate-test models: [google/gemma-7b-aps-it](https://huggingface.co/google/gemma-7b-aps-it) model + EN->KO, KO->EN translation using GPT-4o or GPT-4o-mini.
   - Translate-train models: LoRA fine-tuned sLLMs using the Korean RoSE dataset.