jihun13's picture
Upload README.md with huggingface_hub
7f92d65 verified
---
language:
- en
- ko
- ja
- zh
license: apache-2.0
tags:
- text2text-generation
- fact-decomposition
- propositionizer
- onnx
- mt5
- multilingual
library_name: transformers
pipeline_tag: text2text-generation
base_model: google/mt5-small
datasets:
- cnn_dailymail
- EdinburghNLP/xsum
- csebuetnlp/xlsum
- klue
---
# Propositionizer-mT5-Small v2 (Multilingual)
A multilingual atomic fact decomposition model that converts unstructured text into a list of self-contained atomic propositions.
## Overview
| Property | Value |
|----------|-------|
| Base Model | [google/mt5-small](https://huggingface.co/google/mt5-small) (300M params) |
| Training Method | Claude โ†’ mT5-small distillation |
| Languages | English, Korean, Japanese, Chinese |
| Training Data | v1: ~9,700 + v2: ~5,900 Korean examples |
| Format | ONNX (int8 quantized) |
| License | Apache 2.0 |
Based on the [Dense X Retrieval](https://arxiv.org/abs/2312.06648) (Propositionizer) approach, extended to multilingual.
## Usage
### Transformers.js (Browser / Node.js)
```javascript
import { pipeline } from '@huggingface/transformers';
const decomposer = await pipeline(
'text2text-generation',
'liliplanet/propositionizer-mt5-small'
);
const result = await decomposer(
'Title: Meeting. Section: . Content: The deadline is Friday and the rate was reduced.',
{ max_new_tokens: 256, repetition_penalty: 2.0 }
);
console.log(JSON.parse(result[0].generated_text));
// ["The deadline is Friday.", "The rate was reduced."]
```
### Python (Transformers)
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("liliplanet/propositionizer-mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("liliplanet/propositionizer-mt5-small")
input_text = "Title: ํšŒ์˜. Section: . Content: ๊น€ ๋Œ€๋ฆฌ๊ฐ€ ์‹œ๊ธ‰์„ ๋‚ฎ์ถ”๊ณ  ๋งˆ๊ฐ์€ ๊ธˆ์š”์ผ์ด๋‹ค."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, repetition_penalty=2.0, no_repeat_ngram_size=3, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Input Format
Follow the Propositionizer format:
```
Title: {title}. Section: {section}. Content: {content}
```
## Training
### v1
- **Source texts**: CNN/DailyMail, XSum, Wikipedia (EN/KO/JA/ZH), KLUE, XLSum
- **Labeling**: Claude Haiku 4.5 atomic fact decomposition
- **Data**: EN 4,879 / KO 2,860 / JA 983 / ZH 975 (~9,700 total)
- **Training**: 5 epochs, Adafactor, lr=1e-3, batch_size=16
### v2 (current)
- **Focus**: Korean quality improvement
- **Additional data**: ~5,900 Korean complex sentences (XLSum KO, KLUE NLI/RE/STS, KorQuAD, NSMC)
- **Improved prompt**: language drift prevention, proper noun preservation
- **Training**: continued training from v1, 3 epochs, lr=5e-4
- **Generation config**: repetition_penalty=2.0, no_repeat_ngram_size=3
### v2 improvements over v1
- Korean language drift (ํ•œ๊ตญ์–ดโ†’์˜์–ด ์ „ํ™˜) resolved
- Repetition loop eliminated
- Proper noun preservation improved (CEO, Q2 ๋“ฑ)
## Comparison with Original Propositionizer
| | Original (Flan-T5-Large) | This Model (mT5-Small) |
|---|---|---|
| Parameters | 780M | 300M |
| Languages | English only | EN, KO, JA, ZH |
| Teacher | GPT-4 | Claude |
| Training Data | English only | Multilingual |
## Known Limitations
- Small model (300M) has limited capacity for complex decompositions
- May hallucinate facts not present in the source text, especially with uncommon proper nouns
- Best suited for short-to-medium length paragraphs (< 500 chars)
- Korean complex sentences with many IT terms may produce errors
## Citation
```bibtex
@article{chen2023densex,
title={Dense X Retrieval: What Retrieval Granularity Should We Use?},
author={Chen, Tong and Wang, Hongwei and Chen, Sihao and Yu, Wenhao and Ma, Kaixin and Zhao, Xinran and Zhang, Hongming and Yu, Dong},
journal={arXiv preprint arXiv:2312.06648},
year={2023}
}
```
## Part of MemRosetta
This model is a component of the [MemRosetta](https://github.com/obst2580/memrosetta) project for multilingual memory and knowledge extraction.