---
language:
- en
- ko
- ja
- zh
license: apache-2.0
tags:
- text2text-generation
- fact-decomposition
- propositionizer
- onnx
- mt5
- multilingual
library_name: transformers
pipeline_tag: text2text-generation
base_model: google/mt5-small
datasets:
- cnn_dailymail
- EdinburghNLP/xsum
- csebuetnlp/xlsum
- klue
---

# Propositionizer-mT5-Small v2 (Multilingual)

A multilingual atomic fact decomposition model that converts unstructured text into a list of self-contained atomic propositions.

## Overview

| Property | Value |
|----------|-------|
| Base Model | [google/mt5-small](https://huggingface.co/google/mt5-small) (300M params) |
| Training Method | Claude → mT5-small distillation |
| Languages | English, Korean, Japanese, Chinese |
| Training Data | v1: ~9,700 + v2: ~5,900 Korean examples |
| Format | ONNX (int8 quantized) |
| License | Apache 2.0 |

Based on the [Dense X Retrieval](https://arxiv.org/abs/2312.06648) (Propositionizer) approach, extended to multilingual.

## Usage

### Transformers.js (Browser / Node.js)

```javascript
import { pipeline } from '@huggingface/transformers';

const decomposer = await pipeline(
  'text2text-generation',
  'liliplanet/propositionizer-mt5-small'
);

const result = await decomposer(
  'Title: Meeting. Section: . Content: The deadline is Friday and the rate was reduced.',
  { max_new_tokens: 256, repetition_penalty: 2.0 }
);
console.log(JSON.parse(result[0].generated_text));
// ["The deadline is Friday.", "The rate was reduced."]
```

### Python (Transformers)

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("liliplanet/propositionizer-mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("liliplanet/propositionizer-mt5-small")

input_text = "Title: 회의. Section: . Content: 김 대리가 시급을 낮추고 마감은 금요일이다."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, repetition_penalty=2.0, no_repeat_ngram_size=3, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Input Format

Follow the Propositionizer format:

```
Title: {title}. Section: {section}. Content: {content}
```

## Training

### v1
- **Source texts**: CNN/DailyMail, XSum, Wikipedia (EN/KO/JA/ZH), KLUE, XLSum
- **Labeling**: Claude Haiku 4.5 atomic fact decomposition
- **Data**: EN 4,879 / KO 2,860 / JA 983 / ZH 975 (~9,700 total)
- **Training**: 5 epochs, Adafactor, lr=1e-3, batch_size=16

### v2 (current)
- **Focus**: Korean quality improvement
- **Additional data**: ~5,900 Korean complex sentences (XLSum KO, KLUE NLI/RE/STS, KorQuAD, NSMC)
- **Improved prompt**: language drift prevention, proper noun preservation
- **Training**: continued training from v1, 3 epochs, lr=5e-4
- **Generation config**: repetition_penalty=2.0, no_repeat_ngram_size=3

### v2 improvements over v1
- Korean language drift (한국어→영어 전환) resolved
- Repetition loop eliminated
- Proper noun preservation improved (CEO, Q2 등)

## Comparison with Original Propositionizer

| | Original (Flan-T5-Large) | This Model (mT5-Small) |
|---|---|---|
| Parameters | 780M | 300M |
| Languages | English only | EN, KO, JA, ZH |
| Teacher | GPT-4 | Claude |
| Training Data | English only | Multilingual |

## Known Limitations

- Small model (300M) has limited capacity for complex decompositions
- May hallucinate facts not present in the source text, especially with uncommon proper nouns
- Best suited for short-to-medium length paragraphs (< 500 chars)
- Korean complex sentences with many IT terms may produce errors

## Citation

```bibtex
@article{chen2023densex,
  title={Dense X Retrieval: What Retrieval Granularity Should We Use?},
  author={Chen, Tong and Wang, Hongwei and Chen, Sihao and Yu, Wenhao and Ma, Kaixin and Zhao, Xinran and Zhang, Hongming and Yu, Dong},
  journal={arXiv preprint arXiv:2312.06648},
  year={2023}
}
```

## Part of MemRosetta

This model is a component of the [MemRosetta](https://github.com/obst2580/memrosetta) project for multilingual memory and knowledge extraction.