--- language: - en - ko - ja - zh license: apache-2.0 tags: - text2text-generation - fact-decomposition - propositionizer - onnx - mt5 - multilingual library_name: transformers pipeline_tag: text2text-generation base_model: google/mt5-small datasets: - cnn_dailymail - EdinburghNLP/xsum - csebuetnlp/xlsum - klue --- # Propositionizer-mT5-Small v2 (Multilingual) A multilingual atomic fact decomposition model that converts unstructured text into a list of self-contained atomic propositions. ## Overview | Property | Value | |----------|-------| | Base Model | [google/mt5-small](https://huggingface.co/google/mt5-small) (300M params) | | Training Method | Claude → mT5-small distillation | | Languages | English, Korean, Japanese, Chinese | | Training Data | v1: ~9,700 + v2: ~5,900 Korean examples | | Format | ONNX (int8 quantized) | | License | Apache 2.0 | Based on the [Dense X Retrieval](https://arxiv.org/abs/2312.06648) (Propositionizer) approach, extended to multilingual. ## Usage ### Transformers.js (Browser / Node.js) ```javascript import { pipeline } from '@huggingface/transformers'; const decomposer = await pipeline( 'text2text-generation', 'liliplanet/propositionizer-mt5-small' ); const result = await decomposer( 'Title: Meeting. Section: . Content: The deadline is Friday and the rate was reduced.', { max_new_tokens: 256, repetition_penalty: 2.0 } ); console.log(JSON.parse(result[0].generated_text)); // ["The deadline is Friday.", "The rate was reduced."] ``` ### Python (Transformers) ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("liliplanet/propositionizer-mt5-small") model = AutoModelForSeq2SeqLM.from_pretrained("liliplanet/propositionizer-mt5-small") input_text = "Title: 회의. Section: . Content: 김 대리가 시급을 낮추고 마감은 금요일이다." inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=256, repetition_penalty=2.0, no_repeat_ngram_size=3, num_beams=4) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Input Format Follow the Propositionizer format: ``` Title: {title}. Section: {section}. Content: {content} ``` ## Training ### v1 - **Source texts**: CNN/DailyMail, XSum, Wikipedia (EN/KO/JA/ZH), KLUE, XLSum - **Labeling**: Claude Haiku 4.5 atomic fact decomposition - **Data**: EN 4,879 / KO 2,860 / JA 983 / ZH 975 (~9,700 total) - **Training**: 5 epochs, Adafactor, lr=1e-3, batch_size=16 ### v2 (current) - **Focus**: Korean quality improvement - **Additional data**: ~5,900 Korean complex sentences (XLSum KO, KLUE NLI/RE/STS, KorQuAD, NSMC) - **Improved prompt**: language drift prevention, proper noun preservation - **Training**: continued training from v1, 3 epochs, lr=5e-4 - **Generation config**: repetition_penalty=2.0, no_repeat_ngram_size=3 ### v2 improvements over v1 - Korean language drift (한국어→영어 전환) resolved - Repetition loop eliminated - Proper noun preservation improved (CEO, Q2 등) ## Comparison with Original Propositionizer | | Original (Flan-T5-Large) | This Model (mT5-Small) | |---|---|---| | Parameters | 780M | 300M | | Languages | English only | EN, KO, JA, ZH | | Teacher | GPT-4 | Claude | | Training Data | English only | Multilingual | ## Known Limitations - Small model (300M) has limited capacity for complex decompositions - May hallucinate facts not present in the source text, especially with uncommon proper nouns - Best suited for short-to-medium length paragraphs (< 500 chars) - Korean complex sentences with many IT terms may produce errors ## Citation ```bibtex @article{chen2023densex, title={Dense X Retrieval: What Retrieval Granularity Should We Use?}, author={Chen, Tong and Wang, Hongwei and Chen, Sihao and Yu, Wenhao and Ma, Kaixin and Zhao, Xinran and Zhang, Hongming and Yu, Dong}, journal={arXiv preprint arXiv:2312.06648}, year={2023} } ``` ## Part of MemRosetta This model is a component of the [MemRosetta](https://github.com/obst2580/memrosetta) project for multilingual memory and knowledge extraction.