Text Generation
Transformers
ONNX
Safetensors
mt5
text2text-generation
fact-decomposition
propositionizer
multilingual
Instructions to use liliplanet/propositionizer-mt5-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use liliplanet/propositionizer-mt5-small with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="liliplanet/propositionizer-mt5-small")# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("liliplanet/propositionizer-mt5-small") model = AutoModelForSeq2SeqLM.from_pretrained("liliplanet/propositionizer-mt5-small") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use liliplanet/propositionizer-mt5-small with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "liliplanet/propositionizer-mt5-small" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "liliplanet/propositionizer-mt5-small", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/liliplanet/propositionizer-mt5-small
- SGLang
How to use liliplanet/propositionizer-mt5-small with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "liliplanet/propositionizer-mt5-small" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "liliplanet/propositionizer-mt5-small", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "liliplanet/propositionizer-mt5-small" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "liliplanet/propositionizer-mt5-small", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use liliplanet/propositionizer-mt5-small with Docker Model Runner:
docker model run hf.co/liliplanet/propositionizer-mt5-small
| language: | |
| - en | |
| - ko | |
| - ja | |
| - zh | |
| license: apache-2.0 | |
| tags: | |
| - text2text-generation | |
| - fact-decomposition | |
| - propositionizer | |
| - onnx | |
| - mt5 | |
| - multilingual | |
| library_name: transformers | |
| pipeline_tag: text2text-generation | |
| base_model: google/mt5-small | |
| datasets: | |
| - cnn_dailymail | |
| - EdinburghNLP/xsum | |
| - csebuetnlp/xlsum | |
| - klue | |
| # Propositionizer-mT5-Small v2 (Multilingual) | |
| A multilingual atomic fact decomposition model that converts unstructured text into a list of self-contained atomic propositions. | |
| ## Overview | |
| | Property | Value | | |
| |----------|-------| | |
| | Base Model | [google/mt5-small](https://huggingface.co/google/mt5-small) (300M params) | | |
| | Training Method | Claude โ mT5-small distillation | | |
| | Languages | English, Korean, Japanese, Chinese | | |
| | Training Data | v1: ~9,700 + v2: ~5,900 Korean examples | | |
| | Format | ONNX (int8 quantized) | | |
| | License | Apache 2.0 | | |
| Based on the [Dense X Retrieval](https://arxiv.org/abs/2312.06648) (Propositionizer) approach, extended to multilingual. | |
| ## Usage | |
| ### Transformers.js (Browser / Node.js) | |
| ```javascript | |
| import { pipeline } from '@huggingface/transformers'; | |
| const decomposer = await pipeline( | |
| 'text2text-generation', | |
| 'liliplanet/propositionizer-mt5-small' | |
| ); | |
| const result = await decomposer( | |
| 'Title: Meeting. Section: . Content: The deadline is Friday and the rate was reduced.', | |
| { max_new_tokens: 256, repetition_penalty: 2.0 } | |
| ); | |
| console.log(JSON.parse(result[0].generated_text)); | |
| // ["The deadline is Friday.", "The rate was reduced."] | |
| ``` | |
| ### Python (Transformers) | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM | |
| tokenizer = AutoTokenizer.from_pretrained("liliplanet/propositionizer-mt5-small") | |
| model = AutoModelForSeq2SeqLM.from_pretrained("liliplanet/propositionizer-mt5-small") | |
| input_text = "Title: ํ์. Section: . Content: ๊น ๋๋ฆฌ๊ฐ ์๊ธ์ ๋ฎ์ถ๊ณ ๋ง๊ฐ์ ๊ธ์์ผ์ด๋ค." | |
| inputs = tokenizer(input_text, return_tensors="pt") | |
| outputs = model.generate(**inputs, max_new_tokens=256, repetition_penalty=2.0, no_repeat_ngram_size=3, num_beams=4) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ## Input Format | |
| Follow the Propositionizer format: | |
| ``` | |
| Title: {title}. Section: {section}. Content: {content} | |
| ``` | |
| ## Training | |
| ### v1 | |
| - **Source texts**: CNN/DailyMail, XSum, Wikipedia (EN/KO/JA/ZH), KLUE, XLSum | |
| - **Labeling**: Claude Haiku 4.5 atomic fact decomposition | |
| - **Data**: EN 4,879 / KO 2,860 / JA 983 / ZH 975 (~9,700 total) | |
| - **Training**: 5 epochs, Adafactor, lr=1e-3, batch_size=16 | |
| ### v2 (current) | |
| - **Focus**: Korean quality improvement | |
| - **Additional data**: ~5,900 Korean complex sentences (XLSum KO, KLUE NLI/RE/STS, KorQuAD, NSMC) | |
| - **Improved prompt**: language drift prevention, proper noun preservation | |
| - **Training**: continued training from v1, 3 epochs, lr=5e-4 | |
| - **Generation config**: repetition_penalty=2.0, no_repeat_ngram_size=3 | |
| ### v2 improvements over v1 | |
| - Korean language drift (ํ๊ตญ์ดโ์์ด ์ ํ) resolved | |
| - Repetition loop eliminated | |
| - Proper noun preservation improved (CEO, Q2 ๋ฑ) | |
| ## Comparison with Original Propositionizer | |
| | | Original (Flan-T5-Large) | This Model (mT5-Small) | | |
| |---|---|---| | |
| | Parameters | 780M | 300M | | |
| | Languages | English only | EN, KO, JA, ZH | | |
| | Teacher | GPT-4 | Claude | | |
| | Training Data | English only | Multilingual | | |
| ## Known Limitations | |
| - Small model (300M) has limited capacity for complex decompositions | |
| - May hallucinate facts not present in the source text, especially with uncommon proper nouns | |
| - Best suited for short-to-medium length paragraphs (< 500 chars) | |
| - Korean complex sentences with many IT terms may produce errors | |
| ## Citation | |
| ```bibtex | |
| @article{chen2023densex, | |
| title={Dense X Retrieval: What Retrieval Granularity Should We Use?}, | |
| author={Chen, Tong and Wang, Hongwei and Chen, Sihao and Yu, Wenhao and Ma, Kaixin and Zhao, Xinran and Zhang, Hongming and Yu, Dong}, | |
| journal={arXiv preprint arXiv:2312.06648}, | |
| year={2023} | |
| } | |
| ``` | |
| ## Part of MemRosetta | |
| This model is a component of the [MemRosetta](https://github.com/obst2580/memrosetta) project for multilingual memory and knowledge extraction. | |