Text Generation
Transformers
ONNX
Safetensors
mt5
text2text-generation
fact-decomposition
propositionizer
multilingual
Instructions to use liliplanet/propositionizer-mt5-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use liliplanet/propositionizer-mt5-small with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="liliplanet/propositionizer-mt5-small")# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("liliplanet/propositionizer-mt5-small") model = AutoModelForSeq2SeqLM.from_pretrained("liliplanet/propositionizer-mt5-small") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use liliplanet/propositionizer-mt5-small with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "liliplanet/propositionizer-mt5-small" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "liliplanet/propositionizer-mt5-small", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/liliplanet/propositionizer-mt5-small
- SGLang
How to use liliplanet/propositionizer-mt5-small with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "liliplanet/propositionizer-mt5-small" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "liliplanet/propositionizer-mt5-small", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "liliplanet/propositionizer-mt5-small" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "liliplanet/propositionizer-mt5-small", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use liliplanet/propositionizer-mt5-small with Docker Model Runner:
docker model run hf.co/liliplanet/propositionizer-mt5-small
metadata
language:
- en
- ko
- ja
- zh
license: apache-2.0
tags:
- text2text-generation
- fact-decomposition
- propositionizer
- onnx
- mt5
- multilingual
library_name: transformers
pipeline_tag: text2text-generation
base_model: google/mt5-small
datasets:
- cnn_dailymail
- EdinburghNLP/xsum
- csebuetnlp/xlsum
- klue
Propositionizer-mT5-Small v2 (Multilingual)
A multilingual atomic fact decomposition model that converts unstructured text into a list of self-contained atomic propositions.
Overview
| Property | Value |
|---|---|
| Base Model | google/mt5-small (300M params) |
| Training Method | Claude โ mT5-small distillation |
| Languages | English, Korean, Japanese, Chinese |
| Training Data | v1: ~9,700 + v2: ~5,900 Korean examples |
| Format | ONNX (int8 quantized) |
| License | Apache 2.0 |
Based on the Dense X Retrieval (Propositionizer) approach, extended to multilingual.
Usage
Transformers.js (Browser / Node.js)
import { pipeline } from '@huggingface/transformers';
const decomposer = await pipeline(
'text2text-generation',
'liliplanet/propositionizer-mt5-small'
);
const result = await decomposer(
'Title: Meeting. Section: . Content: The deadline is Friday and the rate was reduced.',
{ max_new_tokens: 256, repetition_penalty: 2.0 }
);
console.log(JSON.parse(result[0].generated_text));
// ["The deadline is Friday.", "The rate was reduced."]
Python (Transformers)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("liliplanet/propositionizer-mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("liliplanet/propositionizer-mt5-small")
input_text = "Title: ํ์. Section: . Content: ๊น ๋๋ฆฌ๊ฐ ์๊ธ์ ๋ฎ์ถ๊ณ ๋ง๊ฐ์ ๊ธ์์ผ์ด๋ค."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, repetition_penalty=2.0, no_repeat_ngram_size=3, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Input Format
Follow the Propositionizer format:
Title: {title}. Section: {section}. Content: {content}
Training
v1
- Source texts: CNN/DailyMail, XSum, Wikipedia (EN/KO/JA/ZH), KLUE, XLSum
- Labeling: Claude Haiku 4.5 atomic fact decomposition
- Data: EN 4,879 / KO 2,860 / JA 983 / ZH 975 (~9,700 total)
- Training: 5 epochs, Adafactor, lr=1e-3, batch_size=16
v2 (current)
- Focus: Korean quality improvement
- Additional data: ~5,900 Korean complex sentences (XLSum KO, KLUE NLI/RE/STS, KorQuAD, NSMC)
- Improved prompt: language drift prevention, proper noun preservation
- Training: continued training from v1, 3 epochs, lr=5e-4
- Generation config: repetition_penalty=2.0, no_repeat_ngram_size=3
v2 improvements over v1
- Korean language drift (ํ๊ตญ์ดโ์์ด ์ ํ) resolved
- Repetition loop eliminated
- Proper noun preservation improved (CEO, Q2 ๋ฑ)
Comparison with Original Propositionizer
| Original (Flan-T5-Large) | This Model (mT5-Small) | |
|---|---|---|
| Parameters | 780M | 300M |
| Languages | English only | EN, KO, JA, ZH |
| Teacher | GPT-4 | Claude |
| Training Data | English only | Multilingual |
Known Limitations
- Small model (300M) has limited capacity for complex decompositions
- May hallucinate facts not present in the source text, especially with uncommon proper nouns
- Best suited for short-to-medium length paragraphs (< 500 chars)
- Korean complex sentences with many IT terms may produce errors
Citation
@article{chen2023densex,
title={Dense X Retrieval: What Retrieval Granularity Should We Use?},
author={Chen, Tong and Wang, Hongwei and Chen, Sihao and Yu, Wenhao and Ma, Kaixin and Zhao, Xinran and Zhang, Hongming and Yu, Dong},
journal={arXiv preprint arXiv:2312.06648},
year={2023}
}
Part of MemRosetta
This model is a component of the MemRosetta project for multilingual memory and knowledge extraction.