Instructions to use liliplanet/propositionizer-mt5-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use liliplanet/propositionizer-mt5-small with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="liliplanet/propositionizer-mt5-small")

# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("liliplanet/propositionizer-mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("liliplanet/propositionizer-mt5-small")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use liliplanet/propositionizer-mt5-small with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "liliplanet/propositionizer-mt5-small"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "liliplanet/propositionizer-mt5-small",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/liliplanet/propositionizer-mt5-small

SGLang

How to use liliplanet/propositionizer-mt5-small with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "liliplanet/propositionizer-mt5-small" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "liliplanet/propositionizer-mt5-small",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "liliplanet/propositionizer-mt5-small" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "liliplanet/propositionizer-mt5-small",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use liliplanet/propositionizer-mt5-small with Docker Model Runner:
```
docker model run hf.co/liliplanet/propositionizer-mt5-small
```

propositionizer-mt5-small / README.md

jihun13

Upload README.md with huggingface_hub

7f92d65 verified about 2 months ago

preview code

raw

history blame contribute delete

4.13 kB

	---
	language:
	- en
	- ko
	- ja
	- zh
	license: apache-2.0
	tags:
	- text2text-generation
	- fact-decomposition
	- propositionizer
	- onnx
	- mt5
	- multilingual
	library_name: transformers
	pipeline_tag: text2text-generation
	base_model: google/mt5-small
	datasets:
	- cnn_dailymail
	- EdinburghNLP/xsum
	- csebuetnlp/xlsum
	- klue
	---

	# Propositionizer-mT5-Small v2 (Multilingual)

	A multilingual atomic fact decomposition model that converts unstructured text into a list of self-contained atomic propositions.

	## Overview

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [google/mt5-small](https://huggingface.co/google/mt5-small) (300M params) \|
	\| Training Method \| Claude → mT5-small distillation \|
	\| Languages \| English, Korean, Japanese, Chinese \|
	\| Training Data \| v1: ~9,700 + v2: ~5,900 Korean examples \|
	\| Format \| ONNX (int8 quantized) \|
	\| License \| Apache 2.0 \|

	Based on the [Dense X Retrieval](https://arxiv.org/abs/2312.06648) (Propositionizer) approach, extended to multilingual.

	## Usage

	### Transformers.js (Browser / Node.js)

	```javascript
	import { pipeline } from '@huggingface/transformers';

	const decomposer = await pipeline(
	'text2text-generation',
	'liliplanet/propositionizer-mt5-small'
	);

	const result = await decomposer(
	'Title: Meeting. Section: . Content: The deadline is Friday and the rate was reduced.',
	{ max_new_tokens: 256, repetition_penalty: 2.0 }
	);
	console.log(JSON.parse(result[0].generated_text));
	// ["The deadline is Friday.", "The rate was reduced."]
	```

	### Python (Transformers)

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	tokenizer = AutoTokenizer.from_pretrained("liliplanet/propositionizer-mt5-small")
	model = AutoModelForSeq2SeqLM.from_pretrained("liliplanet/propositionizer-mt5-small")

	input_text = "Title: 회의. Section: . Content: 김 대리가 시급을 낮추고 마감은 금요일이다."
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=256, repetition_penalty=2.0, no_repeat_ngram_size=3, num_beams=4)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Input Format

	Follow the Propositionizer format:

	```
	Title: {title}. Section: {section}. Content: {content}
	```

	## Training

	### v1
	- Source texts: CNN/DailyMail, XSum, Wikipedia (EN/KO/JA/ZH), KLUE, XLSum
	- Labeling: Claude Haiku 4.5 atomic fact decomposition
	- Data: EN 4,879 / KO 2,860 / JA 983 / ZH 975 (~9,700 total)
	- Training: 5 epochs, Adafactor, lr=1e-3, batch_size=16

	### v2 (current)
	- Focus: Korean quality improvement
	- Additional data: ~5,900 Korean complex sentences (XLSum KO, KLUE NLI/RE/STS, KorQuAD, NSMC)
	- Improved prompt: language drift prevention, proper noun preservation
	- Training: continued training from v1, 3 epochs, lr=5e-4
	- Generation config: repetition_penalty=2.0, no_repeat_ngram_size=3

	### v2 improvements over v1
	- Korean language drift (한국어→영어 전환) resolved
	- Repetition loop eliminated
	- Proper noun preservation improved (CEO, Q2 등)

	## Comparison with Original Propositionizer

	\| \| Original (Flan-T5-Large) \| This Model (mT5-Small) \|
	\|---\|---\|---\|
	\| Parameters \| 780M \| 300M \|
	\| Languages \| English only \| EN, KO, JA, ZH \|
	\| Teacher \| GPT-4 \| Claude \|
	\| Training Data \| English only \| Multilingual \|

	## Known Limitations

	- Small model (300M) has limited capacity for complex decompositions
	- May hallucinate facts not present in the source text, especially with uncommon proper nouns
	- Best suited for short-to-medium length paragraphs (< 500 chars)
	- Korean complex sentences with many IT terms may produce errors

	## Citation

	```bibtex
	@article{chen2023densex,
	title={Dense X Retrieval: What Retrieval Granularity Should We Use?},
	author={Chen, Tong and Wang, Hongwei and Chen, Sihao and Yu, Wenhao and Ma, Kaixin and Zhao, Xinran and Zhang, Hongming and Yu, Dong},
	journal={arXiv preprint arXiv:2312.06648},
	year={2023}
	}
	```

	## Part of MemRosetta

	This model is a component of the [MemRosetta](https://github.com/obst2580/memrosetta) project for multilingual memory and knowledge extraction.