Instructions to use liliplanet/propositionizer-mt5-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use liliplanet/propositionizer-mt5-small with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="liliplanet/propositionizer-mt5-small")

# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("liliplanet/propositionizer-mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("liliplanet/propositionizer-mt5-small")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use liliplanet/propositionizer-mt5-small with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "liliplanet/propositionizer-mt5-small"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "liliplanet/propositionizer-mt5-small",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/liliplanet/propositionizer-mt5-small

SGLang

How to use liliplanet/propositionizer-mt5-small with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "liliplanet/propositionizer-mt5-small" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "liliplanet/propositionizer-mt5-small",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "liliplanet/propositionizer-mt5-small" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "liliplanet/propositionizer-mt5-small",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use liliplanet/propositionizer-mt5-small with Docker Model Runner:
```
docker model run hf.co/liliplanet/propositionizer-mt5-small
```

propositionizer-mt5-small / README.md

jihun13

Upload README.md with huggingface_hub

7f92d65 verified about 2 months ago

preview code

raw

history blame contribute delete

4.13 kB

metadata

language:
  - en
  - ko
  - ja
  - zh
license: apache-2.0
tags:
  - text2text-generation
  - fact-decomposition
  - propositionizer
  - onnx
  - mt5
  - multilingual
library_name: transformers
pipeline_tag: text2text-generation
base_model: google/mt5-small
datasets:
  - cnn_dailymail
  - EdinburghNLP/xsum
  - csebuetnlp/xlsum
  - klue

Propositionizer-mT5-Small v2 (Multilingual)

A multilingual atomic fact decomposition model that converts unstructured text into a list of self-contained atomic propositions.

Overview

Property	Value
Base Model	google/mt5-small (300M params)
Training Method	Claude → mT5-small distillation
Languages	English, Korean, Japanese, Chinese
Training Data	v1: ~9,700 + v2: ~5,900 Korean examples
Format	ONNX (int8 quantized)
License	Apache 2.0

Based on the Dense X Retrieval (Propositionizer) approach, extended to multilingual.

Usage

Transformers.js (Browser / Node.js)

import { pipeline } from '@huggingface/transformers';

const decomposer = await pipeline(
  'text2text-generation',
  'liliplanet/propositionizer-mt5-small'
);

const result = await decomposer(
  'Title: Meeting. Section: . Content: The deadline is Friday and the rate was reduced.',
  { max_new_tokens: 256, repetition_penalty: 2.0 }
);
console.log(JSON.parse(result[0].generated_text));
// ["The deadline is Friday.", "The rate was reduced."]

Python (Transformers)

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("liliplanet/propositionizer-mt5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("liliplanet/propositionizer-mt5-small")

input_text = "Title: 회의. Section: . Content: 김 대리가 시급을 낮추고 마감은 금요일이다."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, repetition_penalty=2.0, no_repeat_ngram_size=3, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Input Format

Follow the Propositionizer format:

Title: {title}. Section: {section}. Content: {content}

Training

v1

Source texts: CNN/DailyMail, XSum, Wikipedia (EN/KO/JA/ZH), KLUE, XLSum
Labeling: Claude Haiku 4.5 atomic fact decomposition
Data: EN 4,879 / KO 2,860 / JA 983 / ZH 975 (~9,700 total)
Training: 5 epochs, Adafactor, lr=1e-3, batch_size=16

v2 (current)

Focus: Korean quality improvement
Additional data: ~5,900 Korean complex sentences (XLSum KO, KLUE NLI/RE/STS, KorQuAD, NSMC)
Improved prompt: language drift prevention, proper noun preservation
Training: continued training from v1, 3 epochs, lr=5e-4
Generation config: repetition_penalty=2.0, no_repeat_ngram_size=3

v2 improvements over v1

Korean language drift (한국어→영어 전환) resolved
Repetition loop eliminated
Proper noun preservation improved (CEO, Q2 등)

Comparison with Original Propositionizer

	Original (Flan-T5-Large)	This Model (mT5-Small)
Parameters	780M	300M
Languages	English only	EN, KO, JA, ZH
Teacher	GPT-4	Claude
Training Data	English only	Multilingual

Known Limitations

Small model (300M) has limited capacity for complex decompositions
May hallucinate facts not present in the source text, especially with uncommon proper nouns
Best suited for short-to-medium length paragraphs (< 500 chars)
Korean complex sentences with many IT terms may produce errors

Citation

@article{chen2023densex,
  title={Dense X Retrieval: What Retrieval Granularity Should We Use?},
  author={Chen, Tong and Wang, Hongwei and Chen, Sihao and Yu, Wenhao and Ma, Kaixin and Zhao, Xinran and Zhang, Hongming and Yu, Dong},
  journal={arXiv preprint arXiv:2312.06648},
  year={2023}
}

Part of MemRosetta

This model is a component of the MemRosetta project for multilingual memory and knowledge extraction.