Txukun — Capitalization & Punctuation Restoration (ONNX)

ONNX export of HiTZ/cap-punct-eu for browser and edge inference with Transformers.js. Weights are dynamically quantized to int8 for faster downloads (77 MB vs 297 MB fp32).

Model

MarianMT model trained on 9.78M Basque sentences that restores capitalization and punctuation to lowercase, punctuationless text — typically output from automatic speech recognition (ASR) systems.

Original model: HiTZ/cap-punct-eu by HiTZ Zentroa (UPV/EHU)

Architecture

Type: MarianMT (6 encoder + 6 decoder layers)
d_model: 512
Tokenizer: SentencePiece (Unigram, 32k vocab), shipped as custom tokenizer.json
Format: Int8 dynamically quantized ONNX (Q/DQ nodes, compatible with ORT Web WASM)
Input: Lowercase, punctuationless Basque text
Output: Properly capitalized and punctuated Basque text

Files

File	Size	Description
`encoder_model_quantized.onnx`	34 MB	Encoder (IR 8, int8 quantized)
`decoder_model_merged_quantized.onnx`	41 MB	Decoder with KV-cache (IR 8, int8 quantized)
`encoder_model.onnx`	136 MB	Encoder (fp32, for reference / non-WASM use)
`decoder_model_merged.onnx`	160 MB	Decoder with KV-cache (fp32, for reference / non-WASM use)
`tokenizer.json`	2.1 MB	Custom Unigram + Metaspace pre-tokenizer (for Transformers.js)
`source.spm`	842 KB	SentencePiece model (for Python / HF MarianTokenizer)
`vocab.json`	2.1 MB	Vocab mapping (for Python / HF MarianTokenizer)
`config.json`	979 B	Model configuration
`tokenizer_config.json`	864 B	Tokenizer metadata
`generation_config.json`	288 B	Generation defaults

Usage with Transformers.js (browser)

import { pipeline } from '@huggingface/transformers';

const corrector = await pipeline(
  'translation',
  'itzune/txukun-cap-punct-eu',
  { device: 'wasm', dtype: 'q8' }
);

const result = await corrector('kaixo zer moduz zaude');
console.log(result[0].translation_text);
// → "Kaixo, zer moduz zaude?"

Usage with Python (ONNX Runtime via optimum)

Install dependencies:

pip install optimum[onnxruntime] sentencepiece

Basic inference:

from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import AutoTokenizer, pipeline

model_id = "itzune/txukun-cap-punct-eu"

# Load int8 quantized ONNX model
model = ORTModelForSeq2SeqLM.from_pretrained(
    model_id,
    encoder_file_name="encoder_model_quantized.onnx",
    decoder_file_name="decoder_model_merged_quantized.onnx",
    decoder_with_past_file_name="decoder_model_merged_quantized.onnx",
    provider="CPUExecutionProvider",
    use_cache=True,
)

# Tokenizer: load from our repo or HiTZ
tokenizer = AutoTokenizer.from_pretrained("HiTZ/cap-punct-eu")

# Create pipeline
corrector = pipeline("translation", model=model, tokenizer=tokenizer, max_length=512)

# Correct text
result = corrector("euskal herrian euskaraz bizi nahi dugu")
print(result[0]["translation_text"])
# → "Euskal Herrian euskaraz bizi nahi dugu."

For a complete CLI tool using this model, see txukun-cli.

Quantization details

Dynamically quantized with onnxruntime.quantization.quantize_dynamic(QuantType.QInt8, extra_options={"EnableSubgraph": True}). The EnableSubgraph flag traverses into the If-node subgraphs of the merged decoder, quantizing MatMul operations in both branches. Results:

Encoder: 136 MB → 34 MB (75% reduction)
Decoder: 160 MB → 41 MB (74% reduction)
Total: 297 MB → 77 MB (74% reduction)

Part of Txukun

This model powers Txukun, a browser-based Basque text cleaning tool. Visit itzune.eus/txukun to try it.

License

Apache 2.0 (same as original HiTZ/cap-punct-eu)

Downloads last month: -