Txukun β€” Capitalization & Punctuation Restoration (ONNX)

ONNX export of HiTZ/cap-punct-eu for browser and edge inference with Transformers.js. Weights are dynamically quantized to int8 for faster downloads (77 MB vs 297 MB fp32).

Model

MarianMT model trained on 9.78M Basque sentences that restores capitalization and punctuation to lowercase, punctuationless text β€” typically output from automatic speech recognition (ASR) systems.

Original model: HiTZ/cap-punct-eu by HiTZ Zentroa (UPV/EHU)

Architecture

  • Type: MarianMT (6 encoder + 6 decoder layers)
  • d_model: 512
  • Tokenizer: SentencePiece (Unigram, 32k vocab), shipped as custom tokenizer.json
  • Format: Int8 dynamically quantized ONNX (Q/DQ nodes, compatible with ORT Web WASM)
  • Input: Lowercase, punctuationless Basque text
  • Output: Properly capitalized and punctuated Basque text

Files

File Size Description
encoder_model_quantized.onnx 34 MB Encoder (IR 8, int8 quantized)
decoder_model_merged_quantized.onnx 41 MB Decoder with KV-cache (IR 8, int8 quantized)
encoder_model.onnx 136 MB Encoder (fp32, for reference / non-WASM use)
decoder_model_merged.onnx 160 MB Decoder with KV-cache (fp32, for reference / non-WASM use)
tokenizer.json 2.1 MB Custom Unigram + Metaspace pre-tokenizer (for Transformers.js)
source.spm 842 KB SentencePiece model (for Python / HF MarianTokenizer)
vocab.json 2.1 MB Vocab mapping (for Python / HF MarianTokenizer)
config.json 979 B Model configuration
tokenizer_config.json 864 B Tokenizer metadata
generation_config.json 288 B Generation defaults

Usage with Transformers.js (browser)

import { pipeline } from '@huggingface/transformers';

const corrector = await pipeline(
  'translation',
  'itzune/txukun-cap-punct-eu',
  { device: 'wasm', dtype: 'q8' }
);

const result = await corrector('kaixo zer moduz zaude');
console.log(result[0].translation_text);
// β†’ "Kaixo, zer moduz zaude?"

Usage with Python (ONNX Runtime via optimum)

Install dependencies:

pip install optimum[onnxruntime] sentencepiece

Basic inference:

from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import AutoTokenizer, pipeline

model_id = "itzune/txukun-cap-punct-eu"

# Load int8 quantized ONNX model
model = ORTModelForSeq2SeqLM.from_pretrained(
    model_id,
    encoder_file_name="encoder_model_quantized.onnx",
    decoder_file_name="decoder_model_merged_quantized.onnx",
    decoder_with_past_file_name="decoder_model_merged_quantized.onnx",
    provider="CPUExecutionProvider",
    use_cache=True,
)

# Tokenizer: load from our repo or HiTZ
tokenizer = AutoTokenizer.from_pretrained("HiTZ/cap-punct-eu")

# Create pipeline
corrector = pipeline("translation", model=model, tokenizer=tokenizer, max_length=512)

# Correct text
result = corrector("euskal herrian euskaraz bizi nahi dugu")
print(result[0]["translation_text"])
# β†’ "Euskal Herrian euskaraz bizi nahi dugu."

For a complete CLI tool using this model, see txukun-cli.

Quantization details

Dynamically quantized with onnxruntime.quantization.quantize_dynamic(QuantType.QInt8, extra_options={"EnableSubgraph": True}). The EnableSubgraph flag traverses into the If-node subgraphs of the merged decoder, quantizing MatMul operations in both branches. Results:

  • Encoder: 136 MB β†’ 34 MB (75% reduction)
  • Decoder: 160 MB β†’ 41 MB (74% reduction)
  • Total: 297 MB β†’ 77 MB (74% reduction)

Part of Txukun

This model powers Txukun, a browser-based Basque text cleaning tool. Visit itzune.eus/txukun to try it.

License

Apache 2.0 (same as original HiTZ/cap-punct-eu)

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support