Instructions to use itzune/txukun-cap-punct-eu with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use itzune/txukun-cap-punct-eu with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('translation', 'itzune/txukun-cap-punct-eu');
Txukun β Capitalization & Punctuation Restoration (ONNX)
ONNX export of HiTZ/cap-punct-eu for browser and edge inference with Transformers.js. Weights are dynamically quantized to int8 for faster downloads (77 MB vs 297 MB fp32).
Model
MarianMT model trained on 9.78M Basque sentences that restores capitalization and punctuation to lowercase, punctuationless text β typically output from automatic speech recognition (ASR) systems.
Original model: HiTZ/cap-punct-eu by HiTZ Zentroa (UPV/EHU)
Architecture
- Type: MarianMT (6 encoder + 6 decoder layers)
- d_model: 512
- Tokenizer: SentencePiece (Unigram, 32k vocab), shipped as custom
tokenizer.json - Format: Int8 dynamically quantized ONNX (Q/DQ nodes, compatible with ORT Web WASM)
- Input: Lowercase, punctuationless Basque text
- Output: Properly capitalized and punctuated Basque text
Files
| File | Size | Description |
|---|---|---|
encoder_model_quantized.onnx |
34 MB | Encoder (IR 8, int8 quantized) |
decoder_model_merged_quantized.onnx |
41 MB | Decoder with KV-cache (IR 8, int8 quantized) |
encoder_model.onnx |
136 MB | Encoder (fp32, for reference / non-WASM use) |
decoder_model_merged.onnx |
160 MB | Decoder with KV-cache (fp32, for reference / non-WASM use) |
tokenizer.json |
2.1 MB | Custom Unigram + Metaspace pre-tokenizer (for Transformers.js) |
source.spm |
842 KB | SentencePiece model (for Python / HF MarianTokenizer) |
vocab.json |
2.1 MB | Vocab mapping (for Python / HF MarianTokenizer) |
config.json |
979 B | Model configuration |
tokenizer_config.json |
864 B | Tokenizer metadata |
generation_config.json |
288 B | Generation defaults |
Usage with Transformers.js (browser)
import { pipeline } from '@huggingface/transformers';
const corrector = await pipeline(
'translation',
'itzune/txukun-cap-punct-eu',
{ device: 'wasm', dtype: 'q8' }
);
const result = await corrector('kaixo zer moduz zaude');
console.log(result[0].translation_text);
// β "Kaixo, zer moduz zaude?"
Usage with Python (ONNX Runtime via optimum)
Install dependencies:
pip install optimum[onnxruntime] sentencepiece
Basic inference:
from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import AutoTokenizer, pipeline
model_id = "itzune/txukun-cap-punct-eu"
# Load int8 quantized ONNX model
model = ORTModelForSeq2SeqLM.from_pretrained(
model_id,
encoder_file_name="encoder_model_quantized.onnx",
decoder_file_name="decoder_model_merged_quantized.onnx",
decoder_with_past_file_name="decoder_model_merged_quantized.onnx",
provider="CPUExecutionProvider",
use_cache=True,
)
# Tokenizer: load from our repo or HiTZ
tokenizer = AutoTokenizer.from_pretrained("HiTZ/cap-punct-eu")
# Create pipeline
corrector = pipeline("translation", model=model, tokenizer=tokenizer, max_length=512)
# Correct text
result = corrector("euskal herrian euskaraz bizi nahi dugu")
print(result[0]["translation_text"])
# β "Euskal Herrian euskaraz bizi nahi dugu."
For a complete CLI tool using this model, see txukun-cli.
Quantization details
Dynamically quantized with onnxruntime.quantization.quantize_dynamic(QuantType.QInt8, extra_options={"EnableSubgraph": True}). The EnableSubgraph flag traverses into the If-node subgraphs of the merged decoder, quantizing MatMul operations in both branches. Results:
- Encoder: 136 MB β 34 MB (75% reduction)
- Decoder: 160 MB β 41 MB (74% reduction)
- Total: 297 MB β 77 MB (74% reduction)
Part of Txukun
This model powers Txukun, a browser-based Basque text cleaning tool. Visit itzune.eus/txukun to try it.
License
Apache 2.0 (same as original HiTZ/cap-punct-eu)
- Downloads last month
- -