URBERT - Hugging Face Model Card

This repository provides the URBERT backbone as a Hugging Face transformers model. The uploaded checkpoint is a BERT encoder trained in the URBERT pipeline with character-level uroman tokenization.

Model Details

Model type: BERT backbone (AutoModel)
Base architecture: bert-base-uncased config family
Tokenizer: custom character-level tokenizer packaged to load with AutoTokenizer
Primary objective during training pipeline:
- Text MLM
- Audio distillation (in multitask training). This HF export contains the backbone encoder only.

Quick Start

import torch
from transformers import AutoModel, AutoTokenizer

REPO_ID = "Sanghyang00/urbert-256"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(REPO_ID, force_download=True)
model = AutoModel.from_pretrained(REPO_ID).to(device).eval()

text = "hello urbert"
inputs = tokenizer(text, add_special_tokens=False, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    last_hidden = outputs.last_hidden_state

print("input_ids:", inputs["input_ids"].tolist())
print("input shape:", tuple(inputs["input_ids"].shape))
print("last_hidden shape:", tuple(last_hidden.shape))

Notes on Tokenization

This tokenizer is character-level and intended to be HF-compatible through AutoTokenizer.
Special token behavior follows HF conventions:
- Example: "[MASK]" is treated as one special token by HF tokenizer.
If you compare against a legacy local tokenizer implementation, special-token string handling may differ even when normal text encoding matches.

Intended Use

Feature extraction from URBERT backbone hidden states
Initialization for downstream tasks that use uroman character-level representations

Limitations

This export is the backbone encoder, not the full multitask training head.
Domain and language coverage are constrained by the training data used in URBERT experiments.
Additional task-specific fine-tuning may be required for production use.

Training Reference

For training code, data processing, and experiment setup, please refer to:

https://github.com/sanghyang00/ur-bert

Citation

If you use this model in your research or applications, please cite:

@article{lee2026urbert,
  title   = {UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction},
  author  = {Lee, Sangmin and Ahn, Eekgyun and Choi, Woongjib and Kang, Hong-Goo},
  journal = {arXiv preprint arXiv:2606.11681},
  year    = {2026}
}

Downloads last month: -

Safetensors

Model size

86.1M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Sanghyang00/urbert-256

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

Paper • 2606.11681 • Published 7 days ago