text2tobi — `libri+peoples+sbc`

ToBI prosodic annotation from text alone. No audio required at inference time.

Given a stream of lowercased words, this model predicts:

Intonation unit boundaries — where a prosodic phrase ends
Intonation direction at each boundary — rising (H%), falling (L%), or level (!H%)
Break index strength at each boundary — intermediate (3) or full (4)

This is the libri+peoples+sbc checkpoint: the best-performing configuration from the Text2ToBI experiments, trained on LibriTTS + People's Speech + SBCSAE with boundary loss weight 2.0, no POS injection, punctuation stripped.

Usage

The recommended way to use this model is via the text2tobi CLI:

pip install torch transformers huggingface_hub
git clone https://github.com/your-handle/text2tobi
cd text2tobi
python -m text2tobi download
python -m text2tobi "the students filed into the lecture hall"

Example output (default table format):

word        boundary    intonation    break_index
the         -           -             -
students    -           -             -
filed       -           -             -
into        -           -             -
the         -           -             -
lecture     -           -             -
hall        B           L%            4

Pass --raw for inline annotations or --ssml for SSML XML output.

Loading directly

If you want to load the model without the CLI, include model.py from this repo in your working directory:

from model import ProsodyBoundaryModel
from transformers import AutoTokenizer

model = ProsodyBoundaryModel.from_pretrained("your-handle/text2tobi")
tokenizer = AutoTokenizer.from_pretrained("your-handle/text2tobi")
model.eval()

The model returns a dict of logits keyed boundary_logits, intonation_logits, and break_idx_logits.

Performance

Evaluated on SBC001–005 (held-out test set, never seen during training). This is the only configuration directly comparable to the GPT-Neo text-only baseline from Roll et al. (2023).

Model	Boundary F1	Intonation F1	Break Index F1
text2tobi `libri+peoples+sbc` BLW=2.0	0.8352	0.5765	0.6018†
GPT-Neo 1.2B (Roll et al., 2023)	0.770	—	—
Random (distribution-matched)	0.257	—	—

†Break index F1 is evaluated on BU Radio News Corpus gold .brk annotations (not the SBC test set, which has no break index labels). Treat as experimental.

text2tobi surpasses the GPT-Neo baseline by 6.5 points while being approximately 18× smaller (~66M vs ~1.2B parameters), and without access to punctuation or capitalization — input is lowercased words only.

Training data

Corpus	Annotation	Role
LibriTTS	Silver (PSST + Wav2ToBI consensus)	Boundary + intonation
People's Speech	Silver (PSST + Wav2ToBI consensus)	Boundary + intonation
SBCSAE	Gold (Du Bois transcripts)	Boundary + intonation
BU Radio News	Gold (`.brk` files)	Break index evaluation only

Silver-standard boundary and intonation labels were generated by cross-validating PSST (NathanRoll/psst-medium-en) against Wav2ToBI (ReginaZ/Wav2ToBI-PB-Fuzzy). Positions where the two systems disagreed were masked from training. 87.3% of utterance-final words received Wav2ToBI corroboration within ±1 word.

SBCSAE data is included under explicit written permission from corpus director John W. Du Bois (June 2026) for unrestricted public distribution of derived model weights.

Known limitations

Intonation labels apply to boundary words only. Non-boundary intonation is not modeled.
Register coverage is read speech (LibriTTS, People's Speech) and conversational speech (SBCSAE). Generalization to telephony, noisy environments, or non-native speakers has not been tested.
Chunking fallback: for unpunctuated input, the inference pipeline splits at a 100-token word boundary when no sentence boundary is detected. This is not linguistically motivated and may affect predictions near split points.

License

Apache 2.0. See LICENSE.

Downloads last month: 16

Safetensors

Model size

66.4M params

Tensor type

F32

Model tree for lemmatix/text2tobi

Base model

distilbert/distilbert-base-uncased

Finetuned

(11928)

this model

text2tobi — libri+peoples+sbc