text2tobi β€” libri+peoples+sbc

ToBI prosodic annotation from text alone. No audio required at inference time.

Given a stream of lowercased words, this model predicts:

  • Intonation unit boundaries β€” where a prosodic phrase ends
  • Intonation direction at each boundary β€” rising (H%), falling (L%), or level (!H%)
  • Break index strength at each boundary β€” intermediate (3) or full (4)

This is the libri+peoples+sbc checkpoint: the best-performing configuration from the Text2ToBI experiments, trained on LibriTTS + People's Speech + SBCSAE with boundary loss weight 2.0, no POS injection, punctuation stripped.


Usage

The recommended way to use this model is via the text2tobi CLI:

pip install torch transformers huggingface_hub
git clone https://github.com/your-handle/text2tobi
cd text2tobi
python -m text2tobi download
python -m text2tobi "the students filed into the lecture hall"

Example output (default table format):

word        boundary    intonation    break_index
the         -           -             -
students    -           -             -
filed       -           -             -
into        -           -             -
the         -           -             -
lecture     -           -             -
hall        B           L%            4

Pass --raw for inline annotations or --ssml for SSML XML output.

Loading directly

If you want to load the model without the CLI, include model.py from this repo in your working directory:

from model import ProsodyBoundaryModel
from transformers import AutoTokenizer

model = ProsodyBoundaryModel.from_pretrained("your-handle/text2tobi")
tokenizer = AutoTokenizer.from_pretrained("your-handle/text2tobi")
model.eval()

The model returns a dict of logits keyed boundary_logits, intonation_logits, and break_idx_logits.


Performance

Evaluated on SBC001–005 (held-out test set, never seen during training). This is the only configuration directly comparable to the GPT-Neo text-only baseline from Roll et al. (2023).

Model Boundary F1 Intonation F1 Break Index F1
text2tobi libri+peoples+sbc BLW=2.0 0.8352 0.5765 0.6018†
GPT-Neo 1.2B (Roll et al., 2023) 0.770 β€” β€”
Random (distribution-matched) 0.257 β€” β€”

†Break index F1 is evaluated on BU Radio News Corpus gold .brk annotations (not the SBC test set, which has no break index labels). Treat as experimental.

text2tobi surpasses the GPT-Neo baseline by 6.5 points while being approximately 18Γ— smaller (~66M vs ~1.2B parameters), and without access to punctuation or capitalization β€” input is lowercased words only.


Training data

Corpus Annotation Role
LibriTTS Silver (PSST + Wav2ToBI consensus) Boundary + intonation
People's Speech Silver (PSST + Wav2ToBI consensus) Boundary + intonation
SBCSAE Gold (Du Bois transcripts) Boundary + intonation
BU Radio News Gold (.brk files) Break index evaluation only

Silver-standard boundary and intonation labels were generated by cross-validating PSST (NathanRoll/psst-medium-en) against Wav2ToBI (ReginaZ/Wav2ToBI-PB-Fuzzy). Positions where the two systems disagreed were masked from training. 87.3% of utterance-final words received Wav2ToBI corroboration within Β±1 word.

SBCSAE data is included under explicit written permission from corpus director John W. Du Bois (June 2026) for unrestricted public distribution of derived model weights.


Known limitations

  • Intonation labels apply to boundary words only. Non-boundary intonation is not modeled.
  • Register coverage is read speech (LibriTTS, People's Speech) and conversational speech (SBCSAE). Generalization to telephony, noisy environments, or non-native speakers has not been tested.
  • Chunking fallback: for unpunctuated input, the inference pipeline splits at a 100-token word boundary when no sentence boundary is detected. This is not linguistically motivated and may affect predictions near split points.

License

Apache 2.0. See LICENSE.

Downloads last month
16
Safetensors
Model size
66.4M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for lemmatix/text2tobi

Finetuned
(11928)
this model