Upload README.md with huggingface_hub

34a7d0b verified 2 days ago

3.59 kB

	---
	license: apache-2.0
	language:
	- de
	tags:
	- truecasing
	- text-processing
	- german
	- nlp
	- lstm
	- crf
	pipeline_tag: token-classification
	---

	# Truecaser Models

	Truecasing models for restoring proper capitalization in lowercase ASR output. Used by [CrispASR](https://github.com/CrispStrobe/CrispASR) via `--truecase-model`.

	## Available Models

	\| File \| Type \| Language \| Size \| F1 \| License \| Flag \|
	\|------\|------\|----------\|------\|-----\|---------\|------\|
	\| `truecaser-lstm-de.bin` \| BiLSTM char-level \| German \| 3.2 MB \| 97.9% \| Apache-2.0 \| `lstm` or `lstm-de` \|
	\| `truecaser-lstm-en.bin` \| BiLSTM char-level \| English \| 3.2 MB \| 93.0% \| Apache-2.0 \| `lstm-en` \|
	\| `truecaser-lstm-es.bin` \| BiLSTM char-level \| Spanish \| 3.2 MB \| — \| Apache-2.0 \| `lstm-es` \|
	\| `truecaser-lstm-ru.bin` \| BiLSTM char-level \| Russian \| 4.1 MB \| — \| Apache-2.0 \| `lstm-ru` \|
	\| `truecaser-crf-de.bin` \| CRF + context \| German \| 8.5 MB \| ~95% \| MIT \| `crf` \|
	\| `truecaser-de.bin` \| Statistical freq \| German \| 1.7 MB \| ~93% \| MIT \| `auto` \|

	## BiLSTM Truecaser (recommended)

	Converted from [mayhewsw/pytorch-truecaser](https://github.com/mayhewsw/pytorch-truecaser) (Apache-2.0).

	- Architecture: Embedding(202, 50) → BiLSTM(50→150, 2 layers) → Linear(300, 2)
	- Labels: L (lowercase), U (uppercase) — per character
	- Training: WMT monolingual text (de: 2.6M tokens, 97.86% F1; en: Wikipedia, 93.01% F1; es: WMT; ru: LORELEI)
	- Original paper: Mayhew et al., "NER and POS When Nothing is Capitalized" (2019)
	- Source: [mayhewsw/pytorch-truecaser v1.0](https://github.com/mayhewsw/pytorch-truecaser/releases/tag/v1.0) — `wmt-truecaser-model-de.tar.gz`

	### Example

	```
	Input: die schnelle braune katze springt über den faulen hund
	Output: Die schnelle braune Katze springt über den faulen Hund
	```

	Correctly handles:
	- Adjective vs noun: "braune" (lowercase) vs "Katze" (capitalize)
	- Formal pronouns: "Ihnen" (capitalize)
	- Compound words and proper nouns

	## CRF Truecaser

	Trained on 245K sentences of WMT News Crawl German using [python-crfsuite](https://github.com/scrapinghub/python-crfsuite).

	- Features: word identity, 3-char suffix, noun suffixes, previous/next word, article context
	- Decode: Viterbi over linear-chain CRF (3 labels: lc, u1, uc)
	- Training data: WMT News Crawl 2023 German (8.5 MB model, MIT license)

	## Statistical Truecaser

	Simple word-frequency lookup trained on WMT News Crawl 2023 German.

	- Entries: 71,142 unique words
	- Size: 1.7 MB
	- Approach: for each word, pick the casing variant (lowercase/capitalize/uppercase) seen most often
	- Training data: WMT News Crawl 2023 German (278K sentences), MIT license

	## Usage with CrispASR

	```bash
	# BiLSTM (recommended)
	crispasr --backend wav2vec2-de -m model.gguf --truecase-model lstm -f audio.wav

	# CRF
	crispasr --backend wav2vec2-de -m model.gguf --truecase-model crf -f audio.wav

	# Statistical
	crispasr --backend wav2vec2-de -m model.gguf --truecase-model auto -f audio.wav

	# Combined with punctuation restoration
	crispasr --backend moonshine -m model.gguf --punc-model punctuate-all --truecase-model lstm -f audio.wav
	```

	## Conversion

	```bash
	# BiLSTM: download from mayhewsw, convert to binary
	wget https://github.com/mayhewsw/pytorch-truecaser/releases/download/v1.0/wmt-truecaser-model-de.tar.gz
	tar xzf wmt-truecaser-model-de.tar.gz
	python models/convert-lstm-truecaser-to-bin.py --input wmt-truecaser-de/ --output truecaser-lstm-de.bin

	# CRF: train from Wikipedia
	python models/train-truecaser-crf.py --output truecaser-crf-de.bin
	```