Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- de
|
| 5 |
+
tags:
|
| 6 |
+
- truecasing
|
| 7 |
+
- text-processing
|
| 8 |
+
- german
|
| 9 |
+
- nlp
|
| 10 |
+
- statistical
|
| 11 |
+
pipeline_tag: token-classification
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# German Statistical Truecaser
|
| 15 |
+
|
| 16 |
+
Statistical truecasing model for German, trained on 2M lines of German Wikipedia (CC-BY-SA source, model released under MIT).
|
| 17 |
+
|
| 18 |
+
Restores proper capitalization of German nouns, proper names, and acronyms in lowercase ASR output.
|
| 19 |
+
|
| 20 |
+
## How It Works
|
| 21 |
+
|
| 22 |
+
For each word (lowercased), the model stores frequency counts of three casing variants:
|
| 23 |
+
- **lc**: all lowercase (e.g. "die")
|
| 24 |
+
- **u1**: first letter capitalized (e.g. "Katze")
|
| 25 |
+
- **uc**: all uppercase (e.g. "NATO")
|
| 26 |
+
|
| 27 |
+
At inference, the variant with the highest count is applied. Sentence-initial words are always capitalized.
|
| 28 |
+
|
| 29 |
+
## Stats
|
| 30 |
+
|
| 31 |
+
- **Entries**: 452,835 unique words
|
| 32 |
+
- **Training data**: German Wikipedia (2M lines, mid-sentence words only)
|
| 33 |
+
- **File size**: 11 MB
|
| 34 |
+
- **Inference**: instant (hash table lookup, no neural network)
|
| 35 |
+
|
| 36 |
+
## Usage with CrispASR
|
| 37 |
+
|
| 38 |
+
```bash
|
| 39 |
+
# Auto-download German truecaser
|
| 40 |
+
crispasr --backend moonshine -m model.gguf --truecase-model auto -f audio.wav
|
| 41 |
+
|
| 42 |
+
# Combined with punctuation restoration
|
| 43 |
+
crispasr --backend wav2vec2-de -m model.gguf \
|
| 44 |
+
--punc-model punctuate-all --truecase-model auto -f audio.wav
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## Example
|
| 48 |
+
|
| 49 |
+
| Stage | Output |
|
| 50 |
+
|-------|--------|
|
| 51 |
+
| Raw ASR | `die schnelle braune katze springt über den faulen hund` |
|
| 52 |
+
| + punctuation | `die schnelle braune katze springt über den faulen hund.` |
|
| 53 |
+
| + truecasing | `Die schnelle Braune Katze springt über den faulen Hund.` |
|
| 54 |
+
|
| 55 |
+
## Limitations
|
| 56 |
+
|
| 57 |
+
- Statistical only — no context awareness (adjective "braune" vs surname "Braun" are ambiguous)
|
| 58 |
+
- German-specific (separate models needed for other languages)
|
| 59 |
+
- Does not handle mixed-case words like "mRNA" or "iPhone"
|
| 60 |
+
|
| 61 |
+
## License
|
| 62 |
+
|
| 63 |
+
MIT
|