cstr commited on
Commit
e5bfca5
·
verified ·
1 Parent(s): 48b268a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +63 -0
README.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - de
5
+ tags:
6
+ - truecasing
7
+ - text-processing
8
+ - german
9
+ - nlp
10
+ - statistical
11
+ pipeline_tag: token-classification
12
+ ---
13
+
14
+ # German Statistical Truecaser
15
+
16
+ Statistical truecasing model for German, trained on 2M lines of German Wikipedia (CC-BY-SA source, model released under MIT).
17
+
18
+ Restores proper capitalization of German nouns, proper names, and acronyms in lowercase ASR output.
19
+
20
+ ## How It Works
21
+
22
+ For each word (lowercased), the model stores frequency counts of three casing variants:
23
+ - **lc**: all lowercase (e.g. "die")
24
+ - **u1**: first letter capitalized (e.g. "Katze")
25
+ - **uc**: all uppercase (e.g. "NATO")
26
+
27
+ At inference, the variant with the highest count is applied. Sentence-initial words are always capitalized.
28
+
29
+ ## Stats
30
+
31
+ - **Entries**: 452,835 unique words
32
+ - **Training data**: German Wikipedia (2M lines, mid-sentence words only)
33
+ - **File size**: 11 MB
34
+ - **Inference**: instant (hash table lookup, no neural network)
35
+
36
+ ## Usage with CrispASR
37
+
38
+ ```bash
39
+ # Auto-download German truecaser
40
+ crispasr --backend moonshine -m model.gguf --truecase-model auto -f audio.wav
41
+
42
+ # Combined with punctuation restoration
43
+ crispasr --backend wav2vec2-de -m model.gguf \
44
+ --punc-model punctuate-all --truecase-model auto -f audio.wav
45
+ ```
46
+
47
+ ## Example
48
+
49
+ | Stage | Output |
50
+ |-------|--------|
51
+ | Raw ASR | `die schnelle braune katze springt über den faulen hund` |
52
+ | + punctuation | `die schnelle braune katze springt über den faulen hund.` |
53
+ | + truecasing | `Die schnelle Braune Katze springt über den faulen Hund.` |
54
+
55
+ ## Limitations
56
+
57
+ - Statistical only — no context awareness (adjective "braune" vs surname "Braun" are ambiguous)
58
+ - German-specific (separate models needed for other languages)
59
+ - Does not handle mixed-case words like "mRNA" or "iPhone"
60
+
61
+ ## License
62
+
63
+ MIT