ilyyeees commited on
Commit
cb5926c
·
1 Parent(s): c35eb56

upgrade to v2 - v1 in v1-legacy branch

Browse files
Files changed (2) hide show
  1. README.md +57 -10
  2. training_args.bin +3 -0
README.md CHANGED
@@ -1,21 +1,68 @@
1
  ---
2
- language: en
 
 
3
  tags:
4
- - byt5
5
  - leetspeak
6
- - decoder
7
  - text2text-generation
8
- license: apache-2.0
 
 
 
 
 
9
  pipeline_tag: translation
10
  ---
11
 
12
- # ByT5 Leetspeak Decoder
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- This is a fine-tuned **ByT5** model trained to decode "Leetspeak" (e.g., `h3110 w0r1d`) back into standard English (`hello world`).
15
 
16
- **Model Accuracy:** ~98% on general sentence structures.
 
 
17
 
 
18
 
19
- ### Performance:
20
- - **BLEU:** 94.8
21
- - **CER:** 0.7%
 
1
  ---
2
+ license: mit
3
+ language:
4
+ - en
5
  tags:
 
6
  - leetspeak
 
7
  - text2text-generation
8
+ - byt5
9
+ - decoder
10
+ - translation
11
+ datasets:
12
+ - wikitext
13
+ - samsum
14
  pipeline_tag: translation
15
  ---
16
 
17
+ # ByT5 Leetspeak Decoder V2
18
+
19
+ **Translates leetspeak, internet slang, and gaming abbreviations back to clean English.**
20
+
21
+ Built on `google/byt5-base`. V2 trained on real Reddit comments for improved slang handling.
22
+
23
+ ## Performance
24
+
25
+ | Metric | V1 | V2 |
26
+ |--------|-----|-----|
27
+ | Accuracy | 71% | **85%** |
28
+ | Training Data | WikiText (synthetic) | Reddit (real) |
29
+
30
+ ## Usage
31
+
32
+ ```python
33
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
34
+
35
+ model = AutoModelForSeq2SeqLM.from_pretrained("ilyyeees/byt5-leetspeak-decoder-v2")
36
+ tokenizer = AutoTokenizer.from_pretrained("ilyyeees/byt5-leetspeak-decoder-v2")
37
+
38
+ def translate(text):
39
+ inputs = tokenizer(text, return_tensors="pt")
40
+ outputs = model.generate(**inputs, max_length=256)
41
+ return tokenizer.decode(outputs[0], skip_special_tokens=True)
42
+
43
+ # Examples
44
+ print(translate("idk wh4t 2 d0 tbh")) # I don't know what to do to be honest.
45
+ print(translate("c u l8r m8")) # See you later mate.
46
+ print(translate("brb in 10")) # be right back in 10
47
+ print(translate("g2g l8r m8")) # got to go later mate
48
+ print(translate("1 h4v3 2 c4ts")) # I have 2 cats
49
+ ```
50
+
51
+ ## What It Handles
52
+
53
+ - **Leetspeak**: `h3ll0 w0rld` → `hello world`
54
+ - **Slang**: `tbh`, `idk`, `rn`, `ngl`, `afk`
55
+ - **Gaming**: `gg wp`, `brb`, `g2g`, `1v1`
56
+ - **Numbers**: Preserves real numbers (`2 cats` stays `2 cats`)
57
+ - **Context**: `2 late` → `too late` vs `2 cats` → `2 cats`
58
 
59
+ ## Training
60
 
61
+ - **Base**: `google/byt5-base` (580M params)
62
+ - **V1**: WikiText + SAMSum + synthetic corruption
63
+ - **V2**: Real Reddit comments (5k) + Qwen 2.5 32B translations + continued training
64
 
65
+ ## Links
66
 
67
+ - [GitHub](https://github.com/ilyyeees/leet-speak-decoder)
68
+ - [V1 Model](https://huggingface.co/ilyyeees/byt5-leetspeak-decoder)
 
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4049fdf80259c02acb852ddc1d1ed5a1cc2ab28b1001affe0b8c846ffdcd111d
3
+ size 5969