--- license: mit language: - en tags: - leetspeak - text2text-generation - byt5 - decoder - translation - normalization datasets: - wikitext - eli5 metrics: - bleu - cer pipeline_tag: translation model-index: - name: ByT5 Leetspeak Decoder V3 results: - task: type: translation name: Leetspeak Decoding metrics: - type: accuracy name: Mixed-Number Accuracy value: 100.0 - type: accuracy name: Basic Leet Accuracy value: 100.0 --- # ByT5 Leetspeak Decoder V3 (Production) **The definitive byte-level translator for leetspeak, internet slang, and visual character obfuscation.** Built on `google/byt5-base`, **V3** represents a major architectural shift from previous versions. It utilizes **Curriculum Learning** and **Adversarial Filtering** to solve the complex context ambiguity between leetspeak numbers (e.g., "2" meaning "to") and actual quantities (e.g., "2 cats"). ## Key Improvements in V3 | Feature | V2 (Legacy) | V3 (Current) | | :--- | :--- | :--- | | **Mixed-Number Context** | Struggled (~74%) | **100.0% Accuracy** | | **Basic Leet Decoding** | 85% | **100.0% Accuracy** | | **Visual Obfuscation** | Moderate | **High** (handles `|<1||`, `|-|`, etc.) | | **Output Style** | Casual/Slang-heavy | **Formal/Standard English** | | **Final Eval Loss** | 0.84 | **0.3812** | ### The "Number Problem" Solved V3 is the first model in this series to perfectly distinguish between numbers used as letters and numbers used as quantities within the same sentence. * **Input:** `1t5 2 l8 4 2 people` * **V2 Output:** *It's to late for to people.* (Fail) * **V3 Output:** *It is too late for 2 people.* (Pass) ## Usage ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_id = "ilyyeees/byt5-leetspeak-decoder" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSeq2SeqLM.from_pretrained(model_id) def decode_leet(text): inputs = tokenizer(text, return_tensors="pt") outputs = model.generate( **inputs, max_length=256, num_beams=4, early_stopping=True ) return tokenizer.decode(outputs[0], skip_special_tokens=True) # Test Cases print(decode_leet("1t5 2 l8 4 th4t")) # Output: It is too late for that. print(decode_leet("1 g0t 100 p01nt5 0n 1t")) # Output: I got 100 points on it. (Preserves the '100' but decodes the rest) print(decode_leet("idk wh4t 2 d0 tbh")) # Output: I don't know what to do to be honest. (Expands abbreviations) ``` ##Training Methodology V3 was trained on 2x NVIDIA RTX 5090s using a custom Reverse-Corruption Pipeline: Clean Base: High-quality English from WikiText and ELI5 to ground the model in correct grammar. LLM Adversarial Corruption: We used Qwen 2.5 72B to generate "Hard Negatives"—specific leetspeak patterns that previous model versions failed to decode. Curriculum Learning: The model was trained in phases of increasing difficulty, starting with simple character swaps and ending with complex visual noise and mixed-number ambiguity. #Limitations & Bias Formalization Bias: Because V3 was trained on high-quality datasets (Wiki/ELI5), it has a bias toward formal English. It may expand casual slang into formal prose (e.g., converting ngl to not gonna lie or idk to I don't know). It generally avoids outputting slang words like gonna or wanna unless strongly prompted. Short Inputs: Extremely short, ambiguous inputs (1-2 characters) may be interpreted as standard English rather than leetspeak due to the conservative decoding threshold. #Links GitHub Repository: ilyyeees/leet-speak-decoder V2 Model (Legacy): byt5-leetspeak-decoder-v2