Upload 5 files
Browse files- .gitignore +1 -0
- README.md +116 -0
- generate_permutations_colab.py +221 -0
- generate_typos.py +157 -0
- words.txt +0 -0
.gitignore
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
misspellings_permutations.txt
|
README.md
ADDED
|
@@ -0,0 +1,116 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Misspelling Generator
|
| 2 |
+
|
| 3 |
+
A misspelling words generator of the 466k words from data [provider](https://github.com/dwyl/english-words) written `words.txt`. For demonstation, we only use 7 letters combination minimum to generate:
|
| 4 |
+
```
|
| 5 |
+
Words processed : 125,414
|
| 6 |
+
Lines written : 173,110,626
|
| 7 |
+
Output file : misspellings_permutations.txt
|
| 8 |
+
File size : 2.53 GB
|
| 9 |
+
```
|
| 10 |
+
depending your storage you could do more litter combination limit, just configure the `MAX_WORD_LEN` in the script.
|
| 11 |
+
|
| 12 |
+
### Option 1 β use `generate_typos.py` (Run Locally)
|
| 13 |
+
|
| 14 |
+
Generates **realistic typo variants** using 4 strategies:
|
| 15 |
+
|
| 16 |
+
| Strategy | Example (`hello`) | Variants |
|
| 17 |
+
|---|---|---|
|
| 18 |
+
| Adjacent swap | `hlelo`, `helol` | nβ1 per word |
|
| 19 |
+
| Char deletion | `hllo`, `helo`, `hell` | n per word |
|
| 20 |
+
| Char duplication | `hhello`, `heello` | n per word |
|
| 21 |
+
| Keyboard proximity | `gello`, `jello`, `hwllo` | varies |
|
| 22 |
+
|
| 23 |
+
- Processes only **pure-alpha words** with length β₯ 3
|
| 24 |
+
- Produces roughly **10β50 typos per word** β ~5Mβ20M lines total
|
| 25 |
+
- Output: data/misspellings.txt in `misspelling=correction` format
|
| 26 |
+
|
| 27 |
+
**To run:**
|
| 28 |
+
```
|
| 29 |
+
python generate_typos.py
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
### Option 2 β use `generate_permutations_colab.py` (Google Colab)
|
| 35 |
+
|
| 36 |
+
Generates **ALL letter permutations** of each word. Key config at the top of the file:
|
| 37 |
+
|
| 38 |
+
```python
|
| 39 |
+
MAX_WORD_LEN = 7 # β CRITICAL control knob
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## Google Colab Education
|
| 45 |
+
|
| 46 |
+
### What Is Google Colab?
|
| 47 |
+
|
| 48 |
+
Google Colab gives you a **free Linux VM** with Python pre-installed. You get:
|
| 49 |
+
|
| 50 |
+
| Resource | Free Tier | Colab Pro ($12/mo) |
|
| 51 |
+
|---|---|---|
|
| 52 |
+
| **Disk** | ~78 GB (temporary) | ~225 GB (temporary) |
|
| 53 |
+
| **RAM** | ~12 GB | ~25-50 GB |
|
| 54 |
+
| **GPU** | T4 (limited) | A100/V100 |
|
| 55 |
+
| **Runtime limit** | ~12 hours, then VM resets | ~24 hours |
|
| 56 |
+
| **Google Drive** | 15 GB (persistent) | 15 GB (same) |
|
| 57 |
+
|
| 58 |
+
> [!IMPORTANT]
|
| 59 |
+
> Colab disk is **ephemeral** β when the runtime disconnects, all files on the VM are deleted. Only Google Drive persists.
|
| 60 |
+
|
| 61 |
+
### Step-by-Step: Running Option 2 on Colab
|
| 62 |
+
|
| 63 |
+
**Step 1 β Open Colab**
|
| 64 |
+
Go to [colab.research.google.com](https://colab.research.google.com) β **New Notebook**
|
| 65 |
+
|
| 66 |
+
**Step 2 β Upload [words.txt]**
|
| 67 |
+
```python
|
| 68 |
+
# Cell 1
|
| 69 |
+
from google.colab import files
|
| 70 |
+
uploaded = files.upload() # select words.txt from your PC
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
**Step 3 β (Optional) Mount Google Drive for persistent storage**
|
| 74 |
+
```python
|
| 75 |
+
# Cell 2
|
| 76 |
+
from google.colab import drive
|
| 77 |
+
drive.mount('/content/drive')
|
| 78 |
+
|
| 79 |
+
# Then change OUTPUT_PATH in the script to:
|
| 80 |
+
# '/content/drive/MyDrive/misspellings_permutations.txt'
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
**Step 4 β Paste & run the script**
|
| 84 |
+
Copy the entire contents of `generate_permutations_colab.py` into a new cell. Adjust `MAX_WORD_LEN` as needed, then run.
|
| 85 |
+
|
| 86 |
+
**Step 5 β Download the result**
|
| 87 |
+
```python
|
| 88 |
+
# If saved to VM disk:
|
| 89 |
+
files.download('misspellings_permutations.txt')
|
| 90 |
+
|
| 91 |
+
# If saved to Google Drive: just access it from drive.google.com
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
### Scale Reference
|
| 95 |
+
|
| 96 |
+
> [!CAUTION]
|
| 97 |
+
> Full permutations grow at **n! (factorial)** rate. Here's what to expect:
|
| 98 |
+
|
| 99 |
+
| `MAX_WORD_LEN` | Max perms/word | Est. total output |
|
| 100 |
+
|---|---|---|
|
| 101 |
+
| 5 | 120 | ~200 MB |
|
| 102 |
+
| 6 | 720 | ~1β2 GB |
|
| 103 |
+
| **7** | **5,040** | **~5β15 GB** β recommended start |
|
| 104 |
+
| 8 | 40,320 | ~50β150 GB |
|
| 105 |
+
| 9 | 362,880 | ~500 GB β 1 TB |
|
| 106 |
+
| 10 | 3,628,800 | ~5β50 TB β impossible |
|
| 107 |
+
|
| 108 |
+
> [!TIP]
|
| 109 |
+
> **Start with `MAX_WORD_LEN = 6` or `7`**, check the output size, then decide if you want to go higher. The script has a built-in safety check that aborts if the estimated size exceeds 70 GB.
|
| 110 |
+
|
| 111 |
+
### Pro Tips for Colab
|
| 112 |
+
|
| 113 |
+
- **Keep the browser tab open** β Colab disconnects if idle too long
|
| 114 |
+
- **Use `Ctrl+Shift+I` β Console** and paste `setInterval(function(){document.querySelector("colab-connect-button").click()}, 60000)` to prevent idle disconnects
|
| 115 |
+
- **For very large outputs**, write directly to Google Drive so you don't lose data on disconnect
|
| 116 |
+
- **CPU-only is fine** for this script β permutation generation is CPU-bound, not GPU
|
generate_permutations_colab.py
ADDED
|
@@ -0,0 +1,221 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
=============================================================================
|
| 3 |
+
FULL PERMUTATION MISSPELLINGS GENERATOR (Google Colab Edition)
|
| 4 |
+
=============================================================================
|
| 5 |
+
|
| 6 |
+
Purpose:
|
| 7 |
+
Generate ALL possible letter permutations of each word from words.txt
|
| 8 |
+
and write them as misspelling=correction pairs.
|
| 9 |
+
|
| 10 |
+
WARNING β READ BEFORE RUNNING
|
| 11 |
+
This is computationally EXTREME. A single 10-letter word has 3,628,800
|
| 12 |
+
permutations. A 12-letter word has 479,001,600. For 466k words, the full
|
| 13 |
+
output could be PETABYTES. You WILL need to limit word length.
|
| 14 |
+
|
| 15 |
+
=============================================================================
|
| 16 |
+
HOW TO USE ON GOOGLE COLAB
|
| 17 |
+
=============================================================================
|
| 18 |
+
|
| 19 |
+
1. Open Google Colab β https://colab.research.google.com
|
| 20 |
+
2. Create a new notebook (Python 3)
|
| 21 |
+
|
| 22 |
+
3. Upload your words.txt:
|
| 23 |
+
βββββββββββββββββββββββββββββββββββββ
|
| 24 |
+
# CELL 1: Upload words.txt
|
| 25 |
+
from google.colab import files
|
| 26 |
+
uploaded = files.upload() # click "Choose Files" β select words.txt
|
| 27 |
+
βββββββββββββββββββββββββββββββββββββ
|
| 28 |
+
|
| 29 |
+
4. Copy-paste this ENTIRE script into a new cell and run it.
|
| 30 |
+
|
| 31 |
+
5. Download the result:
|
| 32 |
+
βββββββββββββββββββββββββββββββββββββ
|
| 33 |
+
# CELL 3: Download the output
|
| 34 |
+
files.download('misspellings_permutations.txt')
|
| 35 |
+
βββββββββββββββββββββββββββββββββββββ
|
| 36 |
+
|
| 37 |
+
=============================================================================
|
| 38 |
+
OR: Use Google Drive for large files
|
| 39 |
+
=============================================================================
|
| 40 |
+
|
| 41 |
+
# Mount Google Drive (you get 15 GB free)
|
| 42 |
+
from google.colab import drive
|
| 43 |
+
drive.mount('/content/drive')
|
| 44 |
+
|
| 45 |
+
# Then set OUTPUT_PATH below to:
|
| 46 |
+
OUTPUT_PATH = '/content/drive/MyDrive/misspellings_permutations.txt'
|
| 47 |
+
|
| 48 |
+
=============================================================================
|
| 49 |
+
CONFIGURATION β Adjust these before running!
|
| 50 |
+
=============================================================================
|
| 51 |
+
"""
|
| 52 |
+
|
| 53 |
+
import os
|
| 54 |
+
import sys
|
| 55 |
+
import time
|
| 56 |
+
import math
|
| 57 |
+
from itertools import permutations
|
| 58 |
+
|
| 59 |
+
# ββ CONFIGURATION βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 60 |
+
|
| 61 |
+
WORDS_PATH = 'words.txt' # path to your words.txt
|
| 62 |
+
OUTPUT_PATH = 'misspellings_permutations.txt' # output file path
|
| 63 |
+
|
| 64 |
+
MIN_WORD_LEN = 3 # skip words shorter than this
|
| 65 |
+
MAX_WORD_LEN = 7 # CRITICAL: max word length to permute
|
| 66 |
+
# 7 β max 5,040 perms/word (manageable)
|
| 67 |
+
# 8 β max 40,320 perms/word (large)
|
| 68 |
+
# 9 β max 362,880 perms/word (very large)
|
| 69 |
+
# 10 β max 3,628,800 perms/word (EXTREME)
|
| 70 |
+
# Increase at your own risk!
|
| 71 |
+
|
| 72 |
+
ONLY_ALPHA = True # only process pure-alphabetical words
|
| 73 |
+
BATCH_LOG = 5000 # print progress every N words
|
| 74 |
+
|
| 75 |
+
# ββ ESTIMATION TABLE ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 76 |
+
# Here's roughly how big the output gets at each MAX_WORD_LEN setting,
|
| 77 |
+
# assuming ~200k qualifying words at each length bracket:
|
| 78 |
+
#
|
| 79 |
+
# MAX_WORD_LEN β Perms per word (worst) β Rough output size
|
| 80 |
+
# ββββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββββ
|
| 81 |
+
# 5 β 120 β ~200 MB
|
| 82 |
+
# 6 β 720 β ~1-2 GB
|
| 83 |
+
# 7 β 5,040 β ~5-15 GB
|
| 84 |
+
# 8 β 40,320 β ~50-150 GB
|
| 85 |
+
# 9 β 362,880 β ~500 GB - 1 TB
|
| 86 |
+
# 10 β 3,628,800 β ~5-50 TB β won't fit anywhere
|
| 87 |
+
#
|
| 88 |
+
# Google Colab free tier gives you:
|
| 89 |
+
# β’ ~78 GB disk on the VM (temporary, lost on disconnect)
|
| 90 |
+
# β’ 15 GB Google Drive (persistent)
|
| 91 |
+
# β’ Colab Pro: 225 GB disk, longer runtimes
|
| 92 |
+
#
|
| 93 |
+
# RECOMMENDATION: Start with MAX_WORD_LEN = 6 or 7, see the size,
|
| 94 |
+
# then increase if you have space.
|
| 95 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
def estimate_output(words):
|
| 99 |
+
"""Estimate total permutations and file size before generating."""
|
| 100 |
+
total_perms = 0
|
| 101 |
+
for w in words:
|
| 102 |
+
n = len(w)
|
| 103 |
+
# Account for duplicate letters: n! / (c1! * c2! * ...)
|
| 104 |
+
freq = {}
|
| 105 |
+
for ch in w.lower():
|
| 106 |
+
freq[ch] = freq.get(ch, 0) + 1
|
| 107 |
+
unique_perms = math.factorial(n)
|
| 108 |
+
for count in freq.values():
|
| 109 |
+
unique_perms //= math.factorial(count)
|
| 110 |
+
total_perms += unique_perms - 1 # subtract the original word
|
| 111 |
+
|
| 112 |
+
# Estimate ~15 bytes per line (avg) β "typo=word\n"
|
| 113 |
+
avg_bytes_per_line = 15
|
| 114 |
+
est_bytes = total_perms * avg_bytes_per_line
|
| 115 |
+
est_gb = est_bytes / (1024 ** 3)
|
| 116 |
+
|
| 117 |
+
return total_perms, est_gb
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
def generate_unique_permutations(word):
|
| 121 |
+
"""
|
| 122 |
+
Generate all unique permutations of a word's letters,
|
| 123 |
+
excluding the original word itself.
|
| 124 |
+
|
| 125 |
+
Uses set() to deduplicate (handles repeated letters efficiently).
|
| 126 |
+
"""
|
| 127 |
+
lower = word.lower()
|
| 128 |
+
perms = set(''.join(p) for p in permutations(lower))
|
| 129 |
+
perms.discard(lower) # remove the correctly-spelled word
|
| 130 |
+
return perms
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
def is_pure_alpha(word):
|
| 134 |
+
return word.isalpha()
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
def main():
|
| 138 |
+
if not os.path.exists(WORDS_PATH):
|
| 139 |
+
print(f"ERROR: '{WORDS_PATH}' not found!")
|
| 140 |
+
print("Make sure you uploaded words.txt or set WORDS_PATH correctly.")
|
| 141 |
+
sys.exit(1)
|
| 142 |
+
|
| 143 |
+
# ββ Read words ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 144 |
+
print(f"Reading words from: {WORDS_PATH}")
|
| 145 |
+
with open(WORDS_PATH, 'r', encoding='utf-8', errors='replace') as f:
|
| 146 |
+
raw_words = [line.strip() for line in f if line.strip()]
|
| 147 |
+
|
| 148 |
+
print(f"Total raw entries: {len(raw_words):,}")
|
| 149 |
+
|
| 150 |
+
# Filter
|
| 151 |
+
words = []
|
| 152 |
+
for w in raw_words:
|
| 153 |
+
if ONLY_ALPHA and not is_pure_alpha(w):
|
| 154 |
+
continue
|
| 155 |
+
if len(w) < MIN_WORD_LEN or len(w) > MAX_WORD_LEN:
|
| 156 |
+
continue
|
| 157 |
+
words.append(w)
|
| 158 |
+
|
| 159 |
+
print(f"Filtered to {len(words):,} words (alpha-only, len {MIN_WORD_LEN}-{MAX_WORD_LEN})")
|
| 160 |
+
|
| 161 |
+
if len(words) == 0:
|
| 162 |
+
print("No words matched the filter. Adjust MIN/MAX_WORD_LEN.")
|
| 163 |
+
sys.exit(1)
|
| 164 |
+
|
| 165 |
+
# ββ Estimate ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 166 |
+
print("\nEstimating output size (this may take a moment)...")
|
| 167 |
+
total_perms, est_gb = estimate_output(words)
|
| 168 |
+
print(f" Estimated permutations : {total_perms:,}")
|
| 169 |
+
print(f" Estimated file size : {est_gb:.2f} GB")
|
| 170 |
+
|
| 171 |
+
# Safety check
|
| 172 |
+
if est_gb > 70:
|
| 173 |
+
print(f"\n WARNING: Estimated output ({est_gb:.1f} GB) exceeds Colab disk (~78 GB).")
|
| 174 |
+
print(" Reduce MAX_WORD_LEN or the script will crash when disk fills up.")
|
| 175 |
+
print(" Aborting. Set MAX_WORD_LEN lower and re-run.")
|
| 176 |
+
sys.exit(1)
|
| 177 |
+
|
| 178 |
+
print(f"\nProceeding with generation β {OUTPUT_PATH}")
|
| 179 |
+
print("=" * 60)
|
| 180 |
+
|
| 181 |
+
# ββ Generate ββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 182 |
+
start = time.time()
|
| 183 |
+
total_written = 0
|
| 184 |
+
|
| 185 |
+
with open(OUTPUT_PATH, 'w', encoding='utf-8') as out:
|
| 186 |
+
out.write("# Auto-generated FULL PERMUTATION misspellings\n")
|
| 187 |
+
out.write(f"# Config: word length {MIN_WORD_LEN}-{MAX_WORD_LEN}\n")
|
| 188 |
+
out.write("# Format: misspelling=correction\n\n")
|
| 189 |
+
|
| 190 |
+
for idx, word in enumerate(words):
|
| 191 |
+
perms = generate_unique_permutations(word)
|
| 192 |
+
|
| 193 |
+
for typo in sorted(perms):
|
| 194 |
+
out.write(f"{typo}={word}\n")
|
| 195 |
+
total_written += 1
|
| 196 |
+
|
| 197 |
+
# Progress
|
| 198 |
+
if (idx + 1) % BATCH_LOG == 0:
|
| 199 |
+
elapsed = time.time() - start
|
| 200 |
+
pct = (idx + 1) / len(words) * 100
|
| 201 |
+
rate = (idx + 1) / elapsed if elapsed > 0 else 0
|
| 202 |
+
cur_size = os.path.getsize(OUTPUT_PATH) / (1024 ** 3)
|
| 203 |
+
print(f" [{pct:5.1f}%] {idx+1:>7,}/{len(words):,} words |"
|
| 204 |
+
f" {total_written:>12,} lines | {cur_size:.2f} GB |"
|
| 205 |
+
f" {rate:.0f} words/sec")
|
| 206 |
+
|
| 207 |
+
elapsed = time.time() - start
|
| 208 |
+
final_size = os.path.getsize(OUTPUT_PATH) / (1024 ** 3)
|
| 209 |
+
|
| 210 |
+
print()
|
| 211 |
+
print("=" * 60)
|
| 212 |
+
print(f" DONE in {elapsed:.1f}s ({elapsed/60:.1f} min)")
|
| 213 |
+
print(f" Words processed : {len(words):,}")
|
| 214 |
+
print(f" Lines written : {total_written:,}")
|
| 215 |
+
print(f" Output file : {OUTPUT_PATH}")
|
| 216 |
+
print(f" File size : {final_size:.2f} GB")
|
| 217 |
+
print("=" * 60)
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
if __name__ == '__main__':
|
| 221 |
+
main()
|
generate_typos.py
ADDED
|
@@ -0,0 +1,157 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Generate realistic typo-based misspellings from words.txt β misspellings.txt
|
| 3 |
+
|
| 4 |
+
Typo strategies:
|
| 5 |
+
1. Adjacent letter swaps ("hello" β "hlelo", "helol")
|
| 6 |
+
2. Single character deletion ("hello" β "hllo", "helo")
|
| 7 |
+
3. Single character duplication ("hello" β "hhello", "heello")
|
| 8 |
+
4. Nearby keyboard key sub ("hello" β "gello", "jello")
|
| 9 |
+
|
| 10 |
+
Output format: misspelling=correction (one per line)
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
import sys
|
| 14 |
+
import os
|
| 15 |
+
import time
|
| 16 |
+
|
| 17 |
+
# QWERTY keyboard proximity map
|
| 18 |
+
KEYBOARD_NEIGHBORS = {
|
| 19 |
+
'q': 'wa', 'w': 'qeas', 'e': 'wrds', 'r': 'etfs', 't': 'rygs',
|
| 20 |
+
'y': 'tuhs', 'u': 'yijs', 'i': 'uoks', 'o': 'ipls', 'p': 'o',
|
| 21 |
+
'a': 'qwsz', 's': 'awedxz', 'd': 'serfcx', 'f': 'drtgvc',
|
| 22 |
+
'g': 'ftyhbv', 'h': 'gyujnb', 'j': 'huikmn', 'k': 'jiolm',
|
| 23 |
+
'l': 'kop', 'z': 'asx', 'x': 'zsdc', 'c': 'xdfv', 'v': 'cfgb',
|
| 24 |
+
'b': 'vghn', 'n': 'bhjm', 'm': 'njk',
|
| 25 |
+
}
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def generate_adjacent_swaps(word):
|
| 29 |
+
"""Swap each pair of adjacent characters."""
|
| 30 |
+
typos = []
|
| 31 |
+
for i in range(len(word) - 1):
|
| 32 |
+
chars = list(word)
|
| 33 |
+
chars[i], chars[i + 1] = chars[i + 1], chars[i]
|
| 34 |
+
typo = ''.join(chars)
|
| 35 |
+
if typo != word:
|
| 36 |
+
typos.append(typo)
|
| 37 |
+
return typos
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def generate_deletions(word):
|
| 41 |
+
"""Delete one character at a time."""
|
| 42 |
+
typos = []
|
| 43 |
+
for i in range(len(word)):
|
| 44 |
+
typo = word[:i] + word[i + 1:]
|
| 45 |
+
if len(typo) >= 2: # keep at least 2 chars
|
| 46 |
+
typos.append(typo)
|
| 47 |
+
return typos
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def generate_duplications(word):
|
| 51 |
+
"""Duplicate one character at a time."""
|
| 52 |
+
typos = []
|
| 53 |
+
for i in range(len(word)):
|
| 54 |
+
typo = word[:i] + word[i] + word[i:]
|
| 55 |
+
if typo != word:
|
| 56 |
+
typos.append(typo)
|
| 57 |
+
return typos
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def generate_nearby_key_subs(word):
|
| 61 |
+
"""Replace one character with a nearby keyboard key."""
|
| 62 |
+
typos = []
|
| 63 |
+
lower = word.lower()
|
| 64 |
+
for i in range(len(word)):
|
| 65 |
+
ch = lower[i]
|
| 66 |
+
if ch in KEYBOARD_NEIGHBORS:
|
| 67 |
+
for neighbor in KEYBOARD_NEIGHBORS[ch]:
|
| 68 |
+
typo = lower[:i] + neighbor + lower[i + 1:]
|
| 69 |
+
if typo != lower:
|
| 70 |
+
typos.append(typo)
|
| 71 |
+
return typos
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def generate_all_typos(word):
|
| 75 |
+
"""Generate all realistic typo variants for a word."""
|
| 76 |
+
typos = set()
|
| 77 |
+
typos.update(generate_adjacent_swaps(word))
|
| 78 |
+
typos.update(generate_deletions(word))
|
| 79 |
+
typos.update(generate_duplications(word))
|
| 80 |
+
typos.update(generate_nearby_key_subs(word))
|
| 81 |
+
typos.discard(word) # never map a word to itself
|
| 82 |
+
typos.discard(word.lower())
|
| 83 |
+
return typos
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
def is_pure_alpha(word):
|
| 87 |
+
"""Only process words that are purely alphabetical (a-z)."""
|
| 88 |
+
return word.isalpha()
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
def main():
|
| 92 |
+
base_dir = os.path.dirname(os.path.abspath(__file__))
|
| 93 |
+
words_path = os.path.join(base_dir, 'data', 'words.txt')
|
| 94 |
+
output_path = os.path.join(base_dir, 'data', 'misspellings.txt')
|
| 95 |
+
|
| 96 |
+
if not os.path.exists(words_path):
|
| 97 |
+
print(f"ERROR: {words_path} not found.")
|
| 98 |
+
sys.exit(1)
|
| 99 |
+
|
| 100 |
+
# ββ Read words ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 101 |
+
print(f"Reading words from: {words_path}")
|
| 102 |
+
with open(words_path, 'r', encoding='utf-8', errors='replace') as f:
|
| 103 |
+
raw_words = [line.strip() for line in f if line.strip()]
|
| 104 |
+
|
| 105 |
+
print(f"Total raw entries: {len(raw_words):,}")
|
| 106 |
+
|
| 107 |
+
# Filter to pure-alpha words with length >= 3
|
| 108 |
+
words = [w for w in raw_words if is_pure_alpha(w) and len(w) >= 3]
|
| 109 |
+
print(f"Filtered to {len(words):,} alphabetical words (len >= 3)")
|
| 110 |
+
|
| 111 |
+
# ββ Generate typos ββββββββββββββββββββββββββββββββββββββββββ
|
| 112 |
+
start = time.time()
|
| 113 |
+
total_typos = 0
|
| 114 |
+
batch_size = 10_000
|
| 115 |
+
|
| 116 |
+
print(f"Generating typos β {output_path}")
|
| 117 |
+
print("This may take a few minutes for 466k words...")
|
| 118 |
+
|
| 119 |
+
with open(output_path, 'w', encoding='utf-8', newline='\n') as out:
|
| 120 |
+
out.write("# Auto-generated misspellings database\n")
|
| 121 |
+
out.write("# Format: misspelling=correction\n")
|
| 122 |
+
out.write("# Generated by generate_typos.py\n")
|
| 123 |
+
out.write("#\n")
|
| 124 |
+
out.write("# Strategies: adjacent swaps, deletions, duplications, keyboard proximity\n")
|
| 125 |
+
out.write("\n")
|
| 126 |
+
|
| 127 |
+
for idx, word in enumerate(words):
|
| 128 |
+
correction = word # original is the correct form
|
| 129 |
+
typos = generate_all_typos(word.lower())
|
| 130 |
+
|
| 131 |
+
for typo in sorted(typos):
|
| 132 |
+
out.write(f"{typo}={correction}\n")
|
| 133 |
+
total_typos += 1
|
| 134 |
+
|
| 135 |
+
# Progress reporting
|
| 136 |
+
if (idx + 1) % batch_size == 0:
|
| 137 |
+
elapsed = time.time() - start
|
| 138 |
+
pct = (idx + 1) / len(words) * 100
|
| 139 |
+
rate = (idx + 1) / elapsed if elapsed > 0 else 0
|
| 140 |
+
print(f" [{pct:5.1f}%] {idx + 1:>7,} / {len(words):,} words |"
|
| 141 |
+
f" {total_typos:>10,} typos | {rate:.0f} words/sec")
|
| 142 |
+
|
| 143 |
+
elapsed = time.time() - start
|
| 144 |
+
file_size_mb = os.path.getsize(output_path) / (1024 * 1024)
|
| 145 |
+
|
| 146 |
+
print()
|
| 147 |
+
print("=" * 60)
|
| 148 |
+
print(f" Done in {elapsed:.1f}s")
|
| 149 |
+
print(f" Words processed : {len(words):,}")
|
| 150 |
+
print(f" Typos generated : {total_typos:,}")
|
| 151 |
+
print(f" Output file : {output_path}")
|
| 152 |
+
print(f" File size : {file_size_mb:.1f} MB")
|
| 153 |
+
print("=" * 60)
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
if __name__ == '__main__':
|
| 157 |
+
main()
|
words.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|