Upload 5 files

Browse files

Files changed (5) hide show

.gitignore +1 -0
README.md +116 -0
generate_permutations_colab.py +221 -0
generate_typos.py +157 -0
words.txt +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ misspellings_permutations.txt

README.md ADDED Viewed

	@@ -0,0 +1,116 @@

+# Misspelling Generator
+A misspelling words generator of the 466k words from data [provider](https://github.com/dwyl/english-words) written `words.txt`. For demonstation, we only use 7 letters combination minimum to generate:
+```
+  Words processed  : 125,414
+  Lines written    : 173,110,626
+  Output file      : misspellings_permutations.txt
+  File size        : 2.53 GB
+```
+depending your storage you could do more litter combination limit, just configure the `MAX_WORD_LEN` in the script.
+### Option 1 — use `generate_typos.py` (Run Locally)
+Generates **realistic typo variants** using 4 strategies:
+| Strategy | Example (`hello`) | Variants |
+|---|---|---|
+| Adjacent swap | `hlelo`, `helol` | n−1 per word |
+| Char deletion | `hllo`, `helo`, `hell` | n per word |
+| Char duplication | `hhello`, `heello` | n per word |
+| Keyboard proximity | `gello`, `jello`, `hwllo` | varies |
+- Processes only **pure-alpha words** with length ≥ 3
+- Produces roughly **10–50 typos per word** → ~5M–20M lines total
+- Output: data/misspellings.txt in `misspelling=correction` format
+**To run:**
+```
+python generate_typos.py
+```
+---
+### Option 2 — use `generate_permutations_colab.py` (Google Colab)
+Generates **ALL letter permutations** of each word. Key config at the top of the file:
+```python
+MAX_WORD_LEN = 7   # ← CRITICAL control knob
+```
+---
+## Google Colab Education
+### What Is Google Colab?
+Google Colab gives you a **free Linux VM** with Python pre-installed. You get:
+| Resource | Free Tier | Colab Pro ($12/mo) |
+|---|---|---|
+| **Disk** | ~78 GB (temporary) | ~225 GB (temporary) |
+| **RAM** | ~12 GB | ~25-50 GB |
+| **GPU** | T4 (limited) | A100/V100 |
+| **Runtime limit** | ~12 hours, then VM resets | ~24 hours |
+| **Google Drive** | 15 GB (persistent) | 15 GB (same) |
+> [!IMPORTANT]
+> Colab disk is **ephemeral** — when the runtime disconnects, all files on the VM are deleted. Only Google Drive persists.
+### Step-by-Step: Running Option 2 on Colab
+**Step 1 — Open Colab**
+Go to [colab.research.google.com](https://colab.research.google.com) → **New Notebook**
+**Step 2 — Upload [words.txt]**
+```python
+# Cell 1
+from google.colab import files
+uploaded = files.upload()   # select words.txt from your PC
+```
+**Step 3 — (Optional) Mount Google Drive for persistent storage**
+```python
+# Cell 2
+from google.colab import drive
+drive.mount('/content/drive')
+# Then change OUTPUT_PATH in the script to:
+# '/content/drive/MyDrive/misspellings_permutations.txt'
+```
+**Step 4 — Paste & run the script**
+Copy the entire contents of `generate_permutations_colab.py` into a new cell. Adjust `MAX_WORD_LEN` as needed, then run.
+**Step 5 — Download the result**
+```python
+# If saved to VM disk:
+files.download('misspellings_permutations.txt')
+# If saved to Google Drive: just access it from drive.google.com
+```
+### Scale Reference
+> [!CAUTION]
+> Full permutations grow at **n! (factorial)** rate. Here's what to expect:
+| `MAX_WORD_LEN` | Max perms/word | Est. total output |
+|---|---|---|
+| 5 | 120 | ~200 MB |
+| 6 | 720 | ~1–2 GB |
+| **7** | **5,040** | **~5–15 GB** ← recommended start |
+| 8 | 40,320 | ~50–150 GB |
+| 9 | 362,880 | ~500 GB – 1 TB |
+| 10 | 3,628,800 | ~5–50 TB ← impossible |
+> [!TIP]
+> **Start with `MAX_WORD_LEN = 6` or `7`**, check the output size, then decide if you want to go higher. The script has a built-in safety check that aborts if the estimated size exceeds 70 GB.
+### Pro Tips for Colab
+- **Keep the browser tab open** — Colab disconnects if idle too long
+- **Use `Ctrl+Shift+I` → Console** and paste `setInterval(function(){document.querySelector("colab-connect-button").click()}, 60000)` to prevent idle disconnects
+- **For very large outputs**, write directly to Google Drive so you don't lose data on disconnect
+- **CPU-only is fine** for this script — permutation generation is CPU-bound, not GPU

generate_permutations_colab.py ADDED Viewed

	@@ -0,0 +1,221 @@

+"""
+=============================================================================
+  FULL PERMUTATION MISSPELLINGS GENERATOR  (Google Colab Edition)
+=============================================================================
+Purpose:
+  Generate ALL possible letter permutations of each word from words.txt
+  and write them as misspelling=correction pairs.
+  WARNING — READ BEFORE RUNNING
+  This is computationally EXTREME. A single 10-letter word has 3,628,800
+  permutations. A 12-letter word has 479,001,600. For 466k words, the full
+  output could be PETABYTES. You WILL need to limit word length.
+=============================================================================
+  HOW TO USE ON GOOGLE COLAB
+=============================================================================
+1. Open Google Colab  →  https://colab.research.google.com
+2. Create a new notebook (Python 3)
+3. Upload your words.txt:
+   ─────────────────────────────────────
+   # CELL 1: Upload words.txt
+   from google.colab import files
+   uploaded = files.upload()     # click "Choose Files" → select words.txt
+   ─────────────────────────────────────
+4. Copy-paste this ENTIRE script into a new cell and run it.
+5. Download the result:
+   ─────────────────────────────────────
+   # CELL 3: Download the output
+   files.download('misspellings_permutations.txt')
+   ─────────────────────────────────────
+=============================================================================
+  OR: Use Google Drive for large files
+=============================================================================
+   # Mount Google Drive (you get 15 GB free)
+   from google.colab import drive
+   drive.mount('/content/drive')
+   # Then set OUTPUT_PATH below to:
+   OUTPUT_PATH = '/content/drive/MyDrive/misspellings_permutations.txt'
+=============================================================================
+  CONFIGURATION — Adjust these before running!
+=============================================================================
+"""
+import os
+import sys
+import time
+import math
+from itertools import permutations
+# ── CONFIGURATION ───────────────────────────────────────────────────────────
+WORDS_PATH   = 'words.txt'                          # path to your words.txt
+OUTPUT_PATH  = 'misspellings_permutations.txt'       # output file path
+MIN_WORD_LEN = 3     # skip words shorter than this
+MAX_WORD_LEN = 7     #  CRITICAL: max word length to permute
+                      # 7  → max 5,040 perms/word   (manageable)
+                      # 8  → max 40,320 perms/word  (large)
+                      # 9  → max 362,880 perms/word (very large)
+                      # 10 → max 3,628,800 perms/word (EXTREME)
+                      # Increase at your own risk!
+ONLY_ALPHA   = True   # only process pure-alphabetical words
+BATCH_LOG    = 5000   # print progress every N words
+# ── ESTIMATION TABLE ────────────────────────────────────────────────────────
+# Here's roughly how big the output gets at each MAX_WORD_LEN setting,
+# assuming ~200k qualifying words at each length bracket:
+#
+# MAX_WORD_LEN │ Perms per word (worst) │ Rough output size
+# ─────────────┼────────────────────────┼──────────────────
+#      5       │          120           │   ~200 MB
+#      6       │          720           │   ~1-2 GB
+#      7       │        5,040           │   ~5-15 GB
+#      8       │       40,320           │   ~50-150 GB
+#      9       │      362,880           │   ~500 GB - 1 TB
+#     10       │    3,628,800           │   ~5-50 TB  ← won't fit anywhere
+#
+# Google Colab free tier gives you:
+#   • ~78 GB disk on the VM (temporary, lost on disconnect)
+#   • 15 GB Google Drive (persistent)
+#   • Colab Pro: 225 GB disk, longer runtimes
+#
+# RECOMMENDATION: Start with MAX_WORD_LEN = 6 or 7, see the size,
+# then increase if you have space.
+# ────────────────────────────────────────────────────────────────────────────
+def estimate_output(words):
+    """Estimate total permutations and file size before generating."""
+    total_perms = 0
+    for w in words:
+        n = len(w)
+        # Account for duplicate letters: n! / (c1! * c2! * ...)
+        freq = {}
+        for ch in w.lower():
+            freq[ch] = freq.get(ch, 0) + 1
+        unique_perms = math.factorial(n)
+        for count in freq.values():
+            unique_perms //= math.factorial(count)
+        total_perms += unique_perms - 1  # subtract the original word
+    # Estimate ~15 bytes per line (avg)  →  "typo=word\n"
+    avg_bytes_per_line = 15
+    est_bytes = total_perms * avg_bytes_per_line
+    est_gb = est_bytes / (1024 ** 3)
+    return total_perms, est_gb
+def generate_unique_permutations(word):
+    """
+    Generate all unique permutations of a word's letters,
+    excluding the original word itself.
+    Uses set() to deduplicate (handles repeated letters efficiently).
+    """
+    lower = word.lower()
+    perms = set(''.join(p) for p in permutations(lower))
+    perms.discard(lower)  # remove the correctly-spelled word
+    return perms
+def is_pure_alpha(word):
+    return word.isalpha()
+def main():
+    if not os.path.exists(WORDS_PATH):
+        print(f"ERROR: '{WORDS_PATH}' not found!")
+        print("Make sure you uploaded words.txt or set WORDS_PATH correctly.")
+        sys.exit(1)
+    # ── Read words ──────────────────────────────────────────────
+    print(f"Reading words from: {WORDS_PATH}")
+    with open(WORDS_PATH, 'r', encoding='utf-8', errors='replace') as f:
+        raw_words = [line.strip() for line in f if line.strip()]
+    print(f"Total raw entries: {len(raw_words):,}")
+    # Filter
+    words = []
+    for w in raw_words:
+        if ONLY_ALPHA and not is_pure_alpha(w):
+            continue
+        if len(w) < MIN_WORD_LEN or len(w) > MAX_WORD_LEN:
+            continue
+        words.append(w)
+    print(f"Filtered to {len(words):,} words (alpha-only, len {MIN_WORD_LEN}-{MAX_WORD_LEN})")
+    if len(words) == 0:
+        print("No words matched the filter. Adjust MIN/MAX_WORD_LEN.")
+        sys.exit(1)
+    # ── Estimate ────────────────────────────────────────────────
+    print("\nEstimating output size (this may take a moment)...")
+    total_perms, est_gb = estimate_output(words)
+    print(f"  Estimated permutations : {total_perms:,}")
+    print(f"  Estimated file size    : {est_gb:.2f} GB")
+    # Safety check
+    if est_gb > 70:
+        print(f"\n  WARNING: Estimated output ({est_gb:.1f} GB) exceeds Colab disk (~78 GB).")
+        print("  Reduce MAX_WORD_LEN or the script will crash when disk fills up.")
+        print("  Aborting. Set MAX_WORD_LEN lower and re-run.")
+        sys.exit(1)
+    print(f"\nProceeding with generation → {OUTPUT_PATH}")
+    print("=" * 60)
+    # ── Generate ────────────────────────────────────────────────
+    start = time.time()
+    total_written = 0
+    with open(OUTPUT_PATH, 'w', encoding='utf-8') as out:
+        out.write("# Auto-generated FULL PERMUTATION misspellings\n")
+        out.write(f"# Config: word length {MIN_WORD_LEN}-{MAX_WORD_LEN}\n")
+        out.write("# Format: misspelling=correction\n\n")
+        for idx, word in enumerate(words):
+            perms = generate_unique_permutations(word)
+            for typo in sorted(perms):
+                out.write(f"{typo}={word}\n")
+                total_written += 1
+            # Progress
+            if (idx + 1) % BATCH_LOG == 0:
+                elapsed = time.time() - start
+                pct = (idx + 1) / len(words) * 100
+                rate = (idx + 1) / elapsed if elapsed > 0 else 0
+                cur_size = os.path.getsize(OUTPUT_PATH) / (1024 ** 3)
+                print(f"  [{pct:5.1f}%]  {idx+1:>7,}/{len(words):,} words  |"
+                      f"  {total_written:>12,} lines  |  {cur_size:.2f} GB  |"
+                      f"  {rate:.0f} words/sec")
+    elapsed = time.time() - start
+    final_size = os.path.getsize(OUTPUT_PATH) / (1024 ** 3)
+    print()
+    print("=" * 60)
+    print(f"  DONE in {elapsed:.1f}s ({elapsed/60:.1f} min)")
+    print(f"  Words processed  : {len(words):,}")
+    print(f"  Lines written    : {total_written:,}")
+    print(f"  Output file      : {OUTPUT_PATH}")
+    print(f"  File size        : {final_size:.2f} GB")
+    print("=" * 60)
+if __name__ == '__main__':
+    main()

generate_typos.py ADDED Viewed

	@@ -0,0 +1,157 @@

+"""
+Generate realistic typo-based misspellings from words.txt → misspellings.txt
+Typo strategies:
+  1. Adjacent letter swaps      ("hello" → "hlelo", "helol")
+  2. Single character deletion   ("hello" → "hllo", "helo")
+  3. Single character duplication ("hello" → "hhello", "heello")
+  4. Nearby keyboard key sub     ("hello" → "gello", "jello")
+Output format: misspelling=correction (one per line)
+"""
+import sys
+import os
+import time
+# QWERTY keyboard proximity map
+KEYBOARD_NEIGHBORS = {
+    'q': 'wa', 'w': 'qeas', 'e': 'wrds', 'r': 'etfs', 't': 'rygs',
+    'y': 'tuhs', 'u': 'yijs', 'i': 'uoks', 'o': 'ipls', 'p': 'o',
+    'a': 'qwsz', 's': 'awedxz', 'd': 'serfcx', 'f': 'drtgvc',
+    'g': 'ftyhbv', 'h': 'gyujnb', 'j': 'huikmn', 'k': 'jiolm',
+    'l': 'kop', 'z': 'asx', 'x': 'zsdc', 'c': 'xdfv', 'v': 'cfgb',
+    'b': 'vghn', 'n': 'bhjm', 'm': 'njk',
+}
+def generate_adjacent_swaps(word):
+    """Swap each pair of adjacent characters."""
+    typos = []
+    for i in range(len(word) - 1):
+        chars = list(word)
+        chars[i], chars[i + 1] = chars[i + 1], chars[i]
+        typo = ''.join(chars)
+        if typo != word:
+            typos.append(typo)
+    return typos
+def generate_deletions(word):
+    """Delete one character at a time."""
+    typos = []
+    for i in range(len(word)):
+        typo = word[:i] + word[i + 1:]
+        if len(typo) >= 2:  # keep at least 2 chars
+            typos.append(typo)
+    return typos
+def generate_duplications(word):
+    """Duplicate one character at a time."""
+    typos = []
+    for i in range(len(word)):
+        typo = word[:i] + word[i] + word[i:]
+        if typo != word:
+            typos.append(typo)
+    return typos
+def generate_nearby_key_subs(word):
+    """Replace one character with a nearby keyboard key."""
+    typos = []
+    lower = word.lower()
+    for i in range(len(word)):
+        ch = lower[i]
+        if ch in KEYBOARD_NEIGHBORS:
+            for neighbor in KEYBOARD_NEIGHBORS[ch]:
+                typo = lower[:i] + neighbor + lower[i + 1:]
+                if typo != lower:
+                    typos.append(typo)
+    return typos
+def generate_all_typos(word):
+    """Generate all realistic typo variants for a word."""
+    typos = set()
+    typos.update(generate_adjacent_swaps(word))
+    typos.update(generate_deletions(word))
+    typos.update(generate_duplications(word))
+    typos.update(generate_nearby_key_subs(word))
+    typos.discard(word)   # never map a word to itself
+    typos.discard(word.lower())
+    return typos
+def is_pure_alpha(word):
+    """Only process words that are purely alphabetical (a-z)."""
+    return word.isalpha()
+def main():
+    base_dir = os.path.dirname(os.path.abspath(__file__))
+    words_path = os.path.join(base_dir, 'data', 'words.txt')
+    output_path = os.path.join(base_dir, 'data', 'misspellings.txt')
+    if not os.path.exists(words_path):
+        print(f"ERROR: {words_path} not found.")
+        sys.exit(1)
+    # ── Read words ──────────────────────────────────────────────
+    print(f"Reading words from: {words_path}")
+    with open(words_path, 'r', encoding='utf-8', errors='replace') as f:
+        raw_words = [line.strip() for line in f if line.strip()]
+    print(f"Total raw entries: {len(raw_words):,}")
+    # Filter to pure-alpha words with length >= 3
+    words = [w for w in raw_words if is_pure_alpha(w) and len(w) >= 3]
+    print(f"Filtered to {len(words):,} alphabetical words (len >= 3)")
+    # ── Generate typos ──────────────────────────────────────────
+    start = time.time()
+    total_typos = 0
+    batch_size = 10_000
+    print(f"Generating typos → {output_path}")
+    print("This may take a few minutes for 466k words...")
+    with open(output_path, 'w', encoding='utf-8', newline='\n') as out:
+        out.write("# Auto-generated misspellings database\n")
+        out.write("# Format: misspelling=correction\n")
+        out.write("# Generated by generate_typos.py\n")
+        out.write("#\n")
+        out.write("# Strategies: adjacent swaps, deletions, duplications, keyboard proximity\n")
+        out.write("\n")
+        for idx, word in enumerate(words):
+            correction = word  # original is the correct form
+            typos = generate_all_typos(word.lower())
+            for typo in sorted(typos):
+                out.write(f"{typo}={correction}\n")
+                total_typos += 1
+            # Progress reporting
+            if (idx + 1) % batch_size == 0:
+                elapsed = time.time() - start
+                pct = (idx + 1) / len(words) * 100
+                rate = (idx + 1) / elapsed if elapsed > 0 else 0
+                print(f"  [{pct:5.1f}%] {idx + 1:>7,} / {len(words):,} words  |"
+                      f"  {total_typos:>10,} typos  |  {rate:.0f} words/sec")
+    elapsed = time.time() - start
+    file_size_mb = os.path.getsize(output_path) / (1024 * 1024)
+    print()
+    print("=" * 60)
+    print(f"  Done in {elapsed:.1f}s")
+    print(f"  Words processed : {len(words):,}")
+    print(f"  Typos generated : {total_typos:,}")
+    print(f"  Output file     : {output_path}")
+    print(f"  File size       : {file_size_mb:.1f} MB")
+    print("=" * 60)
+if __name__ == '__main__':
+    main()

words.txt ADDED Viewed

The diff for this file is too large to render. See raw diff