--- license: mit tags: - misspelling - generator - python - ipynb --- # Misspelling Generator A misspelling words generator of the 466k words from data [provider](https://github.com/dwyl/english-words) written `words.txt`. For demonstation, we only use 7 letters combination minimum to generate: ``` Words processed : 125,414 Lines written : 173,110,626 Output file : misspellings_permutations.txt File size : 2.53 GB ``` depending your storage, you could do more litter combination limit, just configure the `MAX_WORD_LEN` in the script. - use `generate_permutations_colab.py` or `google_collab_173MSW.ipynb` to generate 173,110,626 misspelling words from 125,414 processed datasets; the output is downloadable at [HuggingFace](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/173Million-misspelling-words.txt) - use `generate_typos_colab.py` or `google_collab_263MSW.ipynb` to generate 26,636,990 misspelling words from 415,701 processed datasets; the output is downloadable at [Hugging Face](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/26Million-misspelling-words.txt)


### Option 1 — use ``generate_typos_local.py` (Run Locally) Generates **realistic typo variants** using 4 strategies: | Strategy | Example (`hello`) | Variants | |---|---|---| | Adjacent swap | `hlelo`, `helol` | n−1 per word | | Char deletion | `hllo`, `helo`, `hell` | n per word | | Char duplication | `hhello`, `heello` | n per word | | Keyboard proximity | `gello`, `jello`, `hwllo` | varies | - Processes only **pure-alpha words** with length ≥ 3 - Produces roughly **10–50 typos per word** → ~5M–20M lines total - Output: data/misspellings.txt in `misspelling=correction` format **To run:** ``` python generate_typos_local.py ``` --- ### Option 2 — use `generate_permutations_colab.py` (Google Colab) Generates **ALL letter permutations** of each word. Key config at the top of the file: ```python MAX_WORD_LEN = 7 # ← CRITICAL control knob ``` --- ## Google Colab Education ### What Is Google Colab? Google Colab gives you a **free Linux VM** with Python pre-installed. You get: | Resource | Free Tier | Colab Pro ($12/mo) | |---|---|---| | **Disk** | ~78 GB (temporary) | ~225 GB (temporary) | | **RAM** | ~12 GB | ~25-50 GB | | **GPU** | T4 (limited) | A100/V100 | | **Runtime limit** | ~12 hours, then VM resets | ~24 hours | | **Google Drive** | 15 GB (persistent) | 15 GB (same) | > [!IMPORTANT] > Colab disk is **ephemeral** — when the runtime disconnects, all files on the VM are deleted. Only Google Drive persists. ### Step-by-Step: Running Option 2 on Colab **Step 1 — Open Colab** Go to [colab.research.google.com](https://colab.research.google.com) → **New Notebook** **Step 2 — Upload [words.txt]** ```python # Cell 1 from google.colab import files uploaded = files.upload() # select words.txt from your PC ``` **Step 3 — (Optional) Mount Google Drive for persistent storage** ```python # Cell 2 from google.colab import drive drive.mount('/content/drive') # Then change OUTPUT_PATH in the script to: # '/content/drive/MyDrive/misspellings_permutations.txt' ``` **Step 4 — Paste & run the script** Copy the entire contents of `generate_permutations_colab.py` into a new cell. Adjust `MAX_WORD_LEN` as needed, then run. **Step 5 — Download the result** ```python # If saved to VM disk: files.download('misspellings_permutations.txt') # If saved to Google Drive: just access it from drive.google.com ``` ### Scale Reference > [!CAUTION] > Full permutations grow at **n! (factorial)** rate. Here's what to expect: | `MAX_WORD_LEN` | Max perms/word | Est. total output | |---|---|---| | 5 | 120 | ~200 MB | | 6 | 720 | ~1–2 GB | | **7** | **5,040** | **~5–15 GB** ← recommended start | | 8 | 40,320 | ~50–150 GB | | 9 | 362,880 | ~500 GB – 1 TB | | 10 | 3,628,800 | ~5–50 TB ← impossible | > [!TIP] > **Start with `MAX_WORD_LEN = 6` or `7`**, check the output size, then decide if you want to go higher. The script has a built-in safety check that aborts if the estimated size exceeds 70 GB. ### Pro Tips for Colab - **Keep the browser tab open** — Colab disconnects if idle too long - **Use `Ctrl+Shift+I` → Console** and paste `setInterval(function(){document.querySelector("colab-connect-button").click()}, 60000)` to prevent idle disconnects - **For very large outputs**, write directly to Google Drive so you don't lose data on disconnect - **CPU-only is fine** for this script — permutation generation is CPU-bound, not GPU