| | --- |
| | license: mit |
| | tags: |
| | - misspelling |
| | - generator |
| | - python |
| | - ipynb |
| | --- |
| | |
| | # Misspelling Generator |
| |
|
| | A misspelling words generator of the 466k words from data [provider](https://github.com/dwyl/english-words) written `words.txt`. For demonstation, we only use 7 letters combination minimum to generate: |
| | ``` |
| | Words processed : 125,414 |
| | Lines written : 173,110,626 |
| | Output file : misspellings_permutations.txt |
| | File size : 2.53 GB |
| | ``` |
| | depending your storage, you could do more litter combination limit, just configure the `MAX_WORD_LEN` in the script. |
| |
|
| | - use `generate_permutations_colab.py` or `google_collab_173MSW.ipynb` to generate 173,110,626 misspelling words from 125,414 processed datasets; the output is downloadable at [HuggingFace](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/173Million-misspelling-words.txt) |
| | - use `generate_typos_colab.py` or `google_collab_263MSW.ipynb` to generate 26,636,990 misspelling words from 415,701 processed datasets; the output is downloadable at [Hugging Face](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/26Million-misspelling-words.txt) |
| | |
| | <br><br><br> |
| | ### Option 1 β use ``generate_typos_local.py` (Run Locally) |
| |
|
| | Generates **realistic typo variants** using 4 strategies: |
| |
|
| | | Strategy | Example (`hello`) | Variants | |
| | |---|---|---| |
| | | Adjacent swap | `hlelo`, `helol` | nβ1 per word | |
| | | Char deletion | `hllo`, `helo`, `hell` | n per word | |
| | | Char duplication | `hhello`, `heello` | n per word | |
| | | Keyboard proximity | `gello`, `jello`, `hwllo` | varies | |
| |
|
| | - Processes only **pure-alpha words** with length β₯ 3 |
| | - Produces roughly **10β50 typos per word** β ~5Mβ20M lines total |
| | - Output: data/misspellings.txt in `misspelling=correction` format |
| |
|
| | **To run:** |
| | ``` |
| | python generate_typos_local.py |
| | ``` |
| |
|
| | --- |
| |
|
| | ### Option 2 β use `generate_permutations_colab.py` (Google Colab) |
| |
|
| | Generates **ALL letter permutations** of each word. Key config at the top of the file: |
| |
|
| | ```python |
| | MAX_WORD_LEN = 7 # β CRITICAL control knob |
| | ``` |
| |
|
| | --- |
| |
|
| | ## Google Colab Education |
| |
|
| | ### What Is Google Colab? |
| |
|
| | Google Colab gives you a **free Linux VM** with Python pre-installed. You get: |
| |
|
| | | Resource | Free Tier | Colab Pro ($12/mo) | |
| | |---|---|---| |
| | | **Disk** | ~78 GB (temporary) | ~225 GB (temporary) | |
| | | **RAM** | ~12 GB | ~25-50 GB | |
| | | **GPU** | T4 (limited) | A100/V100 | |
| | | **Runtime limit** | ~12 hours, then VM resets | ~24 hours | |
| | | **Google Drive** | 15 GB (persistent) | 15 GB (same) | |
| |
|
| | > [!IMPORTANT] |
| | > Colab disk is **ephemeral** β when the runtime disconnects, all files on the VM are deleted. Only Google Drive persists. |
| |
|
| | ### Step-by-Step: Running Option 2 on Colab |
| |
|
| | **Step 1 β Open Colab** |
| | Go to [colab.research.google.com](https://colab.research.google.com) β **New Notebook** |
| |
|
| | **Step 2 β Upload [words.txt]** |
| | ```python |
| | # Cell 1 |
| | from google.colab import files |
| | uploaded = files.upload() # select words.txt from your PC |
| | ``` |
| |
|
| | **Step 3 β (Optional) Mount Google Drive for persistent storage** |
| | ```python |
| | # Cell 2 |
| | from google.colab import drive |
| | drive.mount('/content/drive') |
| | |
| | # Then change OUTPUT_PATH in the script to: |
| | # '/content/drive/MyDrive/misspellings_permutations.txt' |
| | ``` |
| |
|
| | **Step 4 β Paste & run the script** |
| | Copy the entire contents of `generate_permutations_colab.py` into a new cell. Adjust `MAX_WORD_LEN` as needed, then run. |
| |
|
| | **Step 5 β Download the result** |
| | ```python |
| | # If saved to VM disk: |
| | files.download('misspellings_permutations.txt') |
| | |
| | # If saved to Google Drive: just access it from drive.google.com |
| | ``` |
| |
|
| | ### Scale Reference |
| |
|
| | > [!CAUTION] |
| | > Full permutations grow at **n! (factorial)** rate. Here's what to expect: |
| |
|
| | | `MAX_WORD_LEN` | Max perms/word | Est. total output | |
| | |---|---|---| |
| | | 5 | 120 | ~200 MB | |
| | | 6 | 720 | ~1β2 GB | |
| | | **7** | **5,040** | **~5β15 GB** β recommended start | |
| | | 8 | 40,320 | ~50β150 GB | |
| | | 9 | 362,880 | ~500 GB β 1 TB | |
| | | 10 | 3,628,800 | ~5β50 TB β impossible | |
| |
|
| | > [!TIP] |
| | > **Start with `MAX_WORD_LEN = 6` or `7`**, check the output size, then decide if you want to go higher. The script has a built-in safety check that aborts if the estimated size exceeds 70 GB. |
| |
|
| | ### Pro Tips for Colab |
| |
|
| | - **Keep the browser tab open** β Colab disconnects if idle too long |
| | - **Use `Ctrl+Shift+I` β Console** and paste `setInterval(function(){document.querySelector("colab-connect-button").click()}, 60000)` to prevent idle disconnects |
| | - **For very large outputs**, write directly to Google Drive so you don't lose data on disconnect |
| | - **CPU-only is fine** for this script β permutation generation is CPU-bound, not GPU |
| |
|