File size: 4,619 Bytes
97cccdb 48f4f5c 97cccdb 2b97944 48f4f5c 2b97944 48f4f5c 2b97944 48f4f5c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | ---
license: mit
tags:
- misspelling
- generator
- python
- ipynb
---
# Misspelling Generator
A misspelling words generator of the 466k words from data [provider](https://github.com/dwyl/english-words) written `words.txt`. For demonstation, we only use 7 letters combination minimum to generate:
```
Words processed : 125,414
Lines written : 173,110,626
Output file : misspellings_permutations.txt
File size : 2.53 GB
```
depending your storage, you could do more litter combination limit, just configure the `MAX_WORD_LEN` in the script.
- use `generate_permutations_colab.py` or `google_collab_173MSW.ipynb` to generate 173,110,626 misspelling words from 125,414 processed datasets; the output is downloadable at [HuggingFace](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/173Million-misspelling-words.txt)
- use `generate_typos_colab.py` or `google_collab_263MSW.ipynb` to generate 26,636,990 misspelling words from 415,701 processed datasets; the output is downloadable at [Hugging Face](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/26Million-misspelling-words.txt)
<br><br><br>
### Option 1 β use ``generate_typos_local.py` (Run Locally)
Generates **realistic typo variants** using 4 strategies:
| Strategy | Example (`hello`) | Variants |
|---|---|---|
| Adjacent swap | `hlelo`, `helol` | nβ1 per word |
| Char deletion | `hllo`, `helo`, `hell` | n per word |
| Char duplication | `hhello`, `heello` | n per word |
| Keyboard proximity | `gello`, `jello`, `hwllo` | varies |
- Processes only **pure-alpha words** with length β₯ 3
- Produces roughly **10β50 typos per word** β ~5Mβ20M lines total
- Output: data/misspellings.txt in `misspelling=correction` format
**To run:**
```
python generate_typos_local.py
```
---
### Option 2 β use `generate_permutations_colab.py` (Google Colab)
Generates **ALL letter permutations** of each word. Key config at the top of the file:
```python
MAX_WORD_LEN = 7 # β CRITICAL control knob
```
---
## Google Colab Education
### What Is Google Colab?
Google Colab gives you a **free Linux VM** with Python pre-installed. You get:
| Resource | Free Tier | Colab Pro ($12/mo) |
|---|---|---|
| **Disk** | ~78 GB (temporary) | ~225 GB (temporary) |
| **RAM** | ~12 GB | ~25-50 GB |
| **GPU** | T4 (limited) | A100/V100 |
| **Runtime limit** | ~12 hours, then VM resets | ~24 hours |
| **Google Drive** | 15 GB (persistent) | 15 GB (same) |
> [!IMPORTANT]
> Colab disk is **ephemeral** β when the runtime disconnects, all files on the VM are deleted. Only Google Drive persists.
### Step-by-Step: Running Option 2 on Colab
**Step 1 β Open Colab**
Go to [colab.research.google.com](https://colab.research.google.com) β **New Notebook**
**Step 2 β Upload [words.txt]**
```python
# Cell 1
from google.colab import files
uploaded = files.upload() # select words.txt from your PC
```
**Step 3 β (Optional) Mount Google Drive for persistent storage**
```python
# Cell 2
from google.colab import drive
drive.mount('/content/drive')
# Then change OUTPUT_PATH in the script to:
# '/content/drive/MyDrive/misspellings_permutations.txt'
```
**Step 4 β Paste & run the script**
Copy the entire contents of `generate_permutations_colab.py` into a new cell. Adjust `MAX_WORD_LEN` as needed, then run.
**Step 5 β Download the result**
```python
# If saved to VM disk:
files.download('misspellings_permutations.txt')
# If saved to Google Drive: just access it from drive.google.com
```
### Scale Reference
> [!CAUTION]
> Full permutations grow at **n! (factorial)** rate. Here's what to expect:
| `MAX_WORD_LEN` | Max perms/word | Est. total output |
|---|---|---|
| 5 | 120 | ~200 MB |
| 6 | 720 | ~1β2 GB |
| **7** | **5,040** | **~5β15 GB** β recommended start |
| 8 | 40,320 | ~50β150 GB |
| 9 | 362,880 | ~500 GB β 1 TB |
| 10 | 3,628,800 | ~5β50 TB β impossible |
> [!TIP]
> **Start with `MAX_WORD_LEN = 6` or `7`**, check the output size, then decide if you want to go higher. The script has a built-in safety check that aborts if the estimated size exceeds 70 GB.
### Pro Tips for Colab
- **Keep the browser tab open** β Colab disconnects if idle too long
- **Use `Ctrl+Shift+I` β Console** and paste `setInterval(function(){document.querySelector("colab-connect-button").click()}, 60000)` to prevent idle disconnects
- **For very large outputs**, write directly to Google Drive so you don't lose data on disconnect
- **CPU-only is fine** for this script β permutation generation is CPU-bound, not GPU
|