File size: 4,619 Bytes

---
license: mit
tags:
- misspelling
- generator
- python
- ipynb
---

# Misspelling Generator 

A misspelling words generator of the 466k words from data [provider](https://github.com/dwyl/english-words) written `words.txt`. For demonstation, we only use 7 letters combination minimum to generate:
```
  Words processed  : 125,414
  Lines written    : 173,110,626
  Output file      : misspellings_permutations.txt
  File size        : 2.53 GB
```
depending your storage, you could do more litter combination limit, just configure the `MAX_WORD_LEN` in the script.

- use `generate_permutations_colab.py` or `google_collab_173MSW.ipynb` to generate 173,110,626 misspelling words from 125,414 processed datasets; the output is downloadable at [HuggingFace](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/173Million-misspelling-words.txt)
- use `generate_typos_colab.py` or `google_collab_263MSW.ipynb` to  generate 26,636,990 misspelling words from 415,701 processed datasets; the output is downloadable at [Hugging Face](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/26Million-misspelling-words.txt)
  
<br><br><br>
### Option 1 — use ``generate_typos_local.py` (Run Locally)

Generates **realistic typo variants** using 4 strategies:

| Strategy | Example (`hello`) | Variants |
|---|---|---|
| Adjacent swap | `hlelo`, `helol` | n−1 per word |
| Char deletion | `hllo`, `helo`, `hell` | n per word |
| Char duplication | `hhello`, `heello` | n per word |
| Keyboard proximity | `gello`, `jello`, `hwllo` | varies |

- Processes only **pure-alpha words** with length ≥ 3
- Produces roughly **10–50 typos per word** → ~5M–20M lines total
- Output: data/misspellings.txt in `misspelling=correction` format

**To run:**
```
python generate_typos_local.py
```

---

### Option 2 — use `generate_permutations_colab.py` (Google Colab)

Generates **ALL letter permutations** of each word. Key config at the top of the file:

```python
MAX_WORD_LEN = 7   # ← CRITICAL control knob
```

---

## Google Colab Education

### What Is Google Colab?

Google Colab gives you a **free Linux VM** with Python pre-installed. You get:

| Resource | Free Tier | Colab Pro ($12/mo) |
|---|---|---|
| **Disk** | ~78 GB (temporary) | ~225 GB (temporary) |
| **RAM** | ~12 GB | ~25-50 GB |
| **GPU** | T4 (limited) | A100/V100 |
| **Runtime limit** | ~12 hours, then VM resets | ~24 hours |
| **Google Drive** | 15 GB (persistent) | 15 GB (same) |

> [!IMPORTANT]
> Colab disk is **ephemeral** — when the runtime disconnects, all files on the VM are deleted. Only Google Drive persists.

### Step-by-Step: Running Option 2 on Colab

**Step 1 — Open Colab**
Go to [colab.research.google.com](https://colab.research.google.com) → **New Notebook**

**Step 2 — Upload [words.txt]**
```python
# Cell 1
from google.colab import files
uploaded = files.upload()   # select words.txt from your PC
```

**Step 3 — (Optional) Mount Google Drive for persistent storage**
```python
# Cell 2
from google.colab import drive
drive.mount('/content/drive')

# Then change OUTPUT_PATH in the script to:
# '/content/drive/MyDrive/misspellings_permutations.txt'
```

**Step 4 — Paste & run the script**
Copy the entire contents of `generate_permutations_colab.py` into a new cell. Adjust `MAX_WORD_LEN` as needed, then run.

**Step 5 — Download the result**
```python
# If saved to VM disk:
files.download('misspellings_permutations.txt')

# If saved to Google Drive: just access it from drive.google.com
```

### Scale Reference

> [!CAUTION]
> Full permutations grow at **n! (factorial)** rate. Here's what to expect:

| `MAX_WORD_LEN` | Max perms/word | Est. total output |
|---|---|---|
| 5 | 120 | ~200 MB |
| 6 | 720 | ~1–2 GB |
| **7** | **5,040** | **~5–15 GB** ← recommended start |
| 8 | 40,320 | ~50–150 GB |
| 9 | 362,880 | ~500 GB – 1 TB |
| 10 | 3,628,800 | ~5–50 TB ← impossible |

> [!TIP]
> **Start with `MAX_WORD_LEN = 6` or `7`**, check the output size, then decide if you want to go higher. The script has a built-in safety check that aborts if the estimated size exceeds 70 GB.

### Pro Tips for Colab

- **Keep the browser tab open** — Colab disconnects if idle too long
- **Use `Ctrl+Shift+I` → Console** and paste `setInterval(function(){document.querySelector("colab-connect-button").click()}, 60000)` to prevent idle disconnects
- **For very large outputs**, write directly to Google Drive so you don't lose data on disconnect
- **CPU-only is fine** for this script — permutation generation is CPU-bound, not GPU