File size: 4,619 Bytes
97cccdb
 
 
 
 
 
48f4f5c
97cccdb
 
2b97944
 
 
 
 
 
 
 
 
48f4f5c
 
 
 
 
 
 
2b97944
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48f4f5c
2b97944
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48f4f5c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: mit
tags:
- misspelling
- generator
- python
- ipynb
---

# Misspelling Generator 

A misspelling words generator of the 466k words from data [provider](https://github.com/dwyl/english-words) written `words.txt`. For demonstation, we only use 7 letters combination minimum to generate:
```
  Words processed  : 125,414
  Lines written    : 173,110,626
  Output file      : misspellings_permutations.txt
  File size        : 2.53 GB
```
depending your storage, you could do more litter combination limit, just configure the `MAX_WORD_LEN` in the script.

- use `generate_permutations_colab.py` or `google_collab_173MSW.ipynb` to generate 173,110,626 misspelling words from 125,414 processed datasets; the output is downloadable at [HuggingFace](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/173Million-misspelling-words.txt)
- use `generate_typos_colab.py` or `google_collab_263MSW.ipynb` to  generate 26,636,990 misspelling words from 415,701 processed datasets; the output is downloadable at [Hugging Face](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/26Million-misspelling-words.txt)
  
<br><br><br>
### Option 1 β€” use ``generate_typos_local.py` (Run Locally)

Generates **realistic typo variants** using 4 strategies:

| Strategy | Example (`hello`) | Variants |
|---|---|---|
| Adjacent swap | `hlelo`, `helol` | nβˆ’1 per word |
| Char deletion | `hllo`, `helo`, `hell` | n per word |
| Char duplication | `hhello`, `heello` | n per word |
| Keyboard proximity | `gello`, `jello`, `hwllo` | varies |

- Processes only **pure-alpha words** with length β‰₯ 3
- Produces roughly **10–50 typos per word** β†’ ~5M–20M lines total
- Output: data/misspellings.txt in `misspelling=correction` format

**To run:**
```
python generate_typos_local.py
```

---

### Option 2 β€” use `generate_permutations_colab.py` (Google Colab)

Generates **ALL letter permutations** of each word. Key config at the top of the file:

```python
MAX_WORD_LEN = 7   # ← CRITICAL control knob
```

---

## Google Colab Education

### What Is Google Colab?

Google Colab gives you a **free Linux VM** with Python pre-installed. You get:

| Resource | Free Tier | Colab Pro ($12/mo) |
|---|---|---|
| **Disk** | ~78 GB (temporary) | ~225 GB (temporary) |
| **RAM** | ~12 GB | ~25-50 GB |
| **GPU** | T4 (limited) | A100/V100 |
| **Runtime limit** | ~12 hours, then VM resets | ~24 hours |
| **Google Drive** | 15 GB (persistent) | 15 GB (same) |

> [!IMPORTANT]
> Colab disk is **ephemeral** β€” when the runtime disconnects, all files on the VM are deleted. Only Google Drive persists.

### Step-by-Step: Running Option 2 on Colab

**Step 1 β€” Open Colab**
Go to [colab.research.google.com](https://colab.research.google.com) β†’ **New Notebook**

**Step 2 β€” Upload [words.txt]**
```python
# Cell 1
from google.colab import files
uploaded = files.upload()   # select words.txt from your PC
```

**Step 3 β€” (Optional) Mount Google Drive for persistent storage**
```python
# Cell 2
from google.colab import drive
drive.mount('/content/drive')

# Then change OUTPUT_PATH in the script to:
# '/content/drive/MyDrive/misspellings_permutations.txt'
```

**Step 4 β€” Paste & run the script**
Copy the entire contents of `generate_permutations_colab.py` into a new cell. Adjust `MAX_WORD_LEN` as needed, then run.

**Step 5 β€” Download the result**
```python
# If saved to VM disk:
files.download('misspellings_permutations.txt')

# If saved to Google Drive: just access it from drive.google.com
```

### Scale Reference

> [!CAUTION]
> Full permutations grow at **n! (factorial)** rate. Here's what to expect:

| `MAX_WORD_LEN` | Max perms/word | Est. total output |
|---|---|---|
| 5 | 120 | ~200 MB |
| 6 | 720 | ~1–2 GB |
| **7** | **5,040** | **~5–15 GB** ← recommended start |
| 8 | 40,320 | ~50–150 GB |
| 9 | 362,880 | ~500 GB – 1 TB |
| 10 | 3,628,800 | ~5–50 TB ← impossible |

> [!TIP]
> **Start with `MAX_WORD_LEN = 6` or `7`**, check the output size, then decide if you want to go higher. The script has a built-in safety check that aborts if the estimated size exceeds 70 GB.

### Pro Tips for Colab

- **Keep the browser tab open** β€” Colab disconnects if idle too long
- **Use `Ctrl+Shift+I` β†’ Console** and paste `setInterval(function(){document.querySelector("colab-connect-button").click()}, 60000)` to prevent idle disconnects
- **For very large outputs**, write directly to Google Drive so you don't lose data on disconnect
- **CPU-only is fine** for this script β€” permutation generation is CPU-bound, not GPU