algorembrant's picture
Update README.md
48f4f5c verified
---
license: mit
tags:
- misspelling
- generator
- python
- ipynb
---
# Misspelling Generator
A misspelling words generator of the 466k words from data [provider](https://github.com/dwyl/english-words) written `words.txt`. For demonstation, we only use 7 letters combination minimum to generate:
```
Words processed : 125,414
Lines written : 173,110,626
Output file : misspellings_permutations.txt
File size : 2.53 GB
```
depending your storage, you could do more litter combination limit, just configure the `MAX_WORD_LEN` in the script.
- use `generate_permutations_colab.py` or `google_collab_173MSW.ipynb` to generate 173,110,626 misspelling words from 125,414 processed datasets; the output is downloadable at [HuggingFace](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/173Million-misspelling-words.txt)
- use `generate_typos_colab.py` or `google_collab_263MSW.ipynb` to generate 26,636,990 misspelling words from 415,701 processed datasets; the output is downloadable at [Hugging Face](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/26Million-misspelling-words.txt)
<br><br><br>
### Option 1 β€” use ``generate_typos_local.py` (Run Locally)
Generates **realistic typo variants** using 4 strategies:
| Strategy | Example (`hello`) | Variants |
|---|---|---|
| Adjacent swap | `hlelo`, `helol` | nβˆ’1 per word |
| Char deletion | `hllo`, `helo`, `hell` | n per word |
| Char duplication | `hhello`, `heello` | n per word |
| Keyboard proximity | `gello`, `jello`, `hwllo` | varies |
- Processes only **pure-alpha words** with length β‰₯ 3
- Produces roughly **10–50 typos per word** β†’ ~5M–20M lines total
- Output: data/misspellings.txt in `misspelling=correction` format
**To run:**
```
python generate_typos_local.py
```
---
### Option 2 β€” use `generate_permutations_colab.py` (Google Colab)
Generates **ALL letter permutations** of each word. Key config at the top of the file:
```python
MAX_WORD_LEN = 7 # ← CRITICAL control knob
```
---
## Google Colab Education
### What Is Google Colab?
Google Colab gives you a **free Linux VM** with Python pre-installed. You get:
| Resource | Free Tier | Colab Pro ($12/mo) |
|---|---|---|
| **Disk** | ~78 GB (temporary) | ~225 GB (temporary) |
| **RAM** | ~12 GB | ~25-50 GB |
| **GPU** | T4 (limited) | A100/V100 |
| **Runtime limit** | ~12 hours, then VM resets | ~24 hours |
| **Google Drive** | 15 GB (persistent) | 15 GB (same) |
> [!IMPORTANT]
> Colab disk is **ephemeral** β€” when the runtime disconnects, all files on the VM are deleted. Only Google Drive persists.
### Step-by-Step: Running Option 2 on Colab
**Step 1 β€” Open Colab**
Go to [colab.research.google.com](https://colab.research.google.com) β†’ **New Notebook**
**Step 2 β€” Upload [words.txt]**
```python
# Cell 1
from google.colab import files
uploaded = files.upload() # select words.txt from your PC
```
**Step 3 β€” (Optional) Mount Google Drive for persistent storage**
```python
# Cell 2
from google.colab import drive
drive.mount('/content/drive')
# Then change OUTPUT_PATH in the script to:
# '/content/drive/MyDrive/misspellings_permutations.txt'
```
**Step 4 β€” Paste & run the script**
Copy the entire contents of `generate_permutations_colab.py` into a new cell. Adjust `MAX_WORD_LEN` as needed, then run.
**Step 5 β€” Download the result**
```python
# If saved to VM disk:
files.download('misspellings_permutations.txt')
# If saved to Google Drive: just access it from drive.google.com
```
### Scale Reference
> [!CAUTION]
> Full permutations grow at **n! (factorial)** rate. Here's what to expect:
| `MAX_WORD_LEN` | Max perms/word | Est. total output |
|---|---|---|
| 5 | 120 | ~200 MB |
| 6 | 720 | ~1–2 GB |
| **7** | **5,040** | **~5–15 GB** ← recommended start |
| 8 | 40,320 | ~50–150 GB |
| 9 | 362,880 | ~500 GB – 1 TB |
| 10 | 3,628,800 | ~5–50 TB ← impossible |
> [!TIP]
> **Start with `MAX_WORD_LEN = 6` or `7`**, check the output size, then decide if you want to go higher. The script has a built-in safety check that aborts if the estimated size exceeds 70 GB.
### Pro Tips for Colab
- **Keep the browser tab open** β€” Colab disconnects if idle too long
- **Use `Ctrl+Shift+I` β†’ Console** and paste `setInterval(function(){document.querySelector("colab-connect-button").click()}, 60000)` to prevent idle disconnects
- **For very large outputs**, write directly to Google Drive so you don't lose data on disconnect
- **CPU-only is fine** for this script β€” permutation generation is CPU-bound, not GPU