Update README.md
Browse files
README.md
CHANGED
|
@@ -4,6 +4,7 @@ tags:
|
|
| 4 |
- misspelling
|
| 5 |
- generator
|
| 6 |
- python
|
|
|
|
| 7 |
---
|
| 8 |
|
| 9 |
# Misspelling Generator
|
|
@@ -15,9 +16,13 @@ A misspelling words generator of the 466k words from data [provider](https://git
|
|
| 15 |
Output file : misspellings_permutations.txt
|
| 16 |
File size : 2.53 GB
|
| 17 |
```
|
| 18 |
-
depending your storage you could do more litter combination limit, just configure the `MAX_WORD_LEN` in the script.
|
| 19 |
-
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
Generates **realistic typo variants** using 4 strategies:
|
| 23 |
|
|
@@ -34,7 +39,7 @@ Generates **realistic typo variants** using 4 strategies:
|
|
| 34 |
|
| 35 |
**To run:**
|
| 36 |
```
|
| 37 |
-
python
|
| 38 |
```
|
| 39 |
|
| 40 |
---
|
|
@@ -121,4 +126,4 @@ files.download('misspellings_permutations.txt')
|
|
| 121 |
- **Keep the browser tab open** — Colab disconnects if idle too long
|
| 122 |
- **Use `Ctrl+Shift+I` → Console** and paste `setInterval(function(){document.querySelector("colab-connect-button").click()}, 60000)` to prevent idle disconnects
|
| 123 |
- **For very large outputs**, write directly to Google Drive so you don't lose data on disconnect
|
| 124 |
-
- **CPU-only is fine** for this script — permutation generation is CPU-bound, not GPU
|
|
|
|
| 4 |
- misspelling
|
| 5 |
- generator
|
| 6 |
- python
|
| 7 |
+
- ipynb
|
| 8 |
---
|
| 9 |
|
| 10 |
# Misspelling Generator
|
|
|
|
| 16 |
Output file : misspellings_permutations.txt
|
| 17 |
File size : 2.53 GB
|
| 18 |
```
|
| 19 |
+
depending your storage, you could do more litter combination limit, just configure the `MAX_WORD_LEN` in the script.
|
| 20 |
+
|
| 21 |
+
- use `generate_permutations_colab.py` or `google_collab_173MSW.ipynb` to generate 173,110,626 misspelling words from 125,414 processed datasets; the output is downloadable at [HuggingFace](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/173Million-misspelling-words.txt)
|
| 22 |
+
- use `generate_typos_colab.py` or `google_collab_263MSW.ipynb` to generate 26,636,990 misspelling words from 415,701 processed datasets; the output is downloadable at [Hugging Face](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/26Million-misspelling-words.txt)
|
| 23 |
+
|
| 24 |
+
<br><br><br>
|
| 25 |
+
### Option 1 — use ``generate_typos_local.py` (Run Locally)
|
| 26 |
|
| 27 |
Generates **realistic typo variants** using 4 strategies:
|
| 28 |
|
|
|
|
| 39 |
|
| 40 |
**To run:**
|
| 41 |
```
|
| 42 |
+
python generate_typos_local.py
|
| 43 |
```
|
| 44 |
|
| 45 |
---
|
|
|
|
| 126 |
- **Keep the browser tab open** — Colab disconnects if idle too long
|
| 127 |
- **Use `Ctrl+Shift+I` → Console** and paste `setInterval(function(){document.querySelector("colab-connect-button").click()}, 60000)` to prevent idle disconnects
|
| 128 |
- **For very large outputs**, write directly to Google Drive so you don't lose data on disconnect
|
| 129 |
+
- **CPU-only is fine** for this script — permutation generation is CPU-bound, not GPU
|