algorembrant commited on
Commit
48f4f5c
·
verified ·
1 Parent(s): e518c34

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -5
README.md CHANGED
@@ -4,6 +4,7 @@ tags:
4
  - misspelling
5
  - generator
6
  - python
 
7
  ---
8
 
9
  # Misspelling Generator
@@ -15,9 +16,13 @@ A misspelling words generator of the 466k words from data [provider](https://git
15
  Output file : misspellings_permutations.txt
16
  File size : 2.53 GB
17
  ```
18
- depending your storage you could do more litter combination limit, just configure the `MAX_WORD_LEN` in the script.
19
-
20
- ### Option 1 — use `generate_typos.py` (Run Locally)
 
 
 
 
21
 
22
  Generates **realistic typo variants** using 4 strategies:
23
 
@@ -34,7 +39,7 @@ Generates **realistic typo variants** using 4 strategies:
34
 
35
  **To run:**
36
  ```
37
- python generate_typos.py
38
  ```
39
 
40
  ---
@@ -121,4 +126,4 @@ files.download('misspellings_permutations.txt')
121
  - **Keep the browser tab open** — Colab disconnects if idle too long
122
  - **Use `Ctrl+Shift+I` → Console** and paste `setInterval(function(){document.querySelector("colab-connect-button").click()}, 60000)` to prevent idle disconnects
123
  - **For very large outputs**, write directly to Google Drive so you don't lose data on disconnect
124
- - **CPU-only is fine** for this script — permutation generation is CPU-bound, not GPU
 
4
  - misspelling
5
  - generator
6
  - python
7
+ - ipynb
8
  ---
9
 
10
  # Misspelling Generator
 
16
  Output file : misspellings_permutations.txt
17
  File size : 2.53 GB
18
  ```
19
+ depending your storage, you could do more litter combination limit, just configure the `MAX_WORD_LEN` in the script.
20
+
21
+ - use `generate_permutations_colab.py` or `google_collab_173MSW.ipynb` to generate 173,110,626 misspelling words from 125,414 processed datasets; the output is downloadable at [HuggingFace](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/173Million-misspelling-words.txt)
22
+ - use `generate_typos_colab.py` or `google_collab_263MSW.ipynb` to generate 26,636,990 misspelling words from 415,701 processed datasets; the output is downloadable at [Hugging Face](https://huggingface.co/datasets/algorembrant/generated-misspelling-words/blob/main/26Million-misspelling-words.txt)
23
+
24
+ <br><br><br>
25
+ ### Option 1 — use ``generate_typos_local.py` (Run Locally)
26
 
27
  Generates **realistic typo variants** using 4 strategies:
28
 
 
39
 
40
  **To run:**
41
  ```
42
+ python generate_typos_local.py
43
  ```
44
 
45
  ---
 
126
  - **Keep the browser tab open** — Colab disconnects if idle too long
127
  - **Use `Ctrl+Shift+I` → Console** and paste `setInterval(function(){document.querySelector("colab-connect-button").click()}, 60000)` to prevent idle disconnects
128
  - **For very large outputs**, write directly to Google Drive so you don't lose data on disconnect
129
+ - **CPU-only is fine** for this script — permutation generation is CPU-bound, not GPU