BonySmoke commited on
Commit
0103dc1
·
verified ·
1 Parent(s): 922916c

Add README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -3
README.md CHANGED
@@ -1,3 +1,70 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Speliuk
2
+
3
+ A more accurate spelling correction for the Ukrainian language.
4
+
5
+ ## Motivation
6
+
7
+ When using a spell checker in systems that perform an automatic spelling correction without human verification, the following questions arise:
8
+ - How to avoid false correction, i.e. when a real word that is not present in a vocabulary is corrected? This is especially viable for fusional languages such as Ukrainian.
9
+ - How to find a single best correction for a misspelled word? Many spell checkers rely on the frequency of candidates and their edit distance discarding the surrounding context.
10
+
11
+ To address these issues, we propose a system that is compatible with any spell checker but focuses on precision over recall.<br>
12
+ We improve the accuracy of a spell checker by using these complimentary models:
13
+ - [KenLM](https://github.com/kpu/kenlm). The model is used for fast perplexity calculation to find the best candidate for a misspelled word.
14
+ - Transfomer-based NER pipeline to detect misspelled words.
15
+ - [SymSpell](https://github.com/wolfgarbe/SymSpell). As of now, this is the only supported spell checker.
16
+
17
+ ## Installation
18
+
19
+ 1. For CPU-only inference, install the CPU version of [PyTorch](https://pytorch.org/get-started/locally/).
20
+ 2. Make sure you can compile Python extension modules (required for KenLM). If you are on Linux, you can install them like this:
21
+ ```
22
+ sudo apt-get install python-dev
23
+ ```
24
+ 3. Install Speliuk:
25
+ ```
26
+ pip install speliuk
27
+ ```
28
+
29
+ ## Usage
30
+
31
+ By default, Speliuk will use pre-trained models stored on [Hugging Face](https://huggingface.co/BonySmoke/Speliuk/tree/main).
32
+
33
+ ```python
34
+ >>> from speliuk.correct import Speliuk
35
+ >>> speliuk = Speliuk()
36
+ >>> speliuk.load()
37
+ >>> speliuk.correct("то він моее це зраабити для меніе?")
38
+ Correction(corrected_text='то він може це зробити для мене?', annotations=[Annotation(start=7, end=11, source_text='моее', suggestions=['може'], meta={}), Annotation(start=15, end=23, source_text='зраабити', suggestions=['зробити'], meta={}), Annotation(start=28, end=33, source_text='меніе', suggestions=['мене'], meta={})])
39
+ ```
40
+
41
+ Speliuk can also be used directly from a spaCy model:
42
+ ```python
43
+ >>> import spacy
44
+ >>> from speliuk.correct import CorrectionPipe
45
+ >>> nlp = spacy.blank('uk')
46
+ >>> nlp.add_pipe('speliuk', config=dict(spacy_spelling_model_path='/my/custom/model'))
47
+ >>> doc = nlp("то він моее це зраабити для меніе?")
48
+ >>> doc._.speliuk_corrected
49
+ 'то він може це зробити для мене?'
50
+ >>> doc.spans["speliuk_errors"]
51
+ [моее, зраабити, меніе]
52
+ ```
53
+
54
+ ## Training Details
55
+
56
+ ### Spelling Error Detection
57
+
58
+ To detect spelling errors, a spaCy NER model is used.
59
+
60
+ It was trained on a combination of synthetic and golden data:
61
+ - For synthetic data generation, we used [UberText](https://lang.org.ua/en/ubertext/) as base texts and [nlpaug](https://github.com/makcedward/nlpaug) for errors generation. In total, 10k samples from different categories were used.
62
+ - For golden data, we used spelling errors from the [UA-GEC](https://github.com/grammarly/ua-gec) corpus.
63
+
64
+ ### Perplexity Calculation
65
+
66
+ We used KenLM for quick perplexity calculation. We used an existing model [Yehor/kenlm-uk](https://huggingface.co/Yehor/kenlm-uk) trained on UberText.
67
+
68
+ ### Spell Checker
69
+
70
+ We used [SymSpell](https://github.com/wolfgarbe/SymSpell) for error correction. The dictionary consists of 500k most frequent words from the UberText corpus.