Add README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,70 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Speliuk
|
| 2 |
+
|
| 3 |
+
A more accurate spelling correction for the Ukrainian language.
|
| 4 |
+
|
| 5 |
+
## Motivation
|
| 6 |
+
|
| 7 |
+
When using a spell checker in systems that perform an automatic spelling correction without human verification, the following questions arise:
|
| 8 |
+
- How to avoid false correction, i.e. when a real word that is not present in a vocabulary is corrected? This is especially viable for fusional languages such as Ukrainian.
|
| 9 |
+
- How to find a single best correction for a misspelled word? Many spell checkers rely on the frequency of candidates and their edit distance discarding the surrounding context.
|
| 10 |
+
|
| 11 |
+
To address these issues, we propose a system that is compatible with any spell checker but focuses on precision over recall.<br>
|
| 12 |
+
We improve the accuracy of a spell checker by using these complimentary models:
|
| 13 |
+
- [KenLM](https://github.com/kpu/kenlm). The model is used for fast perplexity calculation to find the best candidate for a misspelled word.
|
| 14 |
+
- Transfomer-based NER pipeline to detect misspelled words.
|
| 15 |
+
- [SymSpell](https://github.com/wolfgarbe/SymSpell). As of now, this is the only supported spell checker.
|
| 16 |
+
|
| 17 |
+
## Installation
|
| 18 |
+
|
| 19 |
+
1. For CPU-only inference, install the CPU version of [PyTorch](https://pytorch.org/get-started/locally/).
|
| 20 |
+
2. Make sure you can compile Python extension modules (required for KenLM). If you are on Linux, you can install them like this:
|
| 21 |
+
```
|
| 22 |
+
sudo apt-get install python-dev
|
| 23 |
+
```
|
| 24 |
+
3. Install Speliuk:
|
| 25 |
+
```
|
| 26 |
+
pip install speliuk
|
| 27 |
+
```
|
| 28 |
+
|
| 29 |
+
## Usage
|
| 30 |
+
|
| 31 |
+
By default, Speliuk will use pre-trained models stored on [Hugging Face](https://huggingface.co/BonySmoke/Speliuk/tree/main).
|
| 32 |
+
|
| 33 |
+
```python
|
| 34 |
+
>>> from speliuk.correct import Speliuk
|
| 35 |
+
>>> speliuk = Speliuk()
|
| 36 |
+
>>> speliuk.load()
|
| 37 |
+
>>> speliuk.correct("то він моее це зраабити для меніе?")
|
| 38 |
+
Correction(corrected_text='то він може це зробити для мене?', annotations=[Annotation(start=7, end=11, source_text='моее', suggestions=['може'], meta={}), Annotation(start=15, end=23, source_text='зраабити', suggestions=['зробити'], meta={}), Annotation(start=28, end=33, source_text='меніе', suggestions=['мене'], meta={})])
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
Speliuk can also be used directly from a spaCy model:
|
| 42 |
+
```python
|
| 43 |
+
>>> import spacy
|
| 44 |
+
>>> from speliuk.correct import CorrectionPipe
|
| 45 |
+
>>> nlp = spacy.blank('uk')
|
| 46 |
+
>>> nlp.add_pipe('speliuk', config=dict(spacy_spelling_model_path='/my/custom/model'))
|
| 47 |
+
>>> doc = nlp("то він моее це зраабити для меніе?")
|
| 48 |
+
>>> doc._.speliuk_corrected
|
| 49 |
+
'то він може це зробити для мене?'
|
| 50 |
+
>>> doc.spans["speliuk_errors"]
|
| 51 |
+
[моее, зраабити, меніе]
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## Training Details
|
| 55 |
+
|
| 56 |
+
### Spelling Error Detection
|
| 57 |
+
|
| 58 |
+
To detect spelling errors, a spaCy NER model is used.
|
| 59 |
+
|
| 60 |
+
It was trained on a combination of synthetic and golden data:
|
| 61 |
+
- For synthetic data generation, we used [UberText](https://lang.org.ua/en/ubertext/) as base texts and [nlpaug](https://github.com/makcedward/nlpaug) for errors generation. In total, 10k samples from different categories were used.
|
| 62 |
+
- For golden data, we used spelling errors from the [UA-GEC](https://github.com/grammarly/ua-gec) corpus.
|
| 63 |
+
|
| 64 |
+
### Perplexity Calculation
|
| 65 |
+
|
| 66 |
+
We used KenLM for quick perplexity calculation. We used an existing model [Yehor/kenlm-uk](https://huggingface.co/Yehor/kenlm-uk) trained on UberText.
|
| 67 |
+
|
| 68 |
+
### Spell Checker
|
| 69 |
+
|
| 70 |
+
We used [SymSpell](https://github.com/wolfgarbe/SymSpell) for error correction. The dictionary consists of 500k most frequent words from the UberText corpus.
|