slone
/

bert-tiny-char-ctc-bak-denoise

grammatical-error-correction

Model card Files Files and versions

cointegrated commited on Jun 30, 2023

Commit

c56158d

·

1 Parent(s): 1941090

Update README.md

Files changed (1) hide show

README.md +34 -0

README.md CHANGED Viewed

@@ -1,3 +1,37 @@
 ---
 license: cc-by-4.0
 ---

 ---
 license: cc-by-4.0
+language:
+- ba
+tags:
+- grammatical-error-correction
 ---
+This is a tiny BERT model for Bashkir, intended for fixing OCR errors.
+Here is the code to run it (it uses a custom tokenizer, with the code downloaded in the runtime):
+```Python
+import torch
+from transformers import AutoModelForMaskedLM, AutoTokenizer
+MODEL_NAME = 'slone/bert-tiny-char-ctc-bak-denoise'
+model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True, revision='194109')
+def fix_text(text, verbose=False, spaces=2):
+    with torch.inference_mode():
+        batch = tokenizer(text, return_tensors='pt', spaces=spaces, padding=True, truncation=True, return_token_type_ids=False).to(model.device)
+        logits = torch.log_softmax(model(**batch).logits, axis=-1)
+    return tokenizer.decode(logits[0].argmax(-1), skip_special_tokens=True)
+print(fix_text("Э Ҡаратау ҙы белмәйем."))
+# Ә Ҡаратауҙы белмәйем.
+```
+The model works by:
+- inserting special characters (`spaces`) between each input character,
+- performing token classification (when for most tokens, predicted output equals input, but some may modify it),
+- and removing the special characters from the output.
+It was trained on a parallel corpus (corrupted + fixed sentence) with CTC loss.
+On our test dataset, it reduces OCR errors by 41%.
+Training details: in [this post](https://habr.com/ru/articles/744972/) (in Russian).