--- language: - udm --- # bert-tiny-char-ctc-udm-denoise This is a tiny BERT model for Udmurt, intended for fixing OCR errors. Here is the code to run it (it uses a custom tokenizer, with the code downloaded in the runtime): ```python import torch from transformers import AutoModelForMaskedLM, AutoTokenizer MODEL_NAME = 'udmurtNLP/bert-tiny-char-ctc-udm-denoise' model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME) tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True) def fix_text(text, verbose=False, spaces=2): with torch.inference_mode(): batch = tokenizer(text, return_tensors='pt', spaces=spaces, padding=True, truncation=True, return_token_type_ids=False).to(model.device) logits = torch.log_softmax(model(**batch).logits, axis=-1) decoded = tokenizer.decode(logits[0].argmax(-1), skip_special_tokens=True) return tokenizer.clean_up_tokenization(decoded) fix_text("кыче мои солы оскылй!") # Кыӵе мон солы оскылӥ! ``` It was trained on a parallel corpus (corrupted + fixed sentence) with CTC loss. On our test dataset, it reduces OCR errors by 50%. Inspired by https://huggingface.co/slone/bert-tiny-char-ctc-bak-denoise