Upload 6 files

Browse files

Files changed (6) hide show

README.md +57 -0
generation_config.json +5 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +13 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,57 @@

+---
+base_model: Ransaka/sinhala-ocr-model
+model-index:
+- name: sinhala-ocr-model-v2
+  results: []
+pipeline_tag: image-to-text
+language:
+- si
+---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# TrOCR-Sinhala
+See training metrics tab for performance details.
+## Model description
+This model is finetuned version of Microsoft [TrOCR Printed](https://huggingface.co/microsoft/trocr-base-printed)
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Example
+```python
+from PIL import Image
+import requests
+from io import BytesIO
+from transformers import TrOCRProcessor, VisionEncoderDecoderModel, AutoTokenizer
+image_url = "https://datasets-server.huggingface.co/assets/Ransaka/sinhala_synthetic_ocr/--/bf7c8a455b564cd73fe035031e19a5f39babb73b/--/default/train/0/image/image.jpg"
+response = requests.get(image_url)
+img = Image.open(BytesIO(response.content))
+processor = TrOCRProcessor.from_pretrained('Ransaka/TrOCR-Sinhala')
+model = VisionEncoderDecoderModel.from_pretrained('Ransaka/TrOCR-Sinhala')
+model.to("cuda:0")
+pixel_values = processor(img, return_tensors="pt").pixel_values.to('cuda:0')
+generated_ids = model.generate(pixel_values,num_beams=2,early_stopping=True)
+generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
+generated_text #දිවයිනට බලයට ඇති ආපදා තත්ත්වය හමුවේ සබරගමුව පළාතේ
+```
+### Framework versions
+- Transformers 4.35.2
+- Pytorch 2.0.0
+- Datasets 2.16.0
+- Tokenizers 0.15.0

generation_config.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "_from_model_config": true,
+  "pad_token_id": 0,
+  "transformers_version": "4.33.3"
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": false,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff