AshenR commited on
Commit
7ed6f24
·
verified ·
1 Parent(s): f61b68d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -1
README.md CHANGED
@@ -4,4 +4,33 @@ language:
4
  - si
5
  metrics:
6
  - perplexity
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - si
5
  metrics:
6
  - perplexity
7
+ ---
8
+
9
+ ### Overview
10
+
11
+ This is a slightly smaller model trained on half of the [Fasttext](https://fasttext.cc/docs/en/crawl-vectors.html) dataset. Since Sinhala is classified as a low-resource language, there is a significant scarcity of pre-trained models available for it. This lack of resources creates a noticeable gap in the language's representation within the field of natural language processing (NLP). As a result, developing new models tailored for Sinhala presents a valuable opportunity. This model can act as foundational tools to enable further advancements in downstream tasks such as sentiment analysis, machine translation, named entity recognition, or question answering.
12
+ ## Model Specification
13
+
14
+
15
+ The model chosen for training is [Roberta](https://arxiv.org/abs/1907.11692) with the following specifications:
16
+ 1. vocab_size=52000
17
+ 2. max_position_embeddings=514
18
+ 3. num_attention_heads=12
19
+ 4. num_hidden_layers=6
20
+ 5. type_vocab_size=1
21
+ Perplexity Value - 3.5
22
+
23
+ ## How to Use
24
+ You can use this model directly with a pipeline for masked language modeling:
25
+
26
+ ```py
27
+ from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
28
+
29
+ model = AutoModelWithLMHead.from_pretrained("ashen/AshenBERTo")
30
+ tokenizer = AutoTokenizer.from_pretrained("ashen/AshenBERTo")
31
+
32
+ fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
33
+
34
+ fill_mask("මම ගෙදර <mask>.")
35
+
36
+ ```