File size: 6,317 Bytes
c75ec1c
dfb9780
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d53ae91
dfb9780
 
 
c75ec1c
 
28e6c24
 
 
 
 
dfb9780
c75ec1c
dfb9780
c75ec1c
dfb9780
c75ec1c
dfb9780
c75ec1c
 
 
dfb9780
 
 
d53ae91
dfb9780
01530f9
 
 
c75ec1c
dfb9780
c75ec1c
01530f9
c75ec1c
dfb9780
 
01530f9
 
 
 
 
 
 
 
 
c75ec1c
01530f9
c75ec1c
dfb9780
c75ec1c
 
01530f9
 
 
 
 
 
 
 
c75ec1c
 
01530f9
 
 
c75ec1c
dfb9780
c75ec1c
01530f9
c75ec1c
01530f9
 
 
 
 
c75ec1c
dfb9780
c75ec1c
dfb9780
01530f9
 
 
c75ec1c
d53ae91
01530f9
 
 
d53ae91
dfb9780
01530f9
 
 
 
 
 
c75ec1c
dfb9780
c75ec1c
01530f9
c75ec1c
dfb9780
 
c75ec1c
dfb9780
 
c75ec1c
dfb9780
 
 
c75ec1c
dfb9780
01530f9
dfb9780
01530f9
 
dfb9780
c75ec1c
dfb9780
c75ec1c
dfb9780
 
 
c75ec1c
dfb9780
 
c75ec1c
dfb9780
 
c75ec1c
dfb9780
 
 
c75ec1c
dfb9780
 
c75ec1c
dfb9780
c75ec1c
dfb9780
 
 
 
c75ec1c
dfb9780
c75ec1c
dfb9780
 
 
 
c75ec1c
dfb9780
c75ec1c
dfb9780
c75ec1c
dfb9780
d53ae91
dfb9780
 
d53ae91
dfb9780
 
 
 
c75ec1c
dfb9780
c75ec1c
dfb9780
 
 
c75ec1c
dfb9780
c75ec1c
dfb9780
c75ec1c
dfb9780
 
 
c75ec1c
dfb9780
 
c75ec1c
dfb9780
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
language:
- as
license: cc-by-4.0
tags:
- assamese
- roberta
- masked-lm
- fill-mask
datasets:
- MWirelabs/assamese-monolingual-corpus
metrics:
- perplexity
model-index:
- name: AssameseRoBERTa
  results:
  - task:
      type: fill-mask
      name: Masked Language Modeling
    metrics:
    - name: Perplexity (Training Domain)
      type: perplexity
      value: 2.2547
    - name: Perplexity (Unseen Text)
      type: perplexity
      value: 5.9281
---

![Model Size](https://img.shields.io/badge/Model%20Size-110M-lightblue)
![Training PPL](https://img.shields.io/badge/Training%20PPL-1.78-brightgreen)
![Unseen PPL](https://img.shields.io/badge/Unseen%20PPL-2.53-yellowgreen)
![License](https://img.shields.io/badge/License-CC--BY--4.0-orange)

# AssameseRoBERTa

## Model Description

AssameseRoBERTa is a RoBERTa-based language model trained from scratch on Assamese monolingual text. The model is designed to provide robust language understanding capabilities for the Assamese language, which is spoken by over 15 million people primarily in the Indian state of Assam.

This model was developed by [MWire Labs](https://mwirelabs.com), an AI research organization focused on building language technologies for Northeast Indian languages.

## Model Details

- **Model Type:** RoBERTa (Robustly Optimized BERT Pretraining Approach)
- **Language:** Assamese (as)
- **Training Data:** 1.6M Assamese sentences from diverse sources
- **Parameters:** ~110M
- **Training Epochs:** 10
- **Training Duration:** ~12 hours on A40 GPU
- **Vocabulary Size:** 50,265 tokens
- **Max Sequence Length:** 128 tokens

## Performance

### Perplexity Scores (Final Evaluation)

| Model | Training Domain PPL | Unseen Text PPL |
|-------|---------------------|-----------------|
| **AssameseRoBERTa (Ours)** | **1.7819** | **2.5332** |
| Assamese-BERT | 48.8211 | 12.5911 |
| MuRIL | 85.7272 | 8.7032 |
| mBERT | 26.7085 | 18.1564 |
| IndicBERT | 3194.1843 | 595.4611 |
| AxomiyaBERTa | 83615627.1696 | 30861455.2924 |

📄 **Unseen evaluation set (10 Assamese sentences):**  
https://huggingface.co/MWirelabs/assamese-roberta/blob/main/assamese_unseen_eval_10.txt

The model significantly outperforms existing multilingual and Assamese models on both seen and unseen Assamese text.

## Intended Use

### Direct Use
- Masked language modeling
- Feature extraction
- Downstream Assamese NLP tasks such as:
  - Text classification  
  - NER  
  - Sentiment analysis  
  - Question answering  
  - Token classification  

### Out-of-Scope Use
- Generating factual information without verification  
- High-risk decision making  
- Real-time critical systems  

## Training Data

The model was trained on the [MWirelabs/assamese-monolingual-corpus](https://huggingface.co/datasets/MWirelabs/assamese-monolingual-corpus) dataset (~1.6M sentences), sourced from:

- News  
- Web crawl  
- Literature  
- Government text  
- Social media  

## Training Procedure

### Preprocessing
- Assamese script normalization  
- Byte-Level BPE tokenization  
- Custom Assamese vocabulary  

### Tokenizer
- **Type:** Byte-Level BPE  
- **Vocab Size:** 50,265  
- **Special Tokens:** `<s>`, `</s>`, `<pad>`, `<unk>`, `<mask>`  

### Training Hyperparameters
- **Architecture:** RoBERTa-base  
- **Optimizer:** AdamW  
- **Scheduler:** Warmup + Linear decay  
- **Precision:** BF16  
- **Device:** NVIDIA A40 (48GB)  
- **Epochs:** 10  

## Usage

### Masked LM Example

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("MWirelabs/assamese-roberta")
model = AutoModelForMaskedLM.from_pretrained("MWirelabs/assamese-roberta")

text = "অসম হৈছে [MASK] এখন সুন্দৰ ৰাজ্য।"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted_token_id = outputs.logits[0, masked_index].argmax(-1)
predicted_token = tokenizer.decode(predicted_token_id)

print("Predicted:", predicted_token)
```

### Feature Extraction

```python
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("MWirelabs/assamese-roberta")
model = AutoModel.from_pretrained("MWirelabs/assamese-roberta")

text = "অসমীয়া ভাষা অতি সুন্দৰ।"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state

print(f"Embeddings shape: {embeddings.shape}")
```

## Limitations

- The model is trained exclusively on Assamese text and does not perform well on other languages
- Performance may vary on specialized domains not well-represented in the training data
- The model inherits biases present in the training data
- Code-mixed text (Assamese-English) may not be handled optimally

## Ethical Considerations

- This model may reflect biases present in the training corpus
- Users should evaluate the model's outputs in their specific context before deployment
- The model should not be used for generating harmful or misleading content
- Consider fairness implications when deploying in real-world applications

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{assamese-roberta-2025,
  author = {MWire Labs},
  title = {AssameseRoBERTa: A RoBERTa Model for Assamese Language},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/MWirelabs/assamese-roberta}}
}
```

## Contact

For questions or feedback, please contact:
- Website: https://mwirelabs.com
- Email: connect@mwirelabs.com

## License

This model is released under the **Creative Commons Attribution 4.0 International License (CC-BY-4.0)**. 

You are free to:
- **Share** — copy and redistribute the material in any medium or format
- **Adapt** — remix, transform, and build upon the material for any purpose, even commercially

Under the following terms:
- **Attribution** — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

See the full license at: https://creativecommons.org/licenses/by/4.0/