Update README.md
Browse files
README.md
CHANGED
|
@@ -1,199 +1,207 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
-
#
|
| 7 |
|
| 8 |
-
|
| 9 |
|
|
|
|
| 10 |
|
|
|
|
| 11 |
|
| 12 |
## Model Details
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
| 21 |
-
- **Funded by [optional]:** [More Information Needed]
|
| 22 |
-
- **Shared by [optional]:** [More Information Needed]
|
| 23 |
-
- **Model type:** [More Information Needed]
|
| 24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
| 25 |
-
- **License:** [More Information Needed]
|
| 26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
| 27 |
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
-
- **Paper [optional]:** [More Information Needed]
|
| 34 |
-
- **Demo [optional]:** [More Information Needed]
|
| 35 |
-
|
| 36 |
-
## Uses
|
| 37 |
-
|
| 38 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
| 39 |
|
| 40 |
### Direct Use
|
| 41 |
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
|
| 52 |
### Out-of-Scope Use
|
| 53 |
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
## Bias, Risks, and Limitations
|
| 59 |
-
|
| 60 |
-
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
| 61 |
-
|
| 62 |
-
[More Information Needed]
|
| 63 |
-
|
| 64 |
-
### Recommendations
|
| 65 |
-
|
| 66 |
-
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
| 67 |
-
|
| 68 |
-
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
| 69 |
-
|
| 70 |
-
## How to Get Started with the Model
|
| 71 |
-
|
| 72 |
-
Use the code below to get started with the model.
|
| 73 |
-
|
| 74 |
-
[More Information Needed]
|
| 75 |
-
|
| 76 |
-
## Training Details
|
| 77 |
-
|
| 78 |
-
### Training Data
|
| 79 |
-
|
| 80 |
-
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
| 81 |
-
|
| 82 |
-
[More Information Needed]
|
| 83 |
-
|
| 84 |
-
### Training Procedure
|
| 85 |
|
| 86 |
-
|
| 87 |
|
| 88 |
-
|
| 89 |
|
| 90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
|
|
|
| 92 |
|
| 93 |
-
##
|
| 94 |
|
| 95 |
-
|
| 96 |
|
| 97 |
-
|
|
|
|
|
|
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
## Evaluation
|
| 104 |
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
### Testing Data, Factors & Metrics
|
| 108 |
-
|
| 109 |
-
#### Testing Data
|
| 110 |
-
|
| 111 |
-
<!-- This should link to a Dataset Card if possible. -->
|
| 112 |
-
|
| 113 |
-
[More Information Needed]
|
| 114 |
-
|
| 115 |
-
#### Factors
|
| 116 |
-
|
| 117 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
| 118 |
-
|
| 119 |
-
[More Information Needed]
|
| 120 |
-
|
| 121 |
-
#### Metrics
|
| 122 |
-
|
| 123 |
-
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
| 124 |
-
|
| 125 |
-
[More Information Needed]
|
| 126 |
-
|
| 127 |
-
### Results
|
| 128 |
-
|
| 129 |
-
[More Information Needed]
|
| 130 |
-
|
| 131 |
-
#### Summary
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
## Model Examination [optional]
|
| 136 |
-
|
| 137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
| 138 |
-
|
| 139 |
-
[More Information Needed]
|
| 140 |
-
|
| 141 |
-
## Environmental Impact
|
| 142 |
-
|
| 143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
| 144 |
|
| 145 |
-
|
| 146 |
|
| 147 |
-
|
| 148 |
-
- **Hours used:** [More Information Needed]
|
| 149 |
-
- **Cloud Provider:** [More Information Needed]
|
| 150 |
-
- **Compute Region:** [More Information Needed]
|
| 151 |
-
- **Carbon Emitted:** [More Information Needed]
|
| 152 |
|
| 153 |
-
|
|
|
|
| 154 |
|
| 155 |
-
#
|
|
|
|
|
|
|
| 156 |
|
| 157 |
-
|
|
|
|
|
|
|
|
|
|
| 158 |
|
| 159 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 160 |
|
| 161 |
-
|
| 162 |
|
| 163 |
-
|
|
|
|
|
|
|
| 164 |
|
| 165 |
-
|
|
|
|
| 166 |
|
| 167 |
-
|
|
|
|
| 168 |
|
| 169 |
-
|
|
|
|
|
|
|
| 170 |
|
| 171 |
-
|
|
|
|
| 172 |
|
| 173 |
-
|
| 174 |
|
| 175 |
-
|
|
|
|
|
|
|
|
|
|
| 176 |
|
| 177 |
-
|
| 178 |
|
| 179 |
-
|
|
|
|
|
|
|
|
|
|
| 180 |
|
| 181 |
-
|
| 182 |
|
| 183 |
-
|
| 184 |
|
| 185 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 186 |
|
| 187 |
-
|
| 188 |
|
| 189 |
-
|
|
|
|
|
|
|
| 190 |
|
| 191 |
-
|
| 192 |
|
| 193 |
-
|
| 194 |
|
| 195 |
-
|
|
|
|
|
|
|
| 196 |
|
| 197 |
-
|
|
|
|
| 198 |
|
| 199 |
-
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- as
|
| 4 |
+
license: cc-by-4.0
|
| 5 |
+
tags:
|
| 6 |
+
- assamese
|
| 7 |
+
- roberta
|
| 8 |
+
- masked-lm
|
| 9 |
+
- fill-mask
|
| 10 |
+
datasets:
|
| 11 |
+
- MWirelabs/assamese-monolingual-corpus
|
| 12 |
+
metrics:
|
| 13 |
+
- perplexity
|
| 14 |
+
model-index:
|
| 15 |
+
- name: AssameseRoBERTa
|
| 16 |
+
results:
|
| 17 |
+
- task:
|
| 18 |
+
type: fill-mask
|
| 19 |
+
name: Masked Language Modeling
|
| 20 |
+
metrics:
|
| 21 |
+
- name: Perplexity (Training Domain)
|
| 22 |
+
type: perplexity
|
| 23 |
+
value: 1.5738
|
| 24 |
+
- name: Perplexity (Unseen Text)
|
| 25 |
+
type: perplexity
|
| 26 |
+
value: 5.9281
|
| 27 |
---
|
| 28 |
|
| 29 |
+
# AssameseRoBERTa
|
| 30 |
|
| 31 |
+
## Model Description
|
| 32 |
|
| 33 |
+
AssameseRoBERTa is a RoBERTa-based language model trained from scratch on Assamese monolingual text. The model is designed to provide robust language understanding capabilities for the Assamese language, which is spoken by over 15 million people primarily in the Indian state of Assam.
|
| 34 |
|
| 35 |
+
This model was developed by [MWire Labs](https://mwirelabs.com), an AI research organization focused on building language technologies for Northeast Indian languages.
|
| 36 |
|
| 37 |
## Model Details
|
| 38 |
|
| 39 |
+
- **Model Type:** RoBERTa (Robustly Optimized BERT Pretraining Approach)
|
| 40 |
+
- **Language:** Assamese (as)
|
| 41 |
+
- **Training Data:** 1.6M Assamese sentences from diverse sources
|
| 42 |
+
- **Parameters:** ~125M
|
| 43 |
+
- **Training Epochs:** 10
|
| 44 |
+
- **Training Duration:** 8 hours on A40 GPU
|
| 45 |
+
- **Vocabulary Size:** ~50,000 tokens
|
| 46 |
+
- **Max Sequence Length:** 512 tokens
|
| 47 |
|
| 48 |
+
## Performance
|
| 49 |
|
| 50 |
+
### Perplexity Scores
|
| 51 |
|
| 52 |
+
The model achieves strong performance on both in-domain and out-of-domain evaluation:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
+
| Model | Training Domain PPL | Unseen Text PPL |
|
| 55 |
+
|-------|---------------------|-----------------|
|
| 56 |
+
| **AssameseRoBERTa (Ours)** | **1.5738** | **5.9281** |
|
| 57 |
+
| mBERT | 29.8206 | 9.9891 |
|
| 58 |
+
| MuRIL | 27.3264 | 14.2509 |
|
| 59 |
+
| Assamese-BERT | 12.1166 | 22.6595 |
|
| 60 |
+
| IndicBERT | - | 283.7512 |
|
| 61 |
|
| 62 |
+
The model significantly outperforms existing multilingual models on Assamese text, demonstrating the value of language-specific pretraining.
|
| 63 |
|
| 64 |
+
## Intended Use
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
### Direct Use
|
| 67 |
|
| 68 |
+
This model is intended for:
|
| 69 |
+
- Masked language modeling tasks
|
| 70 |
+
- Feature extraction for downstream Assamese NLP tasks
|
| 71 |
+
- Fine-tuning on Assamese-specific tasks such as:
|
| 72 |
+
- Text classification
|
| 73 |
+
- Named Entity Recognition (NER)
|
| 74 |
+
- Sentiment analysis
|
| 75 |
+
- Question answering
|
| 76 |
+
- Token classification
|
| 77 |
|
| 78 |
### Out-of-Scope Use
|
| 79 |
|
| 80 |
+
This model should not be used for:
|
| 81 |
+
- Generating factual information without verification
|
| 82 |
+
- Making decisions that affect individuals' rights or well-being without human oversight
|
| 83 |
+
- Any application requiring real-time critical decision making
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
+
## Training Data
|
| 86 |
|
| 87 |
+
The model was trained on the [MWirelabs/assamese-monolingual-corpus](https://huggingface.co/datasets/MWirelabs/assamese-monolingual-corpus) dataset, which contains approximately 1.6 million Assamese sentences from diverse sources including:
|
| 88 |
|
| 89 |
+
- News articles
|
| 90 |
+
- Literature
|
| 91 |
+
- Web crawl data
|
| 92 |
+
- Government documents
|
| 93 |
+
- Social media content
|
| 94 |
|
| 95 |
+
The diverse nature of the training data helps the model generalize across different domains and text styles.
|
| 96 |
|
| 97 |
+
## Training Procedure
|
| 98 |
|
| 99 |
+
### Preprocessing
|
| 100 |
|
| 101 |
+
- Text normalization for Assamese script (Bengali script)
|
| 102 |
+
- Tokenization using SentencePiece
|
| 103 |
+
- Vocabulary built specifically for Assamese
|
| 104 |
|
| 105 |
+
### Training Hyperparameters
|
| 106 |
|
| 107 |
+
- **Architecture:** RoBERTa-base
|
| 108 |
+
- **Optimizer:** AdamW
|
| 109 |
+
- **Learning Rate:** Peak LR with warmup and linear decay
|
| 110 |
+
- **Batch Size:** Optimized for A40 GPU
|
| 111 |
+
- **Training Epochs:** 10
|
| 112 |
+
- **Hardware:** NVIDIA A40 (48GB)
|
| 113 |
+
- **Precision:** Mixed precision (BF16)
|
| 114 |
+
- **Training Time:** ~8 hours
|
| 115 |
|
| 116 |
## Evaluation
|
| 117 |
|
| 118 |
+
The model was evaluated on both in-domain and out-of-domain Assamese text using perplexity as the primary metric. The significantly lower perplexity compared to multilingual baselines demonstrates strong language modeling capabilities.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
+
## Usage
|
| 121 |
|
| 122 |
+
### Using Transformers
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
+
```python
|
| 125 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
| 126 |
|
| 127 |
+
# Load model and tokenizer
|
| 128 |
+
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/assamese-roberta")
|
| 129 |
+
model = AutoModelForMaskedLM.from_pretrained("MWirelabs/assamese-roberta")
|
| 130 |
|
| 131 |
+
# Example: Fill mask
|
| 132 |
+
text = "অসম হৈছে [MASK] এখন সুন্দৰ ৰাজ্য।"
|
| 133 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 134 |
+
outputs = model(**inputs)
|
| 135 |
|
| 136 |
+
# Get predictions
|
| 137 |
+
masked_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
|
| 138 |
+
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
|
| 139 |
+
predicted_token = tokenizer.decode(predicted_token_id)
|
| 140 |
+
print(f"Predicted token: {predicted_token}")
|
| 141 |
+
```
|
| 142 |
|
| 143 |
+
### Feature Extraction
|
| 144 |
|
| 145 |
+
```python
|
| 146 |
+
from transformers import AutoTokenizer, AutoModel
|
| 147 |
+
import torch
|
| 148 |
|
| 149 |
+
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/assamese-roberta")
|
| 150 |
+
model = AutoModel.from_pretrained("MWirelabs/assamese-roberta")
|
| 151 |
|
| 152 |
+
text = "অসমীয়া ভাষা অতি সুন্দৰ।"
|
| 153 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 154 |
|
| 155 |
+
with torch.no_grad():
|
| 156 |
+
outputs = model(**inputs)
|
| 157 |
+
embeddings = outputs.last_hidden_state
|
| 158 |
|
| 159 |
+
print(f"Embeddings shape: {embeddings.shape}")
|
| 160 |
+
```
|
| 161 |
|
| 162 |
+
## Limitations
|
| 163 |
|
| 164 |
+
- The model is trained exclusively on Assamese text and does not perform well on other languages
|
| 165 |
+
- Performance may vary on specialized domains not well-represented in the training data
|
| 166 |
+
- The model inherits biases present in the training data
|
| 167 |
+
- Code-mixed text (Assamese-English) may not be handled optimally
|
| 168 |
|
| 169 |
+
## Ethical Considerations
|
| 170 |
|
| 171 |
+
- This model may reflect biases present in the training corpus
|
| 172 |
+
- Users should evaluate the model's outputs in their specific context before deployment
|
| 173 |
+
- The model should not be used for generating harmful or misleading content
|
| 174 |
+
- Consider fairness implications when deploying in real-world applications
|
| 175 |
|
| 176 |
+
## Citation
|
| 177 |
|
| 178 |
+
If you use this model in your research, please cite:
|
| 179 |
|
| 180 |
+
```bibtex
|
| 181 |
+
@misc{assamese-roberta-2024,
|
| 182 |
+
author = {MWire Labs},
|
| 183 |
+
title = {AssameseRoBERTa: A RoBERTa Model for Assamese Language},
|
| 184 |
+
year = {2024},
|
| 185 |
+
publisher = {HuggingFace},
|
| 186 |
+
howpublished = {\url{https://huggingface.co/MWirelabs/assamese-roberta}}
|
| 187 |
+
}
|
| 188 |
+
```
|
| 189 |
|
| 190 |
+
## Contact
|
| 191 |
|
| 192 |
+
For questions or feedback, please contact:
|
| 193 |
+
- Website: https://mwirelabs.com
|
| 194 |
+
- Email: connect@mwirelabs.com
|
| 195 |
|
| 196 |
+
## License
|
| 197 |
|
| 198 |
+
This model is released under the **Creative Commons Attribution 4.0 International License (CC-BY-4.0)**.
|
| 199 |
|
| 200 |
+
You are free to:
|
| 201 |
+
- **Share** — copy and redistribute the material in any medium or format
|
| 202 |
+
- **Adapt** — remix, transform, and build upon the material for any purpose, even commercially
|
| 203 |
|
| 204 |
+
Under the following terms:
|
| 205 |
+
- **Attribution** — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
|
| 206 |
|
| 207 |
+
See the full license at: https://creativecommons.org/licenses/by/4.0/
|