nicholasKluge's picture
Update README.md
19b99ea verified
---
library_name: transformers
base_model: l3cube-pune/hindi-roberta
tags:
- educational
- hindi
metrics:
- precision
- recall
- accuracy
model-index:
- name: hindi-hindiroberta-edu-classifier
results: []
license: cc
datasets:
- Polygl0t/hindi-edu-qwen-annotations
language:
- hi
pipeline_tag: text-classification
---
# Hindi Edu Classifier
hindi-roberta-edu-classifier is a [HindRoBERTa](https://huggingface.co/l3cube-pune/hindi-roberta) based model that can be used for judging the educational value of a given Hindi text string. This model was trained on the [Polygl0t/hindi-edu-qwen-annotations](https://huggingface.co/datasets/Polygl0t/hindi-edu-qwen-annotations) dataset.
## Details
- **Dataset:** [hindi-edu-qwen-annotations](https://huggingface.co/datasets/Polygl0t/hindi-edu-qwen-annotations)
- **Language:** Hindi
- **Number of Training Epochs:** 20
- **Batch size:** 256
- **Optimizer:** `torch.optim.AdamW`
- **Learning Rate:** 3e-4
- **Eval Metric:** `f1-score`
This repository has the [source code](https://github.com/Polygl0t/llm-foundry) used to train this model.
### Evaluation Results
#### Confusion Matrix
| | **1** | **2** | **3** | **4** | **5** |
|-------|-------|-------|-------|-------|-------|
| **1** | 8607 | 1661 | 72 | 1 | 0 |
| **2** | 1834 | 4349 | 580 | 18 | 0 |
| **3** | 120 | 885 | 1207 | 102 | 0 |
| **4** | 7 | 52 | 300 | 202 | 0 |
| **5** | 0 | 0 | 1 | 2 | 0 |
- Precision: 0.52416
- Recall: 0.47107
- F1 Macro: 0.49048
- Accuracy: 0.71825
## Usage
Here's an example of how to use the Edu Classifier:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("Polygl0t/hindi-roberta-edu-classifier")
model = AutoModelForSequenceClassification.from_pretrained("Polygl0t/hindi-roberta-edu-classifier")
model.to(device)
text = "यह एक उदाहरण है।"
encoded_input = tokenizer(text, return_tensors="pt", padding="longest", truncation=True).to(device)
with torch.no_grad():
model_output = model(**encoded_input)
logits = model_output.logits.squeeze(-1).float().cpu().numpy()
# scores are produced in the range [0, 4]. To convert to the range [1, 5], we can simply add 1 to the score.
score = [x + 1 for x in logits.tolist()][0]
print({
"text": text,
"score": score,
"int_score": [int(round(max(0, min(score, 4)))) + 1 for score in logits][0],
})
```
## Cite as 🤗
```latex
@misc{shiza2026lilmoo,
title={{Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi}},
author={Shiza Fatimah and Aniket Sen and Sophia Falk and Florian Mai and Lucie Flek and Nicholas Kluge Corr{\^e}a},
year={2026},
eprint={2603.03508},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.03508},
}
```
## Aknowlegments
Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments.
We also gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab.
## License
According to [l3cube-pune/hindi-roberta](https://huggingface.co/l3cube-pune/hindi-roberta), the model is released under [cc-by-4.0](https://spdx.org/licenses/CC-BY-4.0). For any queries, please get in touch with the authors of the original paper tied to [hindi-roberta](https://huggingface.co/l3cube-pune).