--- library_name: transformers base_model: l3cube-pune/hindi-roberta tags: - educational - hindi metrics: - precision - recall - accuracy model-index: - name: hindi-hindiroberta-edu-classifier results: [] license: cc datasets: - Polygl0t/hindi-edu-qwen-annotations language: - hi pipeline_tag: text-classification --- # Hindi Edu Classifier hindi-roberta-edu-classifier is a [HindRoBERTa](https://huggingface.co/l3cube-pune/hindi-roberta) based model that can be used for judging the educational value of a given Hindi text string. This model was trained on the [Polygl0t/hindi-edu-qwen-annotations](https://huggingface.co/datasets/Polygl0t/hindi-edu-qwen-annotations) dataset. ## Details - **Dataset:** [hindi-edu-qwen-annotations](https://huggingface.co/datasets/Polygl0t/hindi-edu-qwen-annotations) - **Language:** Hindi - **Number of Training Epochs:** 20 - **Batch size:** 256 - **Optimizer:** `torch.optim.AdamW` - **Learning Rate:** 3e-4 - **Eval Metric:** `f1-score` This repository has the [source code](https://github.com/Polygl0t/llm-foundry) used to train this model. ### Evaluation Results #### Confusion Matrix | | **1** | **2** | **3** | **4** | **5** | |-------|-------|-------|-------|-------|-------| | **1** | 8607 | 1661 | 72 | 1 | 0 | | **2** | 1834 | 4349 | 580 | 18 | 0 | | **3** | 120 | 885 | 1207 | 102 | 0 | | **4** | 7 | 52 | 300 | 202 | 0 | | **5** | 0 | 0 | 1 | 2 | 0 | - Precision: 0.52416 - Recall: 0.47107 - F1 Macro: 0.49048 - Accuracy: 0.71825 ## Usage Here's an example of how to use the Edu Classifier: ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = AutoTokenizer.from_pretrained("Polygl0t/hindi-roberta-edu-classifier") model = AutoModelForSequenceClassification.from_pretrained("Polygl0t/hindi-roberta-edu-classifier") model.to(device) text = "यह एक उदाहरण है।" encoded_input = tokenizer(text, return_tensors="pt", padding="longest", truncation=True).to(device) with torch.no_grad(): model_output = model(**encoded_input) logits = model_output.logits.squeeze(-1).float().cpu().numpy() # scores are produced in the range [0, 4]. To convert to the range [1, 5], we can simply add 1 to the score. score = [x + 1 for x in logits.tolist()][0] print({ "text": text, "score": score, "int_score": [int(round(max(0, min(score, 4)))) + 1 for score in logits][0], }) ``` ## Cite as 🤗 ```latex @misc{shiza2026lilmoo, title={{Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi}}, author={Shiza Fatimah and Aniket Sen and Sophia Falk and Florian Mai and Lucie Flek and Nicholas Kluge Corr{\^e}a}, year={2026}, eprint={2603.03508}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.03508}, } ``` ## Aknowlegments Polyglot is a project funded by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MWK) as part of TRA Sustainable Futures (University of Bonn) and the Excellence Strategy of the federal and state governments. We also gratefully acknowledge the granted access to the [Marvin cluster](https://www.hpc.uni-bonn.de/en/systems/marvin) hosted by [University of Bonn](https://www.uni-bonn.de/en) along with the support provided by its High Performance Computing & Analytics Lab. ## License According to [l3cube-pune/hindi-roberta](https://huggingface.co/l3cube-pune/hindi-roberta), the model is released under [cc-by-4.0](https://spdx.org/licenses/CC-BY-4.0). For any queries, please get in touch with the authors of the original paper tied to [hindi-roberta](https://huggingface.co/l3cube-pune).