File size: 6,106 Bytes
b458d38
 
 
75a9dc3
 
470dfa5
 
 
 
 
 
 
 
 
 
8e5b69c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0c0cd22
 
 
 
8e5b69c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
287124f
 
 
 
 
8e5b69c
 
 
 
 
 
 
 
 
 
 
 
9746d20
8e5b69c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0c0cd22
8e5b69c
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
base_model:
- csebuetnlp/banglat5_nmt_en_bn
datasets:
- reyazul/BanglaSTEM
language:
- bn
- en
license: apache-2.0
pipeline_tag: translation
library_name: transformers
---

This model is the BanglaSTEM translation model, presented in the paper [BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation](https://huggingface.co/papers/2511.03498).

The model is a T5-based translation model specifically trained on the BanglaSTEM dataset, which consists of 5,000 carefully selected Bangla-English sentence pairs from STEM fields. It aims to improve translation accuracy for technical content, enabling Bangla speakers to effectively use English-focused language models for technical problem-solving.

# BanglaSTEM-T5: Technical Domain Translation Model

<div align="center">

[![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2511.03498)
[![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-yellow)](https://huggingface.co/datasets/reyazul/BanglaSTEM)
[![Model](https://img.shields.io/badge/Model-HuggingFace-blue)](https://huggingface.co/reyazul/BanglaSTEM-T5)
[![License](https://img.shields.io/badge/License-Apache%202.0-green)](https://www.apache.org/licenses/LICENSE-2.0)

</div>

## 🎯 Overview

**BanglaSTEM-T5** is a specialized translation model designed to accurately translate technical content between Bangla and English. Unlike general-purpose translation systems that struggle with technical terminology, this model preserves the precise meaning of STEM concepts, making it ideal for:

- **Programming & Software Development** - Translate code-related questions and documentation
- **Mathematics** - Handle mathematical concepts and problem statements
- **Science** - Accurately translate physics, chemistry, and biology content
- **AI & Machine Learning** - Work with technical AI/ML terminology


## 📊 Performance Benchmarks

Our model significantly outperforms existing translation systems on technical content:

### Code Generation Task (400 Programming Problems)
| Translation Method | Accuracy |
|-------------------|----------|
| Direct Bangla (no translation) | 35.3% |
| BanglaT5-Base | 59.8% |
| Google Translate | 76.5% |
| **BanglaSTEM-T5 (Ours)** | **82.5%** ✨ |

### Mathematical Problem Solving (100 Olympiad Problems)
| Translation Method | Success Rate |
|-------------------|--------------|
| Direct Bangla (no translation) | 31.0% |
| BanglaT5-Base | 59.0% |
| Google Translate | 72.0% |
| **BanglaSTEM-T5 (Ours)** | **79.0%** ✨ |

**Key Improvement**: Our model achieves **22.7% higher accuracy** than base models on code generation and **20% better** on math problems.

## 🚀 Quick Start

### Installation

```bash
pip install transformers torch
```

### Basic Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("reyazul/BanglaSTEM-T5")
model = AutoModelForSeq2SeqLM.from_pretrained("reyazul/BanglaSTEM-T5")

# Translate Bangla to English
bangla_text = "একটি পাইথন ফাংশন লিখুন যা একটি তালিকার সর্বোচ্চ মান খুঁজে বের করে।"
inputs = tokenizer(bangla_text, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_length=128, num_beams=4)
english_translation = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(english_translation)
# Output: "Write a Python function that finds the maximum value in a list."
```


### Custom Generation Parameters

```python
# For more accurate translations
outputs = model.generate(
    **inputs,
    max_length=256,
    num_beams=5,
    early_stopping=True,
    temperature=0.7,
    do_sample=False
)
```

## 📚 Model Details

- **Base Model**: [csebuetnlp/banglat5_nmt_en_bn](https://huggingface.co/csebuetnlp/banglat5_nmt_en_bn)
- **Parameters**: 247M
- **Training Data**: 5,000 high-quality technical sentence pairs
- **Domains Covered**: 
  - Programming (52%)
  - Mathematics (25.5%)
  - Information Technology (23.7%)
  - Physics (9.8%)
  - Chemistry (7.3%)
  - Biology & Bioinformatics (5.6%)
- **Quality Score**: Mean translation accuracy of 4.41/5.0
- **Training Details**:
  - Learning rate: 5e-4
  - Batch size: 64 (effective)
  - Epochs: 8
  - Precision: BF16 mixed precision

## 🎓 Citation

If you use BanglaSTEM-T5 in your research or applications, please cite our paper:

```bibtex
@article{hasan2025banglastem,
  title={BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation},
  author={Hasan, Kazi Reyazul and Musarrat, Mubasshira and Islam, ABM and Adnan, Muhammad Abdullah},
  journal={arXiv preprint arXiv:2511.03498},
  year={2025}
}
```

## 📖 Resources

- **Paper**: [BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation](https://arxiv.org/abs/2511.03498)
- **Dataset**: [reyazul/BanglaSTEM](https://huggingface.co/datasets/reyazul/BanglaSTEM)
- **Model Card**: [reyazul/BanglaSTEM-T5](https://huggingface.co/reyazul/BanglaSTEM-T5)


## ⚠️ Limitations

- The dataset used for finetuning is currently not large-scale (we plan to expand it soon!)
- The model works best with technical content in STEM domains
- Performance on non-technical, general conversation may be similar to base models
- Programming domain is most heavily represented in training data
- For optimal results, input text should be grammatically correct

## 📜 License

This model is released under the Apache 2.0 License. See the [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for details.


## 🙏 Acknowledgments

This work was supported by the Department of Computer Science and Engineering at Bangladesh University of Engineering and Technology (BUET). We thank all annotators who contributed to the human curation process.

---

<div align="center">

**Made with ❤️ for the Bangla NLP community**

[Report Issues](https://huggingface.co/reyazul/BanglaSTEM-T5/discussions) • [Request Features](https://huggingface.co/reyazul/BanglaSTEM-T5/discussions)

</div>