md-nishat-008 commited on
Commit
05d39f8
·
verified ·
1 Parent(s): 3144df5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +241 -0
README.md ADDED
@@ -0,0 +1,241 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - bn
5
+ - en
6
+ library_name: transformers
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - code
10
+ - bangla
11
+ - bengali
12
+ - code-generation
13
+ - nlp
14
+ - low-resource
15
+ datasets:
16
+ - md-nishat-008/Bangla-Code-Instruct
17
+ base_model:
18
+ - md-nishat-008/TigerLLM-9B-it
19
+ ---
20
+
21
+ <div align="center">
22
+
23
+ <img src="https://img.shields.io/badge/🐯_TigerCoder-9B-orange?style=for-the-badge" alt="TigerCoder-9B"/>
24
+
25
+ <h1 style="color: #2e8b57;">🐯 TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla</h1>
26
+
27
+ <h3>Accepted at LREC 2026</h3>
28
+
29
+ <h4>Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri</h4>
30
+ <h5>George Mason University, Fairfax, VA, USA</h5>
31
+
32
+ <br/>
33
+
34
+ <table>
35
+ <tr>
36
+ <td>
37
+ <a href="https://arxiv.org/abs/2509.09101">
38
+ <img src="https://img.shields.io/badge/arXiv-2509.09101-b31b1b?style=for-the-badge&logo=arxiv" alt="arXiv"/>
39
+ </a>
40
+ </td>
41
+ <td>
42
+ <a href="https://arxiv.org/pdf/2509.09101">
43
+ <img src="https://img.shields.io/badge/Paper-Read_PDF-blue?style=for-the-badge&logo=adobeacrobatreader" alt="Read PDF"/>
44
+ </a>
45
+ </td>
46
+ <td>
47
+ <a href="mailto:mraihan2@gmu.edu">
48
+ <img src="https://img.shields.io/badge/Email-Contact_Us-green?style=for-the-badge&logo=gmail" alt="Contact Us"/>
49
+ </a>
50
+ </td>
51
+ </tr>
52
+ </table>
53
+
54
+
55
+
56
+ <table>
57
+ <tr>
58
+ <td>
59
+ <a href="https://huggingface.co/md-nishat-008/TigerCoder-1B">
60
+ <img src="https://img.shields.io/badge/🤗_HuggingFace-TigerCoder--1B-yellow?style=for-the-badge" alt="TigerCoder-1B"/>
61
+ </a>
62
+ </td>
63
+ <td>
64
+ <a href="https://huggingface.co/md-nishat-008/TigerCoder-9B">
65
+ <img src="https://img.shields.io/badge/🤗_HuggingFace-TigerCoder--9B-yellow?style=for-the-badge" alt="TigerCoder-9B"/>
66
+ </a>
67
+ </td>
68
+ </tr>
69
+ </table>
70
+
71
+ <br/>
72
+
73
+ **The first dedicated family of Code LLMs for Bangla, achieving 11-18% Pass@1 gains over prior baselines.**
74
+
75
+ </div>
76
+
77
+ ---
78
+
79
+ > **⚠️ Note:** Model weights will be released after the LREC 2026 conference. Stay tuned!
80
+
81
+ ## Overview
82
+
83
+ Despite being the 5th most spoken language globally (242M+ native speakers), Bangla remains severely underrepresented in code generation. **TigerCoder** addresses this gap by introducing the first dedicated Bangla Code LLM family, available in 1B and 9B parameter variants.
84
+
85
+ This model card is for **TigerCoder-9B**, the instruction-tuned 9B parameter variant, finetuned on **300K Bangla instruction-code pairs** from the Bangla-Code-Instruct dataset. TigerCoder-9B pushes the frontier of Bangla code generation to **0.82 Pass@1 on MBPP-Bangla**, achieving 11-18% gains over the strongest prior baselines (Gemma-3 27B and TigerLLM-9B) while being only one-third their size.
86
+
87
+ ## Key Contributions
88
+
89
+ 1. **Bangla-Code-Instruct**: A comprehensive 300K instruction-code dataset comprising three subsets: Self-Instruct (SI, 100K), Synthetic (Syn, 100K), and Translated+Filtered (TE, 100K).
90
+ 2. **MBPP-Bangla**: A 974-problem benchmark with expert-validated Bangla programming tasks across 5 programming languages (Python, C++, Java, JavaScript, Ruby).
91
+ 3. **TigerCoder Model Family**: Specialized Bangla Code LLMs (1B and 9B) that set a new state-of-the-art for Bangla code generation.
92
+
93
+ ## Performance
94
+
95
+ ### Python (Pass@K on Bangla Prompts)
96
+
97
+ | Model | mHumanEval P@1 | mHumanEval P@10 | mHumanEval P@100 | MBPP P@1 | MBPP P@10 | MBPP P@100 |
98
+ |:---|:---:|:---:|:---:|:---:|:---:|:---:|
99
+ | GPT-3.5 | 0.56 | 0.56 | 0.59 | 0.60 | 0.62 | 0.62 |
100
+ | Gemini-Flash 2.5 | 0.58 | 0.61 | 0.62 | 0.62 | 0.62 | 0.70 |
101
+ | Gemma-3 (27B) | 0.64 | 0.65 | 0.69 | 0.69 | 0.70 | 0.70 |
102
+ | TigerLLM (9B) | 0.63 | 0.69 | 0.72 | 0.61 | 0.68 | 0.73 |
103
+ | TigerCoder (1B) | 0.69 | 0.73 | 0.77 | 0.74 | 0.74 | 0.81 |
104
+ | **TigerCoder (9B)** | **0.75** | **0.80** | **0.84** | **0.82** | **0.84** | **0.91** |
105
+
106
+ ### Improvements over Strongest Prior Baseline (Δ)
107
+
108
+ | Model | mHumanEval P@1 | mHumanEval P@10 | mHumanEval P@100 | MBPP P@1 | MBPP P@10 | MBPP P@100 |
109
+ |:---|:---:|:---:|:---:|:---:|:---:|:---:|
110
+ | TigerCoder (1B) | +0.05 | +0.04 | +0.05 | +0.05 | +0.04 | +0.08 |
111
+ | **TigerCoder (9B)** | **+0.11** | **+0.11** | **+0.12** | **+0.13** | **+0.14** | **+0.18** |
112
+
113
+ ### Multi-Language Performance (TigerCoder-9B, Pass@1 on Bangla Prompts)
114
+
115
+ | Language | mHumanEval P@1 | MBPP P@1 |
116
+ |:---|:---:|:---:|
117
+ | Python | 0.75 | 0.82 |
118
+ | C++ | 0.67 | 0.72 |
119
+ | Java | 0.62 | 0.67 |
120
+ | JavaScript | 0.57 | 0.62 |
121
+
122
+ ## Usage
123
+
124
+ ### Quickstart
125
+
126
+ ```python
127
+ from transformers import AutoModelForCausalLM, AutoTokenizer
128
+ import torch
129
+
130
+ model_name = "md-nishat-008/TigerCoder-9B"
131
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
132
+ model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
133
+
134
+ # Bangla coding prompt
135
+ chat = [{"role": "user", "content": "একটি ফাংশন লিখুন যা একটি সংখ্যার ফ্যাক্টরিয়াল গণনা করে।"}]
136
+
137
+ inputs = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors="pt").to(model.device)
138
+
139
+ with torch.no_grad():
140
+ outputs = model.generate(
141
+ inputs=inputs,
142
+ max_new_tokens=512,
143
+ temperature=0.7,
144
+ top_p=0.95,
145
+ pad_token_id=tokenizer.eos_token_id,
146
+ eos_token_id=tokenizer.eos_token_id
147
+ )
148
+
149
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
150
+ ```
151
+
152
+ ## Training Details
153
+
154
+ | Hyperparameter | Value |
155
+ |:---|:---|
156
+ | Base Model | TigerLLM-9B-it |
157
+ | Training Data | Bangla-Code-Instruct (300K examples) |
158
+ | Max Sequence Length | 2048 |
159
+ | Batch Size (Train / Eval) | 32 |
160
+ | Gradient Accumulation Steps | 8 |
161
+ | Epochs | 3 |
162
+ | Learning Rate | 1 × 10⁻⁶ |
163
+ | Weight Decay | 0.04 |
164
+ | Warm-up Steps | 15% |
165
+ | Optimizer | AdamW |
166
+ | LR Scheduler | Cosine |
167
+ | Precision | BF16 |
168
+ | Hardware | NVIDIA A100 (40GB) |
169
+
170
+ ## Datasets
171
+
172
+ The **Bangla-Code-Instruct** dataset (300K total) consists of three complementary subsets:
173
+
174
+ | Subset | Size | Method | Prompt Origin | Code Origin |
175
+ |:---|:---:|:---|:---|:---|
176
+ | SI (Self-Instruct) | 100K | 5000 expert seeds + GPT-4o expansion | Semi-Natural | Synthetic |
177
+ | Syn (Synthetic) | 100K | GPT-4o + Claude 3.5 generation | Synthetic | Synthetic |
178
+ | TE (Translated) | 100K | NLLB-200 MT from Evol-Instruct | Translated | Natural (Source) |
179
+
180
+ All code in SI and Syn subsets is validated via syntax checking (`ast.parse`) and execution testing (Python 3.13.0, 10s timeout, 16GB memory).
181
+
182
+ ## Key Findings
183
+
184
+ 1. **LLMs exhibit a notable performance drop when coding prompts are in Bangla rather than English.** Most models lose 20-50+ percentage points.
185
+ 2. **Bangla → English machine translation does not help.** Translated prompts perform similarly or worse than native Bangla prompts due to mistranslation of code-specific keywords (e.g., "অক্ষর" (Character) → "Letter", "চলক" (Variable) → "Clever", "স্ট্রিং" (String) → "Rope").
186
+ 3. **High-quality, targeted data beats scale.** TigerCoder-1B surpasses models 27x its size, and TigerCoder-9B widens the lead to 11-18%, confirming that curated, domain-specific data outweighs model scale for low-resource code generation.
187
+
188
+ ## Limitations
189
+
190
+ - TigerCoder is optimized primarily for Bangla code generation tasks. Performance on general NLU or non-code tasks may not match general-purpose models.
191
+ - The training data is synthetically generated and/or machine-translated, which may introduce biases or artifacts.
192
+ - Evaluation is currently limited to MBPP-Bangla and mHumanEval-Bangla; performance on real-world, production-level coding tasks has not been benchmarked.
193
+
194
+ ## Ethics Statement
195
+
196
+ We adhere to the ethical guidelines outlined in the LREC 2026 CFP. Our benchmark creation involved careful translation and verification by qualified native speakers. We promote transparency through the open-source release of our models, datasets, and benchmark. We encourage responsible downstream use and community scrutiny.
197
+
198
+ ---
199
+
200
+ ## Citation
201
+
202
+ If you find our work helpful, please consider citing our paper:
203
+
204
+ ```bibtex
205
+ @article{raihan2025tigercoder,
206
+ title={Tigercoder: A novel suite of llms for code generation in bangla},
207
+ author={Raihan, Nishat and Anastasopoulos, Antonios and Zampieri, Marcos},
208
+ journal={arXiv preprint arXiv:2509.09101},
209
+ year={2025}
210
+ }
211
+ ```
212
+
213
+ You may also find our related work useful:
214
+
215
+ ```bibtex
216
+ @inproceedings{raihan-zampieri-2025-tigerllm,
217
+ title = "{T}iger{LLM} - A Family of {B}angla Large Language Models",
218
+ author = "Raihan, Nishat and
219
+ Zampieri, Marcos",
220
+ booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
221
+ month = jul,
222
+ year = "2025",
223
+ address = "Vienna, Austria",
224
+ publisher = "Association for Computational Linguistics",
225
+ url = "https://aclanthology.org/2025.acl-short.69/",
226
+ doi = "10.18653/v1/2025.acl-short.69",
227
+ pages = "887--896",
228
+ ISBN = "979-8-89176-252-7"
229
+ }
230
+ ```
231
+
232
+ ```bibtex
233
+ @inproceedings{raihan-etal-2025-mhumaneval,
234
+ title = "m{H}uman{E}val - A Multilingual Benchmark to Evaluate Large Language Models for Code Generation",
235
+ author = "Raihan, Nishat and
236
+ Anastasopoulos, Antonios and
237
+ Zampieri, Marcos",
238
+ booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
239
+ year = "2025",
240
+ }
241
+ ```