File size: 3,824 Bytes
cce70aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# Bengali-Code LLM Training Pipeline

A comprehensive pipeline for training a Bengali language model specialized in code understanding and generation. The model is fine-tuned on Bengali programming tutorials, documentation, and code examples.

## 🌟 Features

- Automated data collection from Bengali Wikipedia and Prothom Alo
- Custom tokenizer training with SentencePiece for Bengali text and code
- Model fine-tuning using TinyLlama base model
- Comprehensive evaluation suite for Bengali code generation
- GitHub Actions workflow for automated training
- Weights & Biases integration for experiment tracking

## πŸ“‹ Requirements

- Python 3.10 or higher
- CUDA-capable GPU (recommended)
- 16GB+ RAM
- Internet connection for data collection

## πŸš€ Quick Start

1. Clone the repository:
```bash

git clone https://github.com/yourusername/bengali-code-llm.git

cd bengali-code-llm

```

2. Install dependencies:
```bash

pip install -r requirements.txt

```

3. Set up environment variables:
```bash

export HUGGINGFACE_TOKEN="your_token_here"

export WANDB_API_KEY="your_wandb_key_here"

```

4. Run the complete pipeline:
```bash

# Collect data

python scripts/data_collector.py



# Train tokenizer

python scripts/tokenizer_trainer.py



# Train model

python scripts/model_trainer.py



# Evaluate model

python scripts/model_evaluator.py

```

## πŸ—οΈ Pipeline Components

### Data Collection (`scripts/data_collector.py`)

- Scrapes Bengali text from Wikipedia and Prothom Alo

- Implements rate limiting and error handling

- Outputs processed data in JSON format



### Tokenizer Training (`scripts/tokenizer_trainer.py`)
- Uses SentencePiece for tokenizer training
- Custom vocabulary with Bengali and code tokens
- Generates HuggingFace-compatible tokenizer files

### Model Training (`scripts/model_trainer.py`)

- Fine-tunes TinyLlama model

- Implements efficient training with gradient accumulation

- Supports mixed precision training

- Integrates with Weights & Biases for tracking



### Model Evaluation (`scripts/model_evaluator.py`)
- Comprehensive evaluation suite
- Tests code generation capabilities
- Measures BLEU and ROUGE scores
- Generates detailed evaluation reports

## πŸ“Š Training Metrics

The training progress can be monitored through Weights & Biases:
- Loss curves
- Evaluation metrics
- Generated samples
- Resource utilization

## πŸ”„ GitHub Actions Workflow

The repository includes an automated training pipeline that:
- Runs daily to incorporate new data
- Executes the complete training pipeline
- Uploads model artifacts
- Can be triggered manually

## πŸ“ Directory Structure

```

bengali-code-llm/

β”œβ”€β”€ .github/

β”‚   └── workflows/

β”‚       └── train_model.yml

β”œβ”€β”€ scripts/

β”‚   β”œβ”€β”€ data_collector.py

β”‚   β”œβ”€β”€ tokenizer_trainer.py

β”‚   β”œβ”€β”€ model_trainer.py

β”‚   └── model_evaluator.py

β”œβ”€β”€ data/

β”‚   └── raw/

β”œβ”€β”€ outputs/

β”‚   β”œβ”€β”€ tokenizer/

β”‚   β”œβ”€β”€ model/

β”‚   └── evaluation/

β”œβ”€β”€ requirements.txt

└── README.md

```

## 🎯 Model Performance

The model is evaluated on various tasks:
- Code generation in Bengali
- Code explanation and documentation
- Error detection and correction
- Algorithm explanation

## πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

## 🀝 Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

## πŸ“§ Contact

For questions and feedback, please open an issue in the repository.

## πŸ™ Acknowledgments

- TinyLlama team for the base model
- HuggingFace for the Transformers library
- Weights & Biases for experiment tracking