File size: 3,824 Bytes

cce70aa

# Bengali-Code LLM Training Pipeline

A comprehensive pipeline for training a Bengali language model specialized in code understanding and generation. The model is fine-tuned on Bengali programming tutorials, documentation, and code examples.

## 🌟 Features

- Automated data collection from Bengali Wikipedia and Prothom Alo
- Custom tokenizer training with SentencePiece for Bengali text and code
- Model fine-tuning using TinyLlama base model
- Comprehensive evaluation suite for Bengali code generation
- GitHub Actions workflow for automated training
- Weights & Biases integration for experiment tracking

## 📋 Requirements

- Python 3.10 or higher
- CUDA-capable GPU (recommended)
- 16GB+ RAM
- Internet connection for data collection

## 🚀 Quick Start

1. Clone the repository:
```bash

git clone https://github.com/yourusername/bengali-code-llm.git

cd bengali-code-llm

```

2. Install dependencies:
```bash

pip install -r requirements.txt

```

3. Set up environment variables:
```bash

export HUGGINGFACE_TOKEN="your_token_here"

export WANDB_API_KEY="your_wandb_key_here"

```

4. Run the complete pipeline:
```bash

# Collect data

python scripts/data_collector.py



# Train tokenizer

python scripts/tokenizer_trainer.py



# Train model

python scripts/model_trainer.py



# Evaluate model

python scripts/model_evaluator.py

```

## 🏗️ Pipeline Components

### Data Collection (`scripts/data_collector.py`)

- Scrapes Bengali text from Wikipedia and Prothom Alo

- Implements rate limiting and error handling

- Outputs processed data in JSON format



### Tokenizer Training (`scripts/tokenizer_trainer.py`)
- Uses SentencePiece for tokenizer training
- Custom vocabulary with Bengali and code tokens
- Generates HuggingFace-compatible tokenizer files

### Model Training (`scripts/model_trainer.py`)

- Fine-tunes TinyLlama model

- Implements efficient training with gradient accumulation

- Supports mixed precision training

- Integrates with Weights & Biases for tracking



### Model Evaluation (`scripts/model_evaluator.py`)
- Comprehensive evaluation suite
- Tests code generation capabilities
- Measures BLEU and ROUGE scores
- Generates detailed evaluation reports

## 📊 Training Metrics

The training progress can be monitored through Weights & Biases:
- Loss curves
- Evaluation metrics
- Generated samples
- Resource utilization

## 🔄 GitHub Actions Workflow

The repository includes an automated training pipeline that:
- Runs daily to incorporate new data
- Executes the complete training pipeline
- Uploads model artifacts
- Can be triggered manually

## 📁 Directory Structure

```

bengali-code-llm/

├── .github/

│   └── workflows/

│       └── train_model.yml

├── scripts/

│   ├── data_collector.py

│   ├── tokenizer_trainer.py

│   ├── model_trainer.py

│   └── model_evaluator.py

├── data/

│   └── raw/

├── outputs/

│   ├── tokenizer/

│   ├── model/

│   └── evaluation/

├── requirements.txt

└── README.md

```

## 🎯 Model Performance

The model is evaluated on various tasks:
- Code generation in Bengali
- Code explanation and documentation
- Error detection and correction
- Algorithm explanation

## 📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

## 🤝 Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

## 📧 Contact

For questions and feedback, please open an issue in the repository.

## 🙏 Acknowledgments

- TinyLlama team for the base model
- HuggingFace for the Transformers library
- Weights & Biases for experiment tracking