mujib71 / README.md
likhonsheikhdev's picture
Upload 28 files
cce70aa verified
# Bengali-Code LLM Training Pipeline
A comprehensive pipeline for training a Bengali language model specialized in code understanding and generation. The model is fine-tuned on Bengali programming tutorials, documentation, and code examples.
## 🌟 Features
- Automated data collection from Bengali Wikipedia and Prothom Alo
- Custom tokenizer training with SentencePiece for Bengali text and code
- Model fine-tuning using TinyLlama base model
- Comprehensive evaluation suite for Bengali code generation
- GitHub Actions workflow for automated training
- Weights & Biases integration for experiment tracking
## πŸ“‹ Requirements
- Python 3.10 or higher
- CUDA-capable GPU (recommended)
- 16GB+ RAM
- Internet connection for data collection
## πŸš€ Quick Start
1. Clone the repository:
```bash
git clone https://github.com/yourusername/bengali-code-llm.git
cd bengali-code-llm
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Set up environment variables:
```bash
export HUGGINGFACE_TOKEN="your_token_here"
export WANDB_API_KEY="your_wandb_key_here"
```
4. Run the complete pipeline:
```bash
# Collect data
python scripts/data_collector.py
# Train tokenizer
python scripts/tokenizer_trainer.py
# Train model
python scripts/model_trainer.py
# Evaluate model
python scripts/model_evaluator.py
```
## πŸ—οΈ Pipeline Components
### Data Collection (`scripts/data_collector.py`)
- Scrapes Bengali text from Wikipedia and Prothom Alo
- Implements rate limiting and error handling
- Outputs processed data in JSON format
### Tokenizer Training (`scripts/tokenizer_trainer.py`)
- Uses SentencePiece for tokenizer training
- Custom vocabulary with Bengali and code tokens
- Generates HuggingFace-compatible tokenizer files
### Model Training (`scripts/model_trainer.py`)
- Fine-tunes TinyLlama model
- Implements efficient training with gradient accumulation
- Supports mixed precision training
- Integrates with Weights & Biases for tracking
### Model Evaluation (`scripts/model_evaluator.py`)
- Comprehensive evaluation suite
- Tests code generation capabilities
- Measures BLEU and ROUGE scores
- Generates detailed evaluation reports
## πŸ“Š Training Metrics
The training progress can be monitored through Weights & Biases:
- Loss curves
- Evaluation metrics
- Generated samples
- Resource utilization
## πŸ”„ GitHub Actions Workflow
The repository includes an automated training pipeline that:
- Runs daily to incorporate new data
- Executes the complete training pipeline
- Uploads model artifacts
- Can be triggered manually
## πŸ“ Directory Structure
```
bengali-code-llm/
β”œβ”€β”€ .github/
β”‚ └── workflows/
β”‚ └── train_model.yml
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ data_collector.py
β”‚ β”œβ”€β”€ tokenizer_trainer.py
β”‚ β”œβ”€β”€ model_trainer.py
β”‚ └── model_evaluator.py
β”œβ”€β”€ data/
β”‚ └── raw/
β”œβ”€β”€ outputs/
β”‚ β”œβ”€β”€ tokenizer/
β”‚ β”œβ”€β”€ model/
β”‚ └── evaluation/
β”œβ”€β”€ requirements.txt
└── README.md
```
## 🎯 Model Performance
The model is evaluated on various tasks:
- Code generation in Bengali
- Code explanation and documentation
- Error detection and correction
- Algorithm explanation
## πŸ“œ License
This project is licensed under the MIT License - see the LICENSE file for details.
## 🀝 Contributing
Contributions are welcome! Please feel free to submit issues and pull requests.
## πŸ“§ Contact
For questions and feedback, please open an issue in the repository.
## πŸ™ Acknowledgments
- TinyLlama team for the base model
- HuggingFace for the Transformers library
- Weights & Biases for experiment tracking