File size: 3,824 Bytes
cce70aa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
# Bengali-Code LLM Training Pipeline
A comprehensive pipeline for training a Bengali language model specialized in code understanding and generation. The model is fine-tuned on Bengali programming tutorials, documentation, and code examples.
## π Features
- Automated data collection from Bengali Wikipedia and Prothom Alo
- Custom tokenizer training with SentencePiece for Bengali text and code
- Model fine-tuning using TinyLlama base model
- Comprehensive evaluation suite for Bengali code generation
- GitHub Actions workflow for automated training
- Weights & Biases integration for experiment tracking
## π Requirements
- Python 3.10 or higher
- CUDA-capable GPU (recommended)
- 16GB+ RAM
- Internet connection for data collection
## π Quick Start
1. Clone the repository:
```bash
git clone https://github.com/yourusername/bengali-code-llm.git
cd bengali-code-llm
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Set up environment variables:
```bash
export HUGGINGFACE_TOKEN="your_token_here"
export WANDB_API_KEY="your_wandb_key_here"
```
4. Run the complete pipeline:
```bash
# Collect data
python scripts/data_collector.py
# Train tokenizer
python scripts/tokenizer_trainer.py
# Train model
python scripts/model_trainer.py
# Evaluate model
python scripts/model_evaluator.py
```
## ποΈ Pipeline Components
### Data Collection (`scripts/data_collector.py`)
- Scrapes Bengali text from Wikipedia and Prothom Alo
- Implements rate limiting and error handling
- Outputs processed data in JSON format
### Tokenizer Training (`scripts/tokenizer_trainer.py`)
- Uses SentencePiece for tokenizer training
- Custom vocabulary with Bengali and code tokens
- Generates HuggingFace-compatible tokenizer files
### Model Training (`scripts/model_trainer.py`)
- Fine-tunes TinyLlama model
- Implements efficient training with gradient accumulation
- Supports mixed precision training
- Integrates with Weights & Biases for tracking
### Model Evaluation (`scripts/model_evaluator.py`)
- Comprehensive evaluation suite
- Tests code generation capabilities
- Measures BLEU and ROUGE scores
- Generates detailed evaluation reports
## π Training Metrics
The training progress can be monitored through Weights & Biases:
- Loss curves
- Evaluation metrics
- Generated samples
- Resource utilization
## π GitHub Actions Workflow
The repository includes an automated training pipeline that:
- Runs daily to incorporate new data
- Executes the complete training pipeline
- Uploads model artifacts
- Can be triggered manually
## π Directory Structure
```
bengali-code-llm/
βββ .github/
β βββ workflows/
β βββ train_model.yml
βββ scripts/
β βββ data_collector.py
β βββ tokenizer_trainer.py
β βββ model_trainer.py
β βββ model_evaluator.py
βββ data/
β βββ raw/
βββ outputs/
β βββ tokenizer/
β βββ model/
β βββ evaluation/
βββ requirements.txt
βββ README.md
```
## π― Model Performance
The model is evaluated on various tasks:
- Code generation in Bengali
- Code explanation and documentation
- Error detection and correction
- Algorithm explanation
## π License
This project is licensed under the MIT License - see the LICENSE file for details.
## π€ Contributing
Contributions are welcome! Please feel free to submit issues and pull requests.
## π§ Contact
For questions and feedback, please open an issue in the repository.
## π Acknowledgments
- TinyLlama team for the base model
- HuggingFace for the Transformers library
- Weights & Biases for experiment tracking
|