| # Bengali-Code LLM Training Pipeline | |
| A comprehensive pipeline for training a Bengali language model specialized in code understanding and generation. The model is fine-tuned on Bengali programming tutorials, documentation, and code examples. | |
| ## π Features | |
| - Automated data collection from Bengali Wikipedia and Prothom Alo | |
| - Custom tokenizer training with SentencePiece for Bengali text and code | |
| - Model fine-tuning using TinyLlama base model | |
| - Comprehensive evaluation suite for Bengali code generation | |
| - GitHub Actions workflow for automated training | |
| - Weights & Biases integration for experiment tracking | |
| ## π Requirements | |
| - Python 3.10 or higher | |
| - CUDA-capable GPU (recommended) | |
| - 16GB+ RAM | |
| - Internet connection for data collection | |
| ## π Quick Start | |
| 1. Clone the repository: | |
| ```bash | |
| git clone https://github.com/yourusername/bengali-code-llm.git | |
| cd bengali-code-llm | |
| ``` | |
| 2. Install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 3. Set up environment variables: | |
| ```bash | |
| export HUGGINGFACE_TOKEN="your_token_here" | |
| export WANDB_API_KEY="your_wandb_key_here" | |
| ``` | |
| 4. Run the complete pipeline: | |
| ```bash | |
| # Collect data | |
| python scripts/data_collector.py | |
| # Train tokenizer | |
| python scripts/tokenizer_trainer.py | |
| # Train model | |
| python scripts/model_trainer.py | |
| # Evaluate model | |
| python scripts/model_evaluator.py | |
| ``` | |
| ## ποΈ Pipeline Components | |
| ### Data Collection (`scripts/data_collector.py`) | |
| - Scrapes Bengali text from Wikipedia and Prothom Alo | |
| - Implements rate limiting and error handling | |
| - Outputs processed data in JSON format | |
| ### Tokenizer Training (`scripts/tokenizer_trainer.py`) | |
| - Uses SentencePiece for tokenizer training | |
| - Custom vocabulary with Bengali and code tokens | |
| - Generates HuggingFace-compatible tokenizer files | |
| ### Model Training (`scripts/model_trainer.py`) | |
| - Fine-tunes TinyLlama model | |
| - Implements efficient training with gradient accumulation | |
| - Supports mixed precision training | |
| - Integrates with Weights & Biases for tracking | |
| ### Model Evaluation (`scripts/model_evaluator.py`) | |
| - Comprehensive evaluation suite | |
| - Tests code generation capabilities | |
| - Measures BLEU and ROUGE scores | |
| - Generates detailed evaluation reports | |
| ## π Training Metrics | |
| The training progress can be monitored through Weights & Biases: | |
| - Loss curves | |
| - Evaluation metrics | |
| - Generated samples | |
| - Resource utilization | |
| ## π GitHub Actions Workflow | |
| The repository includes an automated training pipeline that: | |
| - Runs daily to incorporate new data | |
| - Executes the complete training pipeline | |
| - Uploads model artifacts | |
| - Can be triggered manually | |
| ## π Directory Structure | |
| ``` | |
| bengali-code-llm/ | |
| βββ .github/ | |
| β βββ workflows/ | |
| β βββ train_model.yml | |
| βββ scripts/ | |
| β βββ data_collector.py | |
| β βββ tokenizer_trainer.py | |
| β βββ model_trainer.py | |
| β βββ model_evaluator.py | |
| βββ data/ | |
| β βββ raw/ | |
| βββ outputs/ | |
| β βββ tokenizer/ | |
| β βββ model/ | |
| β βββ evaluation/ | |
| βββ requirements.txt | |
| βββ README.md | |
| ``` | |
| ## π― Model Performance | |
| The model is evaluated on various tasks: | |
| - Code generation in Bengali | |
| - Code explanation and documentation | |
| - Error detection and correction | |
| - Algorithm explanation | |
| ## π License | |
| This project is licensed under the MIT License - see the LICENSE file for details. | |
| ## π€ Contributing | |
| Contributions are welcome! Please feel free to submit issues and pull requests. | |
| ## π§ Contact | |
| For questions and feedback, please open an issue in the repository. | |
| ## π Acknowledgments | |
| - TinyLlama team for the base model | |
| - HuggingFace for the Transformers library | |
| - Weights & Biases for experiment tracking | |