# Bengali-Code LLM Training Pipeline A comprehensive pipeline for training a Bengali language model specialized in code understanding and generation. The model is fine-tuned on Bengali programming tutorials, documentation, and code examples. ## 🌟 Features - Automated data collection from Bengali Wikipedia and Prothom Alo - Custom tokenizer training with SentencePiece for Bengali text and code - Model fine-tuning using TinyLlama base model - Comprehensive evaluation suite for Bengali code generation - GitHub Actions workflow for automated training - Weights & Biases integration for experiment tracking ## 📋 Requirements - Python 3.10 or higher - CUDA-capable GPU (recommended) - 16GB+ RAM - Internet connection for data collection ## 🚀 Quick Start 1. Clone the repository: ```bash git clone https://github.com/yourusername/bengali-code-llm.git cd bengali-code-llm ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Set up environment variables: ```bash export HUGGINGFACE_TOKEN="your_token_here" export WANDB_API_KEY="your_wandb_key_here" ``` 4. Run the complete pipeline: ```bash # Collect data python scripts/data_collector.py # Train tokenizer python scripts/tokenizer_trainer.py # Train model python scripts/model_trainer.py # Evaluate model python scripts/model_evaluator.py ``` ## 🏗️ Pipeline Components ### Data Collection (`scripts/data_collector.py`) - Scrapes Bengali text from Wikipedia and Prothom Alo - Implements rate limiting and error handling - Outputs processed data in JSON format ### Tokenizer Training (`scripts/tokenizer_trainer.py`) - Uses SentencePiece for tokenizer training - Custom vocabulary with Bengali and code tokens - Generates HuggingFace-compatible tokenizer files ### Model Training (`scripts/model_trainer.py`) - Fine-tunes TinyLlama model - Implements efficient training with gradient accumulation - Supports mixed precision training - Integrates with Weights & Biases for tracking ### Model Evaluation (`scripts/model_evaluator.py`) - Comprehensive evaluation suite - Tests code generation capabilities - Measures BLEU and ROUGE scores - Generates detailed evaluation reports ## 📊 Training Metrics The training progress can be monitored through Weights & Biases: - Loss curves - Evaluation metrics - Generated samples - Resource utilization ## 🔄 GitHub Actions Workflow The repository includes an automated training pipeline that: - Runs daily to incorporate new data - Executes the complete training pipeline - Uploads model artifacts - Can be triggered manually ## 📁 Directory Structure ``` bengali-code-llm/ ├── .github/ │ └── workflows/ │ └── train_model.yml ├── scripts/ │ ├── data_collector.py │ ├── tokenizer_trainer.py │ ├── model_trainer.py │ └── model_evaluator.py ├── data/ │ └── raw/ ├── outputs/ │ ├── tokenizer/ │ ├── model/ │ └── evaluation/ ├── requirements.txt └── README.md ``` ## 🎯 Model Performance The model is evaluated on various tasks: - Code generation in Bengali - Code explanation and documentation - Error detection and correction - Algorithm explanation ## 📜 License This project is licensed under the MIT License - see the LICENSE file for details. ## 🤝 Contributing Contributions are welcome! Please feel free to submit issues and pull requests. ## 📧 Contact For questions and feedback, please open an issue in the repository. ## 🙏 Acknowledgments - TinyLlama team for the base model - HuggingFace for the Transformers library - Weights & Biases for experiment tracking