mujib71 / README.md

Upload 28 files

cce70aa verified 9 months ago

3.82 kB

	# Bengali-Code LLM Training Pipeline

	A comprehensive pipeline for training a Bengali language model specialized in code understanding and generation. The model is fine-tuned on Bengali programming tutorials, documentation, and code examples.

	## 🌟 Features

	- Automated data collection from Bengali Wikipedia and Prothom Alo
	- Custom tokenizer training with SentencePiece for Bengali text and code
	- Model fine-tuning using TinyLlama base model
	- Comprehensive evaluation suite for Bengali code generation
	- GitHub Actions workflow for automated training
	- Weights & Biases integration for experiment tracking

	## 📋 Requirements

	- Python 3.10 or higher
	- CUDA-capable GPU (recommended)
	- 16GB+ RAM
	- Internet connection for data collection

	## 🚀 Quick Start

	1. Clone the repository:
	```bash
	git clone https://github.com/yourusername/bengali-code-llm.git
	cd bengali-code-llm
	```

	2. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	3. Set up environment variables:
	```bash
	export HUGGINGFACE_TOKEN="your_token_here"
	export WANDB_API_KEY="your_wandb_key_here"
	```

	4. Run the complete pipeline:
	```bash
	# Collect data
	python scripts/data_collector.py

	# Train tokenizer
	python scripts/tokenizer_trainer.py

	# Train model
	python scripts/model_trainer.py

	# Evaluate model
	python scripts/model_evaluator.py
	```

	## 🏗️ Pipeline Components

	### Data Collection (`scripts/data_collector.py`)
	- Scrapes Bengali text from Wikipedia and Prothom Alo
	- Implements rate limiting and error handling
	- Outputs processed data in JSON format

	### Tokenizer Training (`scripts/tokenizer_trainer.py`)
	- Uses SentencePiece for tokenizer training
	- Custom vocabulary with Bengali and code tokens
	- Generates HuggingFace-compatible tokenizer files

	### Model Training (`scripts/model_trainer.py`)
	- Fine-tunes TinyLlama model
	- Implements efficient training with gradient accumulation
	- Supports mixed precision training
	- Integrates with Weights & Biases for tracking

	### Model Evaluation (`scripts/model_evaluator.py`)
	- Comprehensive evaluation suite
	- Tests code generation capabilities
	- Measures BLEU and ROUGE scores
	- Generates detailed evaluation reports

	## 📊 Training Metrics

	The training progress can be monitored through Weights & Biases:
	- Loss curves
	- Evaluation metrics
	- Generated samples
	- Resource utilization

	## 🔄 GitHub Actions Workflow

	The repository includes an automated training pipeline that:
	- Runs daily to incorporate new data
	- Executes the complete training pipeline
	- Uploads model artifacts
	- Can be triggered manually

	## 📁 Directory Structure

	```
	bengali-code-llm/
	├── .github/
	│ └── workflows/
	│ └── train_model.yml
	├── scripts/
	│ ├── data_collector.py
	│ ├── tokenizer_trainer.py
	│ ├── model_trainer.py
	│ └── model_evaluator.py
	├── data/
	│ └── raw/
	├── outputs/
	│ ├── tokenizer/
	│ ├── model/
	│ └── evaluation/
	├── requirements.txt
	└── README.md
	```

	## 🎯 Model Performance

	The model is evaluated on various tasks:
	- Code generation in Bengali
	- Code explanation and documentation
	- Error detection and correction
	- Algorithm explanation

	## 📜 License

	This project is licensed under the MIT License - see the LICENSE file for details.

	## 🤝 Contributing

	Contributions are welcome! Please feel free to submit issues and pull requests.

	## 📧 Contact

	For questions and feedback, please open an issue in the repository.

	## 🙏 Acknowledgments

	- TinyLlama team for the base model
	- HuggingFace for the Transformers library
	- Weights & Biases for experiment tracking