verma75preetam's picture
Update README.md
0f71715 verified
---
library_name: peft
license: bigcode-openrail-m
base_model: bigcode/starcoderbase-1b
tags:
- base_model:adapter:bigcode/starcoderbase-1b
- lora
- transformers
pipeline_tag: text-generation
model-index:
- name: peft-starcoder-lora-apple
results: []
datasets:
- smangrul/hf-stack-v1
language:
- en
---
# Code Completion with StarCoder
A Python implementation of an AI-powered code completion system using the StarCoder base model with LoRA fine-tuning. This project provides a lightweight yet powerful code completion capability that can be trained on custom datasets.
## Features
- **Fill-in-the-Middle (FIM) Capability**: Handles both prefix-suffix code completion and middle-context completion
- **LoRA Fine-tuning**: Efficient parameter-efficient fine-tuning using Low-Rank Adaptation
- **Modular Architecture**: Clean separation between settings, model components, and training logic
- **Customizable Training**: Easily adjust hyperparameters through the settings file
- **Apple Silicon Support**: Optimized for running on Apple MPS devices / Note-Since currenlty huggingface does'nt support MLX backend out of the box, hence training runs on mlx backend is slow when compare to cuda device.
## Requirements
- Python 3.8+
- PyTorch 2.0+
- Transformers 4.30+
- PEFT (Parameter-Efficient Fine-Tuning)
- Datasets
- Accelerate
- BitsAndBytes (for quantization)
## πŸ›  Installation
```bash
# Clone the repository
git clone https://github.com/deep-learner-ConfigurableAI/code-completion.git
cd code-completion
# Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install torch transformers datasets accelerate bitsandbytes peft tqdm
```
## Project Structure
```
code-completion/
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ main.py # Entry point for the application
β”‚ β”œβ”€β”€ settings.py # Configuration settings
β”‚ β”œβ”€β”€ model.py # Core model implementation
β”‚ └── runner.py # Training and inference logic
```
## Configuration
All model and training configurations are centralized in `src/settings.py`. Key parameters include:
- Model checkpoint (`MODEL`)
- Training dataset (`DATASET`)
- Sequence length (`SEQ_LENGTH`)
- Training parameters (batch size, learning rate, etc.)
- LoRA configuration (rank, alpha, target modules)
- FIM transformation settings
## Usage
### Training
To train the model on your dataset:
1. Update the `settings.py` file with your desired configuration
2. Uncomment the `train_model()` line in `main.py`
3. Run the following command:
```bash
cd src
python main.py
```
Snapshot of Trainig Run:
TrainOutput(global_step=1000, training_loss=0.7857105331420898, metrics={'train_runtime': 626.5932, 'train_samples_per_second': 12.767, 'train_steps_per_second': 1.596, 'train_tokens_per_second': 25841.328, 'total_flos': 9.961198190592e+16, 'train_loss': 0.7857105331420898, 'epoch': 0.8176614881439084})
### Inference
To use the model for code completion:
1. Ensure you have a trained model or use the provided checkpoint
2. Uncomment the `code_completion_demo()` line in `main.py`
3. Run:
```bash
cd src
python main.py
```
### Custom Inference
You can also use the model programmatically:
```python
from model import load_model_tokenizer, get_code_completion
from runner import load_model_for_inference
# Load model
model, tokenizer = load_model_for_inference()
# Example code completion
prefix = "def calculate_total(items):"
suffix = " return total"
completed_code = get_code_completion(model, tokenizer, prefix, suffix)
print(completed_code)
```
## How It Works
1. **Fill-in-the-Middle (FIM)**: The model is trained to predict missing code in the middle of two context pieces (prefix and suffix).
2. **LoRA Fine-tuning**: Instead of fine-tuning all parameters, we use LoRA to efficiently adapt the pre-trained StarCoder model.
3. **Dataset Processing**: The training process formats the dataset into fixed-length chunks with FIM transformations applied.
4. **Constant Length Dataset**: For efficient training, we process examples into a constant length format.
## Performance
The model's performance depends on:
- The quality and size of the training dataset
- The hyperparameters used (especially LoRA rank and learning rate)
- The number of training steps
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Acknowledgments
- [BigCode Project](https://www.bigcode-project.org/) for the StarCoder base model
- [Hugging Face](https://huggingface.co/) for their excellent Transformers library
- [PEFT](https://github.com/huggingface/peft) for the efficient fine-tuning implementation