peft-starcoder-lora-apple / README.md

verma75preetam

Update README.md

0f71715 verified 4 months ago

preview code

raw

history blame contribute delete

4.87 kB

metadata

library_name: peft
license: bigcode-openrail-m
base_model: bigcode/starcoderbase-1b
tags:
  - base_model:adapter:bigcode/starcoderbase-1b
  - lora
  - transformers
pipeline_tag: text-generation
model-index:
  - name: peft-starcoder-lora-apple
    results: []
datasets:
  - smangrul/hf-stack-v1
language:
  - en

Code Completion with StarCoder

A Python implementation of an AI-powered code completion system using the StarCoder base model with LoRA fine-tuning. This project provides a lightweight yet powerful code completion capability that can be trained on custom datasets.

Features

Fill-in-the-Middle (FIM) Capability: Handles both prefix-suffix code completion and middle-context completion
LoRA Fine-tuning: Efficient parameter-efficient fine-tuning using Low-Rank Adaptation
Modular Architecture: Clean separation between settings, model components, and training logic
Customizable Training: Easily adjust hyperparameters through the settings file
Apple Silicon Support: Optimized for running on Apple MPS devices / Note-Since currenlty huggingface does'nt support MLX backend out of the box, hence training runs on mlx backend is slow when compare to cuda device.

Requirements

Python 3.8+
PyTorch 2.0+
Transformers 4.30+
PEFT (Parameter-Efficient Fine-Tuning)
Datasets
Accelerate
BitsAndBytes (for quantization)

🛠 Installation

# Clone the repository
git clone https://github.com/deep-learner-ConfigurableAI/code-completion.git
cd code-completion

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install torch transformers datasets accelerate bitsandbytes peft tqdm

Project Structure

code-completion/
├── LICENSE
├── README.md
├── src/
│   ├── main.py           # Entry point for the application
│   ├── settings.py       # Configuration settings
│   ├── model.py          # Core model implementation
│   └── runner.py         # Training and inference logic

Configuration

All model and training configurations are centralized in src/settings.py. Key parameters include:

Model checkpoint (MODEL)
Training dataset (DATASET)
Sequence length (SEQ_LENGTH)
Training parameters (batch size, learning rate, etc.)
LoRA configuration (rank, alpha, target modules)
FIM transformation settings

Usage

Training

To train the model on your dataset:

Update the settings.py file with your desired configuration
Uncomment the train_model() line in main.py
Run the following command:

cd src
python main.py

Snapshot of Trainig Run: TrainOutput(global_step=1000, training_loss=0.7857105331420898, metrics={'train_runtime': 626.5932, 'train_samples_per_second': 12.767, 'train_steps_per_second': 1.596, 'train_tokens_per_second': 25841.328, 'total_flos': 9.961198190592e+16, 'train_loss': 0.7857105331420898, 'epoch': 0.8176614881439084})

Inference

To use the model for code completion:

Ensure you have a trained model or use the provided checkpoint
Uncomment the code_completion_demo() line in main.py
Run:

cd src
python main.py

Custom Inference

You can also use the model programmatically:

from model import load_model_tokenizer, get_code_completion
from runner import load_model_for_inference

# Load model
model, tokenizer = load_model_for_inference()

# Example code completion
prefix = "def calculate_total(items):"
suffix = "    return total"
completed_code = get_code_completion(model, tokenizer, prefix, suffix)
print(completed_code)

How It Works

Fill-in-the-Middle (FIM): The model is trained to predict missing code in the middle of two context pieces (prefix and suffix).
LoRA Fine-tuning: Instead of fine-tuning all parameters, we use LoRA to efficiently adapt the pre-trained StarCoder model.
Dataset Processing: The training process formats the dataset into fixed-length chunks with FIM transformations applied.
Constant Length Dataset: For efficient training, we process examples into a constant length format.

Performance

The model's performance depends on:

The quality and size of the training dataset
The hyperparameters used (especially LoRA rank and learning rate)
The number of training steps

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

BigCode Project for the StarCoder base model
Hugging Face for their excellent Transformers library
PEFT for the efficient fine-tuning implementation