verma75preetam's picture
Update README.md
0f71715 verified
metadata
library_name: peft
license: bigcode-openrail-m
base_model: bigcode/starcoderbase-1b
tags:
  - base_model:adapter:bigcode/starcoderbase-1b
  - lora
  - transformers
pipeline_tag: text-generation
model-index:
  - name: peft-starcoder-lora-apple
    results: []
datasets:
  - smangrul/hf-stack-v1
language:
  - en

Code Completion with StarCoder

A Python implementation of an AI-powered code completion system using the StarCoder base model with LoRA fine-tuning. This project provides a lightweight yet powerful code completion capability that can be trained on custom datasets.

Features

  • Fill-in-the-Middle (FIM) Capability: Handles both prefix-suffix code completion and middle-context completion
  • LoRA Fine-tuning: Efficient parameter-efficient fine-tuning using Low-Rank Adaptation
  • Modular Architecture: Clean separation between settings, model components, and training logic
  • Customizable Training: Easily adjust hyperparameters through the settings file
  • Apple Silicon Support: Optimized for running on Apple MPS devices / Note-Since currenlty huggingface does'nt support MLX backend out of the box, hence training runs on mlx backend is slow when compare to cuda device.

Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • Transformers 4.30+
  • PEFT (Parameter-Efficient Fine-Tuning)
  • Datasets
  • Accelerate
  • BitsAndBytes (for quantization)

πŸ›  Installation

# Clone the repository
git clone https://github.com/deep-learner-ConfigurableAI/code-completion.git
cd code-completion

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install torch transformers datasets accelerate bitsandbytes peft tqdm

Project Structure

code-completion/
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.py           # Entry point for the application
β”‚   β”œβ”€β”€ settings.py       # Configuration settings
β”‚   β”œβ”€β”€ model.py          # Core model implementation
β”‚   └── runner.py         # Training and inference logic

Configuration

All model and training configurations are centralized in src/settings.py. Key parameters include:

  • Model checkpoint (MODEL)
  • Training dataset (DATASET)
  • Sequence length (SEQ_LENGTH)
  • Training parameters (batch size, learning rate, etc.)
  • LoRA configuration (rank, alpha, target modules)
  • FIM transformation settings

Usage

Training

To train the model on your dataset:

  1. Update the settings.py file with your desired configuration
  2. Uncomment the train_model() line in main.py
  3. Run the following command:
cd src
python main.py

Snapshot of Trainig Run: TrainOutput(global_step=1000, training_loss=0.7857105331420898, metrics={'train_runtime': 626.5932, 'train_samples_per_second': 12.767, 'train_steps_per_second': 1.596, 'train_tokens_per_second': 25841.328, 'total_flos': 9.961198190592e+16, 'train_loss': 0.7857105331420898, 'epoch': 0.8176614881439084})

Inference

To use the model for code completion:

  1. Ensure you have a trained model or use the provided checkpoint
  2. Uncomment the code_completion_demo() line in main.py
  3. Run:
cd src
python main.py

Custom Inference

You can also use the model programmatically:

from model import load_model_tokenizer, get_code_completion
from runner import load_model_for_inference

# Load model
model, tokenizer = load_model_for_inference()

# Example code completion
prefix = "def calculate_total(items):"
suffix = "    return total"
completed_code = get_code_completion(model, tokenizer, prefix, suffix)
print(completed_code)

How It Works

  1. Fill-in-the-Middle (FIM): The model is trained to predict missing code in the middle of two context pieces (prefix and suffix).

  2. LoRA Fine-tuning: Instead of fine-tuning all parameters, we use LoRA to efficiently adapt the pre-trained StarCoder model.

  3. Dataset Processing: The training process formats the dataset into fixed-length chunks with FIM transformations applied.

  4. Constant Length Dataset: For efficient training, we process examples into a constant length format.

Performance

The model's performance depends on:

  • The quality and size of the training dataset
  • The hyperparameters used (especially LoRA rank and learning rate)
  • The number of training steps

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments