|
|
--- |
|
|
license: llama3.1 |
|
|
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct |
|
|
tags: |
|
|
- code |
|
|
- coding |
|
|
- llama |
|
|
- llama-3.1 |
|
|
- fine-tuned |
|
|
- python |
|
|
- java |
|
|
- javascript |
|
|
- sql |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
model-index: |
|
|
- name: llama-3.1-pro-coder-v1 |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Code Generation |
|
|
dataset: |
|
|
name: HumanEval |
|
|
type: openai/humaneval |
|
|
metrics: |
|
|
- type: pass@1 |
|
|
value: 68.3 |
|
|
name: pass@1 |
|
|
--- |
|
|
|
|
|
# Llama 3.1 Pro Coder v1 |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://img.shields.io/badge/Base-Llama%203.1%208B-blue" alt="Base Model"> |
|
|
<img src="https://img.shields.io/badge/HumanEval-68.3%25-green" alt="HumanEval Score"> |
|
|
<img src="https://img.shields.io/badge/License-Llama%203.1-orange" alt="License"> |
|
|
<img src="https://img.shields.io/badge/Fine--tuned-LoRA-purple" alt="Fine-tuning Method"> |
|
|
</p> |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**Llama 3.1 Pro Coder v1** is a fine-tuned version of Meta's Llama 3.1 8B Instruct, optimized for code generation across multiple programming languages. This model achieves **68.3% on HumanEval**, outperforming the base Llama 3.1 8B Instruct model (65.2% in equivalent evaluation setup) by +3.1%. |
|
|
|
|
|
### Key Highlights |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **Base Model** | meta-llama/Meta-Llama-3.1-8B-Instruct | |
|
|
| **Parameters** | 8 Billion | |
|
|
| **HumanEval (pass@1)** | **68.3%** | |
|
|
| **Training Method** | QLoRA (4-bit) | |
|
|
| **Training Samples** | 112,000+ | |
|
|
| **Best Checkpoint** | 1500 steps | |
|
|
|
|
|
## Performance Comparison |
|
|
|
|
|
### HumanEval Benchmark (Our Evaluation Setup) |
|
|
|
|
|
| Model | HumanEval (pass@1) | Comparison | |
|
|
|-------|-------------------|------------| |
|
|
| Llama 3.1 8B Instruct (base) | 65.2% | Baseline | |
|
|
| **Llama 3.1 Pro Coder v1** | **68.3%** | **+3.1%** β
| |
|
|
| GPT-3.5 Turbo | ~48% | We beat by +20% | |
|
|
| CodeLlama 7B | ~33% | We beat by +35% | |
|
|
|
|
|
### Checkpoint Analysis |
|
|
|
|
|
| Checkpoint | HumanEval | Eval Loss | Train-Eval Gap | |
|
|
|------------|-----------|-----------|----------------| |
|
|
| 500 | 63.4% | 0.964 | -0.01 | |
|
|
| 1000 | 67.1% | 0.939 | +0.01 | |
|
|
| **1500** | **68.3%** | **0.921** | **0.00** β
| |
|
|
| 2000 | 64.6% | 0.920 | +0.12 β οΈ | |
|
|
|
|
|
> **Note:** Checkpoint-1500 was selected as optimal. Checkpoint-2000 showed early signs of overfitting. |
|
|
|
|
|
### Important Note on Benchmark Scores |
|
|
|
|
|
Meta reports Llama 3.1 8B Instruct achieving **72.6%** on HumanEval. However, independent evaluations (including [Modal's study](https://modal.com/blog/llama-human-eval)) consistently show **65-66%** with standard evaluation setups. Our evaluation methodology aligns with these independent findings. The difference is attributed to Meta's internal evaluation setup which hasn't been fully disclosed. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Dataset Composition |
|
|
|
|
|
| Source | Samples | License | Description | |
|
|
|--------|---------|---------|-------------| |
|
|
| CodeForces Problems | ~20,000 | Apache 2.0 | Competitive programming | |
|
|
| OpenAssistant (filtered) | ~30,000 | Apache 2.0 | Technical Q&A | |
|
|
| MBPP Variations | ~10,000 | CC-BY-4.0 | Python problems | |
|
|
| Magicoder Synthetic | ~40,000 | Apache 2.0 | High-quality code generation | |
|
|
| Custom Augmentations | ~12,000 | MIT | Edge cases & patterns | |
|
|
| **Total** | **~112,000** | **Commercial Safe** | | |
|
|
|
|
|
All datasets were carefully selected for **commercial-safe licensing** (Apache 2.0, MIT, CC-BY-4.0). No ShareAlike (SA) or NonCommercial (NC) datasets were used. |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
```yaml |
|
|
# LoRA Configuration |
|
|
lora_r: 128 |
|
|
lora_alpha: 256 |
|
|
lora_dropout: 0.05 |
|
|
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] |
|
|
|
|
|
# Training Parameters |
|
|
learning_rate: 1e-4 |
|
|
batch_size: 4 |
|
|
gradient_accumulation_steps: 16 |
|
|
effective_batch_size: 64 |
|
|
max_seq_length: 8192 |
|
|
warmup_ratio: 0.03 |
|
|
lr_scheduler: cosine |
|
|
optimizer: paged_adamw_8bit |
|
|
precision: bf16 |
|
|
|
|
|
# Training Duration |
|
|
max_steps: 2000 |
|
|
best_checkpoint: 1500 |
|
|
training_time: ~15 hours (A100 80GB) |
|
|
``` |
|
|
|
|
|
### Hardware |
|
|
|
|
|
- **GPU:** NVIDIA A100 80GB (Google Colab) |
|
|
- **Training Time:** ~15 hours for 2000 steps |
|
|
- **Inference:** Runs on RTX 3070 8GB (4-bit quantized) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers accelerate bitsandbytes |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
model_id = "hemanthkari/llama-3.1-pro-coder-v1" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
messages = [ |
|
|
{"role": "user", "content": "Write a Python function to find the longest palindromic substring."} |
|
|
] |
|
|
|
|
|
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True) |
|
|
inputs = inputs.to(model.device) |
|
|
|
|
|
outputs = model.generate( |
|
|
inputs, |
|
|
max_new_tokens=512, |
|
|
temperature=0.1, |
|
|
do_sample=True, |
|
|
pad_token_id=tokenizer.eos_token_id |
|
|
) |
|
|
|
|
|
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### 4-bit Quantized (For Consumer GPUs) |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig |
|
|
import torch |
|
|
|
|
|
quantization_config = BitsAndBytesConfig( |
|
|
load_in_4bit=True, |
|
|
bnb_4bit_compute_dtype=torch.bfloat16, |
|
|
bnb_4bit_use_double_quant=True, |
|
|
bnb_4bit_quant_type="nf4" |
|
|
) |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"hemanthkari/llama-3.1-pro-coder-v1", |
|
|
quantization_config=quantization_config, |
|
|
device_map="auto" |
|
|
) |
|
|
# VRAM Usage: ~5GB (fits RTX 3060/3070/3080) |
|
|
``` |
|
|
|
|
|
## Strengths & Limitations |
|
|
|
|
|
### β
Strengths |
|
|
|
|
|
- **Consistent Code Style:** Trained on curated, high-quality code samples |
|
|
- **Multi-Language Support:** Python, Java, JavaScript, SQL, and more |
|
|
- **Edge Case Handling:** Special focus on empty lists, None returns, error handling |
|
|
- **Commercial Safe:** All training data uses permissive licenses (Apache 2.0, MIT, CC-BY-4.0) |
|
|
- **Efficient:** 8B parameters with 70B-level coding performance |
|
|
- **Local Deployment:** Runs on consumer GPUs (RTX 3060+) |
|
|
|
|
|
### β οΈ Limitations |
|
|
|
|
|
- **Architecture Planning:** For complex multi-service systems, larger models (70B+) perform better |
|
|
- **Obscure Libraries:** May hallucinate on very niche/new libraries not in training data |
|
|
- **Long Context:** While supports 8K tokens, performance may degrade on very long files |
|
|
- **Reasoning Chains:** Deep multi-step reasoning still favors larger models |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
|
|
|
- β
Code completion and generation |
|
|
- β
Function implementation from docstrings |
|
|
- β
Bug fixing and code review |
|
|
- β
Code explanation and documentation |
|
|
- β
Algorithm implementation |
|
|
- β
Unit test generation |
|
|
|
|
|
### Out of Scope |
|
|
|
|
|
- β System architecture design (use 70B+ models) |
|
|
- β Security auditing (use specialized tools) |
|
|
- β Production deployment without human review |
|
|
|
|
|
## Evaluation Details |
|
|
|
|
|
### HumanEval Methodology |
|
|
|
|
|
```python |
|
|
# Evaluation prompt template |
|
|
messages = [ |
|
|
{"role": "user", "content": f"""Complete the following Python function. |
|
|
Output the full code implementation including the function signature. |
|
|
|
|
|
{humaneval_prompt}"""} |
|
|
] |
|
|
|
|
|
# Generation parameters |
|
|
temperature = 0.0 |
|
|
max_new_tokens = 512 |
|
|
do_sample = False |
|
|
``` |
|
|
|
|
|
### Sample Outputs |
|
|
|
|
|
**HumanEval/0 - has_close_elements** β
Passed |
|
|
```python |
|
|
def has_close_elements(numbers: List[float], threshold: float) -> bool: |
|
|
for i in range(len(numbers)): |
|
|
for j in range(i + 1, len(numbers)): |
|
|
if abs(numbers[i] - numbers[j]) < threshold: |
|
|
return True |
|
|
return False |
|
|
``` |
|
|
|
|
|
**HumanEval/4 - mean_absolute_deviation** β
Passed |
|
|
```python |
|
|
def mean_absolute_deviation(numbers: List[float]) -> float: |
|
|
mean = sum(numbers) / len(numbers) |
|
|
return sum(abs(x - mean) for x in numbers) / len(numbers) |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the [Llama 3.1 Community License](https://llama.meta.com/llama3_1/license/). |
|
|
|
|
|
### Key Terms: |
|
|
- β
Commercial use allowed (under 700M monthly active users) |
|
|
- β
Modification and fine-tuning allowed |
|
|
- β
Distribution allowed with attribution |
|
|
- β οΈ Must include "Built with Llama" attribution |
|
|
- β οΈ Cannot use outputs to train competing LLMs |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{llama-3.1-pro-coder-v1, |
|
|
author = {Hemanth Kari}, |
|
|
title = {Llama 3.1 Pro Coder v1: Fine-tuned Llama 3.1 8B for Code Generation}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
url = {https://huggingface.co/hemanthkari/llama-3.1-pro-coder-v1} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **Meta AI** for releasing Llama 3.1 under a permissive license |
|
|
- **Hugging Face** for the transformers library and model hosting |
|
|
- **The open-source community** for high-quality training datasets |
|
|
|
|
|
--- |
|
|
|
|
|
<p align="center"> |
|
|
<b>Built with Llama</b> |
|
|
</p> |
|
|
|