deepseek-coder-6.7b-code-gen-finetuned

A supervised fine-tuned (SFT) version of deepseek-ai/deepseek-coder-6.7b-instruct trained with QLoRA on a curated blend of high-quality code instruction datasets. The model is optimised for Python code generation — given a natural language instruction, it produces clean, correct, executable code.

Kaggle notebook: code-refining


Model description

This model improves upon the already capable deepseek-coder-6.7b-instruct base by fine-tuning on 10,000 carefully filtered instruction-output pairs drawn from three complementary code datasets. Training used the SFT (supervised fine-tuning) stage with the deepseekcoder chat template, making it a drop-in replacement for the base instruct model with improved instruction-following on coding tasks.

Performance was tracked using the HumanEval benchmark (Pass@1) — the proportion of 164 programming problems where the model's first generated solution passes all hidden test cases.


Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_NAME = "AbdoSaad24/deepseek-coder-6.7b-code-gen-finetuned"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

def generate_code(instruction: str, max_new_tokens: int = 512) -> str:
    """Generate Python code from a natural language instruction."""
    messages = [
        {
            "role": "system",
            "content": (
                "You are a Python coding assistant. "
                "Complete the given function. "
                "Return ONLY the complete function code with no explanation, "
                "no markdown, no extra text."
            )
        },
        {
            "role": "user",
            "content": f"Complete this Python function:\n\n{instruction}"
        }
    ]

    formatted = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = tokenizer(formatted, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # greedy decoding for reproducibility
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )

    generated = outputs[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(generated, skip_special_tokens=True)

Example: function completion

prompt = """
from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    \"\"\" Check if in given list of numbers, are any two numbers closer to each
    other than given threshold.
    \"\"\"
"""

print(generate_code(prompt))

Example: instruction-driven generation

instruction = "Write a Python function that checks whether a string is a palindrome, ignoring case and spaces."
print(generate_code(instruction))

Evaluation

The model was evaluated on the HumanEval benchmark (164 programming problems), which tests functional correctness by executing generated code against hidden test cases.

Metric Value
Benchmark HumanEval
Evaluation strategy Pass@1 (greedy decoding, do_sample=False)
Problems evaluated 20-problem subset (during training run)

Full 164-problem Pass@1 evaluation was set up in the notebook — update this card with the final score after running the complete evaluation.


Training details

Base model

deepseek-ai/deepseek-coder-6.7b-instruct — the instruction-tuned variant of DeepSeek-Coder, chosen for its strong Python baseline and native support for the deepseekcoder chat template.

Dataset

Three code instruction datasets were combined, filtered, shuffled, and capped at 10,000 examples:

Dataset Description
m-a-p/CodeFeedback-Filtered-Instruction High-quality code instruction-response pairs with feedback filtering
nickrosh/Evol-Instruct-Code-80k-v1 80k evolved coding instructions (WizardCoder-style)
sahil2801/CodeAlpaca-20k 20k code instruction-output pairs in Alpaca format

All datasets were mapped to a unified Alpaca format (instruction, input, output) and filtered to remove examples with outputs shorter than 50 characters. The combined pool was shuffled with seed=42, capped at 10,000 examples, and split 99/1 into train (9,900) and validation (100).

Fine-tuning method: QLoRA SFT via LLaMA-Factory

Training used the SFT stage with the deepseekcoder chat template, meaning examples are formatted as instruction-response pairs using DeepSeek-Coder's native conversational format.

Hyperparameter Value
Framework LLaMA-Factory 0.9.5
Stage SFT (supervised fine-tuning)
Fine-tuning type LoRA (QLoRA 4-bit NF4)
Chat template deepseekcoder
LoRA rank 32
LoRA alpha 64
LoRA dropout 0.05
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Quantization 4-bit NF4 + double quantization
Context length (cutoff_len) 1024 tokens
Batch size per device 1
Gradient accumulation steps 16 (effective batch size = 16)
Learning rate 2e-4
LR scheduler Cosine
Warmup ratio 0.05
Epochs 3
Optimizer AdamW (torch)
Weight decay 0.01
Max grad norm 1.0
Mixed precision FP16
Eval strategy Every 50 steps
Hardware NVIDIA Tesla T4 × 2 (Kaggle)
Experiment tracking Weights & Biases (Generation)

After training, LoRA adapters were merged into the base model weights using LLaMA-Factory's export pipeline (llamafactory-cli export) and pushed as a single standalone model.


Intended use

This model is designed for Python code generation from natural language instructions:

  • Completing partially written functions from their docstrings or signatures
  • Generating utility functions from plain-English descriptions
  • Coding assistants and IDE integrations
  • Educational tools for learning Python patterns
  • Automated code scaffolding in development workflows

Out-of-scope use

  • Languages other than Python (training data is Python-heavy; other languages may produce lower quality output)
  • Security-critical code generation without expert review
  • Generating code for harmful or malicious purposes

Limitations

  • Context window is limited to 1024 tokens — very long functions or multi-file contexts may be truncated
  • Training data was capped at 10,000 examples; broader or domain-specific coverage may improve performance on specialised tasks
  • Generated code should always be reviewed and tested before use in production
  • The model may produce plausible-looking but incorrect implementations for complex algorithmic problems
  • Performance on non-Python languages is not guaranteed

Citation

If you use this model, please cite the original DeepSeek-Coder work:

@misc{guo2024deepseekcoderlargelanguagemodel,
  title={DeepSeek-Coder: When the Large Language Model Meets Programming},
  author={Daya Guo et al.},
  year={2024},
  eprint={2401.14196},
  archivePrefix={arXiv}
}

Fine-tuned by AbdoSaad24 · Kaggle notebook: code-refining

Downloads last month
19
Safetensors
Model size
7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AbdoSaad24/deepseek-coder-6.7b-code-gen-finetuned

Adapter
(391)
this model

Datasets used to train AbdoSaad24/deepseek-coder-6.7b-code-gen-finetuned

Paper for AbdoSaad24/deepseek-coder-6.7b-code-gen-finetuned

Evaluation results

  • Pass@1 on HumanEval
    self-reported
    evaluated on 20-problem subset during training