---
language: en
license: apache-2.0
tags:
  - code
  - python
  - docstring
  - mistral
  - qlora
  - peft
  - code-generation
base_model: mistralai/Mistral-7B-v0.1
datasets:
  - code_search_net
---

# mistral-7b-docstring

Mistral 7B fine-tuned with QLoRA on Python docstring generation from CodeSearchNet.

Outperforms Llama 3.3 70B — a model 10x larger — on both ROUGE-L and BERTScore on domain-specific NumPy-style docstring generation.

## Evaluation results

Evaluated on 100 held-out Python functions from CodeSearchNet (never seen during training).

| Model | ROUGE-L | BERTScore F1 |
|---|---|---|
| **Mistral 7B fine-tuned (this model)** | **0.2033** | **0.7739** |
| Llama 3.3 70B via Groq | 0.1715 | 0.7594 |
| Mistral 7B base (no fine-tuning) | 0.1102 | 0.7118 |

The fine-tuned 7B model beats Llama 3.3 70B on ROUGE-L (+18.5%) and BERTScore (+1.9%) while being 10x smaller and running at a fraction of the inference cost.

## How to use

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

BASE_MODEL = "mistralai/Mistral-7B-v0.1"

# Load in 4-bit for efficient inference
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "kk014/mistral-7b-docstring")
model.eval()

# Generate a docstring
function_code = """
def calculate_bmi(weight_kg, height_m):
    return weight_kg / (height_m ** 2)
""".strip()

prompt = (
    "You are a Python documentation expert. "
    "Write a clear, concise NumPy-style docstring for the following Python function.\n\n"
    f"### Function:\n{function_code}\n\n"
    "### Docstring:"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        temperature=0.1,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
docstring  = generated[len(prompt):].strip()
print(docstring)
```

## Training details

| Parameter | Value |
|---|---|
| Base model | mistralai/Mistral-7B-v0.1 |
| Dataset | CodeSearchNet (Python split) |
| Training samples | 8,000 |
| Method | QLoRA (4-bit NF4 quantisation) |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| Epochs | 1 |
| Batch size | 2 (effective 16 with grad accum) |
| Learning rate | 2e-4 |
| Hardware | Kaggle T4 x2 (free tier) |
| Training time | ~4 hours |
| Framework | HuggingFace PEFT + TRL |

## Limitations

- Trained on NumPy-style docstrings specifically — output style may differ for Google or Sphinx style
- Best on standalone functions under ~50 lines
- May repeat examples in generated output at very low temperatures
- Evaluated on CodeSearchNet Python split only — performance on other codebases may vary

## Citation

If you use this model, please cite the original QLoRA paper:

```
@article{dettmers2023qlora,
  title={QLoRA: Efficient Finetuning of Quantized LLMs},
  author={Dettmers, Tim and others},
  journal={arXiv preprint arXiv:2305.14314},
  year={2023}
}
```