File size: 3,354 Bytes
0d2c6af
 
 
 
 
 
 
 
 
 
 
e6328b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fea7a04
 
 
e6328b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
license: bsd-3-clause
datasets:
- pedrodev2026/microcoder-dataset-1024-tokens
base_model:
- unsloth/Qwen2.5-Coder-1.5B-Instruct
pipeline_tag: text-generation
tags:
- coder
- code
- microcoder
---
# Microcoder 1.5B

**Microcoder 1.5B** is a code-focused language model fine-tuned from [Qwen 2.5 Coder 1.5B Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) using LoRA (Low-Rank Adaptation) on curated code datasets. It is designed for code generation, completion, and instruction-following tasks in a lightweight, efficient package.

---

## Model Details

| Property         | Value                                      |
|------------------|--------------------------------------------|
| **Base Model**   | Qwen 2.5 Coder 1.5B Instruct               |
| **Fine-tuning**  | LoRA                                       |
| **Parameters**   | ~1.5B                                      |
| **License**      | BSD 3-Clause                               |
| **Language**     | English (primary), multilingual code       |
| **Task**         | Code generation, completion, instruction following |

---

## Benchmarks

| Benchmark          | Metric   | Score        |
|--------------------|----------|--------------|
| HumanEval          | pass@1   | **59.15%**   |
| MBPP+              | pass@1   | **52.91%**   |
> HumanEval and MBPP+ results were obtained using the model in **GGUF format** with **Q5_K_M quantization**. Results may vary slightly with other formats or quantization levels.

---

## Usage

> **Important:** You must use `apply_chat_template` when formatting inputs. Passing raw text directly to the tokenizer will produce incorrect results.

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "your-org/microcoder-1.5b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

messages = [
    {
        "role": "user",
        "content": "Write a Python function that returns the nth Fibonacci number."
    }
]

input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## Training Details

Microcoder 1.5B was fine-tuned using LoRA on top of Qwen 2.5 Coder 1.5B Instruct. The training focused on code-heavy datasets covering multiple programming languages and problem-solving scenarios, aiming to improve instruction-following and code correctness at a small model scale.

---

## Credits

- **Model credits** — see [`MODEL_CREDITS.md`](./MODEL_CREDITS.md)
- **Dataset credits** — see [`DATASET_CREDITS.md`](./DATASET_CREDITS.md)

---

## License

The Microcoder 1.5B model weights and associated code in this repository are released under the **BSD 3-Clause License**. See [`LICENSE`](./LICENSE) for details.

Note that the base model (Qwen 2.5 Coder 1.5B Instruct) and the datasets used for fine-tuning are subject to their own respective licenses, as detailed in the credit files above.

---

## Notice

The documentation files in this repository (including `README.md`, `MODEL_CREDITS.md`, `DATASET_CREDITS.md`, and other `.md` files) were generated with the assistance of an AI language model.