---
language:
  - en
tags:
  - transformers
  - trl
  - grpo
  - peft
  - lora
  - python
  - code-generation
pipeline_tag: text-generation
base_model: summerMC/matutake
library_name: transformers
---


# ume

`ume` is a GRPO fine-tuned derivative of [`summerMC/matutake`](https://huggingface.co/summerMC/matutake), trained with LoRA on Python code-generation tasks and merged back into the base model for standalone inference.

## Model Summary

* **Model name:** `summerMC/ume`
* **Base model:** `summerMC/matutake`
* **Training method:** GRPO (Group Relative Policy Optimization)
* **Parameter-efficient tuning:** LoRA
* **Training dataset:** `Hoglet-33/python-coding-dataset`
* **Final artifact:** merged checkpoint for direct inference

This model is intended to improve Python code generation behavior using lightweight reward functions that favor syntactically valid, code-like outputs.

---

## Training Details

### Base model

* `summerMC/matutake`

### Dataset

* `Hoglet-33/python-coding-dataset`

### Fine-tuning method

* **Trainer:** TRL `GRPOTrainer`
* **Adapter method:** LoRA
* **Final export:** merged LoRA weights into the base model

### Reward functions

Training used simple heuristic reward functions:

#### 1) Syntax reward

Rewards outputs that can be parsed as valid Python:

* `1.0` if `ast.parse(output)` succeeds
* `0.0` otherwise

#### 2) Code-shape reward

Rewards outputs that look more like actual Python code:

* no Markdown code fences
* contains Python-like tokens such as `def`, `import`, `return`, `class`
* non-trivially long output
* avoids extremely long generations

These rewards are intentionally lightweight and should be treated as a baseline GRPO setup rather than a production-grade evaluation system.

---

## Prompt Format

The training data was converted into a chat-style coding prompt like this:

```python
[
    {
        "role": "user",
        "content": (
            "Write correct Python code for the following task.\n"
            "Return only Python code. Do not use markdown.\n\n"
            "<task text>"
        ),
    }
]
```

For best results, prompt the model with a direct coding task and explicitly request **code only**.

---

## Usage

### Transformers

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "summerMC/ume"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {
        "role": "user",
        "content": "Write a Python function that computes fibonacci numbers with memoization."
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
)

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True,
)

print(response)
```

---

## Example Prompt

### Input

```text
Write a Python function that returns the longest common prefix of a list of strings.
Return only Python code.
```

### Expected output style

```python
def longest_common_prefix(strs):
    if not strs:
        return ""

    prefix = strs[0]
    for s in strs[1:]:
        while not s.startswith(prefix):
            prefix = prefix[:-1]
            if not prefix:
                return ""
    return prefix
```

---

## Training Configuration

The model was trained with a setup similar to the following:

* **LoRA rank (`r`)**: 16
* **LoRA alpha**: 32
* **LoRA dropout**: 0.05
* **Learning rate**: 5e-6
* **Batch size**: 1
* **Gradient accumulation**: 8
* **Generation batch size**: 2
* **Number of generations**: 2
* **Epochs**: 1

### LoRA target modules

```python
[
    "q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj",
]
```

---

## Limitations

* Training rewards are heuristic and do **not** verify functional correctness with unit tests.
* The model may still produce syntactically valid but logically incorrect code.
* Outputs may include hallucinated APIs, inefficient solutions, or incomplete implementations.
* Performance depends heavily on the capabilities and constraints of the base model `summerMC/matutake`.

---

## Intended Use

`summerMC/ume` is intended for:

* Python code generation experiments
* GRPO / RLHF-style fine-tuning experiments
* LoRA + merge workflows
* lightweight coding assistant prototyping
* research and hobbyist use

It is **not** validated for:

* production-critical software generation
* security-sensitive code
* safety-critical systems
* correctness-sensitive automated coding pipelines without external verification

---

## Reproducibility

The training pipeline used:

* `transformers`
* `datasets`
* `trl`
* `peft`
* `torch`

A simplified training flow:

1. Load `summerMC/matutake`
2. Convert the dataset into chat prompts
3. Train with `GRPOTrainer` using LoRA adapters
4. Save the LoRA adapter
5. Merge adapter weights back into the base model
6. Save the merged model as `summerMC/ume`

---

## Base Model and Dataset Attribution

### Base model

* [`summerMC/matutake`](https://huggingface.co/summerMC/matutake)

### Dataset

* [`Hoglet-33/python-coding-dataset`](https://huggingface.co/datasets/Hoglet-33/python-coding-dataset)

---

## License

Please follow the licenses and usage terms of:

1. the original base model `summerMC/matutake`
2. the training dataset `Hoglet-33/python-coding-dataset`

If you redistribute or publish derivative checkpoints, confirm that your use is compatible with both upstream licenses.

---

## Citation

If you use this model in a project or experiment, please cite the upstream base model and dataset.