ume / README.md
summerMC's picture
Update README.md
9da777c verified
|
Raw
History Blame Contribute Delete
5.92 kB
---
language:
- en
tags:
- transformers
- trl
- grpo
- peft
- lora
- python
- code-generation
pipeline_tag: text-generation
base_model: summerMC/matutake
library_name: transformers
---
# ume
`ume` is a GRPO fine-tuned derivative of [`summerMC/matutake`](https://huggingface.co/summerMC/matutake), trained with LoRA on Python code-generation tasks and merged back into the base model for standalone inference.
## Model Summary
* **Model name:** `summerMC/ume`
* **Base model:** `summerMC/matutake`
* **Training method:** GRPO (Group Relative Policy Optimization)
* **Parameter-efficient tuning:** LoRA
* **Training dataset:** `Hoglet-33/python-coding-dataset`
* **Final artifact:** merged checkpoint for direct inference
This model is intended to improve Python code generation behavior using lightweight reward functions that favor syntactically valid, code-like outputs.
---
## Training Details
### Base model
* `summerMC/matutake`
### Dataset
* `Hoglet-33/python-coding-dataset`
### Fine-tuning method
* **Trainer:** TRL `GRPOTrainer`
* **Adapter method:** LoRA
* **Final export:** merged LoRA weights into the base model
### Reward functions
Training used simple heuristic reward functions:
#### 1) Syntax reward
Rewards outputs that can be parsed as valid Python:
* `1.0` if `ast.parse(output)` succeeds
* `0.0` otherwise
#### 2) Code-shape reward
Rewards outputs that look more like actual Python code:
* no Markdown code fences
* contains Python-like tokens such as `def`, `import`, `return`, `class`
* non-trivially long output
* avoids extremely long generations
These rewards are intentionally lightweight and should be treated as a baseline GRPO setup rather than a production-grade evaluation system.
---
## Prompt Format
The training data was converted into a chat-style coding prompt like this:
```python
[
{
"role": "user",
"content": (
"Write correct Python code for the following task.\n"
"Return only Python code. Do not use markdown.\n\n"
"<task text>"
),
}
]
```
For best results, prompt the model with a direct coding task and explicitly request **code only**.
---
## Usage
### Transformers
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "summerMC/ume"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
device_map="auto",
trust_remote_code=True,
)
messages = [
{
"role": "user",
"content": "Write a Python function that computes fibonacci numbers with memoization."
}
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
)
response = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[-1]:],
skip_special_tokens=True,
)
print(response)
```
---
## Example Prompt
### Input
```text
Write a Python function that returns the longest common prefix of a list of strings.
Return only Python code.
```
### Expected output style
```python
def longest_common_prefix(strs):
if not strs:
return ""
prefix = strs[0]
for s in strs[1:]:
while not s.startswith(prefix):
prefix = prefix[:-1]
if not prefix:
return ""
return prefix
```
---
## Training Configuration
The model was trained with a setup similar to the following:
* **LoRA rank (`r`)**: 16
* **LoRA alpha**: 32
* **LoRA dropout**: 0.05
* **Learning rate**: 5e-6
* **Batch size**: 1
* **Gradient accumulation**: 8
* **Generation batch size**: 2
* **Number of generations**: 2
* **Epochs**: 1
### LoRA target modules
```python
[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
]
```
---
## Limitations
* Training rewards are heuristic and do **not** verify functional correctness with unit tests.
* The model may still produce syntactically valid but logically incorrect code.
* Outputs may include hallucinated APIs, inefficient solutions, or incomplete implementations.
* Performance depends heavily on the capabilities and constraints of the base model `summerMC/matutake`.
---
## Intended Use
`summerMC/ume` is intended for:
* Python code generation experiments
* GRPO / RLHF-style fine-tuning experiments
* LoRA + merge workflows
* lightweight coding assistant prototyping
* research and hobbyist use
It is **not** validated for:
* production-critical software generation
* security-sensitive code
* safety-critical systems
* correctness-sensitive automated coding pipelines without external verification
---
## Reproducibility
The training pipeline used:
* `transformers`
* `datasets`
* `trl`
* `peft`
* `torch`
A simplified training flow:
1. Load `summerMC/matutake`
2. Convert the dataset into chat prompts
3. Train with `GRPOTrainer` using LoRA adapters
4. Save the LoRA adapter
5. Merge adapter weights back into the base model
6. Save the merged model as `summerMC/ume`
---
## Base Model and Dataset Attribution
### Base model
* [`summerMC/matutake`](https://huggingface.co/summerMC/matutake)
### Dataset
* [`Hoglet-33/python-coding-dataset`](https://huggingface.co/datasets/Hoglet-33/python-coding-dataset)
---
## License
Please follow the licenses and usage terms of:
1. the original base model `summerMC/matutake`
2. the training dataset `Hoglet-33/python-coding-dataset`
If you redistribute or publish derivative checkpoints, confirm that your use is compatible with both upstream licenses.
---
## Citation
If you use this model in a project or experiment, please cite the upstream base model and dataset.