--- language: - en tags: - transformers - trl - grpo - peft - lora - python - code-generation pipeline_tag: text-generation base_model: summerMC/matutake library_name: transformers --- # ume `ume` is a GRPO fine-tuned derivative of [`summerMC/matutake`](https://huggingface.co/summerMC/matutake), trained with LoRA on Python code-generation tasks and merged back into the base model for standalone inference. ## Model Summary * **Model name:** `summerMC/ume` * **Base model:** `summerMC/matutake` * **Training method:** GRPO (Group Relative Policy Optimization) * **Parameter-efficient tuning:** LoRA * **Training dataset:** `Hoglet-33/python-coding-dataset` * **Final artifact:** merged checkpoint for direct inference This model is intended to improve Python code generation behavior using lightweight reward functions that favor syntactically valid, code-like outputs. --- ## Training Details ### Base model * `summerMC/matutake` ### Dataset * `Hoglet-33/python-coding-dataset` ### Fine-tuning method * **Trainer:** TRL `GRPOTrainer` * **Adapter method:** LoRA * **Final export:** merged LoRA weights into the base model ### Reward functions Training used simple heuristic reward functions: #### 1) Syntax reward Rewards outputs that can be parsed as valid Python: * `1.0` if `ast.parse(output)` succeeds * `0.0` otherwise #### 2) Code-shape reward Rewards outputs that look more like actual Python code: * no Markdown code fences * contains Python-like tokens such as `def`, `import`, `return`, `class` * non-trivially long output * avoids extremely long generations These rewards are intentionally lightweight and should be treated as a baseline GRPO setup rather than a production-grade evaluation system. --- ## Prompt Format The training data was converted into a chat-style coding prompt like this: ```python [ { "role": "user", "content": ( "Write correct Python code for the following task.\n" "Return only Python code. Do not use markdown.\n\n" "" ), } ] ``` For best results, prompt the model with a direct coding task and explicitly request **code only**. --- ## Usage ### Transformers ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "summerMC/ume" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32, device_map="auto", trust_remote_code=True, ) messages = [ { "role": "user", "content": "Write a Python function that computes fibonacci numbers with memoization." } ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_tensors="pt", ).to(model.device) outputs = model.generate( **inputs, max_new_tokens=256, do_sample=True, temperature=0.7, ) response = tokenizer.decode( outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True, ) print(response) ``` --- ## Example Prompt ### Input ```text Write a Python function that returns the longest common prefix of a list of strings. Return only Python code. ``` ### Expected output style ```python def longest_common_prefix(strs): if not strs: return "" prefix = strs[0] for s in strs[1:]: while not s.startswith(prefix): prefix = prefix[:-1] if not prefix: return "" return prefix ``` --- ## Training Configuration The model was trained with a setup similar to the following: * **LoRA rank (`r`)**: 16 * **LoRA alpha**: 32 * **LoRA dropout**: 0.05 * **Learning rate**: 5e-6 * **Batch size**: 1 * **Gradient accumulation**: 8 * **Generation batch size**: 2 * **Number of generations**: 2 * **Epochs**: 1 ### LoRA target modules ```python [ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ] ``` --- ## Limitations * Training rewards are heuristic and do **not** verify functional correctness with unit tests. * The model may still produce syntactically valid but logically incorrect code. * Outputs may include hallucinated APIs, inefficient solutions, or incomplete implementations. * Performance depends heavily on the capabilities and constraints of the base model `summerMC/matutake`. --- ## Intended Use `summerMC/ume` is intended for: * Python code generation experiments * GRPO / RLHF-style fine-tuning experiments * LoRA + merge workflows * lightweight coding assistant prototyping * research and hobbyist use It is **not** validated for: * production-critical software generation * security-sensitive code * safety-critical systems * correctness-sensitive automated coding pipelines without external verification --- ## Reproducibility The training pipeline used: * `transformers` * `datasets` * `trl` * `peft` * `torch` A simplified training flow: 1. Load `summerMC/matutake` 2. Convert the dataset into chat prompts 3. Train with `GRPOTrainer` using LoRA adapters 4. Save the LoRA adapter 5. Merge adapter weights back into the base model 6. Save the merged model as `summerMC/ume` --- ## Base Model and Dataset Attribution ### Base model * [`summerMC/matutake`](https://huggingface.co/summerMC/matutake) ### Dataset * [`Hoglet-33/python-coding-dataset`](https://huggingface.co/datasets/Hoglet-33/python-coding-dataset) --- ## License Please follow the licenses and usage terms of: 1. the original base model `summerMC/matutake` 2. the training dataset `Hoglet-33/python-coding-dataset` If you redistribute or publish derivative checkpoints, confirm that your use is compatible with both upstream licenses. --- ## Citation If you use this model in a project or experiment, please cite the upstream base model and dataset.