ucoder-mini / README.md
uaytug's picture
Update README.md
dca2dd0 verified
|
raw
history blame
4.45 kB
metadata
license: apache-2.0
language:
  - en
tags:
  - code
  - programming
  - mathematics
  - reasoning
  - text-generation
  - conversational
pipeline_tag: text-generation
library_name: transformers
datasets:
  - uaytug/UCDS
model-index:
  - name: ucoder-mini
    results: []

uCoder Mini

Parameters Context Length License

Overview

uCoder Mini is a compact 1.5B parameter language model fine-tuned for code generation and mathematical reasoning. Despite its small size, it delivers strong performance on programming tasks across multiple languages and competitive programming challenges.

Trained on the UCDS (uCoder Dataset) — a curated collection of 420K+ high-quality coding and mathematics samples.

Intended Use

  • Code generation across Python, JavaScript, C++, Java, and more
  • Competitive programming problem solving
  • Mathematical reasoning and problem breakdown
  • Code explanation and debugging assistance
  • Learning companion for programming concepts

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "uaytug/ucoder-mini"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Write a Python function to find the longest palindromic substring."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Chat Template

uCoder Mini uses the ChatML format:

<|im_start|>user
Your question here<|im_end|>
<|im_start|>assistant

Training Details

Attribute Value
Training Dataset uaytug/UCDS
Dataset Size 420,686 samples
Training Method Supervised Fine-Tuning (SFT)
Precision bfloat16
Context Length 4096 tokens

Dataset Composition

The UCDS dataset combines high-quality sources:

Source Description
CodeForces Competitive programming problems
Code-Contests-Plus Algorithm challenges with solutions
CodeAlpaca Instruction-following code samples
OpenMathInstruct Mathematical reasoning chains
Additional curated sources Tool use, chain-of-thought examples

Limitations

  • Model size: As a 1.5B model, it may struggle with highly complex multi-file projects
  • Knowledge cutoff: Training data has a knowledge cutoff; recent libraries/APIs may not be covered
  • Verification needed: Always review and test generated code before production use
  • Language coverage: Strongest in Python; other languages may have reduced quality

Example Outputs

Competitive Programming

Prompt: Solve the two-sum problem efficiently.

Response:

def two_sum(nums, target):
    seen = {}
    for i, num in enumerate(nums):
        complement = target - num
        if complement in seen:
            return [seen[complement], i]
        seen[num] = i
    return []
Mathematical Reasoning

Prompt: Find the time complexity of merge sort and explain why.

Response: Merge sort has O(n log n) time complexity. Here's why:

  1. Divide step: The array is split in half recursively, creating log(n) levels
  2. Merge step: At each level, we process all n elements exactly once
  3. Total: n elements × log(n) levels = O(n log n)

This holds for best, average, and worst cases, making merge sort consistently efficient.

Citation

@misc{ucoder-mini,
  author = {uaytug},
  title = {uCoder Mini: A Compact Code and Math Language Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/uaytug/ucoder-mini}
}