ucoder-mini / README.md

uaytug

Update README.md

dca2dd0 verified 17 days ago

preview code

raw

history blame

4.45 kB

metadata

license: apache-2.0
language:
  - en
tags:
  - code
  - programming
  - mathematics
  - reasoning
  - text-generation
  - conversational
pipeline_tag: text-generation
library_name: transformers
datasets:
  - uaytug/UCDS
model-index:
  - name: ucoder-mini
    results: []

uCoder Mini

Overview

uCoder Mini is a compact 1.5B parameter language model fine-tuned for code generation and mathematical reasoning. Despite its small size, it delivers strong performance on programming tasks across multiple languages and competitive programming challenges.

Trained on the UCDS (uCoder Dataset) — a curated collection of 420K+ high-quality coding and mathematics samples.

Intended Use

Code generation across Python, JavaScript, C++, Java, and more
Competitive programming problem solving
Mathematical reasoning and problem breakdown
Code explanation and debugging assistance
Learning companion for programming concepts

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "uaytug/ucoder-mini"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Write a Python function to find the longest palindromic substring."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Chat Template

uCoder Mini uses the ChatML format:

<|im_start|>user
Your question here<|im_end|>
<|im_start|>assistant

Training Details

Attribute	Value
Training Dataset	uaytug/UCDS
Dataset Size	420,686 samples
Training Method	Supervised Fine-Tuning (SFT)
Precision	bfloat16
Context Length	4096 tokens

Dataset Composition

The UCDS dataset combines high-quality sources:

Source	Description
CodeForces	Competitive programming problems
Code-Contests-Plus	Algorithm challenges with solutions
CodeAlpaca	Instruction-following code samples
OpenMathInstruct	Mathematical reasoning chains
Additional curated sources	Tool use, chain-of-thought examples

Limitations

Model size: As a 1.5B model, it may struggle with highly complex multi-file projects
Knowledge cutoff: Training data has a knowledge cutoff; recent libraries/APIs may not be covered
Verification needed: Always review and test generated code before production use
Language coverage: Strongest in Python; other languages may have reduced quality

Example Outputs

Competitive Programming

Prompt: Solve the two-sum problem efficiently.

Response:

def two_sum(nums, target):
    seen = {}
    for i, num in enumerate(nums):
        complement = target - num
        if complement in seen:
            return [seen[complement], i]
        seen[num] = i
    return []

Mathematical Reasoning

Prompt: Find the time complexity of merge sort and explain why.

Response: Merge sort has O(n log n) time complexity. Here's why:

Divide step: The array is split in half recursively, creating log(n) levels
Merge step: At each level, we process all n elements exactly once
Total: n elements × log(n) levels = O(n log n)

This holds for best, average, and worst cases, making merge sort consistently efficient.

Citation

@misc{ucoder-mini,
  author = {uaytug},
  title = {uCoder Mini: A Compact Code and Math Language Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/uaytug/ucoder-mini}
}