Instructions to use dongwookkwon/qwen0.5b-tech-interview-test with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dongwookkwon/qwen0.5b-tech-interview-test with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dongwookkwon/qwen0.5b-tech-interview-test")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("dongwookkwon/qwen0.5b-tech-interview-test")
model = AutoModelForCausalLM.from_pretrained("dongwookkwon/qwen0.5b-tech-interview-test", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use dongwookkwon/qwen0.5b-tech-interview-test with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dongwookkwon/qwen0.5b-tech-interview-test"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dongwookkwon/qwen0.5b-tech-interview-test",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dongwookkwon/qwen0.5b-tech-interview-test

SGLang

How to use dongwookkwon/qwen0.5b-tech-interview-test with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dongwookkwon/qwen0.5b-tech-interview-test" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dongwookkwon/qwen0.5b-tech-interview-test",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dongwookkwon/qwen0.5b-tech-interview-test" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dongwookkwon/qwen0.5b-tech-interview-test",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use dongwookkwon/qwen0.5b-tech-interview-test with Docker Model Runner:
```
docker model run hf.co/dongwookkwon/qwen0.5b-tech-interview-test
```

qwen0.5b-tech-interview-test

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B on mathematical reasoning tasks. It has been trained using TRL with QLoRA (Quantized LoRA).

Model Details

Base Model: Qwen/Qwen2.5-0.5B
Fine-tuning Method: QLoRA (Quantized LoRA) followed by weight merging
Task: Mathematical reasoning (GSM8K benchmark)
Training Framework: TRL (Transformer Reinforcement Learning)

Training Data

The model was fine-tuned on a mixture of datasets:

GSM8K (15.7%): 7,473 samples from the GSM8K training set (human-written natural reasoning)
NuminaMath-CoT (84.3%): 40,000 samples from the NuminaMath-CoT dataset (model-generated CoT examples)

Total training samples: 47,473 Train/Test Split: 90%/10% (42,726 train / 4,747 test)

Dataset Composition Strategy

The combination strategy aimed to balance:

Natural human reasoning patterns from GSM8K
Diverse Chain-of-Thought (CoT) patterns from NuminaMath-CoT

Both datasets were converted to a unified messages format compatible with Qwen's chat template.

Evaluation Results

GSM8K Benchmark

Metric	Method	Few-shot	Score	Std Error
exact_match	flexible-extract	5	34.12%	±1.31%
exact_match	strict-match	5	33.59%	±1.30%

Baseline (Qwen2.5-0.5B-Instruct): 34.42% (flexible-extract), 31.69% (strict-match)
Improvement:
- Flexible-extract: Comparable performance (34.12% vs 34.42%)
- Strict-match: +1.90% improvement (33.59% vs 31.69%)
Note: This model was fine-tuned on a curated dataset mixture of 47,473 samples to improve mathematical reasoning capabilities

Evaluation Details

Evaluation Tool: EleutherAI's lm-evaluation-harness
Inference Engine: vLLM (for efficient batch inference)
Test Samples: 1,319 (GSM8K test split)
Generation Settings:
- temperature=0.0
- do_sample=False
- max_tokens=256
Evaluation Method: Few-shot evaluation with 5 examples
Data Leakage Prevention: Only GSM8K test split used for evaluation, train split was used for training

Training Procedure

Training Hyperparameters

Learning Rate: 2e-5 (increased from 5e-6 for faster convergence)
Training Epochs: 2 (with early stopping)
Batch Size: 1 (per device)
Effective Batch Size: 8 (with gradient accumulation)
Gradient Accumulation Steps: 8 (increased from 4 for stable gradients)
Weight Decay: 0.01
Max Gradient Norm: 1.0
Max Sequence Length: 2048
Warmup Ratio: 0.15 (increased from 0.05 for better training stability)

QLoRA Configuration

Quantization: 8-bit (BitsAndBytes)
Quantization Config:
- llm_int8_threshold=6.0
- llm_int8_has_fp16_weight=False
- llm_int8_enable_fp32_cpu_offload=False
LoRA Rank (r): 32 (increased from 16 for more capacity)
LoRA Alpha: 64 (increased from 32, typically 2x rank)
LoRA Dropout: 0.1
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable Parameters: ~17.6M (3.4% of total parameters: 511.6M)
Gradient Checkpointing: Enabled (for memory efficiency)

Training Process

The model was trained using:

Training Framework: TRL SFTTrainer with QLoRA
Data Formatting: Qwen chat template applied to messages format
Evaluation Strategy: Steps (every 250 steps)
Checkpoint Saving: Every 500 steps
Early Stopping: Enabled with patience=3 (based on eval_loss)
Best Model Selection: Based on lowest eval_loss
Optimizer: paged_adamw_8bit (8-bit AdamW optimizer for memory efficiency)
Learning Rate Schedule: Cosine decay
Packing: Enabled (for efficient batch processing)
Model Merging: LoRA weights merged with base model after training for inference

Key Optimizations

Dataset Curation: Combined GSM8K (human-written) and NuminaMath-CoT (model-generated) for balanced learning
Hyperparameter Tuning: Increased learning rate and warmup ratio for better convergence
Memory Efficiency: 8-bit quantization + gradient checkpointing + LoRA adapters
Training Stability: Gradient accumulation and early stopping to prevent overfitting

Model Usage

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "dongwookkwon/qwen0.5b-tech-interview-test"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Format your question
question = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"

messages = [
    {"role": "user", "content": question}
]

# Apply chat template
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
)

# Generate
outputs = model.generate(
    inputs,
    max_new_tokens=256,
    temperature=0.0,
    do_sample=False
)

response = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(response)

Using vLLM (for faster inference)

from vllm import LLM, SamplingParams

model = LLM(
    model="dongwookkwon/qwen0.5b-tech-interview-test",
    trust_remote_code=True,
    dtype="float16",
    gpu_memory_utilization=0.5
)

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=256
)

prompt = "Question: Natalia sold clips to 48 of her friends in April..."
outputs = model.generate([prompt], sampling_params)

Limitations

Domain Specificity: This model is fine-tuned specifically for mathematical reasoning tasks and may not perform well on other domains
Model Size: The 0.5B parameter size limits reasoning capabilities compared to larger models (7B+)
Problem Complexity: Performance may vary depending on the complexity of mathematical problems
Data Dependency: Model performance is dependent on the quality and diversity of the training data mixture
Inference Requirements: While optimized for inference, the model still requires GPU resources for best performance

Training Infrastructure

Framework Versions

TRL: 0.24.0 (SFTTrainer)
Transformers: 4.57.1
PyTorch: 2.8.0
Datasets: 4.3.0
PEFT: Latest (for LoRA/QLoRA support)
BitsAndBytes: Latest (for 8-bit quantization)
Accelerate: >=0.26.0
lm-evaluation-harness: 0.4.9.1 (for evaluation)
vLLM: Latest (for efficient batch inference during evaluation)

Hardware Requirements

Training: GPU with CUDA support (tested on A100, T4)
Inference: GPU recommended for best performance
Memory: ~8GB VRAM minimum for 8-bit QLoRA training

Citation

If you use this model, please cite:

@misc{qwen0.5b-tech-interview-test,
  title={qwen0.5b-tech-interview-test: Fine-tuned Qwen2.5-0.5B for Mathematical Reasoning},
  author={Dongwook Kwon},
  year={2024},
  howpublished={\url{https://huggingface.co/dongwookkwon/qwen0.5b-tech-interview-test}}
}

Base Model Citation

@misc{qwen2.5,
  title={Qwen2.5: A Party of Foundation Models},
  author={Qwen Team},
  year={2024},
  howpublished={\url{https://huggingface.co/Qwen/Qwen2.5-0.5B}}
}

Dataset Citations

GSM8K: Cobbe et al., 2021
NuminaMath-CoT: AI-MO/NuminaMath-CoT

Acknowledgments

This model was developed as part of a coding challenge focused on optimizing small language models for mathematical reasoning tasks. The approach combines efficient fine-tuning techniques (QLoRA) with curated dataset mixtures to improve performance on the GSM8K benchmark.

Downloads last month: 2

Safetensors

Model size

0.5B params

Tensor type

F32

Model tree for dongwookkwon/qwen0.5b-tech-interview-test

Base model

Qwen/Qwen2.5-0.5B

Adapter

(431)

this model

Paper for dongwookkwon/qwen0.5b-tech-interview-test

Training Verifiers to Solve Math Word Problems

Paper • 2110.14168 • Published Oct 27, 2021 • 9