Instructions to use dongwookkwon/qwen0.5b-tech-interview-test with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use dongwookkwon/qwen0.5b-tech-interview-test with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="dongwookkwon/qwen0.5b-tech-interview-test") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("dongwookkwon/qwen0.5b-tech-interview-test") model = AutoModelForCausalLM.from_pretrained("dongwookkwon/qwen0.5b-tech-interview-test") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use dongwookkwon/qwen0.5b-tech-interview-test with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "dongwookkwon/qwen0.5b-tech-interview-test" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dongwookkwon/qwen0.5b-tech-interview-test", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/dongwookkwon/qwen0.5b-tech-interview-test
- SGLang
How to use dongwookkwon/qwen0.5b-tech-interview-test with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "dongwookkwon/qwen0.5b-tech-interview-test" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dongwookkwon/qwen0.5b-tech-interview-test", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "dongwookkwon/qwen0.5b-tech-interview-test" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dongwookkwon/qwen0.5b-tech-interview-test", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use dongwookkwon/qwen0.5b-tech-interview-test with Docker Model Runner:
docker model run hf.co/dongwookkwon/qwen0.5b-tech-interview-test
qwen0.5b-tech-interview-test
This model is a fine-tuned version of Qwen/Qwen2.5-0.5B on mathematical reasoning tasks. It has been trained using TRL with QLoRA (Quantized LoRA).
Model Details
- Base Model: Qwen/Qwen2.5-0.5B
- Fine-tuning Method: QLoRA (Quantized LoRA) followed by weight merging
- Task: Mathematical reasoning (GSM8K benchmark)
- Training Framework: TRL (Transformer Reinforcement Learning)
Training Data
The model was fine-tuned on a mixture of datasets:
- GSM8K (15.7%): 7,473 samples from the GSM8K training set (human-written natural reasoning)
- NuminaMath-CoT (84.3%): 40,000 samples from the NuminaMath-CoT dataset (model-generated CoT examples)
Total training samples: 47,473 Train/Test Split: 90%/10% (42,726 train / 4,747 test)
Dataset Composition Strategy
The combination strategy aimed to balance:
- Natural human reasoning patterns from GSM8K
- Diverse Chain-of-Thought (CoT) patterns from NuminaMath-CoT
Both datasets were converted to a unified messages format compatible with Qwen's chat template.
Evaluation Results
GSM8K Benchmark
| Metric | Method | Few-shot | Score | Std Error |
|---|---|---|---|---|
| exact_match | flexible-extract | 5 | 34.12% | ±1.31% |
| exact_match | strict-match | 5 | 33.59% | ±1.30% |
- Baseline (Qwen2.5-0.5B-Instruct): 34.42% (flexible-extract), 31.69% (strict-match)
- Improvement:
- Flexible-extract: Comparable performance (34.12% vs 34.42%)
- Strict-match: +1.90% improvement (33.59% vs 31.69%)
- Note: This model was fine-tuned on a curated dataset mixture of 47,473 samples to improve mathematical reasoning capabilities
Evaluation Details
- Evaluation Tool: EleutherAI's lm-evaluation-harness
- Inference Engine: vLLM (for efficient batch inference)
- Test Samples: 1,319 (GSM8K test split)
- Generation Settings:
temperature=0.0do_sample=Falsemax_tokens=256
- Evaluation Method: Few-shot evaluation with 5 examples
- Data Leakage Prevention: Only GSM8K test split used for evaluation, train split was used for training
Training Procedure
Training Hyperparameters
- Learning Rate: 2e-5 (increased from 5e-6 for faster convergence)
- Training Epochs: 2 (with early stopping)
- Batch Size: 1 (per device)
- Effective Batch Size: 8 (with gradient accumulation)
- Gradient Accumulation Steps: 8 (increased from 4 for stable gradients)
- Weight Decay: 0.01
- Max Gradient Norm: 1.0
- Max Sequence Length: 2048
- Warmup Ratio: 0.15 (increased from 0.05 for better training stability)
QLoRA Configuration
- Quantization: 8-bit (BitsAndBytes)
- Quantization Config:
llm_int8_threshold=6.0llm_int8_has_fp16_weight=Falsellm_int8_enable_fp32_cpu_offload=False
- LoRA Rank (r): 32 (increased from 16 for more capacity)
- LoRA Alpha: 64 (increased from 32, typically 2x rank)
- LoRA Dropout: 0.1
- Target Modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj - Trainable Parameters: ~17.6M (3.4% of total parameters: 511.6M)
- Gradient Checkpointing: Enabled (for memory efficiency)
Training Process
The model was trained using:
- Training Framework: TRL SFTTrainer with QLoRA
- Data Formatting: Qwen chat template applied to messages format
- Evaluation Strategy: Steps (every 250 steps)
- Checkpoint Saving: Every 500 steps
- Early Stopping: Enabled with patience=3 (based on eval_loss)
- Best Model Selection: Based on lowest eval_loss
- Optimizer:
paged_adamw_8bit(8-bit AdamW optimizer for memory efficiency) - Learning Rate Schedule: Cosine decay
- Packing: Enabled (for efficient batch processing)
- Model Merging: LoRA weights merged with base model after training for inference
Key Optimizations
- Dataset Curation: Combined GSM8K (human-written) and NuminaMath-CoT (model-generated) for balanced learning
- Hyperparameter Tuning: Increased learning rate and warmup ratio for better convergence
- Memory Efficiency: 8-bit quantization + gradient checkpointing + LoRA adapters
- Training Stability: Gradient accumulation and early stopping to prevent overfitting
Model Usage
Using Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "dongwookkwon/qwen0.5b-tech-interview-test"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Format your question
question = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"
messages = [
{"role": "user", "content": question}
]
# Apply chat template
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
)
# Generate
outputs = model.generate(
inputs,
max_new_tokens=256,
temperature=0.0,
do_sample=False
)
response = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(response)
Using vLLM (for faster inference)
from vllm import LLM, SamplingParams
model = LLM(
model="dongwookkwon/qwen0.5b-tech-interview-test",
trust_remote_code=True,
dtype="float16",
gpu_memory_utilization=0.5
)
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=256
)
prompt = "Question: Natalia sold clips to 48 of her friends in April..."
outputs = model.generate([prompt], sampling_params)
Limitations
- Domain Specificity: This model is fine-tuned specifically for mathematical reasoning tasks and may not perform well on other domains
- Model Size: The 0.5B parameter size limits reasoning capabilities compared to larger models (7B+)
- Problem Complexity: Performance may vary depending on the complexity of mathematical problems
- Data Dependency: Model performance is dependent on the quality and diversity of the training data mixture
- Inference Requirements: While optimized for inference, the model still requires GPU resources for best performance
Training Infrastructure
Framework Versions
- TRL: 0.24.0 (SFTTrainer)
- Transformers: 4.57.1
- PyTorch: 2.8.0
- Datasets: 4.3.0
- PEFT: Latest (for LoRA/QLoRA support)
- BitsAndBytes: Latest (for 8-bit quantization)
- Accelerate: >=0.26.0
- lm-evaluation-harness: 0.4.9.1 (for evaluation)
- vLLM: Latest (for efficient batch inference during evaluation)
Hardware Requirements
- Training: GPU with CUDA support (tested on A100, T4)
- Inference: GPU recommended for best performance
- Memory: ~8GB VRAM minimum for 8-bit QLoRA training
Citation
If you use this model, please cite:
@misc{qwen0.5b-tech-interview-test,
title={qwen0.5b-tech-interview-test: Fine-tuned Qwen2.5-0.5B for Mathematical Reasoning},
author={Dongwook Kwon},
year={2024},
howpublished={\url{https://huggingface.co/dongwookkwon/qwen0.5b-tech-interview-test}}
}
Base Model Citation
@misc{qwen2.5,
title={Qwen2.5: A Party of Foundation Models},
author={Qwen Team},
year={2024},
howpublished={\url{https://huggingface.co/Qwen/Qwen2.5-0.5B}}
}
Dataset Citations
- GSM8K: Cobbe et al., 2021
- NuminaMath-CoT: AI-MO/NuminaMath-CoT
Acknowledgments
This model was developed as part of a coding challenge focused on optimizing small language models for mathematical reasoning tasks. The approach combines efficient fine-tuning techniques (QLoRA) with curated dataset mixtures to improve performance on the GSM8K benchmark.
- Downloads last month
- 2
Model tree for dongwookkwon/qwen0.5b-tech-interview-test
Base model
Qwen/Qwen2.5-0.5B