|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: Qwen/Qwen2.5-0.5B |
|
|
datasets: |
|
|
- trl-lib/Capybara |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- conversational |
|
|
- multi-turn |
|
|
- instruction-following |
|
|
- reasoning |
|
|
- sft |
|
|
- trl |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Qwen2.5-0.5B-Capybara |
|
|
|
|
|
This model is a fine-tuned version of [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) on the [trl-lib/Capybara](https://huggingface.co/datasets/trl-lib/Capybara) dataset using Supervised Fine-Tuning (SFT). |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**Qwen2.5-0.5B-Capybara** is trained on the Capybara dataset, which is a high-quality multi-turn conversational dataset emphasizing: |
|
|
|
|
|
- **Reasoning and Logic**: Strong focus on extrapolation and logical thinking |
|
|
- **Information Diversity**: Wide range of domains including STEM, pop-culture, and general knowledge |
|
|
- **Multi-turn Conversations**: Average of 3+ turns per conversation with 1,000+ tokens context |
|
|
- **Natural Prose**: Maintains conversational flow while exploring complex topics |
|
|
|
|
|
The Capybara dataset was created using the Amplify-Instruct synthesis method, combining techniques from Airoboros, Evol-Instruct (WizardLM), Orca, and other high-performing datasets. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Base Model | Qwen/Qwen2.5-0.5B | |
|
|
| Training Method | Supervised Fine-Tuning (SFT) | |
|
|
| Dataset | trl-lib/Capybara (15,806 samples) | |
|
|
| Epochs | 3 | |
|
|
| Batch Size | 8 | |
|
|
| Gradient Accumulation | 4 | |
|
|
| Effective Batch Size | 32 | |
|
|
| Learning Rate | 2e-5 | |
|
|
| LR Scheduler | Linear | |
|
|
| Precision | BF16 | |
|
|
| Max Sequence Length | 1024 | |
|
|
| Optimizer | AdamW (fused) | |
|
|
|
|
|
### Memory Optimizations |
|
|
|
|
|
- **Liger Kernel**: Enabled for ~60% VRAM reduction |
|
|
- **Gradient Checkpointing**: Enabled |
|
|
- **BF16 Mixed Precision**: Enabled |
|
|
|
|
|
### Training Infrastructure |
|
|
|
|
|
- Framework: [TRL](https://github.com/huggingface/trl) (Transformer Reinforcement Learning) |
|
|
- Hardware: Single GPU (8GB VRAM) |
|
|
- Training Time: ~5 minutes per step |
|
|
|
|
|
### Training Progress |
|
|
|
|
|
> **Note**: This is checkpoint at step 42 of 1,482 total steps (~3% of training). This is an early checkpoint from an ongoing training run. |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Steps Completed | 42 / 1,482 | |
|
|
| Training Progress | ~3% | |
|
|
| Final Loss (step 42) | 1.3058 | |
|
|
| Initial Loss (step 1) | 1.9030 | |
|
|
|
|
|
## Dataset Information |
|
|
|
|
|
The [trl-lib/Capybara](https://huggingface.co/datasets/trl-lib/Capybara) dataset contains: |
|
|
|
|
|
- **15,806 training samples** of multi-turn conversations |
|
|
- **Sources**: GPT4LLM, GOAT, EverythingLM, Know-Logic, SuperCOT, Airoboros, Dove, TheoremQA, TaskSource, General-Instruct |
|
|
- **Format**: Conversation messages with user/assistant roles |
|
|
- **Quality**: Aggressively filtered to remove alignment artifacts and common undesirable behaviors |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained("BurnyCoder/Qwen2.5-0.5B-Capybara") |
|
|
tokenizer = AutoTokenizer.from_pretrained("BurnyCoder/Qwen2.5-0.5B-Capybara") |
|
|
|
|
|
messages = [ |
|
|
{"role": "user", "content": "Explain the concept of recursion in programming."} |
|
|
] |
|
|
|
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=512) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- This is an early checkpoint (~3% trained) and may not reflect full model capabilities |
|
|
- Model size is 494M parameters - suitable for edge deployment but limited compared to larger models |
|
|
- Training was conducted on a single GPU with memory optimizations |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{qwen2.5-0.5b-capybara, |
|
|
author = {BurnyCoder}, |
|
|
title = {Qwen2.5-0.5B-Capybara: Multi-turn Conversational Fine-tuning}, |
|
|
year = {2026}, |
|
|
publisher = {Hugging Face}, |
|
|
url = {https://huggingface.co/BurnyCoder/Qwen2.5-0.5B-Capybara} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- [Qwen Team](https://huggingface.co/Qwen) for the base model |
|
|
- [TRL Library](https://github.com/huggingface/trl) for the training framework |
|
|
- [Capybara Dataset](https://huggingface.co/datasets/trl-lib/Capybara) creators |
|
|
|