ryo25
/

dpo-qwen-cot-merged-v2

Text Generation

text-generation-inference

Model card Files Files and versions

TITLE

This model is a fine-tuned version of BASE_MODEL using Direct Preference Optimization (DPO) via the Unsloth library.

This repository contains the full-merged 16-bit weights. No adapter loading is required.

Training Configuration

Base model: BASE_MODEL
Method: DPO (Direct Preference Optimization)
Epochs: EPOCHS
Learning rate: LR
Beta: BETA
Max sequence length: MAXLEN

Sources & License

Training Data: DATASET
Compliance: Users must follow the original base model's license terms.

Downloads last month: 1

Safetensors

Model size

4B params

Tensor type

BF16

·