wzhouad
/

zephyr-7B-WPO-HB

Text Generation

text-generation-inference

Model card Files Files and versions

Description

mistral-7b-sft-beta model finetuned by hybrid WPO (GPT-4-turbo + on-policy sampling + Ultrafeedback). Details in WPO: Enhancing RLHF with Weighted Preference Optimization. The training data is wzhouad/zephyr-ultrafeedback-hybrid.

License

This model is licensed under the Zoom software license and is permitted for use only for noncommercial, educational, or academic research purposes.

Downloads last month: 1

Safetensors

Model size

7B params

Tensor type

F32

·

Dataset used to train wzhouad/zephyr-7B-WPO-HB

Collection including wzhouad/zephyr-7B-WPO-HB

WPO

Models and datasets in paper "WPO: Enhancing RLHF with Weighted Preference Optimization". • 11 items • Updated Aug 22, 2024 • 7

Paper for wzhouad/zephyr-7B-WPO-HB

WPO: Enhancing RLHF with Weighted Preference Optimization

Paper • 2406.11827 • Published Jun 17, 2024 • 17