Improve model card: Add paper info, usage, performance, and pipeline tag

c3d15c5 verified 8 months ago

8.26 kB

base_model: mistralai/Mistral-7B-v0.1
datasets:
  - siqi00/mistral_metamath_question_0.7_1.0_50_256
library_name: transformers
license: apache-2.0
tags:
  - alignment-handbook
  - generated_from_trainer
pipeline_tag: text-generation
model-index:
  - name: MetaMath-Mistral-7B-DFT
    results: []

MetaMath-Mistral-7B-DFT

This model is a fine-tuned version of mistralai/Mistral-7B-v0.1 on the siqi00/mistral_metamath_question_0.7_1.0_50_256 dataset.

The model was presented in the paper Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data. The official code is available at: https://github.com/PenGuln/DFT.

Model description

Discriminative Fine-Tuning (DFT) is an improved variant of Supervised Fine-Tuning (SFT) for aligning Large Language Models (LLMs), designed to overcome the limitations of generative training objectives without requiring human-labeled preference data or strong reward models. Unlike SFT, which uses a generative approach and overlooks negative data, DFT adopts a discriminative paradigm. It aims to increase the probability of positive answers while simultaneously suppressing potentially negative ones, shifting the focus from token prediction to data prediction.

Key Contributions:

Discriminative Probabilistic Framework: DFT introduces a novel framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input.
Efficient Optimization Algorithms: It includes efficient algorithms designed to optimize this discriminative likelihood.
Strong Performance: Extensive experiments demonstrate DFT's effectiveness, achieving performance better than SFT and comparable to, if not better than, the SFT followed by Preference Optimization (SFT→PO) pipeline.

Intended uses & limitations

Intended Uses: This model, MetaMath-Mistral-7B-DFT, is primarily intended for improving performance in mathematical reasoning and general language generation tasks. It provides an effective fine-tuning approach for LLMs, especially in scenarios where collecting extensive human-labeled preference data for alignment is challenging. It can be used for research in LLM alignment and for applications requiring robust and accurate text generation.

Limitations: As a large language model, this model may inherit biases from its pre-training and fine-tuning data. While DFT aims to suppress negative outputs, it's crucial to evaluate its behavior for specific applications to mitigate potential factual inaccuracies or undesirable content generation. Users should implement appropriate safeguards when deploying the model in production environments.

Training and evaluation data

This model was fine-tuned on the siqi00/mistral_metamath_question_0.7_1.0_50_256 dataset, which contains mathematical reasoning questions and generated negative samples. The underlying data for mathematical reasoning comes from MetaMathQA.

For evaluation and training related to general language tasks (not directly for this specific model, but for the DFT method), the paper utilized datasets derived from HuggingFaceH4/ultrafeedback_binarized, where winning responses were treated as ground truth.

How to use

You can use this model for text generation with the Hugging Face transformers library.

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "ilgee/MetaMath-Mistral-7B-DFT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16, # or torch.float16 if bfloat16 is not supported
    device_map="auto",
    trust_remote_code=True,
)

# Example for Text Generation
text = "Question: What is the capital of France?

Answer:"
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)
print(pipe(text, max_new_tokens=30, do_sample=False)[0]["generated_text"])

# Example for Chat Completion (using the model's chat template)
messages = [{"role": "user", "content": "Hi! How are you?"}]
chat_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
print(pipe(chat_prompt, max_new_tokens=50, do_sample=True)[0]["generated_text"])

Performance

The model's performance on various benchmarks, as reported in the paper and GitHub repository, is summarized below.

Mathematical Reasoning

Trained on MetaMathQA. The base model is mistralai/Mistral-7B-v0.1. The generated negative samples $\mathbf{y}'$ can be found at siqi00/mistral_metamath_question_0.7_1.0_50_256.

Method	GSM8K	MATH
MetaMath-7B	66.5	19.8
MetaMath-Mistral-7B	77.7	28.2
MetaMath-Mistral-7B-DFT	79.15	28.34
MetaMath-Mistral-7B-DFT2	78.77	28.62

General Language Tasks

Trained on HuggingFaceH4/ultrafeedback_binarized, i.e., regarding the winning responses $\mathbf{y}_w$ as the ground-truth and discard all losing responses $\mathbf{y}_l$. The base model is mistralai/Mistral-7B-v0.1. The generated negative samples $\mathbf{y}'$ can be found at siqi00/mistral_ultrafeedback_unhelpful_chatprompt_0.7_1.0_50_320.

Method	MMLU	TruthfulQA	HellaSwag	Winogrande	GSM8k	ARC	IFEval	Avg.
SFT	62.18	50.04	83.59	78.06	45.26	63.65	49.72	61.79
SPIN	61.99	49.91	83.75	77.90	46.02	61.95	23.11	57.80
SimPO	62.39	52.08	83.89	78.14	2.58	61.86	18.85	51.40
SimPO-SFT	62.28	49.59	83.46	77.90	42.53	61.52	43.62	60.13
KTO	61.59	49.32	82.88	79.24	43.97	61.60	38.08	59.53
ORPO	62.26	48.26	83.07	79.16	45.41	62.20	53.41	61.97
DPO-p	62.01	48.66	84.03	78.61	40.48	62.20	25.32	57.33
DFT	61.69	52.23	83.95	78.37	48.22	64.25	51.20	62.84
DFT2	61.66	54.14	83.20	77.82	45.49	64.42	51.20	62.56

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 8e-07
train_batch_size: 4
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 4
total_train_batch_size: 128
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3

Training results

Please refer to the original paper and GitHub repository for detailed training results and performance metrics on various benchmarks.

Framework versions

Transformers 4.45.2
Pytorch 2.1.0+cu121
Datasets 3.2.0
Tokenizers 0.20.3

Citation

If you find this model or the related paper useful, please cite:

@inproceedings{guo2025discriminativefinetuninggenerativelarge,
      title={Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data}, 
      author={Siqi Guo and Ilgee Hong and Vicente Balmaseda and Changlong Yu and Liang Qiu and Xin Liu and Haoming Jiang and Tuo Zhao and Tianbao Yang},
      year={2025},
      booktitle={In Proceedings of International Conference on Machine Learning},
      url={https://arxiv.org/abs/2502.18679}, 
}