--- base_model: mistralai/Mistral-7B-v0.1 datasets: - siqi00/mistral_metamath_question_0.7_1.0_50_256 library_name: transformers license: apache-2.0 tags: - alignment-handbook - generated_from_trainer pipeline_tag: text-generation model-index: - name: MetaMath-Mistral-7B-DFT results: [] --- # MetaMath-Mistral-7B-DFT This model is a fine-tuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) on the [siqi00/mistral_metamath_question_0.7_1.0_50_256](https://huggingface.co/datasets/siqi00/mistral_metamath_question_0.7_1.0_50_256) dataset. The model was presented in the paper [Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data](https://huggingface.co/papers/2502.18679). The official code is available at: [https://github.com/PenGuln/DFT](https://github.com/PenGuln/DFT). ## Model description Discriminative Fine-Tuning (DFT) is an improved variant of Supervised Fine-Tuning (SFT) for aligning Large Language Models (LLMs), designed to overcome the limitations of generative training objectives without requiring human-labeled preference data or strong reward models. Unlike SFT, which uses a generative approach and overlooks negative data, DFT adopts a discriminative paradigm. It aims to increase the probability of positive answers while simultaneously suppressing potentially negative ones, shifting the focus from token prediction to data prediction. **Key Contributions:** * **Discriminative Probabilistic Framework**: DFT introduces a novel framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input. * **Efficient Optimization Algorithms**: It includes efficient algorithms designed to optimize this discriminative likelihood. * **Strong Performance**: Extensive experiments demonstrate DFT's effectiveness, achieving performance better than SFT and comparable to, if not better than, the SFT followed by Preference Optimization (SFT→PO) pipeline. ## Intended uses & limitations **Intended Uses:** This model, MetaMath-Mistral-7B-DFT, is primarily intended for improving performance in mathematical reasoning and general language generation tasks. It provides an effective fine-tuning approach for LLMs, especially in scenarios where collecting extensive human-labeled preference data for alignment is challenging. It can be used for research in LLM alignment and for applications requiring robust and accurate text generation. **Limitations:** As a large language model, this model may inherit biases from its pre-training and fine-tuning data. While DFT aims to suppress negative outputs, it's crucial to evaluate its behavior for specific applications to mitigate potential factual inaccuracies or undesirable content generation. Users should implement appropriate safeguards when deploying the model in production environments. ## Training and evaluation data This model was fine-tuned on the [siqi00/mistral_metamath_question_0.7_1.0_50_256](https://huggingface.co/datasets/siqi00/mistral_metamath_question_0.7_1.0_50_256) dataset, which contains mathematical reasoning questions and generated negative samples. The underlying data for mathematical reasoning comes from [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA). For evaluation and training related to general language tasks (not directly for this specific model, but for the DFT method), the paper utilized datasets derived from [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), where winning responses were treated as ground truth. ## How to use You can use this model for text generation with the Hugging Face `transformers` library. ```python from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM import torch model_name = "ilgee/MetaMath-Mistral-7B-DFT" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, # or torch.float16 if bfloat16 is not supported device_map="auto", trust_remote_code=True, ) # Example for Text Generation text = "Question: What is the capital of France? Answer:" pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, ) print(pipe(text, max_new_tokens=30, do_sample=False)[0]["generated_text"]) # Example for Chat Completion (using the model's chat template) messages = [{"role": "user", "content": "Hi! How are you?"}] chat_prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) print(pipe(chat_prompt, max_new_tokens=50, do_sample=True)[0]["generated_text"]) ``` ## Performance The model's performance on various benchmarks, as reported in the paper and GitHub repository, is summarized below. ### Mathematical Reasoning Trained on [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA). The base model is [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1). The generated negative samples $\mathbf{y}'$ can be found at [siqi00/mistral_metamath_question_0.7_1.0_50_256](https://huggingface.co/datasets/siqi00/mistral_metamath_question_0.7_1.0_50_256). | Method | GSM8K | MATH | |---|---|---| | MetaMath-7B | 66.5 | 19.8 | | MetaMath-Mistral-7B | 77.7 | 28.2 | | MetaMath-Mistral-7B-DFT | **79.15** | 28.34 | | MetaMath-Mistral-7B-DFT2 | 78.77 | **28.62** | ### General Language Tasks Trained on [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized), i.e., regarding the winning responses $\mathbf{y}_w$ as the ground-truth and discard all losing responses $\mathbf{y}_l$. The base model is [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1). The generated negative samples $\mathbf{y}'$ can be found at [siqi00/mistral_ultrafeedback_unhelpful_chatprompt_0.7_1.0_50_320](https://huggingface.co/datasets/siqi00/mistral_ultrafeedback_unhelpful_chatprompt_0.7_1.0_50_320). | Method | MMLU | TruthfulQA | HellaSwag | Winogrande | GSM8k | ARC | IFEval | Avg. | |---|---|---|---|---|---|---|---|---| | SFT | 62.18 | 50.04 | 83.59 | 78.06 | 45.26 | 63.65 | 49.72 | 61.79 | | SPIN | 61.99 | 49.91 | 83.75 | 77.90 | 46.02 | 61.95 | 23.11 | 57.80 | | SimPO | 62.39 | 52.08 | 83.89 | 78.14 | 2.58 | 61.86 | 18.85 | 51.40 | | SimPO-SFT | 62.28 | 49.59 | 83.46 | 77.90 | 42.53 | 61.52 | 43.62 | 60.13 | | KTO | 61.59 | 49.32 | 82.88 | 79.24 | 43.97 | 61.60 | 38.08 | 59.53 | | ORPO | 62.26 | 48.26 | 83.07 | 79.16 | 45.41 | 62.20 | 53.41 | 61.97 | | DPO-p | 62.01 | 48.66 | 84.03 | 78.61 | 40.48 | 62.20 | 25.32 | 57.33 | | DFT | 61.69 | 52.23 | 83.95 | 78.37 | 48.22 | 64.25 | 51.20 | 62.84 | | DFT2 | 61.66 | 54.14 | 83.20 | 77.82 | 45.49 | 64.42 | 51.20 | 62.56 | ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 8e-07 - train_batch_size: 4 - eval_batch_size: 8 - seed: 42 - distributed_type: multi-GPU - num_devices: 8 - gradient_accumulation_steps: 4 - total_train_batch_size: 128 - total_eval_batch_size: 64 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: cosine - lr_scheduler_warmup_ratio: 0.1 - num_epochs: 3 ### Training results Please refer to the original [paper](https://huggingface.co/papers/2502.18679) and [GitHub repository](https://github.com/PenGuln/DFT) for detailed training results and performance metrics on various benchmarks. ### Framework versions - Transformers 4.45.2 - Pytorch 2.1.0+cu121 - Datasets 3.2.0 - Tokenizers 0.20.3 ## Citation If you find this model or the related paper useful, please cite: ```bibtex @inproceedings{guo2025discriminativefinetuninggenerativelarge, title={Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data}, author={Siqi Guo and Ilgee Hong and Vicente Balmaseda and Changlong Yu and Liang Qiu and Xin Liu and Haoming Jiang and Tuo Zhao and Tianbao Yang}, year={2025}, booktitle={In Proceedings of International Conference on Machine Learning}, url={https://arxiv.org/abs/2502.18679}, } ```