| | --- |
| | library_name: transformers |
| | tags: [] |
| | --- |
| | |
| |
|
| |
|
| |
|
| | ## Model Description |
| |
|
| | This is the DPO model in our Mixture of Agents Alignment (MoAA) pipeline. This model is tuned on the Gemma-2-9b-it. MoAA is an approach that leverages collective intelligence from open‑source LLMs to advance alignment. |
| |
|
| | Two mains stages are involved in our MoAA method. In the first stage, we employ MoA to produce high-quality synthetic data for supervised fine-tuning. In the second stage, we combines multiple LLMs as a reward model to provide preference annotations. |
| |
|
| | Some key takeaways of our work: |
| |
|
| |
|
| |
|
| | - 📈**Alignment pipeline that actually works** Our MoAA method sends Llama‑3.1‑8B‑Instruct’s Arena‑Hard **19 → 48** and Gemma-2-9B-it **42→56**, handily beating GPT‑4o‑labeled sets at the time. |
| |
|
| | - 🏆**Ensembled rewards > single critics** An MoA reward model with dynamic criteria filtering edges out competitive ArmoRM on MT‑Bench & Arena‑Hard—all while staying 100 % open source. |
| |
|
| | - 🚀**Self‑improvement unlocked** Fine‑tune the strongest model inside the ensemble on MoAA data and it *surpasses its own teachers*—evidence that open models can push past proprietary ceilings without external supervision. |
| |
|
| |
|
| | ## Model Sources |
| |
|
| |
|
| | For more details refer to |
| |
|
| | - **[Paper](https://arxiv.org/abs/2505.03059)** |
| |
|
| | <!-- - **[twitter](https://arxiv.org/abs/2505.03059)** |
| | - **[blgopost](https://arxiv.org/abs/2505.03059)** --> |
| |
|
| |
|
| |
|
| | ## How to Get Started with the Model |
| |
|
| | Use the code below to get started with the model. |
| |
|
| | Run inference like this: |
| | ``` |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("togethercomputer/gemma-2-9b-it-MoAA-DPO") |
| | model = AutoModelForCausalLM.from_pretrained("togethercomputer/gemma-2-9b-it-MoAA-DPO") |
| | ``` |
| |
|
| |
|
| | ## Training Data |
| |
|
| | We sample 5 responses from the previously trained SFT model and use a reward model to select the preferred and rejected responses for preference learning. Specifically, we utilize the reward model to identify the highest-scoring response as the "chosen" response and the lowest-scoring response as the "rejected" response for each method, and here we propose a novel technique that leverages MoA as a reward model. |
| |
|
| | ## Evaluation & Performance |
| |
|
| | Refer to [Paper](https://arxiv.org/abs/2505.03059) for metrics. |
| |
|
| |
|
| |
|
| |
|
| |
|
| | ## Citation |
| | ``` |
| | @article{wang2025improving, |
| | title = {Improving Model Alignment Through Collective Intelligence of Open-Source LLMS}, |
| | author = {Junlin Wang and Roy Xie and Shang Zhu and Jue Wang and Ben Athiwaratkun and Bhuwan Dhingra and Shuaiwen Leon Song and Ce Zhang and James Zou}, |
| | year = {2025}, |
| | journal = {arXiv preprint arXiv: 2505.03059} |
| | } |
| | ``` |