| | --- |
| | library_name: transformers |
| | license: mit |
| | --- |
| | |
| | # Model Card for Model ID |
| |
|
| | This is a saved checkpoint from fine-tuning a meta-llama/Meta-Llama-3.1-8B-Instruct model using supervised fine-tuning and then RPO using data and methodology described by our paper, [**"Training a Generally Curious Agent"**](https://arxiv.org/abs/2502.17543). In our work, we introduce PAPRIKA, a finetuning framework for teaching large language models (LLMs) strategic exploration. |
| |
|
| |
|
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | <!-- Provide a longer summary of what this model is. --> |
| |
|
| | This is the model card of a meta-llama/Meta-Llama-3.1-8B-Instruct model fine-tuned using PAPRIKA. |
| |
|
| | - **Finetuned from model:** meta-llama/Meta-Llama-3.1-8B-Instruct |
| |
|
| | ### Model Sources |
| |
|
| | <!-- Provide the basic links for the model. --> |
| |
|
| | - **Repository:** [Official Code Release for the paper "Training a Generally Curious Agent"](https://github.com/tajwarfahim/paprika) |
| | - **Paper:** [Training a Generally Curious Agent](https://arxiv.org/abs/2502.17543) |
| | - **Project Website:** [Project Website](https://paprika-llm.github.io) |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
| |
|
| | Our training dataset for supervised fine-tuning can be found here: [SFT dataset](https://huggingface.co/datasets/ftajwar/paprika_SFT_dataset) |
| |
|
| | Similarly, the training dataset for preference fine-tuning can be found here: [Preference learning dataset](https://huggingface.co/datasets/ftajwar/paprika_preference_dataset) |
| |
|
| | ### Training Procedure |
| |
|
| | The [attached Wandb link](https://wandb.ai/llm_exploration/paprika_more_data?nw=nwusertajwar) shows the training loss per gradient step during both supervised fine-tuning and preference fine-tuning. |
| |
|
| |
|
| | #### Training Hyperparameters |
| |
|
| | For supervised fine-tuning, we use the AdamW optimizer with learning rate 1e-6, batch size 32, cosine annealing learning rate decay with warmup ratio 0.04, and we train on a total of 17,181 trajectories. |
| |
|
| | For preference fine-tuning, we use the RPO objective, AdamW optimizer with learning rate 2e-7, batch size 32, cosine annealing learning rate decay with warmup ratio 0.04, and we train on a total of 5260 (preferred, dispreferred) trajectory pairs. |
| |
|
| | #### Hardware |
| |
|
| | This model has been finetuned using 8 NVIDIA L40S GPUs. |
| |
|
| |
|
| | ## Citation |
| |
|
| | <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
| |
|
| | **BibTeX:** |
| |
|
| | ``` |
| | @misc{tajwar2025traininggenerallycuriousagent, |
| | title={Training a Generally Curious Agent}, |
| | author={Fahim Tajwar and Yiding Jiang and Abitha Thankaraj and Sumaita Sadia Rahman and J Zico Kolter and Jeff Schneider and Ruslan Salakhutdinov}, |
| | year={2025}, |
| | eprint={2502.17543}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.LG}, |
| | url={https://arxiv.org/abs/2502.17543}, |
| | } |
| | ``` |
| |
|
| | ## Model Card Contact |
| |
|
| | [Fahim Tajwar](mailto:tajwarfahim932@gmail.com) |