ftajwar's picture
Update README.md
9466b2f verified
---
library_name: transformers
license: mit
---
# Model Card for Model ID
This is a saved checkpoint from fine-tuning a meta-llama/Meta-Llama-3.1-8B-Instruct model using supervised fine-tuning and then RPO using data and methodology described by our paper, [**"Training a Generally Curious Agent"**](https://arxiv.org/abs/2502.17543). In our work, we introduce PAPRIKA, a finetuning framework for teaching large language models (LLMs) strategic exploration.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
This is the model card of a meta-llama/Meta-Llama-3.1-8B-Instruct model fine-tuned using PAPRIKA.
- **Finetuned from model:** meta-llama/Meta-Llama-3.1-8B-Instruct
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** [Official Code Release for the paper "Training a Generally Curious Agent"](https://github.com/tajwarfahim/paprika)
- **Paper:** [Training a Generally Curious Agent](https://arxiv.org/abs/2502.17543)
- **Project Website:** [Project Website](https://paprika-llm.github.io)
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
Our training dataset for supervised fine-tuning can be found here: [SFT dataset](https://huggingface.co/datasets/ftajwar/paprika_SFT_dataset)
Similarly, the training dataset for preference fine-tuning can be found here: [Preference learning dataset](https://huggingface.co/datasets/ftajwar/paprika_preference_dataset)
### Training Procedure
The [attached Wandb link](https://wandb.ai/llm_exploration/paprika_more_data?nw=nwusertajwar) shows the training loss per gradient step during both supervised fine-tuning and preference fine-tuning.
#### Training Hyperparameters
For supervised fine-tuning, we use the AdamW optimizer with learning rate 1e-6, batch size 32, cosine annealing learning rate decay with warmup ratio 0.04, and we train on a total of 17,181 trajectories.
For preference fine-tuning, we use the RPO objective, AdamW optimizer with learning rate 2e-7, batch size 32, cosine annealing learning rate decay with warmup ratio 0.04, and we train on a total of 5260 (preferred, dispreferred) trajectory pairs.
#### Hardware
This model has been finetuned using 8 NVIDIA L40S GPUs.
## Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
```
@misc{tajwar2025traininggenerallycuriousagent,
title={Training a Generally Curious Agent},
author={Fahim Tajwar and Yiding Jiang and Abitha Thankaraj and Sumaita Sadia Rahman and J Zico Kolter and Jeff Schneider and Ruslan Salakhutdinov},
year={2025},
eprint={2502.17543},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.17543},
}
```
## Model Card Contact
[Fahim Tajwar](mailto:tajwarfahim932@gmail.com)