ftajwar
/

paprika_Meta-Llama-3.1-8B-Instruct

Text Generation

text-generation-inference

Model card Files Files and versions

paprika_Meta-Llama-3.1-8B-Instruct / README.md

ftajwar's picture

Update README.md

9466b2f verified about 1 year ago

|

history blame contribute delete

3.08 kB

	---
	library_name: transformers
	license: mit
	---

	# Model Card for Model ID

	This is a saved checkpoint from fine-tuning a meta-llama/Meta-Llama-3.1-8B-Instruct model using supervised fine-tuning and then RPO using data and methodology described by our paper, ["Training a Generally Curious Agent"](https://arxiv.org/abs/2502.17543). In our work, we introduce PAPRIKA, a finetuning framework for teaching large language models (LLMs) strategic exploration.



	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This is the model card of a meta-llama/Meta-Llama-3.1-8B-Instruct model fine-tuned using PAPRIKA.

	- Finetuned from model: meta-llama/Meta-Llama-3.1-8B-Instruct

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: [Official Code Release for the paper "Training a Generally Curious Agent"](https://github.com/tajwarfahim/paprika)
	- Paper: [Training a Generally Curious Agent](https://arxiv.org/abs/2502.17543)
	- Project Website: [Project Website](https://paprika-llm.github.io)

	## Training Details

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	Our training dataset for supervised fine-tuning can be found here: [SFT dataset](https://huggingface.co/datasets/ftajwar/paprika_SFT_dataset)

	Similarly, the training dataset for preference fine-tuning can be found here: [Preference learning dataset](https://huggingface.co/datasets/ftajwar/paprika_preference_dataset)

	### Training Procedure

	The [attached Wandb link](https://wandb.ai/llm_exploration/paprika_more_data?nw=nwusertajwar) shows the training loss per gradient step during both supervised fine-tuning and preference fine-tuning.


	#### Training Hyperparameters

	For supervised fine-tuning, we use the AdamW optimizer with learning rate 1e-6, batch size 32, cosine annealing learning rate decay with warmup ratio 0.04, and we train on a total of 17,181 trajectories.

	For preference fine-tuning, we use the RPO objective, AdamW optimizer with learning rate 2e-7, batch size 32, cosine annealing learning rate decay with warmup ratio 0.04, and we train on a total of 5260 (preferred, dispreferred) trajectory pairs.

	#### Hardware

	This model has been finetuned using 8 NVIDIA L40S GPUs.


	## Citation

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	```
	@misc{tajwar2025traininggenerallycuriousagent,
	title={Training a Generally Curious Agent},
	author={Fahim Tajwar and Yiding Jiang and Abitha Thankaraj and Sumaita Sadia Rahman and J Zico Kolter and Jeff Schneider and Ruslan Salakhutdinov},
	year={2025},
	eprint={2502.17543},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2502.17543},
	}
	```

	## Model Card Contact

	[Fahim Tajwar](mailto:tajwarfahim932@gmail.com)