amirali1985
/

interpreting_reward_models

Model card Files Files and versions

Abdullah commited on May 6, 2024

Commit

9cbba88

·

verified ·

1 Parent(s): fa07d8a

Create README.md

Files changed (1) hide show

README.md +11 -0

README.md ADDED Viewed

	@@ -0,0 +1,11 @@

+---
+license: mit
+datasets:
+- unalignment/toxic-dpo-v0.2
+- Anthropic/hh-rlhf
+- stanfordnlp/imdb
+language:
+- en
+---
+We train a collection of models under RLHF on the above datasets. We use DPO for hh-rlhf and unalignment, and train a PPO on completing IMDB prefixes with positive sentiment.