Creation Process: SFT > DPO
SFT on approx 25 million tokens (17.5 million trainable). Datasets included SFW / NSFW RP, stories, NSFW reddit writing prompts, creative instruct & chat data.
90% of the dataset is without thinking, 10% included thinking, using the [THINK][/THINK] tags.
All RP data and synthetic stories went through rewriting with GLM 4.7 using hand edited examples as guidelines to improve the response. Rewritten responses were discarded if they failed to reduce the slop score for the message. This reduced the slop by about 25% for each RP / story dataset and made the model noticably more creative with some of its descriptions.
Assistant messages were checked for repetition in RP conversations via embeddings and word frequency checking across multi-turn conversations. Specific messages were rewritten and conversations that still showed high repetition were filtered.
DPO was expanded to include non creative datasets. My usual RP DPO dataset (also rewritten) was included along with cybersecurity and two partial subsets of general assistant / chat preference datasets to help stabalize the model. This worked pretty well. While creativity did take a small hit, enough remained that the improved logic resulted in a notably improved model (IMO).
Using embeddings, DPO samples where the chosen showed a higher similarity to the conversation than the rejected were removed, to ensure DPO doesn't encourage repetition.