|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
library_name: transformers |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- writing |
|
|
base_model: |
|
|
- maldv/badger-nu-llama-3.1-8B-UltraLong |
|
|
pipeline_tags: |
|
|
- text-generation |
|
|
datasets: |
|
|
- SillyTilly/fiction-writer-596 |
|
|
--- |
|
|
|
|
|
 |
|
|
|
|
|
[GGUF](https://huggingface.co/mradermacher/praxis-bookwriter-llama3.1-8b-sft-GGUF) [iMat](https://huggingface.co/mradermacher/praxis-bookwriter-llama3.1-8b-sft-i1-GGUF) |
|
|
|
|
|
# Praxis Bookwriter Llama 3.1 8B |
|
|
|
|
|
My last iteration of fantasy writer suffered from one glaring flaw: It did not really follow instructions well. |
|
|
After much consideration, I decided it would make sense to introduce some information about the story chapter text |
|
|
somewhere to link instructions to the text generated. |
|
|
|
|
|
For this, I took strides of 16,384 tokens across each of the books in the ~140M token dataset, and used R1 to generate a summary of the text. With |
|
|
some careful modification, I used this to generate the first user turn. Each subsequent assistant turn takes approximately |
|
|
512 tokens of content, and then the user turn is a chapter header, or one paragraph of content. This alternated until I |
|
|
consumed the entirity of the original stride. |
|
|
|
|
|
## Crafting the prompt |
|
|
|
|
|
The system prompt should contain some variation of: |
|
|
|
|
|
```text |
|
|
You are the user's helpful writing assistant. |
|
|
|
|
|
// Title: The Title of Your Story |
|
|
// Author: Author Name For Style |
|
|
// Tags: some comma, delimited list, of genres |
|
|
``` |
|
|
|
|
|
|
|
|
In an initial test, I tried putting the summary in the system prompt. The result was underwhelming. For this |
|
|
version, the first user turn should contain an overview of the setting (the summary), with the last line being of the format: |
|
|
|
|
|
``` |
|
|
// Chapter n |
|
|
``` |
|
|
|
|
|
The content of this block can contain all variety of instruction about what to write in the proceeding frame. The summaries I used were between 500 and 1500 tokens, so the more detail about setting, location, characters, their relationships, and plot points, the better. |
|
|
|
|
|
## Training |
|
|
|
|
|
This model was trained on one Paperspace A6000 using unsloth rsLoRA: |
|
|
|
|
|
```python |
|
|
from datasets import load_from_disk |
|
|
from dotenv import dotenv_values |
|
|
from unsloth import FastLanguageModel, is_bfloat16_supported |
|
|
import torch |
|
|
from transformers import TrainingArguments |
|
|
from trl import SFTTrainer |
|
|
import wandb |
|
|
|
|
|
envconfig = dict(dotenv_values(".env")) |
|
|
|
|
|
dtype = None |
|
|
max_seq_length = 24576 |
|
|
load_in_4bit = True |
|
|
|
|
|
model, tokenizer = FastLanguageModel.from_pretrained( |
|
|
model_name = "unsloth/Meta-Llama-3.1-8B", |
|
|
max_seq_length = max_seq_length, |
|
|
dtype = dtype, |
|
|
load_in_4bit = load_in_4bit, |
|
|
) |
|
|
|
|
|
model = FastLanguageModel.get_peft_model( |
|
|
model, |
|
|
r = 128, |
|
|
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", |
|
|
"gate_proj", "up_proj", "down_proj",], |
|
|
lora_alpha = 128**.5, |
|
|
lora_dropout = 0, |
|
|
bias = "none", |
|
|
use_gradient_checkpointing = "unsloth", |
|
|
random_state = 3407, |
|
|
use_rslora = True, |
|
|
loftq_config = None, |
|
|
) |
|
|
|
|
|
dataset = load_from_disk('bookdata') |
|
|
ds_train = dataset |
|
|
ds_eval = dataset.shuffle(seed=12345).select(range(32)) |
|
|
|
|
|
targs = TrainingArguments( |
|
|
per_device_train_batch_size = 3, |
|
|
gradient_accumulation_steps = 4, |
|
|
learning_rate = 4e-5, |
|
|
weight_decay = 0, |
|
|
gradient_checkpointing = True, |
|
|
max_grad_norm = 1, |
|
|
warmup_steps = 5, |
|
|
num_train_epochs = 3, |
|
|
optim = "paged_adamw_32bit", |
|
|
lr_scheduler_type = "cosine", |
|
|
seed = 3407, |
|
|
fp16 = not is_bfloat16_supported(), |
|
|
bf16 = is_bfloat16_supported(), |
|
|
logging_steps = 1, |
|
|
per_device_eval_batch_size = 1, |
|
|
do_eval = True, |
|
|
eval_steps = 25, |
|
|
eval_strategy = "steps", |
|
|
save_strategy = "steps", |
|
|
save_steps = 20, |
|
|
save_total_limit = 3, |
|
|
output_dir = "outputs", |
|
|
report_to="wandb", |
|
|
) |
|
|
|
|
|
trainer = SFTTrainer( |
|
|
model = model, |
|
|
tokenizer = tokenizer, |
|
|
train_dataset = ds_train, |
|
|
eval_dataset = ds_eval, |
|
|
dataset_text_field = "text", |
|
|
max_seq_length = max_seq_length, |
|
|
dataset_num_proc = 6, |
|
|
packing = False, |
|
|
args = targs, |
|
|
) |
|
|
|
|
|
wandb.login(key=envconfig['wandb_key']) |
|
|
wandb.init( |
|
|
project='bookwriter-596', |
|
|
config={ |
|
|
"learning_rate": 4e-5, |
|
|
"architecture": 'llama 3.1 8b', |
|
|
"dataset": 'bookdata', |
|
|
"epochs": 3, |
|
|
} |
|
|
) |
|
|
|
|
|
#trainer_stats = trainer.train() |
|
|
trainer.train(resume_from_checkpoint=True) |
|
|
``` |
|
|
|
|
|
 |
|
|
|
|
|
## Merged |
|
|
|
|
|
The rsLoRA I trained was applied on top of badger-nu-llama-3.1-8B UltraLong, which is RoPE scaled; so in theory |
|
|
this model should be able to perform at content lengths exceeding my original training data. I say this, but |
|
|
my training data was limited to sequence lengths of around 20k tokens, so anything after that might be out-of-distribution. |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the limitations of both the llama3 license and CC-BY-NC-4.0. |
|
|
|
|
|
## Author |
|
|
|
|
|
Praxis Maldevide |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find our work helpful, feel free to give us a cite. |
|
|
|
|
|
``` |
|
|
@misc{praxis-bookwriter-llama3.1-8b-sft, |
|
|
title = {Praxis Bookwriter Llama3.1 8B}, |
|
|
url = {https://huggingface.co/maldv/praxis-bookwriter-llama3.1-8b-sft}, |
|
|
author = {Praxis Maldevide}, |
|
|
month = {May}, |
|
|
year = {2025} |
|
|
} |
|
|
``` |