praxis-bookwriter-llama3.1-8b-sft / README.md

Update README.md

9007d94 verified 7 months ago

5.31 kB

	---
	license: cc-by-nc-4.0
	library_name: transformers
	language:
	- en
	tags:
	- writing
	base_model:
	- maldv/badger-nu-llama-3.1-8B-UltraLong
	pipeline_tags:
	- text-generation
	datasets:
	- SillyTilly/fiction-writer-596
	---

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b19c1b098c85365af5a83e/IYQqDTuu329tfl7IHo8J8.png)

	[GGUF](https://huggingface.co/mradermacher/praxis-bookwriter-llama3.1-8b-sft-GGUF) [iMat](https://huggingface.co/mradermacher/praxis-bookwriter-llama3.1-8b-sft-i1-GGUF)

	# Praxis Bookwriter Llama 3.1 8B

	My last iteration of fantasy writer suffered from one glaring flaw: It did not really follow instructions well.
	After much consideration, I decided it would make sense to introduce some information about the story chapter text
	somewhere to link instructions to the text generated.

	For this, I took strides of 16,384 tokens across each of the books in the ~140M token dataset, and used R1 to generate a summary of the text. With
	some careful modification, I used this to generate the first user turn. Each subsequent assistant turn takes approximately
	512 tokens of content, and then the user turn is a chapter header, or one paragraph of content. This alternated until I
	consumed the entirity of the original stride.

	## Crafting the prompt

	The system prompt should contain some variation of:

	```text
	You are the user's helpful writing assistant.

	// Title: The Title of Your Story
	// Author: Author Name For Style
	// Tags: some comma, delimited list, of genres
	```


	In an initial test, I tried putting the summary in the system prompt. The result was underwhelming. For this
	version, the first user turn should contain an overview of the setting (the summary), with the last line being of the format:

	```
	// Chapter n
	```

	The content of this block can contain all variety of instruction about what to write in the proceeding frame. The summaries I used were between 500 and 1500 tokens, so the more detail about setting, location, characters, their relationships, and plot points, the better.

	## Training

	This model was trained on one Paperspace A6000 using unsloth rsLoRA:

	```python
	from datasets import load_from_disk
	from dotenv import dotenv_values
	from unsloth import FastLanguageModel, is_bfloat16_supported
	import torch
	from transformers import TrainingArguments
	from trl import SFTTrainer
	import wandb

	envconfig = dict(dotenv_values(".env"))

	dtype = None
	max_seq_length = 24576
	load_in_4bit = True

	model, tokenizer = FastLanguageModel.from_pretrained(
	model_name = "unsloth/Meta-Llama-3.1-8B",
	max_seq_length = max_seq_length,
	dtype = dtype,
	load_in_4bit = load_in_4bit,
	)

	model = FastLanguageModel.get_peft_model(
	model,
	r = 128,
	target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
	"gate_proj", "up_proj", "down_proj",],
	lora_alpha = 128**.5,
	lora_dropout = 0,
	bias = "none",
	use_gradient_checkpointing = "unsloth",
	random_state = 3407,
	use_rslora = True,
	loftq_config = None,
	)

	dataset = load_from_disk('bookdata')
	ds_train = dataset
	ds_eval = dataset.shuffle(seed=12345).select(range(32))

	targs = TrainingArguments(
	per_device_train_batch_size = 3,
	gradient_accumulation_steps = 4,
	learning_rate = 4e-5,
	weight_decay = 0,
	gradient_checkpointing = True,
	max_grad_norm = 1,
	warmup_steps = 5,
	num_train_epochs = 3,
	optim = "paged_adamw_32bit",
	lr_scheduler_type = "cosine",
	seed = 3407,
	fp16 = not is_bfloat16_supported(),
	bf16 = is_bfloat16_supported(),
	logging_steps = 1,
	per_device_eval_batch_size = 1,
	do_eval = True,
	eval_steps = 25,
	eval_strategy = "steps",
	save_strategy = "steps",
	save_steps = 20,
	save_total_limit = 3,
	output_dir = "outputs",
	report_to="wandb",
	)

	trainer = SFTTrainer(
	model = model,
	tokenizer = tokenizer,
	train_dataset = ds_train,
	eval_dataset = ds_eval,
	dataset_text_field = "text",
	max_seq_length = max_seq_length,
	dataset_num_proc = 6,
	packing = False,
	args = targs,
	)

	wandb.login(key=envconfig['wandb_key'])
	wandb.init(
	project='bookwriter-596',
	config={
	"learning_rate": 4e-5,
	"architecture": 'llama 3.1 8b',
	"dataset": 'bookdata',
	"epochs": 3,
	}
	)

	#trainer_stats = trainer.train()
	trainer.train(resume_from_checkpoint=True)
	```

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b19c1b098c85365af5a83e/WRbDGDT9kv9ZnJFFSIRPJ.png)

	## Merged

	The rsLoRA I trained was applied on top of badger-nu-llama-3.1-8B UltraLong, which is RoPE scaled; so in theory
	this model should be able to perform at content lengths exceeding my original training data. I say this, but
	my training data was limited to sequence lengths of around 20k tokens, so anything after that might be out-of-distribution.

	## License

	This model is released under the limitations of both the llama3 license and CC-BY-NC-4.0.

	## Author

	Praxis Maldevide

	## Citation

	If you find our work helpful, feel free to give us a cite.

	```
	@misc{praxis-bookwriter-llama3.1-8b-sft,
	title = {Praxis Bookwriter Llama3.1 8B},
	url = {https://huggingface.co/maldv/praxis-bookwriter-llama3.1-8b-sft},
	author = {Praxis Maldevide},
	month = {May},
	year = {2025}
	}
	```