HaileyStorm
/

llama3-5.4b-instruct

Text Generation

Eval Results (legacy)

text-generation-inference

Model card Files Files and versions

HaileyStorm commited on May 27, 2024

Commit

80a0b92

·

verified ·

1 Parent(s): 083f0a3

Update README.md

Files changed (1) hide show

README.md +18 -3

README.md CHANGED Viewed

@@ -80,7 +80,7 @@ I healed the model by doing a full weight DPO finetune for 139k samples (3.15 ep
 Prior to healing, the model returned absolute gibberish to any prompt, rarely two real words together. For example, give "2+2=" it might return "Mahmisan Pannpyout Na RMITa CMI TTi GP BP GP RSi TBi DD PS..."
-The results are pretty good! The model has issues, but could have legitimate uses. It carry on a conversation. It's certainly usable, if not useful.
 Truthfulness and commonsense reasoning suffered the least from the prune / were healed the best. Knowledge and complex reasoning suffered the most.
 This model has 67% the parameters of the original, and has:
@@ -94,7 +94,11 @@ An average of 69% the benchmark scores for 67% the parameters, not bad! (Note, I
 I do believe it could be much better, by doing the pruning in stages (say, 4 layers at a time) with some healing in between, and longer healing at the end with a more diverse dataset.
 ### Benchmarks
-{Benchmark images on their way...}
 ## Why 5.4B?
 This size should allow for:
@@ -132,4 +136,15 @@ slices:
 - sources:
   - layer_range: [29, 32]
     model: meta-llama/Meta-Llama-3-8B-Instruct
-```

 Prior to healing, the model returned absolute gibberish to any prompt, rarely two real words together. For example, give "2+2=" it might return "Mahmisan Pannpyout Na RMITa CMI TTi GP BP GP RSi TBi DD PS..."
+The results are pretty good! The model has issues, but could have legitimate uses. It can carry on a conversation. It's certainly usable, if not useful.
 Truthfulness and commonsense reasoning suffered the least from the prune / were healed the best. Knowledge and complex reasoning suffered the most.
 This model has 67% the parameters of the original, and has:
 I do believe it could be much better, by doing the pruning in stages (say, 4 layers at a time) with some healing in between, and longer healing at the end with a more diverse dataset.
 ### Benchmarks
+![Comparative Benchmarks](benchmarks.png)
+*Figure 1: Benchmark results for the pruned model, the original 8B model, and other models of similar size. Truthfulness and commonsense reasoning suffered the least from the prune / were healed the best. Knowledge and complex reasoning suffered the most.*
+![Model Size vs Performance](relative.png)
+*Figure 2: Model size vs average benchmark performance. Llama3-5.4b-instruct may not be fully healed, but its performance scales linearly with its size.*
 ## Why 5.4B?
 This size should allow for:
 - sources:
   - layer_range: [29, 32]
     model: meta-llama/Meta-Llama-3-8B-Instruct
+```
+## Weights & Biases Logs
+Here are the logs for the full weight fine tune:
+- https://wandb.ai/haileycollet/llama3-5b/runs/ryyqhc97
+- https://wandb.ai/haileycollet/llama3-5b/runs/fpj2sct3
+- https://wandb.ai/haileycollet/llama3-5b/runs/k9z6n9em
+- https://wandb.ai/haileycollet/llama3-5b/runs/r3xqyhm2
+And the LoRA logs:
+- https://wandb.ai/haileycollet/llama3-5b/runs/rseithn1
+- https://wandb.ai/haileycollet/llama3-5b/runs/g26232ei