mahing commited on
Commit
c6468b9
·
verified ·
1 Parent(s): 9b37d9c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -1
README.md CHANGED
@@ -31,7 +31,7 @@ For my hyperparameter combinations, I chose to use LORA_R = 128, LORA_ALPHA = 12
31
  | DeepSeek R1 Model | 60.4% | 73.3% | 88.6% | 85.3% | 82.4% | 35.9% |
32
  | Mistral Nemo Instruct | 63.3% | 65.6% | 84.4% | 84.8% | 74.5% | 39.5% |
33
 
34
- I used MMLU as a benchmark to both test the model’s historical accuracy skills and see if it still has general baseline knowledge after fine-tuning the model specifically for my task. MMLU’s input are multiple choice questions, and the output is the answer from the model and if it is correct, testing the model’s general abilities and knowledge of history. I also plan to use HellaSwag to test how the model performs at reasoning through sentence completion to make sure the narration of the model has similar performance. TruthfulQA is another benchmark I used to check for model hallucinations. TruthfulQA’s inputs are open-ended questions, and the response is the model’s answer to the question, seeing if it matches with the desired output. MMLU was used to test my model’s results on history-related benchmarks while all 3 benchmarks test the general performance of the model on domains outside of history. I chose DeepSeek R1 and Mistral Nemo as my comparison models since they are of similar model size as my base model Qwen2.5. I also chose these since they were on HuggingFace model leaderboards for high performance relative to model size. My fine-tuned historical narrative model performed quite well compared to the other models overall. The model had no major drops in benchmark results, and even scored the highest on TruthfulQA, HellaSwag (tied with base Qwen2.5 model), MMLU, and MMLU High School World History. This demonstrates the ability of the model to retain general information while also excelling in providing first-person historical narratives.
35
 
36
  **Usage and Intended Uses** <br />
37
  ```python
@@ -114,3 +114,11 @@ I trust that the path we have chosen will lead us to a brighter tomorrow.
114
 
115
  **Limitations** <br />
116
  One of the primary limitations faced with this approach was the difficulty of generating synthetic data. It proved hard to find historical documents from a certain era and took a large amount of compute and time to generate the synthetic first-person narratives for these documents. Future work would entail creating more data for the model to train on, improving results further. The other primary limitation from this model is the lack of creative introductions in the model’s responses. The model has shown to always start with a sentence or phrase of the year and date. While this sets the scene, the model could be improved to have more creative beginnings to the narratives.
 
 
 
 
 
 
 
 
 
31
  | DeepSeek R1 Model | 60.4% | 73.3% | 88.6% | 85.3% | 82.4% | 35.9% |
32
  | Mistral Nemo Instruct | 63.3% | 65.6% | 84.4% | 84.8% | 74.5% | 39.5% |
33
 
34
+ I used MMLU as a benchmark to both test the model’s historical accuracy skills and see if it still has general baseline knowledge after fine-tuning the model specifically for my task. MMLU’s input are multiple choice questions, and the output is the answer from the model and if it is correct, testing the model’s general abilities and knowledge of history (Hendrycks et al., 2021). I also plan to use HellaSwag to test how the model performs at reasoning through sentence completion to make sure the narration of the model has similar performance (Zellers et al., 2019). TruthfulQA is another benchmark I used to check for model hallucinations. TruthfulQA’s inputs are open-ended questions, and the response is the model’s answer to the question, seeing if it matches with the desired output (Lin et al., 2020). MMLU was used to test my model’s results on history-related benchmarks while all 3 benchmarks test the general performance of the model on domains outside of history. I chose DeepSeek R1 and Mistral Nemo as my comparison models since they are of similar model size as my base model Qwen2.5. I also chose these since they were on HuggingFace model leaderboards for high performance relative to model size. My fine-tuned historical narrative model performed quite well compared to the other models overall. The model had no major drops in benchmark results, and even scored the highest on TruthfulQA, HellaSwag (tied with base Qwen2.5 model), MMLU, and MMLU High School World History. This demonstrates the ability of the model to retain general information while also excelling in providing first-person historical narratives.
35
 
36
  **Usage and Intended Uses** <br />
37
  ```python
 
114
 
115
  **Limitations** <br />
116
  One of the primary limitations faced with this approach was the difficulty of generating synthetic data. It proved hard to find historical documents from a certain era and took a large amount of compute and time to generate the synthetic first-person narratives for these documents. Future work would entail creating more data for the model to train on, improving results further. The other primary limitation from this model is the lack of creative introductions in the model’s responses. The model has shown to always start with a sentence or phrase of the year and date. While this sets the scene, the model could be improved to have more creative beginnings to the narratives.
117
+
118
+ **Works Cited** <br />
119
+ Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J.
120
+ (2021). Measuring Massive Multitask Language Understanding (arXiv:2109.07958). arXiv. https://arxiv.org/abs/2109.07958
121
+ Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., & Choi, Y.
122
+ (2019). HellaSwag: Can a Machine Really Finish Your Sentence? (arXiv:1905.07830). arXiv. https://arxiv.org/abs/1905.07830
123
+ Lin, B. Y., Tan, C., Jiang, M., & Han, X. (2020). TruthfulQA: Measuring How Models
124
+ Mimic Human Falsehoods (arXiv:2009.03300). arXiv. https://arxiv.org/abs/2009.03300