Update README.md
Browse files
README.md
CHANGED
|
@@ -10,20 +10,20 @@ LLMs have the potential to support representative democracy by providing constit
|
|
| 10 |
|
| 11 |
### Training Data
|
| 12 |
|
| 13 |
-
The primary dataset used to train and evaluate this model is EZ-STANCE. This dataset contains labeled stances on a
|
| 14 |
|
| 15 |
Text: The source Tweet that stances will be generated from.
|
| 16 |
|
| 17 |
-
Target: A
|
| 18 |
|
| 19 |
Stance: The stance label for the Target text (Favorable, Unfavorable, or Neutral).
|
| 20 |
|
| 21 |
-
Using this dataset, I was able to provide the model with the source text and ask it to determine whether the author of that text would have a favorable, unfavorable, or no stance towards the target topic or claim. I did not make any modifications to those fields in the training dataset, other than adding structure around the data in the prompt to clarify to the model what I wanted it to provide. The second component of the task was to have the model provide step-by-step reasoning behind the stance it provided. This reasoning was not included in the training dataset, but I thought it was important to have the model generate this reasoning because an explanation for the user to
|
| 22 |
|
| 23 |
### Training Method
|
| 24 |
The base model used for this project was Qwen2.5-7B-Instruct-1M . I chose this model because it could handle large context windows, was instruction tuned, and its relatively low number of parameters would make it more efficient to train. The final model was trained on the stance classification task using the LoRA method of PEFT. Then, few-shot chain-of-thought prompting was used to ask the final model for reasoning behind the stances it generated. When reviewing the output of the model on my task, I observed that few-shot prompting alone went a very long way in improving the output of the model when having it explain its reasoning, which is why I only trained the model on the stance classification component of the task. I used PEFT over full fine-tuning because I did not want to drastically change my model since it was already performing well on the reasoning task. Also, since I am using a 7B parameter model and my desired model output is open-ended, I had concerns around the efficiency of full-fine tuning. My aim was to take a targeted training approach to assist the model on its classification task.
|
| 25 |
|
| 26 |
-
That left me deciding between PEFT and Prompt Tuning. My model was already performing well without any tuning, which led me to first consider using prompt tuning as it was the least invasive approach. However, my task does involve asking the model to perform a somewhat specific stance classification task in addition to generating its reasoning, so I thought the somewhat more in-depth approach of PEFT could be useful. Also, since my model is small to medium sized at 7B parameters, I did not have the same concern with resource usage using PEFT as I did with full fine-tuning. Therefore, I decided to take the middle-ground approach of PEFT. Within PEFT, I chose to use LoRA because it is a common approach
|
| 27 |
|
| 28 |
Finally, when I prompt the model for reasoning behind the stance it selected, I used few-shot prompting. Min et al. found that giving the model about 16 examples in the prompt resulted in the best performance on classification and multi-choice tasks. Since I have three possible stance options (FAVOR, AGAINST, NONE) I will provide the model with 15 examples (5 for each stance). The 15 examples included in the prompt were hand-written by me since no training data existed for the logical reasoning portion of this task.
|
| 29 |
|
|
@@ -38,10 +38,10 @@ The benchmarks I chose to use were Hellaswag , TruthfulQA , and Winogrande . I c
|
|
| 38 |
| TruthfulQA (BLEU ACC) | 0.44 | 0.36 | 0.34 | 0.4 |
|
| 39 |
| Stance Accuracy | 0.4781 | 0.4792 | .3362 | 0.3516 |
|
| 40 |
|
| 41 |
-
I chose to evaluate the task performance on two additional models – the Mistral 7B instruction tuned model to test another instruction tuned model of a similar size, and the DeepSeek 1.5B parameter model to test a smaller model that is still in the small to medium sized model category. Overall, my base and post-training model did perform the best on both the benchmarks and stance accuracy task. I was encouraged that the benchmark performance did not degrade significantly after training, indicating that the model did not lose logical reasoning capability. However, even after PEFT, the stance classification task accuracy remained virtually unchanged. If I were starting this project from the beginning, I would attempt to either train the model for significantly longer, or use full fine-tuning. The Qwen 7B parameter base model and the post-training model both performed on
|
| 42 |
|
| 43 |
## Usage and Intended Uses
|
| 44 |
-
The intended use of the model is to take input text like a tweet or public statement along with a specific topic or claim and generate two key outputs
|
| 45 |
|
| 46 |
### Prompt Format
|
| 47 |
The prompt format should ideally include good examples of this task and then provide the model with the statement and the target topic or claim. From there, the model can generate the expected stance and its reasoning. For example:
|
|
@@ -72,9 +72,15 @@ Stance: AGAINST
|
|
| 72 |
Response: The author is against the claim that emissions need to be kept below 15 degrees Celsius. The statement emphasizes the importance of political will and comprehensive efforts to tackle climate change, but the target temperature of 15 degrees is not aligned with the widely accepted scientific goal of limiting global warming to 1.5-2 degrees Celsius.
|
| 73 |
|
| 74 |
## Limitations
|
| 75 |
-
The primary limitation encountered was improving stance classification accuracy via training. Often, the input statement was written poorly with slang, typos, or shorthand, which could make it more difficult for the model to parse meaning. It seems like the model also had difficulty identifying the difference between stance and sentiment. It is possible for a statement with a positive sentiment to have an unfavorable stance towards a topic or claim. Also, the model struggled with correctly identifying neutral stances towards the topic. Given that the model is picking up on sentiment, it could be that it defaults to sentiment when no clear stance is present. This was
|
| 76 |
|
| 77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
Another limitation of this approach is that I was only training on the stance classification task. Given that the minimally invasive approach of PEFT did not have much impact, it could be worth exploring more impactful approaches like full fine-tuning. In that case, there is a risk that training the entire model will damage its ability to reason logically.
|
| 80 |
|
|
@@ -89,7 +95,7 @@ Stance: FAVOR
|
|
| 89 |
Response: The author supports the claim by linking private education institutions to financial irregularities, including money laundering and tax evasion, suggesting that these institutions could indeed face future scrutiny and investigations. This aligns with the claim that private education may be at risk due to such practices.
|
| 90 |
|
| 91 |
|
| 92 |
-
Example of the model perhaps confusing sentiment and stance, and
|
| 93 |
|
| 94 |
Statement: Let s aim to recycle and reuse as much as we can. . . . . noplastic plasticfree plasticpollution environment environmentalawareness environmentalfriendly savetheplanet plasticpollution noplasticbags noplasticwaste makeachange makeachangetoday
|
| 95 |
|
|
|
|
| 10 |
|
| 11 |
### Training Data
|
| 12 |
|
| 13 |
+
The primary dataset used to train and evaluate this model is EZ-STANCE. This dataset contains labeled stances on a variaty of topics across politics and pop culture. This dataset contains the following fields which are relevant for this project.
|
| 14 |
|
| 15 |
Text: The source Tweet that stances will be generated from.
|
| 16 |
|
| 17 |
+
Target: A topic or claim about which the author of the original text could have a specific stance.
|
| 18 |
|
| 19 |
Stance: The stance label for the Target text (Favorable, Unfavorable, or Neutral).
|
| 20 |
|
| 21 |
+
Using this dataset, I was able to provide the model with the source text and ask it to determine whether the author of that text would have a favorable, unfavorable, or no stance towards the target topic or claim. I did not make any modifications to those fields in the training dataset, other than adding structure around the data in the prompt to clarify to the model what I wanted it to provide. The second component of the task was to have the model provide step-by-step reasoning behind the stance it provided. This reasoning was not included in the training dataset, but I thought it was important to have the model generate this reasoning because an explanation for the user to reference is important when considering that the original reason behind this model is to help build trust.
|
| 22 |
|
| 23 |
### Training Method
|
| 24 |
The base model used for this project was Qwen2.5-7B-Instruct-1M . I chose this model because it could handle large context windows, was instruction tuned, and its relatively low number of parameters would make it more efficient to train. The final model was trained on the stance classification task using the LoRA method of PEFT. Then, few-shot chain-of-thought prompting was used to ask the final model for reasoning behind the stances it generated. When reviewing the output of the model on my task, I observed that few-shot prompting alone went a very long way in improving the output of the model when having it explain its reasoning, which is why I only trained the model on the stance classification component of the task. I used PEFT over full fine-tuning because I did not want to drastically change my model since it was already performing well on the reasoning task. Also, since I am using a 7B parameter model and my desired model output is open-ended, I had concerns around the efficiency of full-fine tuning. My aim was to take a targeted training approach to assist the model on its classification task.
|
| 25 |
|
| 26 |
+
That left me deciding between PEFT and Prompt Tuning. My model was already performing well without any tuning, which led me to first consider using prompt tuning as it was the least invasive approach. However, my task does involve asking the model to perform a somewhat specific stance classification task in addition to generating its reasoning, so I thought the somewhat more in-depth approach of PEFT could be useful. Also, since my model is small to medium sized at 7B parameters, I did not have the same concern with resource usage using PEFT as I did with full fine-tuning. Therefore, I decided to take the middle-ground approach of PEFT. Within PEFT, I chose to use LoRA because it is a common approach with a lot of resources and guidance available, which gave me confidence in my ability to implement it effectively. LoRA is also much more efficient than full fine-tuning, and has been shown to perform almost as well, including logical reasoning tasks.
|
| 27 |
|
| 28 |
Finally, when I prompt the model for reasoning behind the stance it selected, I used few-shot prompting. Min et al. found that giving the model about 16 examples in the prompt resulted in the best performance on classification and multi-choice tasks. Since I have three possible stance options (FAVOR, AGAINST, NONE) I will provide the model with 15 examples (5 for each stance). The 15 examples included in the prompt were hand-written by me since no training data existed for the logical reasoning portion of this task.
|
| 29 |
|
|
|
|
| 38 |
| TruthfulQA (BLEU ACC) | 0.44 | 0.36 | 0.34 | 0.4 |
|
| 39 |
| Stance Accuracy | 0.4781 | 0.4792 | .3362 | 0.3516 |
|
| 40 |
|
| 41 |
+
I chose to evaluate the task performance on two additional models – the Mistral 7B instruction tuned model to test another instruction tuned model of a similar size, and the DeepSeek 1.5B parameter model to test a smaller model that is still in the small to medium sized model category. Overall, my base and post-training model did perform the best on both the benchmarks and stance accuracy task. I was encouraged that the benchmark performance did not degrade significantly after training, indicating that the model did not lose logical reasoning capability. However, even after PEFT, the stance classification task accuracy remained virtually unchanged. If I were starting this project from the beginning, I would attempt to either train the model for significantly longer, or use full fine-tuning. The Qwen 7B parameter base model and the post-training model both performed on par with, or better than the comparison models on all tasks.
|
| 42 |
|
| 43 |
## Usage and Intended Uses
|
| 44 |
+
The intended use of the model is to take input text like a tweet or public statement along with a specific topic or claim and generate two key outputs: the stance classification and the reasoning behind the classification.
|
| 45 |
|
| 46 |
### Prompt Format
|
| 47 |
The prompt format should ideally include good examples of this task and then provide the model with the statement and the target topic or claim. From there, the model can generate the expected stance and its reasoning. For example:
|
|
|
|
| 72 |
Response: The author is against the claim that emissions need to be kept below 15 degrees Celsius. The statement emphasizes the importance of political will and comprehensive efforts to tackle climate change, but the target temperature of 15 degrees is not aligned with the widely accepted scientific goal of limiting global warming to 1.5-2 degrees Celsius.
|
| 73 |
|
| 74 |
## Limitations
|
| 75 |
+
The primary limitation encountered was improving stance classification accuracy via training. Often, the input statement was written poorly with slang, typos, or shorthand, which could make it more difficult for the model to parse meaning. It seems like the model also had difficulty identifying the difference between stance and sentiment. It is possible for the author of a statement with a positive sentiment to have an unfavorable stance towards a topic or claim. Also, the model struggled with correctly identifying neutral stances towards the topic. Given that the model is picking up on sentiment, it could be that it defaults to sentiment when no clear stance is present. This was evident when I looked closer at the classification accuracy. It was only able to correctly classify about 20% of examples where the true stance was “None”. Interestingly, it did not perform that well on “Favor” stance (42%) examples but performed extremely well (83%) on “Against” examples. I am not sure why this is the case, but a future step in working on this model could be to focus training efforts more on None and Favor examples.
|
| 76 |
|
| 77 |
+
| | Items | Correct | Accuracy | DeepSeek-R1-Distill-Qwen-1.5B |
|
| 78 |
+
|------------------------|---------------|----------------|-----------------|--------------------------------------|
|
| 79 |
+
| Favor | 2,354 | 992 | 0.4214 | 0.4 |
|
| 80 |
+
| Against | 2,098 | 1,750 | 0.8341 | 0.56 |
|
| 81 |
+
| None | 2,187 | 432 | 0.1975 | 0.32 |
|
| 82 |
+
| Total | 6,639 | 3,174 | 0.4781 | 0.4 |
|
| 83 |
+
| Stance Accuracy | 0.4781 | 0.4792 | .3362 | 0.3516 |
|
| 84 |
|
| 85 |
Another limitation of this approach is that I was only training on the stance classification task. Given that the minimally invasive approach of PEFT did not have much impact, it could be worth exploring more impactful approaches like full fine-tuning. In that case, there is a risk that training the entire model will damage its ability to reason logically.
|
| 86 |
|
|
|
|
| 95 |
Response: The author supports the claim by linking private education institutions to financial irregularities, including money laundering and tax evasion, suggesting that these institutions could indeed face future scrutiny and investigations. This aligns with the claim that private education may be at risk due to such practices.
|
| 96 |
|
| 97 |
|
| 98 |
+
Example of the model perhaps confusing sentiment and stance, and struggling with slang:
|
| 99 |
|
| 100 |
Statement: Let s aim to recycle and reuse as much as we can. . . . . noplastic plasticfree plasticpollution environment environmentalawareness environmentalfriendly savetheplanet plasticpollution noplasticbags noplasticwaste makeachange makeachangetoday
|
| 101 |
|