bfuzzy1 commited on
Commit
a9fd896
·
verified ·
1 Parent(s): 5763cd8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -57
README.md CHANGED
@@ -18,31 +18,6 @@ dtype: float16
18
 
19
  ```
20
 
21
- Here’s a draft Model Card for your new model based on the evaluations you have provided.
22
-
23
- Model Card: Custom AI Model (514M Parameters)
24
-
25
- Model Details
26
-
27
- • Architecture: This model is based on a fine-tuned large language model with 514M parameters, designed for handling a variety of commonsense reasoning tasks and general knowledge. The model has undergone multiple rounds of evaluation and focuses on tasks like ARC Challenge, HellaSwag, PIQA, and Winogrande.
28
- • Model Size: 514M parameters
29
-
30
- Use Case and Intended Applications
31
-
32
- This model is designed for tasks requiring:
33
-
34
- • Commonsense Reasoning: Understanding and predicting everyday physical and linguistic scenarios.
35
- • Text Comprehension: Handling tasks that require completion or understanding of real-world descriptions and ambiguous situations.
36
- • General Knowledge: Reasoning through questions that require broad, general understanding of knowledge domains, such as multiple-choice exams.
37
-
38
- Training Data
39
-
40
- The model was fine-tuned on various datasets to optimize its performance in the following areas:
41
-
42
- • Physical Reasoning: Datasets like PIQA help the model to reason about physical situations and solutions.
43
- • Commonsense and Ambiguous Reasoning: Datasets like HellaSwag and Winogrande help the model make sense of events or situations that require a high degree of commonsense understanding.
44
- • General Knowledge: The ARC Challenge dataset allows the model to answer multiple-choice questions that test general reasoning skills.
45
-
46
  Evaluation Results
47
 
48
  The model was evaluated across a range of tasks. Below are the final evaluation results (after removing GSM8k):
@@ -54,38 +29,10 @@ The model was evaluated across a range of tasks. Below are the final evaluation
54
  | 1.24B | llama 3.2 | 36.75 | 36.18 | 63.70 | 74.54 | 60.54 | 54.34 |
55
  | 514M | archeon | NA | 32.34 | 47.80 | 74.37 | 62.12 | 54.16 |
56
 
57
- • ARC Challenge: The model performs decently in answering general knowledge questions.
58
- • HellaSwag: The model is strong in commonsense reasoning, performing well in predicting the next sequence of events in a given scenario.
59
- • PIQA: The model excels in physical reasoning, showcasing a solid understanding of everyday physical interactions.
60
- • Winogrande: It also shows competitive performance in linguistic reasoning tasks.
61
-
62
- Key Strengths
63
-
64
- 1. Physical and Commonsense Reasoning: The model consistently performs well in tasks like PIQA and HellaSwag, showcasing strong abilities in understanding and predicting physical scenarios and commonsense events.
65
- 2. Linguistic Reasoning: The model also performs competitively in tasks like Winogrande, which tests linguistic understanding and ambiguity resolution.
66
-
67
- Key Weaknesses
68
-
69
- 1. General Knowledge (ARC Challenge): While the model does reasonably well, it lags behind top models in handling more challenging general knowledge questions.
70
- 2. Math Reasoning: Performance on numerical reasoning tasks like GSM8k was excluded due to poor performance, indicating a potential area for future improvement with further fine-tuning.
71
-
72
- Recommendations for Improvement
73
-
74
- • Fine-Tuning on Mathematical Reasoning: To improve on GSM8k and other math-heavy tasks, consider fine-tuning on datasets like MathQA or MATH.
75
- • Enhanced General Knowledge: To further enhance performance in general knowledge tasks (ARC Challenge), additional fine-tuning with datasets like SQuAD, TriviaQA, or other large knowledge datasets would be beneficial.
76
-
77
- Model Usage
78
-
79
- This model is well-suited for a variety of NLP tasks where commonsense reasoning and physical reasoning are required, such as:
80
-
81
- • Answering multiple-choice questions (e.g., exam preparation, automated tutoring).
82
- • Text completion tasks (e.g., completing sequences of events).
83
- • Commonsense AI applications (e.g., chatbot responses requiring real-world understanding).
84
-
85
- Limitations
86
-
87
- • Mathematical Reasoning: The model struggles with tasks requiring numerical problem-solving or complex logical reasoning in math.
88
- • Context-specific Fine-tuning: The model may require additional fine-tuning for specialized tasks outside of its current scope (e.g., legal reasoning, scientific document comprehension).
89
 
90
  Ethical Considerations
91
 
 
18
 
19
  ```
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  Evaluation Results
22
 
23
  The model was evaluated across a range of tasks. Below are the final evaluation results (after removing GSM8k):
 
29
  | 1.24B | llama 3.2 | 36.75 | 36.18 | 63.70 | 74.54 | 60.54 | 54.34 |
30
  | 514M | archeon | NA | 32.34 | 47.80 | 74.37 | 62.12 | 54.16 |
31
 
32
+ • ARC Challenge: The model performs decently in answering general knowledge questions.
33
+ • HellaSwag: The model is strong in commonsense reasoning, performing well in predicting the next sequence of events in a given scenario.
34
+ • PIQA: The model excels in physical reasoning, showcasing a solid understanding of everyday physical interactions.
35
+ • Winogrande: It also shows competitive performance in linguistic reasoning tasks.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  Ethical Considerations
38