Update README.md
Browse files
README.md
CHANGED
|
@@ -18,31 +18,6 @@ dtype: float16
|
|
| 18 |
|
| 19 |
```
|
| 20 |
|
| 21 |
-
Here’s a draft Model Card for your new model based on the evaluations you have provided.
|
| 22 |
-
|
| 23 |
-
Model Card: Custom AI Model (514M Parameters)
|
| 24 |
-
|
| 25 |
-
Model Details
|
| 26 |
-
|
| 27 |
-
• Architecture: This model is based on a fine-tuned large language model with 514M parameters, designed for handling a variety of commonsense reasoning tasks and general knowledge. The model has undergone multiple rounds of evaluation and focuses on tasks like ARC Challenge, HellaSwag, PIQA, and Winogrande.
|
| 28 |
-
• Model Size: 514M parameters
|
| 29 |
-
|
| 30 |
-
Use Case and Intended Applications
|
| 31 |
-
|
| 32 |
-
This model is designed for tasks requiring:
|
| 33 |
-
|
| 34 |
-
• Commonsense Reasoning: Understanding and predicting everyday physical and linguistic scenarios.
|
| 35 |
-
• Text Comprehension: Handling tasks that require completion or understanding of real-world descriptions and ambiguous situations.
|
| 36 |
-
• General Knowledge: Reasoning through questions that require broad, general understanding of knowledge domains, such as multiple-choice exams.
|
| 37 |
-
|
| 38 |
-
Training Data
|
| 39 |
-
|
| 40 |
-
The model was fine-tuned on various datasets to optimize its performance in the following areas:
|
| 41 |
-
|
| 42 |
-
• Physical Reasoning: Datasets like PIQA help the model to reason about physical situations and solutions.
|
| 43 |
-
• Commonsense and Ambiguous Reasoning: Datasets like HellaSwag and Winogrande help the model make sense of events or situations that require a high degree of commonsense understanding.
|
| 44 |
-
• General Knowledge: The ARC Challenge dataset allows the model to answer multiple-choice questions that test general reasoning skills.
|
| 45 |
-
|
| 46 |
Evaluation Results
|
| 47 |
|
| 48 |
The model was evaluated across a range of tasks. Below are the final evaluation results (after removing GSM8k):
|
|
@@ -54,38 +29,10 @@ The model was evaluated across a range of tasks. Below are the final evaluation
|
|
| 54 |
| 1.24B | llama 3.2 | 36.75 | 36.18 | 63.70 | 74.54 | 60.54 | 54.34 |
|
| 55 |
| 514M | archeon | NA | 32.34 | 47.80 | 74.37 | 62.12 | 54.16 |
|
| 56 |
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
Key Strengths
|
| 63 |
-
|
| 64 |
-
1. Physical and Commonsense Reasoning: The model consistently performs well in tasks like PIQA and HellaSwag, showcasing strong abilities in understanding and predicting physical scenarios and commonsense events.
|
| 65 |
-
2. Linguistic Reasoning: The model also performs competitively in tasks like Winogrande, which tests linguistic understanding and ambiguity resolution.
|
| 66 |
-
|
| 67 |
-
Key Weaknesses
|
| 68 |
-
|
| 69 |
-
1. General Knowledge (ARC Challenge): While the model does reasonably well, it lags behind top models in handling more challenging general knowledge questions.
|
| 70 |
-
2. Math Reasoning: Performance on numerical reasoning tasks like GSM8k was excluded due to poor performance, indicating a potential area for future improvement with further fine-tuning.
|
| 71 |
-
|
| 72 |
-
Recommendations for Improvement
|
| 73 |
-
|
| 74 |
-
• Fine-Tuning on Mathematical Reasoning: To improve on GSM8k and other math-heavy tasks, consider fine-tuning on datasets like MathQA or MATH.
|
| 75 |
-
• Enhanced General Knowledge: To further enhance performance in general knowledge tasks (ARC Challenge), additional fine-tuning with datasets like SQuAD, TriviaQA, or other large knowledge datasets would be beneficial.
|
| 76 |
-
|
| 77 |
-
Model Usage
|
| 78 |
-
|
| 79 |
-
This model is well-suited for a variety of NLP tasks where commonsense reasoning and physical reasoning are required, such as:
|
| 80 |
-
|
| 81 |
-
• Answering multiple-choice questions (e.g., exam preparation, automated tutoring).
|
| 82 |
-
• Text completion tasks (e.g., completing sequences of events).
|
| 83 |
-
• Commonsense AI applications (e.g., chatbot responses requiring real-world understanding).
|
| 84 |
-
|
| 85 |
-
Limitations
|
| 86 |
-
|
| 87 |
-
• Mathematical Reasoning: The model struggles with tasks requiring numerical problem-solving or complex logical reasoning in math.
|
| 88 |
-
• Context-specific Fine-tuning: The model may require additional fine-tuning for specialized tasks outside of its current scope (e.g., legal reasoning, scientific document comprehension).
|
| 89 |
|
| 90 |
Ethical Considerations
|
| 91 |
|
|
|
|
| 18 |
|
| 19 |
```
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
Evaluation Results
|
| 22 |
|
| 23 |
The model was evaluated across a range of tasks. Below are the final evaluation results (after removing GSM8k):
|
|
|
|
| 29 |
| 1.24B | llama 3.2 | 36.75 | 36.18 | 63.70 | 74.54 | 60.54 | 54.34 |
|
| 30 |
| 514M | archeon | NA | 32.34 | 47.80 | 74.37 | 62.12 | 54.16 |
|
| 31 |
|
| 32 |
+
• ARC Challenge: The model performs decently in answering general knowledge questions.
|
| 33 |
+
• HellaSwag: The model is strong in commonsense reasoning, performing well in predicting the next sequence of events in a given scenario.
|
| 34 |
+
• PIQA: The model excels in physical reasoning, showcasing a solid understanding of everyday physical interactions.
|
| 35 |
+
• Winogrande: It also shows competitive performance in linguistic reasoning tasks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
Ethical Considerations
|
| 38 |
|