enemydw commited on
Commit
5eaab69
·
verified ·
1 Parent(s): 4c2566c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -18,12 +18,12 @@ The base LLaMA 3.1-8B model was fine-tuned using the parameter-efficient Low-Ran
18
  ## Evaluation
19
 
20
  To evaluate both the general reasoning abilities and the domain-specific performance of the model, three benchmarks from the lm_eval suite were selected: ARC-Challenge, HellaSwag, and Winogrande. ARC-Challenge tests the model's advanced reasoning and problem-solving skills on complex grade-school science questions. HellaSwag measures the ability to select contextually appropriate continuations, which reflects commonsense reasoning. Winogrande evaluates how well the model understands language and context by testing its ability to determine which person or object a pronoun (like “he,” “she,” or “it”) refers to in challenging sentences. These benchmarks were chosen because they are standard in the field for assessing language model capabilities and provide a meaningful comparison of general language understanding, reasoning, and commonsense knowledge. Additionally, a domain-specific Hobby Test Set was included to measure how well the model generates personalized hobby recommendations.
21
- | Model Name | ARC-Challenge | HellaSwag | Winogrande | Hobby Test Set |
22
- |------------------------------|:-------------:|:---------:|:----------:|:--------------:|
23
- | Llama-3-8B (base) | 51.4% | 59.9% | 73.1% | 62.1% |
24
- | Hobby_Recommendation model | 53.7% | 59.9% | 73.2% | 88.5% |
25
- | Falcon-7B-Instruct | --.-% | --.-% | --.-% | --.-% |
26
- | Mistral-7B-Instruct | --.-% | --.-% | --.-% | --.-% |
27
 
28
  **Model Performance Summary**
29
 
@@ -59,7 +59,7 @@ Hobby Recommendation:
59
  ```
60
  ## Expected Output Format
61
 
62
- The expected output from the model is a concise text string that contains only the name of the recommended hobby. The model does not provide reasoning or additional explanation—only the specific hobby recommendation based on the user profile.
63
 
64
  ```coding, board games```
65
 
 
18
  ## Evaluation
19
 
20
  To evaluate both the general reasoning abilities and the domain-specific performance of the model, three benchmarks from the lm_eval suite were selected: ARC-Challenge, HellaSwag, and Winogrande. ARC-Challenge tests the model's advanced reasoning and problem-solving skills on complex grade-school science questions. HellaSwag measures the ability to select contextually appropriate continuations, which reflects commonsense reasoning. Winogrande evaluates how well the model understands language and context by testing its ability to determine which person or object a pronoun (like “he,” “she,” or “it”) refers to in challenging sentences. These benchmarks were chosen because they are standard in the field for assessing language model capabilities and provide a meaningful comparison of general language understanding, reasoning, and commonsense knowledge. Additionally, a domain-specific Hobby Test Set was included to measure how well the model generates personalized hobby recommendations.
21
+ | Model Name | ARC-Challenge | HellaSwag | Winogrande |
22
+ |------------------------------|:-------------:|:---------:|:----------:|
23
+ | Llama-3-8B (base) | 51.4% | 59.9% | 73.1% |
24
+ | Hobby_Recommendation model | 53.7% | 59.9% | 73.2% |
25
+ | Falcon-7B-Instruct | --.-% | --.-% | --.-% |
26
+ | Mistral-7B-Instruct | --.-% | --.-% | --.-% |
27
 
28
  **Model Performance Summary**
29
 
 
59
  ```
60
  ## Expected Output Format
61
 
62
+ The expected output from the model typically includes the name of a recommended hobby, often followed by a brief explanation or justification. Although the instruction requests only the hobby name, the model may generate a full sentence or paragraph explaining why the hobby fits the user's profile.
63
 
64
  ```coding, board games```
65