Files changed (1) hide show
  1. README.md +77 -6
README.md CHANGED
@@ -9,14 +9,10 @@ For detail, you can read the paper at https://huggingface.co/papers/2412.19638
9
  To use Xmodel-2 for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using the latest code and related virtual environments.
10
 
11
  ```
12
- import os
13
  from transformers.models.auto.modeling_auto import AutoModelForCausalLM
14
  from transformers.models.auto.tokenization_auto import AutoTokenizer
15
 
16
-
17
- os.environ["CUDA_VISIBLE_DEVICES"] = "5"
18
-
19
- model_path = os.path.expanduser("~/models/Xmodel-2")
20
 
21
  model = AutoModelForCausalLM.from_pretrained(
22
  model_path,
@@ -72,4 +68,79 @@ output = output.strip()
72
 
73
  print("Generated Response:")
74
  print(output)
75
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  To use Xmodel-2 for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using the latest code and related virtual environments.
10
 
11
  ```
 
12
  from transformers.models.auto.modeling_auto import AutoModelForCausalLM
13
  from transformers.models.auto.tokenization_auto import AutoTokenizer
14
 
15
+ model_path = os.path.expanduser("/path/to/Xmodel-2")
 
 
 
16
 
17
  model = AutoModelForCausalLM.from_pretrained(
18
  model_path,
 
68
 
69
  print("Generated Response:")
70
  print(output)
71
+ ```
72
+
73
+ The possible result generated by this code is:
74
+ ```
75
+ Generated Response:
76
+ Large language models are advanced artificial intelligence systems that are trained on massive amounts of text data to generate human-like text. These models are typically trained on a large corpus of text data, such as books, articles, and websites, and are able to generate text that is coherent and contextually appropriate.
77
+
78
+ Large language models are often used in natural language processing (NLP) tasks, such as language translation, text summarization, and text generation. They are also used in a variety of other applications, such as chatbots, virtual assistants, and language learning tools.
79
+
80
+ Large language models are a key component of the field of artificial intelligence and are being used in a variety of industries and applications. They are a powerful tool for generating human-like text and are helping to transform the way that we interact with technology.
81
+ ```
82
+
83
+ # Evaluation
84
+ ## Commonsense Reasoning
85
+
86
+ We evaluate Xmodel-2 on various commonsense reasoning benchmarks using the Language Model Evaluation Harness, which includes: **ARC-Challenge**, **ARC-Easy**, **BoolQ**, **HellaSwag**, **OpenBookQA**, **PiQA**, **SciQ**, **TriviaQA**, and **Winogrande**. For fairness and reproducibility, all models were evaluated in the same environment using raw accuracy metrics.
87
+
88
+ | Model | ARC-c | ARC-e | Boolq | HS | OB | PiQA | SciQ | Wino. | Avg |
89
+ | :------------------------ | ------: | ------: | ------: | ------: | ------: | ------: | ------: | ------: | ------: |
90
+ | MobiLLama-1B | 28.24 | 61.53 | 60.92 | 46.74 | 21.80 | 75.14 | 88.20 | 59.27 | 55.23 |
91
+ | TinyLLaMA1.1-1.1B | 30.97 | 61.66 | 55.99 | 46.70 | 25.20 | 72.63 | 89.30 | 59.43 | 55.24 |
92
+ | OLMo-1B | 28.67 | 63.34 | 61.74 | 46.97 | 25.00 | 75.03 | 87.00 | 59.98 | 55.97 |
93
+ | OpenELM-1.1B | 28.84 | 62.37 | 63.58 | 48.36 | 25.40 | 74.76 | 90.60 | 61.72 | 56.95 |
94
+ | Llama-3.2-1B | 31.31 | 65.36 | 63.73 | 47.78 | 26.40 | 74.48 | 91.50 | 61.01 | 57.70 |
95
+ | MiniCPM-1.2B | 36.86 | 70.29 | 67.92 | 49.91 | 23.60 | 74.43 | 91.80 | 60.77 | 59.45 |
96
+ | Fox-1-1.6B | 34.73 | 69.91 | 71.77 | 46.33 | 24.60 | 75.24 | 93.20 | 60.77 | 59.57 |
97
+ | InternLM2.5-1.8B | 35.24 | 66.37 | 79.82 | 46.99 | 22.00 | 73.29 | 94.90 | 62.67 | 60.16 |
98
+ | Qwen2-1.5B | 33.11 | 66.41 | 72.60 | 48.57 | 27.00 | 75.57 | 94.60 | 65.75 | 60.45 |
99
+ | StableLM-2-zephyr-1.6B | 36.52 | 66.79 | 80.00 | 53.26 | 26.80 | 74.86 | 88.00 | 64.09 | 61.29 |
100
+ | SmolLM-1.7B | 43.43 | 76.47 | 65.93 | 49.58 | 30.00 | 75.79 | 93.20 | 60.93 | 61.92 |
101
+ | Qwen2.5-1.5B | 41.21 | 75.21 | 72.97 | 50.15 | 31.80 | 75.90 | 94.30 | 63.61 | 63.14 |
102
+ | DCLM-1B | 41.30 | 74.79 | 71.41 | 53.59 | 32.20 | 76.93 | 94.00 | 66.22 | 63.81 |
103
+ | Phi-1.5-1.3B | 44.80 | 76.22 | 74.95 | 47.96 | 38.60 | 76.66 | 93.30 | 72.93 | 65.68 |
104
+ | Xmodel-2-1.2B | 39.16 | 71.55 | 74.65 | 47.45 | 29.20 | 74.81 | 93.60 | 63.93 | 61.79 |
105
+
106
+ ## Complex Reasoning
107
+
108
+ To evaluate the complex reasoning abilities of Xmodel-2, we conducted tests using several well-established benchmarks, including **GSM8K**, **MATH**, **BBH**, **MMLU**, **HumanEval**, and **MBPP**. The first four benchmarks were assessed using the Language Model Evaluation Harness, while the last two were evaluated with the Code Generation LM Evaluation Harness.
109
+
110
+ | Model | GSM8K<br>5-shot | MATH<br>4-shot | BBH<br>3-shot | MMLU<br>0-shot | HumanEval<br>pass@1 | MBPP<br>pass@1 | Avg |
111
+ | :------------------------ | --------------: | -------------: | ------------: | -------------: | -------------------: | -------------: | ------: |
112
+ | OpenELM-1.1B | 0.45 | 1.06 | 6.62 | 25.52 | 8.54 | 6.80 | 8.16 |
113
+ | OLMo-1B | 2.35 | 1.46 | 25.60 | 24.46 | 5.49 | 0.20 | 9.93 |
114
+ | TinyLLaMA1.1-1.1B | 2.50 | 1.48 | 25.57 | 25.35 | 1.83 | 3.40 | 10.02 |
115
+ | MobiLLama-1B | 1.97 | 1.54 | 25.76 | 25.26 | 7.93 | 5.40 | 11.31 |
116
+ | DCLM-1B | 4.93 | 2.14 | 30.70 | 46.43 | 8.54 | 6.80 | 16.59 |
117
+ | Llama-3.2-1B | 6.60 | 1.78 | 31.44 | 36.63 | 14.63 | 22.20 | 18.88 |
118
+ | SmolLM-1.7B | 7.51 | 3.18 | 29.21 | 27.73 | 21.34 | 31.80 | 20.13 |
119
+ | Fox-1-1.6B | 34.34 | 7.94 | 28.75 | 39.55 | 14.02 | 9.00 | 22.27 |
120
+ | StableLM-2-zephyr-1.6B | 41.32 | 10.12 | 32.71 | 41.30 | 25.61 | 19.40 | 28.41 |
121
+ | Phi-1.5-1.3B | 32.15 | 3.18 | 28.81 | 41.75 | 36.59 | 35.40 | 29.65 |
122
+ | InternLM2.5-1.8B | 27.90 | 16.68 | 41.76 | 46.30 | 27.40 | 29.60 | 31.61 |
123
+ | MiniCPM-1.2B | 40.11 | 10.98 | 35.42 | 43.99 | 43.90 | 36.80 | 35.20 |
124
+ | Qwen2-1.5B | 57.62 | 22.90 | 33.05 | 55.11 | 20.73 | 30.40 | 36.64 |
125
+ | Qwen2.5-1.5B | 62.40 | 28.28 | 43.99 | 59.72 | 5.49 | 40.00 | 39.98 |
126
+ | **Xmodel-2-1.2B** | **55.88** | **25.50** | **48.40** | **48.87** | **29.88** | **29.20** | **39.62** |
127
+
128
+ ## Agent Capabilities
129
+
130
+ We evaluate Xmodel-2’s performance on four agent tasks using the ReAct prompting technique. These tasks include **HotpotQA**, **FEVER**, **AlfWorld**, and **WebShop**. We use EM(Exact Match) as the evaluation metric in **FEVER** and **HotpotQA**, and success rate in **AlfWorld** and **WebShop**.
131
+
132
+
133
+ | Model | HotpotQA (EM) | FEVER (EM) | AlfWorld (success rate) | WebShop (success rate) | Avg |
134
+ | :------------------------ | -------------: | ----------: | ----------------------: | ---------------------: | -----: |
135
+ | DCLM-1B | 4.92 | 24.39 | 0.75 | 0.00 | 7.52 |
136
+ | MobiLLama-1B | 0.00 | 30.43 | 0.00 | 0.00 | 7.61 |
137
+ | TinyLLama1.1-1.1B | 2.11 | 28.77 | 0.00 | 0.20 | 7.77 |
138
+ | OpenELM-1-1B | 2.70 | 28.37 | 0.00 | 0.40 | 7.87 |
139
+ | StableLM-2-zephyr 1.6B | 1.44 | 20.81 | 8.96 | 2.20 | 8.35 |
140
+ | SmolLM-1.7B | 2.28 | 31.31 | 0.00 | 0.60 | 8.55 |
141
+ | Fox-1-1.6B | 5.37 | 30.88 | 0.00 | 0.60 | 9.21 |
142
+ | Llama-3.2-1B | 4.87 | 27.67 | 8.21 | 3.20 | 10.99 |
143
+ | Qwen2.5-1.5B | 13.53 | 27.58 | 5.97 | 0.60 | 11.92 |
144
+ | MiniCPM-1.2B | 11.00 | 36.57 | 1.60 | 1.00 | 12.52 |
145
+ | InternLM2.5-1.8B | 12.84 | 34.02 | 2.99 | 1.00 | 12.71 |
146
+ | Xmodel-2-1.2B | 13.70 | 40.00 | 0.78 | 2.20 | 14.21 |