Update README.md
#2
by
ZhaoRuiyu
- opened
README.md
CHANGED
|
@@ -9,14 +9,10 @@ For detail, you can read the paper at https://huggingface.co/papers/2412.19638
|
|
| 9 |
To use Xmodel-2 for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using the latest code and related virtual environments.
|
| 10 |
|
| 11 |
```
|
| 12 |
-
import os
|
| 13 |
from transformers.models.auto.modeling_auto import AutoModelForCausalLM
|
| 14 |
from transformers.models.auto.tokenization_auto import AutoTokenizer
|
| 15 |
|
| 16 |
-
|
| 17 |
-
os.environ["CUDA_VISIBLE_DEVICES"] = "5"
|
| 18 |
-
|
| 19 |
-
model_path = os.path.expanduser("~/models/Xmodel-2")
|
| 20 |
|
| 21 |
model = AutoModelForCausalLM.from_pretrained(
|
| 22 |
model_path,
|
|
@@ -72,4 +68,79 @@ output = output.strip()
|
|
| 72 |
|
| 73 |
print("Generated Response:")
|
| 74 |
print(output)
|
| 75 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
To use Xmodel-2 for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using the latest code and related virtual environments.
|
| 10 |
|
| 11 |
```
|
|
|
|
| 12 |
from transformers.models.auto.modeling_auto import AutoModelForCausalLM
|
| 13 |
from transformers.models.auto.tokenization_auto import AutoTokenizer
|
| 14 |
|
| 15 |
+
model_path = os.path.expanduser("/path/to/Xmodel-2")
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
model = AutoModelForCausalLM.from_pretrained(
|
| 18 |
model_path,
|
|
|
|
| 68 |
|
| 69 |
print("Generated Response:")
|
| 70 |
print(output)
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
The possible result generated by this code is:
|
| 74 |
+
```
|
| 75 |
+
Generated Response:
|
| 76 |
+
Large language models are advanced artificial intelligence systems that are trained on massive amounts of text data to generate human-like text. These models are typically trained on a large corpus of text data, such as books, articles, and websites, and are able to generate text that is coherent and contextually appropriate.
|
| 77 |
+
|
| 78 |
+
Large language models are often used in natural language processing (NLP) tasks, such as language translation, text summarization, and text generation. They are also used in a variety of other applications, such as chatbots, virtual assistants, and language learning tools.
|
| 79 |
+
|
| 80 |
+
Large language models are a key component of the field of artificial intelligence and are being used in a variety of industries and applications. They are a powerful tool for generating human-like text and are helping to transform the way that we interact with technology.
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
# Evaluation
|
| 84 |
+
## Commonsense Reasoning
|
| 85 |
+
|
| 86 |
+
We evaluate Xmodel-2 on various commonsense reasoning benchmarks using the Language Model Evaluation Harness, which includes: **ARC-Challenge**, **ARC-Easy**, **BoolQ**, **HellaSwag**, **OpenBookQA**, **PiQA**, **SciQ**, **TriviaQA**, and **Winogrande**. For fairness and reproducibility, all models were evaluated in the same environment using raw accuracy metrics.
|
| 87 |
+
|
| 88 |
+
| Model | ARC-c | ARC-e | Boolq | HS | OB | PiQA | SciQ | Wino. | Avg |
|
| 89 |
+
| :------------------------ | ------: | ------: | ------: | ------: | ------: | ------: | ------: | ------: | ------: |
|
| 90 |
+
| MobiLLama-1B | 28.24 | 61.53 | 60.92 | 46.74 | 21.80 | 75.14 | 88.20 | 59.27 | 55.23 |
|
| 91 |
+
| TinyLLaMA1.1-1.1B | 30.97 | 61.66 | 55.99 | 46.70 | 25.20 | 72.63 | 89.30 | 59.43 | 55.24 |
|
| 92 |
+
| OLMo-1B | 28.67 | 63.34 | 61.74 | 46.97 | 25.00 | 75.03 | 87.00 | 59.98 | 55.97 |
|
| 93 |
+
| OpenELM-1.1B | 28.84 | 62.37 | 63.58 | 48.36 | 25.40 | 74.76 | 90.60 | 61.72 | 56.95 |
|
| 94 |
+
| Llama-3.2-1B | 31.31 | 65.36 | 63.73 | 47.78 | 26.40 | 74.48 | 91.50 | 61.01 | 57.70 |
|
| 95 |
+
| MiniCPM-1.2B | 36.86 | 70.29 | 67.92 | 49.91 | 23.60 | 74.43 | 91.80 | 60.77 | 59.45 |
|
| 96 |
+
| Fox-1-1.6B | 34.73 | 69.91 | 71.77 | 46.33 | 24.60 | 75.24 | 93.20 | 60.77 | 59.57 |
|
| 97 |
+
| InternLM2.5-1.8B | 35.24 | 66.37 | 79.82 | 46.99 | 22.00 | 73.29 | 94.90 | 62.67 | 60.16 |
|
| 98 |
+
| Qwen2-1.5B | 33.11 | 66.41 | 72.60 | 48.57 | 27.00 | 75.57 | 94.60 | 65.75 | 60.45 |
|
| 99 |
+
| StableLM-2-zephyr-1.6B | 36.52 | 66.79 | 80.00 | 53.26 | 26.80 | 74.86 | 88.00 | 64.09 | 61.29 |
|
| 100 |
+
| SmolLM-1.7B | 43.43 | 76.47 | 65.93 | 49.58 | 30.00 | 75.79 | 93.20 | 60.93 | 61.92 |
|
| 101 |
+
| Qwen2.5-1.5B | 41.21 | 75.21 | 72.97 | 50.15 | 31.80 | 75.90 | 94.30 | 63.61 | 63.14 |
|
| 102 |
+
| DCLM-1B | 41.30 | 74.79 | 71.41 | 53.59 | 32.20 | 76.93 | 94.00 | 66.22 | 63.81 |
|
| 103 |
+
| Phi-1.5-1.3B | 44.80 | 76.22 | 74.95 | 47.96 | 38.60 | 76.66 | 93.30 | 72.93 | 65.68 |
|
| 104 |
+
| Xmodel-2-1.2B | 39.16 | 71.55 | 74.65 | 47.45 | 29.20 | 74.81 | 93.60 | 63.93 | 61.79 |
|
| 105 |
+
|
| 106 |
+
## Complex Reasoning
|
| 107 |
+
|
| 108 |
+
To evaluate the complex reasoning abilities of Xmodel-2, we conducted tests using several well-established benchmarks, including **GSM8K**, **MATH**, **BBH**, **MMLU**, **HumanEval**, and **MBPP**. The first four benchmarks were assessed using the Language Model Evaluation Harness, while the last two were evaluated with the Code Generation LM Evaluation Harness.
|
| 109 |
+
|
| 110 |
+
| Model | GSM8K<br>5-shot | MATH<br>4-shot | BBH<br>3-shot | MMLU<br>0-shot | HumanEval<br>pass@1 | MBPP<br>pass@1 | Avg |
|
| 111 |
+
| :------------------------ | --------------: | -------------: | ------------: | -------------: | -------------------: | -------------: | ------: |
|
| 112 |
+
| OpenELM-1.1B | 0.45 | 1.06 | 6.62 | 25.52 | 8.54 | 6.80 | 8.16 |
|
| 113 |
+
| OLMo-1B | 2.35 | 1.46 | 25.60 | 24.46 | 5.49 | 0.20 | 9.93 |
|
| 114 |
+
| TinyLLaMA1.1-1.1B | 2.50 | 1.48 | 25.57 | 25.35 | 1.83 | 3.40 | 10.02 |
|
| 115 |
+
| MobiLLama-1B | 1.97 | 1.54 | 25.76 | 25.26 | 7.93 | 5.40 | 11.31 |
|
| 116 |
+
| DCLM-1B | 4.93 | 2.14 | 30.70 | 46.43 | 8.54 | 6.80 | 16.59 |
|
| 117 |
+
| Llama-3.2-1B | 6.60 | 1.78 | 31.44 | 36.63 | 14.63 | 22.20 | 18.88 |
|
| 118 |
+
| SmolLM-1.7B | 7.51 | 3.18 | 29.21 | 27.73 | 21.34 | 31.80 | 20.13 |
|
| 119 |
+
| Fox-1-1.6B | 34.34 | 7.94 | 28.75 | 39.55 | 14.02 | 9.00 | 22.27 |
|
| 120 |
+
| StableLM-2-zephyr-1.6B | 41.32 | 10.12 | 32.71 | 41.30 | 25.61 | 19.40 | 28.41 |
|
| 121 |
+
| Phi-1.5-1.3B | 32.15 | 3.18 | 28.81 | 41.75 | 36.59 | 35.40 | 29.65 |
|
| 122 |
+
| InternLM2.5-1.8B | 27.90 | 16.68 | 41.76 | 46.30 | 27.40 | 29.60 | 31.61 |
|
| 123 |
+
| MiniCPM-1.2B | 40.11 | 10.98 | 35.42 | 43.99 | 43.90 | 36.80 | 35.20 |
|
| 124 |
+
| Qwen2-1.5B | 57.62 | 22.90 | 33.05 | 55.11 | 20.73 | 30.40 | 36.64 |
|
| 125 |
+
| Qwen2.5-1.5B | 62.40 | 28.28 | 43.99 | 59.72 | 5.49 | 40.00 | 39.98 |
|
| 126 |
+
| **Xmodel-2-1.2B** | **55.88** | **25.50** | **48.40** | **48.87** | **29.88** | **29.20** | **39.62** |
|
| 127 |
+
|
| 128 |
+
## Agent Capabilities
|
| 129 |
+
|
| 130 |
+
We evaluate Xmodel-2’s performance on four agent tasks using the ReAct prompting technique. These tasks include **HotpotQA**, **FEVER**, **AlfWorld**, and **WebShop**. We use EM(Exact Match) as the evaluation metric in **FEVER** and **HotpotQA**, and success rate in **AlfWorld** and **WebShop**.
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
| Model | HotpotQA (EM) | FEVER (EM) | AlfWorld (success rate) | WebShop (success rate) | Avg |
|
| 134 |
+
| :------------------------ | -------------: | ----------: | ----------------------: | ---------------------: | -----: |
|
| 135 |
+
| DCLM-1B | 4.92 | 24.39 | 0.75 | 0.00 | 7.52 |
|
| 136 |
+
| MobiLLama-1B | 0.00 | 30.43 | 0.00 | 0.00 | 7.61 |
|
| 137 |
+
| TinyLLama1.1-1.1B | 2.11 | 28.77 | 0.00 | 0.20 | 7.77 |
|
| 138 |
+
| OpenELM-1-1B | 2.70 | 28.37 | 0.00 | 0.40 | 7.87 |
|
| 139 |
+
| StableLM-2-zephyr 1.6B | 1.44 | 20.81 | 8.96 | 2.20 | 8.35 |
|
| 140 |
+
| SmolLM-1.7B | 2.28 | 31.31 | 0.00 | 0.60 | 8.55 |
|
| 141 |
+
| Fox-1-1.6B | 5.37 | 30.88 | 0.00 | 0.60 | 9.21 |
|
| 142 |
+
| Llama-3.2-1B | 4.87 | 27.67 | 8.21 | 3.20 | 10.99 |
|
| 143 |
+
| Qwen2.5-1.5B | 13.53 | 27.58 | 5.97 | 0.60 | 11.92 |
|
| 144 |
+
| MiniCPM-1.2B | 11.00 | 36.57 | 1.60 | 1.00 | 12.52 |
|
| 145 |
+
| InternLM2.5-1.8B | 12.84 | 34.02 | 2.99 | 1.00 | 12.71 |
|
| 146 |
+
| Xmodel-2-1.2B | 13.70 | 40.00 | 0.78 | 2.20 | 14.21 |
|