XiaoduoAILab
/

Xmodel-2

@@ -9,14 +9,10 @@ For detail, you can read the paper at https://huggingface.co/papers/2412.19638
 To use Xmodel-2 for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using the latest code and related virtual environments.
 ```
-import os
 from transformers.models.auto.modeling_auto import AutoModelForCausalLM
 from transformers.models.auto.tokenization_auto import AutoTokenizer
-os.environ["CUDA_VISIBLE_DEVICES"] = "5"
-model_path = os.path.expanduser("~/models/Xmodel-2")
 model = AutoModelForCausalLM.from_pretrained(
     model_path,
@@ -72,4 +68,79 @@ output = output.strip()
 print("Generated Response:")
 print(output)
-```

 To use Xmodel-2 for the inference, all you need to do is to input a few lines of codes as demonstrated below. However, please make sure that you are using the latest code and related virtual environments.
 ```
 from transformers.models.auto.modeling_auto import AutoModelForCausalLM
 from transformers.models.auto.tokenization_auto import AutoTokenizer
+model_path = os.path.expanduser("/path/to/Xmodel-2")
 model = AutoModelForCausalLM.from_pretrained(
     model_path,
 print("Generated Response:")
 print(output)
+```
+The possible result generated by this code is:
+```
+Generated Response:
+Large language models are advanced artificial intelligence systems that are trained on massive amounts of text data to generate human-like text. These models are typically trained on a large corpus of text data, such as books, articles, and websites, and are able to generate text that is coherent and contextually appropriate.
+Large language models are often used in natural language processing (NLP) tasks, such as language translation, text summarization, and text generation. They are also used in a variety of other applications, such as chatbots, virtual assistants, and language learning tools.
+Large language models are a key component of the field of artificial intelligence and are being used in a variety of industries and applications. They are a powerful tool for generating human-like text and are helping to transform the way that we interact with technology.
+```
+# Evaluation
+## Commonsense Reasoning
+We evaluate Xmodel-2 on various commonsense reasoning benchmarks using the Language Model Evaluation Harness, which includes: **ARC-Challenge**, **ARC-Easy**, **BoolQ**, **HellaSwag**, **OpenBookQA**, **PiQA**, **SciQ**, **TriviaQA**, and **Winogrande**. For fairness and reproducibility, all models were evaluated in the same environment using raw accuracy metrics.
+| Model                     | ARC-c   | ARC-e   | Boolq   | HS      | OB      | PiQA    | SciQ    | Wino.   | Avg     |
+| :------------------------ | ------: | ------: | ------: | ------: | ------: | ------: | ------: | ------: | ------: |
+| MobiLLama-1B              |  28.24  |  61.53  |  60.92  |  46.74  |  21.80  |  75.14  |  88.20  |  59.27  |  55.23  |
+| TinyLLaMA1.1-1.1B         |  30.97  |  61.66  |  55.99  |  46.70  |  25.20  |  72.63  |  89.30  |  59.43  |  55.24  |
+| OLMo-1B                   |  28.67  |  63.34  |  61.74  |  46.97  |  25.00  |  75.03  |  87.00  |  59.98  |  55.97  |
+| OpenELM-1.1B              |  28.84  |  62.37  |  63.58  |  48.36  |  25.40  |  74.76  |  90.60  |  61.72  |  56.95  |
+| Llama-3.2-1B              |  31.31  |  65.36  |  63.73  |  47.78  |  26.40  |  74.48  |  91.50  |  61.01  |  57.70  |
+| MiniCPM-1.2B              |  36.86  |  70.29  |  67.92  |  49.91  |  23.60  |  74.43  |  91.80  |  60.77  |  59.45  |
+| Fox-1-1.6B                |  34.73  |  69.91  |  71.77  |  46.33  |  24.60  |  75.24  |  93.20  |  60.77  |  59.57  |
+| InternLM2.5-1.8B          |  35.24  |  66.37  |  79.82  |  46.99  |  22.00  |  73.29  |  94.90  |  62.67  |  60.16  |
+| Qwen2-1.5B                |  33.11  |  66.41  |  72.60  |  48.57  |  27.00  |  75.57  |  94.60  |  65.75  |  60.45  |
+| StableLM-2-zephyr-1.6B    |  36.52  |  66.79  |  80.00  |  53.26  |  26.80  |  74.86  |  88.00  |  64.09  |  61.29  |
+| SmolLM-1.7B               |  43.43  |  76.47  |  65.93  |  49.58  |  30.00  |  75.79  |  93.20  |  60.93  |  61.92  |
+| Qwen2.5-1.5B              |  41.21  |  75.21  |  72.97  |  50.15  |  31.80  |  75.90  |  94.30  |  63.61  |  63.14  |
+| DCLM-1B                   |  41.30  |  74.79  |  71.41  |  53.59  |  32.20  |  76.93  |  94.00  |  66.22  |  63.81  |
+| Phi-1.5-1.3B              |  44.80  |  76.22  |  74.95  |  47.96  |  38.60  |  76.66  |  93.30  |  72.93  |  65.68  |
+| Xmodel-2-1.2B             |  39.16  |  71.55  |  74.65  |  47.45  |  29.20  |  74.81  |  93.60  |  63.93  |  61.79  |
+## Complex Reasoning
+To evaluate the complex reasoning abilities of Xmodel-2, we conducted tests using several well-established benchmarks, including **GSM8K**, **MATH**, **BBH**, **MMLU**, **HumanEval**, and **MBPP**. The first four benchmarks were assessed using the Language Model Evaluation Harness, while the last two were evaluated with the Code Generation LM Evaluation Harness.
+| Model                     | GSM8K<br>5-shot | MATH<br>4-shot | BBH<br>3-shot | MMLU<br>0-shot | HumanEval<br>pass@1 | MBPP<br>pass@1 | Avg     |
+| :------------------------ | --------------: | -------------: | ------------: | -------------: | -------------------: | -------------: | ------: |
+| OpenELM-1.1B              |            0.45 |           1.06 |          6.62 |          25.52 |                 8.54 |           6.80 |    8.16 |
+| OLMo-1B                   |            2.35 |           1.46 |         25.60 |          24.46 |                 5.49 |           0.20 |    9.93 |
+| TinyLLaMA1.1-1.1B         |            2.50 |           1.48 |         25.57 |          25.35 |                 1.83 |           3.40 |   10.02 |
+| MobiLLama-1B              |            1.97 |           1.54 |         25.76 |          25.26 |                 7.93 |           5.40 |   11.31 |
+| DCLM-1B                   |            4.93 |           2.14 |         30.70 |          46.43 |                 8.54 |           6.80 |   16.59 |
+| Llama-3.2-1B              |            6.60 |           1.78 |         31.44 |          36.63 |                14.63 |          22.20 |   18.88 |
+| SmolLM-1.7B               |            7.51 |           3.18 |         29.21 |          27.73 |                21.34 |          31.80 |   20.13 |
+| Fox-1-1.6B                |           34.34 |           7.94 |         28.75 |          39.55 |                14.02 |           9.00 |   22.27 |
+| StableLM-2-zephyr-1.6B    |           41.32 |          10.12 |         32.71 |          41.30 |                25.61 |          19.40 |   28.41 |
+| Phi-1.5-1.3B              |           32.15 |           3.18 |         28.81 |          41.75 |                36.59 |          35.40 |   29.65 |
+| InternLM2.5-1.8B          |           27.90 |          16.68 |         41.76 |          46.30 |                27.40 |          29.60 |   31.61 |
+| MiniCPM-1.2B              |           40.11 |          10.98 |         35.42 |          43.99 |                43.90 |          36.80 |   35.20 |
+| Qwen2-1.5B                |           57.62 |          22.90 |         33.05 |          55.11 |                20.73 |          30.40 |   36.64 |
+| Qwen2.5-1.5B              |           62.40 |          28.28 |         43.99 |          59.72 |                 5.49 |          40.00 |   39.98 |
+| **Xmodel-2-1.2B**         |       **55.88** |      **25.50** |     **48.40** |      **48.87** |            **29.88** |      **29.20** | **39.62** |
+## Agent Capabilities
+We evaluate Xmodel-2’s performance on four agent tasks using the ReAct prompting technique. These tasks include **HotpotQA**, **FEVER**, **AlfWorld**, and **WebShop**. We use EM(Exact Match) as the evaluation metric in **FEVER** and **HotpotQA**, and success rate in **AlfWorld** and **WebShop**.
+| Model                     | HotpotQA (EM) | FEVER (EM) | AlfWorld (success rate) | WebShop (success rate) | Avg    |
+| :------------------------ | -------------: | ----------: | ----------------------: | ---------------------: | -----: |
+| DCLM-1B                   |           4.92 |      24.39 |                    0.75 |                   0.00 |   7.52 |
+| MobiLLama-1B              |           0.00 |      30.43 |                    0.00 |                   0.00 |   7.61 |
+| TinyLLama1.1-1.1B         |           2.11 |      28.77 |                    0.00 |                   0.20 |   7.77 |
+| OpenELM-1-1B              |           2.70 |      28.37 |                    0.00 |                   0.40 |   7.87 |
+| StableLM-2-zephyr 1.6B    |           1.44 |      20.81 |                    8.96 |                   2.20 |   8.35 |
+| SmolLM-1.7B               |           2.28 |      31.31 |                    0.00 |                   0.60 |   8.55 |
+| Fox-1-1.6B                |           5.37 |      30.88 |                    0.00 |                   0.60 |   9.21 |
+| Llama-3.2-1B              |           4.87 |      27.67 |                    8.21 |                   3.20 |  10.99 |
+| Qwen2.5-1.5B              |          13.53 |      27.58 |                    5.97 |                   0.60 |  11.92 |
+| MiniCPM-1.2B              |          11.00 |      36.57 |                    1.60 |                   1.00 |  12.52 |
+| InternLM2.5-1.8B          |          12.84 |      34.02 |                    2.99 |                   1.00 |  12.71 |
+| Xmodel-2-1.2B             |          13.70 |      40.00 |                    0.78 |                   2.20 |  14.21 |