diff --git a/ms-swift/docs/source_en/Instruction/Use-tuners.md b/ms-swift/docs/source_en/Instruction/Use-tuners.md
new file mode 100644
index 0000000000000000000000000000000000000000..f960591893498d5114bd4cc8b2ccf00e9d520d6a
--- /dev/null
+++ b/ms-swift/docs/source_en/Instruction/Use-tuners.md
@@ -0,0 +1,122 @@
+# Using Tuners
+
+Tuners refer to additional structural components attached to a model, aimed at reducing the number of training parameters or enhancing training accuracy. The tuners currently supported by SWIFT include:
+
+- LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/abs/2106.09685)
+- LoRA+: [LoRA+: Efficient Low Rank Adaptation of Large Models](https://arxiv.org/pdf/2402.12354.pdf)
+- LLaMA PRO: [LLAMA PRO: Progressive LLaMA with Block Expansion](https://arxiv.org/pdf/2401.02415.pdf)
+- GaLore/Q-GaLore: [GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection](https://arxiv.org/abs/2403.03507)
+- Liger Kernel: [Liger Kernel: Efficient Triton Kernels for LLM Training](https://arxiv.org/abs/2410.10989)
+- LISA: [LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning](https://arxiv.org/abs/2403.17919)
+- UnSloth: https://github.com/unslothai/unsloth
+- SCEdit: [SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing](https://arxiv.org/abs/2312.11392)  < [arXiv](https://arxiv.org/abs/2312.11392)  |  [Project Page](https://scedit.github.io/) >
+- NEFTune: [Noisy Embeddings Improve Instruction Finetuning](https://arxiv.org/abs/2310.05914)
+- LongLoRA: [Efficient Fine-tuning of Long-Context Large Language Models](https://arxiv.org/abs/2309.12307)
+- Adapter: [Parameter-Efficient Transfer Learning for NLP](http://arxiv.org/abs/1902.00751)
+- Vision Prompt Tuning: [Visual Prompt Tuning](https://arxiv.org/abs/2203.12119)
+- Side: [Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks](https://arxiv.org/abs/1912.13503)
+- Res-Tuning: [Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone](https://arxiv.org/abs/2310.19859)  < [arXiv](https://arxiv.org/abs/2310.19859)  |  [Project Page](https://res-tuning.github.io/)  |  [Usage](ResTuning.md) >
+- Tuners provided by [PEFT](https://github.com/huggingface/peft), such as AdaLoRA, DoRA, Fourierft, etc.
+
+## Interface List
+
+### Swift Class Static Interfaces
+
+- `Swift.prepare_model(model, config, **kwargs)`
+  - Function: Loads a tuner into a model. If it is a subclass of `PeftConfig`, it uses the corresponding interface from the Peft library to load the tuner. When using `SwiftConfig`, this interface can accept `SwiftModel` instances and can be called repeatedly, functioning similarly to passing a dictionary of configs.
+    - This interface supports the parallel loading of multiple tuners of different types for concurrent use.
+  - Parameters:
+    - `model`: An instance of `torch.nn.Module` or `SwiftModel`, the model to be loaded.
+    - `config`: An instance of `SwiftConfig` or `PeftConfig`, or a dictionary of custom tuner names paired with their respective configs.
+  - Return Value: An instance of `SwiftModel` or `PeftModel`.
+
+- `Swift.merge_and_unload(model)`
+  - Function: Merges LoRA weights back into the original model and completely unloads the LoRA component.
+  - Parameters:
+    - `model`: An instance of `SwiftModel` or `PeftModel` that has had LoRA loaded.
+  - Return Value: None.
+
+- `Swift.merge(model)`
+  - Function: Merges LoRA weights back into the original model without unloading the LoRA component.
+  - Parameters:
+    - `model`: An instance of `SwiftModel` or `PeftModel` that has had LoRA loaded.
+  - Return Value: None.
+
+- `Swift.unmerge(model)`
+  - Function: Splits LoRA weights back from the original model weights into the LoRA structure.
+  - Parameters:
+    - `model`: An instance of `SwiftModel` or `PeftModel` that has had LoRA loaded.
+  - Return Value: None.
+
+- `Swift.save_to_peft_format(ckpt_dir, output_dir)`
+  - Function: Converts stored LoRA checkpoints to a PEFT-compatible format. Key changes include:
+    - The `default` will be split from the corresponding `default` folder into the root directory of `output_dir`.
+    - The `{tuner_name}.` field will be removed from weight keys, e.g., `model.layer.0.self.in_proj.lora_A.default.weight` becomes `model.layer.0.self.in_proj.lora_A.weight`.
+    - Weight keys will have a `basemodel.model` prefix added.
+    - Note: Only LoRA can be converted; other types of tuners will raise conversion errors due to PEFT not supporting them. Additionally, due to the presence of extra parameters in LoRAConfig, such as `dtype`, conversion to Peft format is not supported when these parameters are set. In such cases, you can manually delete the corresponding fields in adapter_config.json.
+  - Parameters:
+    - `ckpt_dir`: Original weights directory.
+    - `output_dir`: The target directory for the weights.
+  - Return Value: None.
+
+- `Swift.from_pretrained(model, model_id, adapter_name, revision, **kwargs)`
+  - Function: Load the tuner onto the model from the stored weights directory. If `adapter_name` is not provided, all tuners from the `model_id` directory will be loaded. This interface can also be called repeatedly, similar to `prepare_model`.
+  - Parameters:
+    - `model`: An instance of `torch.nn.Module` or `SwiftModel` to which the tuner will be loaded.
+    - `model_id`: A string indicating the tuner checkpoint to be loaded, which can be an ID from the model hub or a local directory.
+    - `adapter_name`: Can be of type `str`, `List[str]`, `Dict[str, str]`, or `None`. If `None`, all tuners in the specified directory will be loaded. If it is a `str` or `List[str]`, only specific tuners will be loaded. If it is a `Dict`, the key represents the tuner to load, which will be renamed to the corresponding value.
+    - `revision`: If `model_id` is an ID from the model hub, `revision` can specify the corresponding version number.
+
+### SwiftModel Interfaces
+
+Below is a list of interfaces that users may call. Other internal or less recommended interfaces can be viewed by running the `make docs` command to access the API Doc.
+
+- `SwiftModel.create_optimizer_param_groups(self, **defaults)`
+  - Function: Creates parameter groups based on the loaded tuners; currently, this only applies to the `LoRA+` algorithm.
+  - Parameters:
+    - `defaults`: Default parameters for the `optimizer_groups`, such as `lr` and `weight_decay`.
+  - Return Value:
+    - The created `optimizer_groups`.
+
+- `SwiftModel.add_weighted_adapter(self, ...)`
+  - Function: Merges existing LoRA tuners into one.
+  - Parameters:
+    - This interface is a passthrough to `PeftModel.add_weighted_adapter`, and parameters can be referenced in the [add_weighted_adapter documentation](https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraModel.add_weighted_adapter).
+
+- `SwiftModel.save_pretrained(self, save_directory, safe_serialization, adapter_name)`
+  - Function: Saves tuner weights.
+  - Parameters:
+    - `save_directory`: The directory for saving.
+    - `safe_serialization`: Whether to use safe tensors, default is `False`.
+    - `adapter_name`: Stored adapter tuner, if not provided, defaults to storing all tuners.
+
+- `SwiftModel.set_active_adapters(self, adapter_names, offload=None)`
+  - Function: Sets the currently active adapters; adapters not in the list will be deactivated.
+    - In inference, the environment variable `USE_UNIQUE_THREAD=0/1`, default is `1`. If set to `0`, then `set_active_adapters` only takes effect in the current thread, at which point it defaults to using the tuners activated in this thread, with tuners in different threads not interfering with each other.
+  - Parameters:
+    - `adapter_names`: The names of the active tuners.
+    - `offload`: How to handle deactivated adapters; default is `None`, meaning they remain in GPU memory. Can also use `cpu` or `meta` to offload to CPU or meta device to reduce memory consumption. In `USE_UNIQUE_THREAD=0`, do not pass the `offload` value to avoid affecting other threads.
+  - Return Value: None.
+
+- `SwiftModel.activate_adapter(self, adapter_name)`
+  - Function: Activates a tuner.
+    - In inference, the environment variable `USE_UNIQUE_THREAD=0/1`, default is `1`. If set to `0`, `activate_adapter` will only be effective for the current thread, at which point it defaults to using the tuners activated in this thread, with tuners in different threads not interfering with each other.
+  - Parameters:
+    - `adapter_name`: The name of the tuner to be activated.
+  - Return Value: None.
+
+- `SwiftModel.deactivate_adapter(self, adapter_name, offload)`
+  - Function: Deactivates a tuner.
+    - During `inference`, do not call this interface when the `USE_UNIQUE_THREAD=0`.
+  - Parameters:
+    - `adapter_name`: The name of the tuner to be deactivated.
+    - `offload`: How to handle deactivated adapters; defaults to `None`, meaning they remain in GPU memory. Can also use `cpu` or `meta` to offload to CPU or meta device to reduce memory consumption.
+  - Return Value: None.
+
+- `SwiftModel.get_trainable_parameters(self)`
+  - Function: Returns information about the trainable parameters.
+  - Parameters: None.
+  - Return Value: Information about trainable parameters in the following format:
+    ```text
+    trainable params: 100M || all params: 1000M || trainable%: 10.00% || cuda memory: 10GiB.
+    ```
diff --git a/ms-swift/examples/app/llm.sh b/ms-swift/examples/app/llm.sh
new file mode 100644
index 0000000000000000000000000000000000000000..661555ed5b3caea5d3d06ba3e11220e5144351ca
--- /dev/null
+++ b/ms-swift/examples/app/llm.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 swift app \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --stream true \
+    --infer_backend vllm \
+    --max_new_tokens 2048 \
+    --gpu_memory_utilization 0.9 \
+    --max_model_len 8192 \
+    --lang zh
diff --git a/ms-swift/examples/custom/sft.sh b/ms-swift/examples/custom/sft.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0338381884575bfe2bb3ad9213015e17dfd1e38e
--- /dev/null
+++ b/ms-swift/examples/custom/sft.sh
@@ -0,0 +1,25 @@
+# sh examples/custom/sft.sh
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --custom_register_path examples/custom/dataset.py \
+                           examples/custom/model.py \
+    --model AI-ModelScope/Nemotron-Mini-4B-Instruct \
+    --train_type lora \
+    --dataset swift/stsb \
+    --num_train_epochs 3 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --max_length 2048 \
+    --output_dir output \
+    --dataset_num_proc 4
diff --git a/ms-swift/examples/deploy/client/llm/base/swift_client.py b/ms-swift/examples/deploy/client/llm/base/swift_client.py
new file mode 100644
index 0000000000000000000000000000000000000000..65c5757df045df8d2e91239fcb455ac3572b8f77
--- /dev/null
+++ b/ms-swift/examples/deploy/client/llm/base/swift_client.py
@@ -0,0 +1,33 @@
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import os
+from typing import List
+
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+
+def infer_batch(engine: 'InferEngine', infer_requests: List['InferRequest']):
+    request_config = RequestConfig(max_tokens=64, temperature=0)
+
+    resp_list = engine.infer(infer_requests, request_config)
+
+    query0 = infer_requests[0].messages[0]['content']
+    print(f'query0: {query0}')
+    print(f'response0: {resp_list[0].choices[0].message.content}')
+
+
+def run_client(host: str = '127.0.0.1', port: int = 8000):
+    engine = InferClient(host=host, port=port)
+    print(f'models: {engine.models}')
+
+    infer_requests = [InferRequest(messages=[{'role': 'user', 'content': '浙江 -> 杭州\n安徽 -> 合肥\n四川 ->'}])]
+    infer_batch(engine, infer_requests)
+
+
+if __name__ == '__main__':
+    from swift.llm import InferEngine, InferRequest, InferClient, RequestConfig, run_deploy, DeployArguments
+    # NOTE: In a real deployment scenario, please comment out the context of run_deploy.
+    with run_deploy(
+            DeployArguments(
+                model='Qwen/Qwen2.5-1.5B', verbose=False, log_interval=-1, infer_backend='pt',
+                use_chat_template=False)) as port:
+        run_client(port=port)
diff --git a/ms-swift/examples/infer/demo_grounding.py b/ms-swift/examples/infer/demo_grounding.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f20fd8294a3d7515e9f3e349f775f8b044a5d04
--- /dev/null
+++ b/ms-swift/examples/infer/demo_grounding.py
@@ -0,0 +1,43 @@
+# pip install git+https://github.com/huggingface/transformers.git  # transformers>=4.49
+import os
+import re
+from typing import Literal
+
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+
+def draw_bbox_qwen2_vl(image, response, norm_bbox: Literal['norm1000', 'none']):
+    matches = re.findall(
+        r'<\|object_ref_start\|>(.*?)<\|object_ref_end\|><\|box_start\|>\((\d+),(\d+)\),\((\d+),(\d+)\)<\|box_end\|>',
+        response)
+    ref = []
+    bbox = []
+    for match_ in matches:
+        ref.append(match_[0])
+        bbox.append(list(match_[1:]))
+    draw_bbox(image, ref, bbox, norm_bbox=norm_bbox)
+
+
+def infer_grounding():
+    from swift.llm import PtEngine, RequestConfig, BaseArguments, InferRequest, safe_snapshot_download
+    output_path = 'bbox.png'
+    image = load_image('http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png')
+    infer_request = InferRequest(messages=[{'role': 'user', 'content': 'Task: Object Detection'}], images=[image])
+
+    request_config = RequestConfig(max_tokens=512, temperature=0)
+    adapter_path = safe_snapshot_download('swift/test_grounding')
+    args = BaseArguments.from_pretrained(adapter_path)
+
+    engine = PtEngine(args.model, adapters=[adapter_path])
+    resp_list = engine.infer([infer_request], request_config)
+    response = resp_list[0].choices[0].message.content
+    print(f'lora-response: {response}')
+
+    draw_bbox_qwen2_vl(image, response, norm_bbox=args.norm_bbox)
+    print(f'output_path: {output_path}')
+    image.save(output_path)
+
+
+if __name__ == '__main__':
+    from swift.llm import draw_bbox, load_image
+    infer_grounding()
diff --git a/ms-swift/examples/notebook/qwen2_5-self-cognition/infer.ipynb b/ms-swift/examples/notebook/qwen2_5-self-cognition/infer.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..0d5d3a8d1e2b0e570e4a3b5073bd35535281cf33
--- /dev/null
+++ b/ms-swift/examples/notebook/qwen2_5-self-cognition/infer.ipynb
@@ -0,0 +1,148 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Inference\n",
+    "We have trained a well-trained checkpoint through the `self-cognition-sft.ipynb` tutorial, and here we use `PtEngine` to do the inference on it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# import some libraries\n",
+    "import os\n",
+    "os.environ['CUDA_VISIBLE_DEVICES'] = '0'\n",
+    "\n",
+    "from swift.llm import InferEngine, InferRequest, PtEngine, RequestConfig, get_template"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Hyperparameters for inference\n",
+    "last_model_checkpoint = 'output/checkpoint-xxx'\n",
+    "\n",
+    "# model\n",
+    "model_id_or_path = 'Qwen/Qwen2.5-3B-Instruct'  # model_id or model_path\n",
+    "system = 'You are a helpful assistant.'\n",
+    "infer_backend = 'pt'\n",
+    "\n",
+    "# generation_config\n",
+    "max_new_tokens = 512\n",
+    "temperature = 0\n",
+    "stream = True"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Get model and template, and load LoRA weights.\n",
+    "engine = PtEngine(model_id_or_path, adapters=[last_model_checkpoint])\n",
+    "template = get_template(engine.model_meta.template, engine.tokenizer, default_system=system)\n",
+    "# You can modify the `default_template` directly here, or pass it in during `engine.infer`.\n",
+    "engine.default_template = template"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "query: who are you?\n",
+      "response: I am an artificial intelligence language model named Xiao Huang, developed by ModelScope. I can answer various questions and engage in conversation with humans. If you have any questions or need help, feel free to ask me at any time.\n",
+      "--------------------------------------------------\n",
+      "query: What should I do if I can't sleep at night?\n",
+      "response: If you're having trouble sleeping, there are several things you can try:\n",
+      "\n",
+      "1. Establish a regular sleep schedule: Try to go to bed and wake up at the same time every day, even on weekends.\n",
+      "\n",
+      "2. Create a relaxing bedtime routine: Engage in calming activities before bed, such as reading a book or taking a warm bath.\n",
+      "\n",
+      "3. Make your bedroom conducive to sleep: Keep your bedroom cool, dark, and quiet. Invest in comfortable bedding and pillows.\n",
+      "\n",
+      "4. Avoid stimulating activities before bed: Avoid using electronic devices, watching TV, or engaging in mentally stimulating activities before bed.\n",
+      "\n",
+      "5. Exercise regularly: Regular physical activity can help improve your sleep quality, but avoid exercising too close to bedtime.\n",
+      "\n",
+      "6. Manage stress: Practice relaxation techniques, such as deep breathing, meditation, or yoga, to help manage stress and promote better sleep.\n",
+      "\n",
+      "7. Limit caffeine and alcohol intake: Both caffeine and alcohol can disrupt sleep patterns, so it's best to limit their consumption, especially in the evening.\n",
+      "\n",
+      "8. Seek professional help: If you continue to have difficulty sleeping despite trying these strategies, consider seeking help from a healthcare provider or a sleep specialist.\n",
+      "--------------------------------------------------\n",
+      "query: 你是谁训练的？\n",
+      "response: 我是由魔搭团队训练和开发的。\n",
+      "--------------------------------------------------\n"
+     ]
+    }
+   ],
+   "source": [
+    "query_list = [\n",
+    "    'who are you?',\n",
+    "    \"What should I do if I can't sleep at night?\",\n",
+    "    '你是谁训练的？',\n",
+    "]\n",
+    "\n",
+    "def infer_stream(engine: InferEngine, infer_request: InferRequest):\n",
+    "    request_config = RequestConfig(max_tokens=max_new_tokens, temperature=temperature, stream=True)\n",
+    "    gen_list = engine.infer([infer_request], request_config)\n",
+    "    query = infer_request.messages[0]['content']\n",
+    "    print(f'query: {query}\\nresponse: ', end='')\n",
+    "    for resp in gen_list[0]:\n",
+    "        if resp is None:\n",
+    "            continue\n",
+    "        print(resp.choices[0].delta.content, end='', flush=True)\n",
+    "    print()\n",
+    "\n",
+    "def infer(engine: InferEngine, infer_request: InferRequest):\n",
+    "    request_config = RequestConfig(max_tokens=max_new_tokens, temperature=temperature)\n",
+    "    resp_list = engine.infer([infer_request], request_config)\n",
+    "    query = infer_request.messages[0]['content']\n",
+    "    response = resp_list[0].choices[0].message.content\n",
+    "    print(f'query: {query}')\n",
+    "    print(f'response: {response}')\n",
+    "\n",
+    "infer_func = infer_stream if stream else infer\n",
+    "for query in query_list:\n",
+    "    infer_func(engine, InferRequest(messages=[{'role': 'user', 'content': query}]))\n",
+    "    print('-' * 50)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "test_py310",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.15"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/ms-swift/examples/notebook/qwen2_5-self-cognition/infer.sh b/ms-swift/examples/notebook/qwen2_5-self-cognition/infer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d957257cb17b296a03c81e4cab6630f536471639
--- /dev/null
+++ b/ms-swift/examples/notebook/qwen2_5-self-cognition/infer.sh
@@ -0,0 +1,7 @@
+# Here is the command-line style inference code.
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --adapters output/vx-xxx/checkpoint-xxx \
+    --stream true \
+    --temperature 0 \
+    --max_new_tokens 2048
diff --git a/ms-swift/examples/notebook/qwen2_5-vl-grounding/zh.ipynb b/ms-swift/examples/notebook/qwen2_5-vl-grounding/zh.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..b16b037260bb43236ce065b8fbbda43c5c964f98
--- /dev/null
+++ b/ms-swift/examples/notebook/qwen2_5-vl-grounding/zh.ipynb
@@ -0,0 +1,261 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Qwen2.5-VL Grounding任务\n",
+    "\n",
+    "这里介绍使用qwen2.5-vl进行grounding任务的全流程介绍。当然，你也可以使用internvl2.5或者qwen2-vl等多模态模型。\n",
+    "\n",
+    "我们使用[AI-ModelScope/coco](https://modelscope.cn/datasets/AI-ModelScope/coco)数据集来展示整个流程。\n",
+    "\n",
+    "如果需要使用自定义数据集，需要符合以下格式："
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"<image>描述图像\"}, {\"role\": \"assistant\", \"content\": \"<ref-object><bbox>和<ref-object><bbox>正在沙滩上玩耍\"}], \"images\": [\"/xxx/x.jpg\"], \"objects\": {\"ref\": [\"一只狗\", \"一个女人\"], \"bbox\": [[331.5, 761.4, 853.5, 1594.8], [676.5, 685.8, 1099.5, 1427.4]]}}\n",
+    "{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"<image>找到图像中的<ref-object>\"}, {\"role\": \"assistant\", \"content\": \"<bbox><bbox>\"}], \"images\": [\"/xxx/x.jpg\"], \"objects\": {\"ref\": [\"羊\"], \"bbox\": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}\n",
+    "{\"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"<image>帮我打开谷歌浏览器\"}, {\"role\": \"assistant\", \"content\": \"Action: click(start_box='<bbox>')\"}], \"images\": [\"/xxx/x.jpg\"], \"objects\": {\"ref\": [], \"bbox\": [[615, 226]]}}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "ms-swift在预处理数据集时，会使用模型特有的grounding任务格式，将objects中的ref填充`<ref-object>`，bbox会根据模型类型选择是否进行0-1000的归一化，并填充`<bbox>`。例如：qwen2-vl为`f'<|object_ref_start|>羊<|object_ref_end|>'`和`f'<|box_start|>(101,201),(150,266)<|box_end|>'`（qwen2.5-vl不进行归一化，只将float型转成int型），internvl2.5则为`f'<ref>羊</ref>'`和`f'<box>[[101, 201, 150, 266]]</box>'`等。\n",
+    "\n",
+    "\n",
+    "训练之前，你需要从main分支安装ms-swift："
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "shellscript"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# pip install git+https://github.com/modelscope/ms-swift.git\n",
+    "\n",
+    "git clone https://github.com/modelscope/ms-swift.git\n",
+    "cd ms-swift\n",
+    "pip install -e .\n",
+    "\n",
+    "# 如果'transformers>=4.49'已经发版，则无需从main分支安装\n",
+    "pip install git+https://github.com/huggingface/transformers.git"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "然后，使用以下shell进行训练。MAX_PIXELS的参数含义可以查看[这里](https://swift.readthedocs.io/en/latest/Instruction/Command-line-parameters.html#specific-model-arguments)\n",
+    "\n",
+    "### 训练\n",
+    "\n",
+    "单卡训练："
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "shellscript"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# 显存资源：24GiB\n",
+    "CUDA_VISIBLE_DEVICES=0 \\\n",
+    "MAX_PIXELS=1003520 \\\n",
+    "swift sft \\\n",
+    "    --model Qwen/Qwen2.5-VL-7B-Instruct \\\n",
+    "    --dataset 'AI-ModelScope/coco#2000' \\\n",
+    "    --train_type lora \\\n",
+    "    --torch_dtype bfloat16 \\\n",
+    "    --num_train_epochs 1 \\\n",
+    "    --per_device_train_batch_size 1 \\\n",
+    "    --per_device_eval_batch_size 1 \\\n",
+    "    --learning_rate 1e-4 \\\n",
+    "    --lora_rank 8 \\\n",
+    "    --lora_alpha 32 \\\n",
+    "    --target_modules all-linear \\\n",
+    "    --freeze_vit true \\\n",
+    "    --gradient_accumulation_steps 16 \\\n",
+    "    --eval_steps 100 \\\n",
+    "    --save_steps 100 \\\n",
+    "    --save_total_limit 5 \\\n",
+    "    --logging_steps 5 \\\n",
+    "    --max_length 2048 \\\n",
+    "    --output_dir output \\\n",
+    "    --warmup_ratio 0.05 \\\n",
+    "    --dataloader_num_workers 4 \\\n",
+    "    --dataset_num_proc 4"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "然后我们将训练的模型推送到ModelScope："
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "shellscript"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "swift export \\\n",
+    "    --adapters output/vx-xxx/checkpoint-xxx \\\n",
+    "    --push_to_hub true \\\n",
+    "    --hub_model_id '<model-id>' \\\n",
+    "    --hub_token '<sdk-token>' \\\n",
+    "    --use_hf false"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "我们将训练的checkpoint推送到[swift/test_grounding](https://modelscope.cn/models/swift/test_grounding)。\n",
+    "\n",
+    "### 推理\n",
+    "\n",
+    "训练完成后，我们使用以下命令对训练时的验证集进行推理。这里`--adapters`需要替换成训练生成的last checkpoint文件夹。由于adapters文件夹中包含了训练的参数文件，因此不需要额外指定`--model`。\n",
+    "\n",
+    "若模型采用的是绝对坐标的方式进行输出，推理时请提前对图像进行缩放而不使用`MAX_PIXELS`或者`--max_pixels`。若是千分位坐标，则没有此约束。\n",
+    "\n",
+    "由于我们已经将训练后的checkpoint推送到了ModelScope上，以下推理脚本可以直接运行："
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "shellscript"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "CUDA_VISIBLE_DEVICES=0 \\\n",
+    "swift infer \\\n",
+    "    --adapters swift/test_grounding \\\n",
+    "    --stream true \\\n",
+    "    --load_data_args true \\\n",
+    "    --max_new_tokens 512 \\\n",
+    "    --dataset_num_proc 4"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "我们也可以使用代码的方式进行推理：\n",
+    "\n",
+    "单样本推理的例子可以查看[这里](https://github.com/modelscope/ms-swift/blob/main/examples/infer/demo_grounding.py)。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.environ['CUDA_VISIBLE_DEVICES'] = '0'\n",
+    "\n",
+    "import re\n",
+    "from typing import Literal\n",
+    "from swift.llm import (\n",
+    "    PtEngine, RequestConfig, BaseArguments, InferRequest, safe_snapshot_download, draw_bbox, load_image, load_dataset, InferEngine\n",
+    ")\n",
+    "from IPython.display import display\n",
+    "\n",
+    "def infer_stream(engine: InferEngine, infer_request: InferRequest):\n",
+    "    request_config = RequestConfig(max_tokens=512, temperature=0, stream=True)\n",
+    "    gen_list = engine.infer([infer_request], request_config)\n",
+    "    query = infer_request.messages[0]['content']\n",
+    "    print(f'query: {query}\\nresponse: ', end='')\n",
+    "    response = ''\n",
+    "    for resp in gen_list[0]:\n",
+    "        if resp is None:\n",
+    "            continue\n",
+    "        delta = resp.choices[0].delta.content\n",
+    "        response += delta\n",
+    "        print(delta, end='', flush=True)\n",
+    "    print()\n",
+    "    return response\n",
+    "\n",
+    "def draw_bbox_qwen2_vl(image, response, norm_bbox: Literal['norm1000', 'none']):\n",
+    "    matches = re.findall(\n",
+    "        r'<\\|object_ref_start\\|>(.*?)<\\|object_ref_end\\|><\\|box_start\\|>\\((\\d+),(\\d+)\\),\\((\\d+),(\\d+)\\)<\\|box_end\\|>',\n",
+    "        response)\n",
+    "    ref = []\n",
+    "    bbox = []\n",
+    "    for match_ in matches:\n",
+    "        ref.append(match_[0])\n",
+    "        bbox.append(list(match_[1:]))\n",
+    "    draw_bbox(image, ref, bbox, norm_bbox=norm_bbox)\n",
+    "\n",
+    "# 下载权重，并加载模型\n",
+    "output_dir = 'images_bbox'\n",
+    "model_id_or_path = 'swift/test_grounding'\n",
+    "output_dir = os.path.abspath(os.path.expanduser(output_dir))\n",
+    "adapter_path = safe_snapshot_download(model_id_or_path)\n",
+    "args = BaseArguments.from_pretrained(adapter_path)\n",
+    "engine = PtEngine(args.model, adapters=[adapter_path])\n",
+    "\n",
+    "# 获取验证集并推理\n",
+    "_, val_dataset = load_dataset(args.dataset, split_dataset_ratio=args.split_dataset_ratio, num_proc=4, seed=args.seed)\n",
+    "print(f'output_dir: {output_dir}')\n",
+    "os.makedirs(output_dir, exist_ok=True)\n",
+    "for i, data in enumerate(val_dataset):\n",
+    "    image = data['images'][0]\n",
+    "    image = load_image(image['bytes'] or image['path'])\n",
+    "    display(image)\n",
+    "    response = infer_stream(engine, InferRequest(**data))\n",
+    "    draw_bbox_qwen2_vl(image, response, norm_bbox=args.norm_bbox)\n",
+    "    print('-' * 50)\n",
+    "    image.save(os.path.join(output_dir, f'{i}.png'))\n",
+    "    display(image)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "test_py310",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/ms-swift/examples/sampler/mcts/mcts.sh b/ms-swift/examples/sampler/mcts/mcts.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6b91ab10b973314b25e0c4d3caa42f07474f3898
--- /dev/null
+++ b/ms-swift/examples/sampler/mcts/mcts.sh
@@ -0,0 +1,35 @@
+export CUDA_VISIBLE_DEVICES=0
+export USE_OPENCOMPASS_EVALUATOR=True
+
+swift sample \
+    --model ./output/Qwen2.5-Math-7B-Instruct/v40-20250126-161112/checkpoint-20 \
+    --orm_model math \
+    --sampler_type mcts \
+    --sampler_engine vllm \
+    --output_dir ./output/sampler/mcts \
+    --system ./examples/sampler/system_prompt.txt \
+    --stop_words ки \
+    --dataset ./datasets/competition_math/small_test.jsonl \
+    --num_return_sequences 2 \
+    --process_reward_rate 0 \
+    --max_new_tokens 2048
+
+## Train
+# nproc_per_node=8
+# NPROC_PER_NODE=$nproc_per_node \
+# swift sft \
+#     --model Qwen/Qwen2.5-Math-7B-Instruct \
+#     --train_type full \
+#     --torch_dtype bfloat16 \
+#     --dataset 'datasets/gen_V5.jsonl' \
+#     --num_train_epochs 1 \
+#     --per_device_train_batch_size 1 \
+#     --learning_rate 1e-5 \
+#     --gradient_accumulation_steps $(expr 128 / $nproc_per_node) \
+#     --eval_steps 1000 \
+#     --save_steps 10 \
+#     --save_total_limit 100 \
+#     --max_length 10000 \
+#     --logging_steps 5 \
+#     --gradient_checkpointing_kwargs '{"use_reentrant": false}' \
+#     --deepspeed zero3
diff --git a/ms-swift/examples/train/agent/deepseek_r1.sh b/ms-swift/examples/train/agent/deepseek_r1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3640384f4f9080bc5583817ba42dbfea7d05c39b
--- /dev/null
+++ b/ms-swift/examples/train/agent/deepseek_r1.sh
@@ -0,0 +1,27 @@
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
+    --train_type full \
+    --dataset AI-ModelScope/function-calling-chatml \
+    --agent_template react_en \
+    --loss_scale react \
+    --response_prefix '' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 2 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-5 \
+    --gradient_accumulation_steps 8 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 8192 \
+    --save_only_model true \
+    --packing true \
+    --use_liger_kernel true \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --attn_impl flash_attn \
+    --dataloader_num_workers 4 \
+    --dataset_num_proc 16
diff --git a/ms-swift/examples/train/all_to_all/train.sh b/ms-swift/examples/train/all_to_all/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7018f8f2f92e21448b00abf898482be3da77eeaa
--- /dev/null
+++ b/ms-swift/examples/train/all_to_all/train.sh
@@ -0,0 +1,23 @@
+# 70 GiB * 2
+nproc_per_node=2
+NPROC_PER_NODE=$nproc_per_node \
+CUDA_VISIBLE_DEVICES=0,2 \
+max_position_embeddings=10240 \
+image_area=518400 \
+swift sft \
+    --model BAAI/Emu3-Gen \
+    --train_type lora \
+    --dataset 'swift/TextCaps#40' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 10 \
+    --per_device_train_batch_size 1 \
+    --learning_rate 1e-5 \
+    --gradient_accumulation_steps 4 \
+    --warmup_ratio 0.03 \
+    --eval_steps 500 \
+    --save_steps 500 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 1024 \
+    --weight_decay 0.1 \
+    --gradient_checkpointing_kwargs '{"use_reentrant": false}'
diff --git a/ms-swift/examples/train/base_to_chat/lora.sh b/ms-swift/examples/train/base_to_chat/lora.sh
new file mode 100644
index 0000000000000000000000000000000000000000..32ab0dca898dd0a4a16f56b71b95567ab3ae5c95
--- /dev/null
+++ b/ms-swift/examples/train/base_to_chat/lora.sh
@@ -0,0 +1,34 @@
+# Use `--template default`
+nproc_per_node=2
+
+CUDA_VISIBLE_DEVICES=0,1 \
+MASTER_PORT=29501 \
+NPROC_PER_NODE=$nproc_per_node \
+swift sft \
+    --model Qwen/Qwen2.5-1.5B \
+    --train_type lora \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
+              'AI-ModelScope/alpaca-gpt4-data-en#500' \
+              'swift/self-cognition' \
+    --torch_dtype bfloat16 \
+    --template default \
+    --num_train_epochs 10 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
+    --eval_steps 50 \
+    --save_steps 50 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --system 'You are a helpful assistant.' \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --model_author swift \
+    --model_name swift-robot \
+    --deepspeed zero2
diff --git a/ms-swift/examples/train/full/infer.sh b/ms-swift/examples/train/full/infer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..09da3bb9313deb607f211b1f410efc77dd3de2d9
--- /dev/null
+++ b/ms-swift/examples/train/full/infer.sh
@@ -0,0 +1,7 @@
+# If you are using the validation set for inference, add the parameter `--load_data_args true`.
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --model output/vx-xxx/checkpoint-xxx \
+    --stream true \
+    --temperature 0 \
+    --max_new_tokens 2048
diff --git a/ms-swift/examples/train/full/train.sh b/ms-swift/examples/train/full/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c8d11703a2e092dd03c7efb8b795ffd324258d4b
--- /dev/null
+++ b/ms-swift/examples/train/full/train.sh
@@ -0,0 +1,25 @@
+# 76GiB
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type full \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
+              'AI-ModelScope/alpaca-gpt4-data-en#500' \
+              'swift/self-cognition#500' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-5 \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --system 'You are a helpful assistant.' \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/examples/train/grpo/internal/README.md b/ms-swift/examples/train/grpo/internal/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d15220fc6cdf2a9ad7cb51bf8050b9aafb39258a
--- /dev/null
+++ b/ms-swift/examples/train/grpo/internal/README.md
@@ -0,0 +1,48 @@
+# README: GRPO Internal Mode Execution Scripts
+
+---
+
+## Known Issues
+Bugs in **vLLM >= 0.8**
+1. DeepSpeed ZeRO-3 Mode :
+    When using DeepSpeed's ZeRO-3 configuration, gradients may become zero during training.
+
+2. Async Mode
+    In certain scenarios, the asynchronous mode (Async Mode) may hang, causing the program to become unresponsive.
+
+To ensure stability and compatibility, it is recommended to use **vLLM 0.7.3** to avoid the above issues.
+
+
+## **Introduction**
+
+The GRPO (Gradient-based Reinforcement Policy Optimization) training framework supports integrating high-performance inference engines like vLLM to accelerate the sampling process. The **Internal Mode** allows the inference service to be directly launched within the Trainer, reducing external dependencies and simplifying deployment.
+
+This folder contains scripts and instructions for running GRPO in **Internal Mode**, where the model training and inference are tightly integrated with flexible resource allocation strategies.
+
+
+## **Resource Allocation Strategies**
+
+GRPO provides two resource allocation strategies under the Internal mode:
+
+### 1. **Colocate Mode**
+
+- **Description**: Training and inference share GPU resources.
+- **Recommended Setting**:
+  - Set `sleep_level=1` to release vLLM memory during training steps.
+- **Resource Allocation Rules**:
+  ```plaintext
+  NPROC_PER_NODE = Total number of GPUs
+  num_infer_workers = Total number of GPUs
+  ```
+
+### 2. **Async Mode**
+
+- **Description**: Training and inference use independent GPU resources.
+- **Recommended Setting**:
+  - Set `sleep_level=1` to release vLLM memory during training steps.
+- **Resource Allocation Rules**:
+  ```plaintext
+    NPROC_PER_NODE = Number of training GPUs
+    num_infer_workers = Number of inference GPUs
+    Must satisfy: Number of training GPUs + Number of inference GPUs = Total GPU count
+  ```
diff --git a/ms-swift/examples/train/grpo/plugin/run_external_rm.sh b/ms-swift/examples/train/grpo/plugin/run_external_rm.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e2fcac2c423405e14844bcaf1619c0e466ea1ab3
--- /dev/null
+++ b/ms-swift/examples/train/grpo/plugin/run_external_rm.sh
@@ -0,0 +1,35 @@
+# pip install math_verify # reward function
+# pip install -U trl
+# GPU memory: 80GiB
+
+CUDA_VISIBLE_DEVICES=0 \
+swift rlhf \
+    --rlhf_type grpo \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --external_plugins examples/train/grpo/plugin/plugin.py \
+    --reward_funcs external_math_acc external_math_format \
+    --train_type lora \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --torch_dtype bfloat16 \
+    --dataset 'AI-MO/NuminaMath-TIR#1000' \
+    --max_completion_length 1024 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 4 \
+    --per_device_eval_batch_size 4 \
+    --learning_rate 1e-5 \
+    --gradient_accumulation_steps 1 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --dataset_num_proc 4 \
+    --num_generations 4 \
+    --temperature 0.9 \
+    --system 'examples/train/grpo/prompt.txt' \
+    --log_completions true
diff --git a/ms-swift/examples/train/long_text/sequence_parallel.sh b/ms-swift/examples/train/long_text/sequence_parallel.sh
new file mode 100644
index 0000000000000000000000000000000000000000..934d31c9f1fc8b2f0c5df977779d053be5eef5a8
--- /dev/null
+++ b/ms-swift/examples/train/long_text/sequence_parallel.sh
@@ -0,0 +1,28 @@
+# Env: 4 * A100
+# Max Length: 16K
+# GPU Memory: 4 * 43GiB, Training Speed 12s/it
+NPROC_PER_NODE=4 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+swift sft \
+    --model Qwen/Qwen2.5-7B \
+    --train_type full \
+    --dataset 'AI-ModelScope/LongAlpaca-12k' \
+    --torch_dtype bfloat16 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-5 \
+    --gradient_accumulation_steps 8 \
+    --packing true \
+    --eval_steps 200 \
+    --save_steps 200 \
+    --logging_steps 5 \
+    --max_length 16384 \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 8 \
+    --dataset_num_proc 8 \
+    --save_total_limit 2 \
+    --save_only_model true \
+    --output_dir output/Qwen2.5-7B \
+    --deepspeed zero3 \
+    --attn_impl flash_attn \
+    --sequence_parallel_size 4
diff --git a/ms-swift/examples/train/megatron/base_to_chat.sh b/ms-swift/examples/train/megatron/base_to_chat.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d4e0c6e4217c378cff07236a2ab9a32563a28c97
--- /dev/null
+++ b/ms-swift/examples/train/megatron/base_to_chat.sh
@@ -0,0 +1,28 @@
+# 8 * 65GiB
+NPROC_PER_NODE=8 \
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+megatron sft \
+    --load Qwen2.5-14B-mcore \
+    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
+    --tensor_model_parallel_size 4 \
+    --micro_batch_size 1 \
+    --global_batch_size 16 \
+    --packing true \
+    --recompute_granularity selective \
+    --train_iters 2000 \
+    --eval_iters 50 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_iters 100 \
+    --min_lr 1e-6 \
+    --save megatron_output/Qwen2.5-14B \
+    --eval_interval 200 \
+    --save_interval 200 \
+    --max_length 8192 \
+    --num_workers 8 \
+    --dataset_num_proc 8 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --sequence_parallel true \
+    --use_flash_attn true
diff --git a/ms-swift/examples/train/megatron/benchmark/deepspeed.sh b/ms-swift/examples/train/megatron/benchmark/deepspeed.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bb67bce1fd2bff79f380340be026bc798ed9f100
--- /dev/null
+++ b/ms-swift/examples/train/megatron/benchmark/deepspeed.sh
@@ -0,0 +1,28 @@
+# 8 * 80GiB
+# Corresponding Megatron-SWIFT script reference:
+# https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/base_to_chat.sh
+NPROC_PER_NODE=8 \
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+swift sft \
+    --model Qwen/Qwen2.5-14B \
+    --train_type full \
+    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
+    --torch_dtype bfloat16 \
+    --max_steps 2000 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-5 \
+    --gradient_accumulation_steps 2 \
+    --packing true \
+    --eval_steps 200 \
+    --save_steps 200 \
+    --logging_steps 5 \
+    --max_length 8192 \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 8 \
+    --dataset_num_proc 8 \
+    --save_total_limit -1 \
+    --save_only_model true \
+    --output_dir output/Qwen2.5-14B \
+    --deepspeed zero2 \
+    --attn_impl flash_attn
diff --git a/ms-swift/examples/train/megatron/moe.sh b/ms-swift/examples/train/megatron/moe.sh
new file mode 100644
index 0000000000000000000000000000000000000000..34e2f8f2c8a2bdc84f383c36bd745bfcfb84ac60
--- /dev/null
+++ b/ms-swift/examples/train/megatron/moe.sh
@@ -0,0 +1,32 @@
+# 8 * 65GiB
+NPROC_PER_NODE=8 \
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+megatron sft \
+    --load Qwen1.5-MoE-A2.7B-mcore \
+    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
+    --tensor_model_parallel_size 2 \
+    --expert_model_parallel_size 4 \
+    --moe_grouped_gemm true \
+    --moe_shared_expert_overlap true \
+    --moe_aux_loss_coeff 0.01 \
+    --micro_batch_size 1 \
+    --global_batch_size 16 \
+    --packing true \
+    --recompute_granularity selective \
+    --train_iters 2000 \
+    --eval_iters 50 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_iters 100 \
+    --min_lr 1e-6 \
+    --save megatron_output/Qwen1.5-MoE-A2.7B \
+    --eval_interval 200 \
+    --save_interval 200 \
+    --max_length 8192 \
+    --num_workers 8 \
+    --dataset_num_proc 8 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --sequence_parallel true \
+    --use_flash_attn true
diff --git a/ms-swift/examples/train/megatron/multi-node/node2.sh b/ms-swift/examples/train/megatron/multi-node/node2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9402e6e6f3593c58da62d022d9d230bff09fb720
--- /dev/null
+++ b/ms-swift/examples/train/megatron/multi-node/node2.sh
@@ -0,0 +1,31 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NNODES=2 \
+NODE_RANK=1 \
+MASTER_ADDR=xxx.xxx.xxx.xxx \
+MASTER_PORT=29500 \
+NPROC_PER_NODE=4 \
+megatron sft \
+    --load Qwen2.5-14B-mcore \
+    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
+    --tensor_model_parallel_size 4 \
+    --micro_batch_size 1 \
+    --global_batch_size 16 \
+    --packing true \
+    --recompute_granularity selective \
+    --train_iters 2000 \
+    --eval_iters 50 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_iters 100 \
+    --min_lr 1e-6 \
+    --save megatron_output/Qwen2.5-14B \
+    --eval_interval 200 \
+    --save_interval 200 \
+    --max_length 8192 \
+    --num_workers 8 \
+    --dataset_num_proc 8 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --sequence_parallel true \
+    --use_flash_attn true
diff --git a/ms-swift/examples/train/megatron/qwen3_32b.sh b/ms-swift/examples/train/megatron/qwen3_32b.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5967f5401835195b1a8c9ab659e51c9677ff0a92
--- /dev/null
+++ b/ms-swift/examples/train/megatron/qwen3_32b.sh
@@ -0,0 +1,31 @@
+# 8 * 80GiB
+NPROC_PER_NODE=8 \
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+megatron sft \
+    --load Qwen3-32B-mcore \
+    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
+    --tensor_model_parallel_size 8 \
+    --micro_batch_size 1 \
+    --global_batch_size 16 \
+    --packing true \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --train_iters 10000 \
+    --max_epochs 5 \
+    --eval_iters 50 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_iters 100 \
+    --min_lr 1e-6 \
+    --save megatron_output/Qwen3-32B \
+    --eval_interval 500 \
+    --save_interval 500 \
+    --max_length 8192 \
+    --num_workers 8 \
+    --dataset_num_proc 8 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --sequence_parallel true \
+    --attention_backend flash
diff --git a/ms-swift/examples/train/megatron/qwen3_moe.sh b/ms-swift/examples/train/megatron/qwen3_moe.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c4b241149cf2d68d3bdc0ee797be4d5b542146c8
--- /dev/null
+++ b/ms-swift/examples/train/megatron/qwen3_moe.sh
@@ -0,0 +1,37 @@
+# ZeRO3: 91.2s/it; 16 * 80GiB
+# Megatron-LM: 9.6s/it; 16 * 60GiB
+# Launch using Alibaba Cloud DLC
+# ref: https://github.com/modelscope/ms-swift/blob/main/examples/train/multi-node/dlc/train.sh
+NNODES=$WORLD_SIZE \
+NODE_RANK=$RANK \
+megatron sft \
+    --load Qwen3-30B-A3B-Base-mcore \
+    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
+    --tensor_model_parallel_size 2 \
+    --expert_model_parallel_size 8 \
+    --moe_grouped_gemm true \
+    --moe_shared_expert_overlap true \
+    --moe_aux_loss_coeff 0.01 \
+    --micro_batch_size 1 \
+    --global_batch_size 16 \
+    --packing true \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --train_iters 2000 \
+    --eval_iters 50 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_iters 100 \
+    --min_lr 1e-6 \
+    --save megatron_output/Qwen3-30B-A3B-Base \
+    --eval_interval 200 \
+    --save_interval 200 \
+    --max_length 8192 \
+    --num_workers 8 \
+    --dataset_num_proc 8 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --sequence_parallel true \
+    --use_flash_attn true
diff --git a/ms-swift/examples/train/moe/llama4.sh b/ms-swift/examples/train/moe/llama4.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8ab5a3b26a74e1acdbd5b75912a0b13d0c2a0e56
--- /dev/null
+++ b/ms-swift/examples/train/moe/llama4.sh
@@ -0,0 +1,28 @@
+# Manually select `target_modules` to avoid 'all-linear' selecting 'router'
+NPROC_PER_NODE=4 \
+USE_HF=1 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+swift sft \
+    --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
+    --dataset 'linxy/LaTeX_OCR:full#5000' \
+    --train_type lora \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_regex '^(language_model).*\.(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)$' \
+    --freeze_vit true \
+    --gradient_accumulation_steps 4 \
+    --gradient_checkpointing true \
+    --eval_steps 50 \
+    --save_steps 50 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --deepspeed zero3 \
+    --dataloader_num_workers 4
diff --git a/ms-swift/examples/train/moe/qwen2_5_moe.sh b/ms-swift/examples/train/moe/qwen2_5_moe.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9677a17cf8ce37a0abb7b4231ab9f8e667d0fea9
--- /dev/null
+++ b/ms-swift/examples/train/moe/qwen2_5_moe.sh
@@ -0,0 +1,28 @@
+# Manually select `target_modules` to avoid 'all-linear' selecting 'gate'
+CUDA_VISIBLE_DEVICES=0,1 \
+swift sft \
+    --model Qwen/Qwen2-57B-A14B-Instruct \
+    --train_type lora \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
+              'AI-ModelScope/alpaca-gpt4-data-en#500' \
+              'swift/self-cognition#500' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules q_proj k_proj v_proj o_proj gate_proj up_proj down_proj \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 50 \
+    --save_steps 50 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --system 'You are a helpful assistant.' \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/examples/train/multi-gpu/ddp_device_map/train.sh b/ms-swift/examples/train/multi-gpu/ddp_device_map/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e0bc2bd299c88a216bb44f210e298b3240d31b06
--- /dev/null
+++ b/ms-swift/examples/train/multi-gpu/ddp_device_map/train.sh
@@ -0,0 +1,30 @@
+# 14GiB * 4
+nproc_per_node=2
+
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NPROC_PER_NODE=$nproc_per_node \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --dataset 'swift/self-cognition#1000' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --system 'You are a helpful assistant.' \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --model_author swift \
+    --model_name swift-robot \
+    --gradient_checkpointing_kwargs '{"use_reentrant": false}'
diff --git a/ms-swift/examples/train/multi-gpu/deepspeed/train_zero2.sh b/ms-swift/examples/train/multi-gpu/deepspeed/train_zero2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..deaea2afa032b5529914de329fdd6c6ad76a24fe
--- /dev/null
+++ b/ms-swift/examples/train/multi-gpu/deepspeed/train_zero2.sh
@@ -0,0 +1,30 @@
+# 18GiB * 2
+nproc_per_node=2
+
+CUDA_VISIBLE_DEVICES=0,1 \
+NPROC_PER_NODE=$nproc_per_node \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --dataset 'swift/self-cognition#1000' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --system 'You are a helpful assistant.' \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --model_author swift \
+    --model_name swift-robot \
+    --deepspeed zero2
diff --git a/ms-swift/examples/train/multi-gpu/deepspeed/train_zero3.sh b/ms-swift/examples/train/multi-gpu/deepspeed/train_zero3.sh
new file mode 100644
index 0000000000000000000000000000000000000000..04a32a9da740e5b3bdd245ec568ce40e69c3e2bf
--- /dev/null
+++ b/ms-swift/examples/train/multi-gpu/deepspeed/train_zero3.sh
@@ -0,0 +1,30 @@
+# 16GiB * 2
+nproc_per_node=2
+
+CUDA_VISIBLE_DEVICES=0,1 \
+NPROC_PER_NODE=$nproc_per_node \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --dataset 'swift/self-cognition#1000' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --system 'You are a helpful assistant.' \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --model_author swift \
+    --model_name swift-robot \
+    --deepspeed zero3
diff --git a/ms-swift/examples/train/multi-gpu/fsdp_qlora/fsdp_offload.json b/ms-swift/examples/train/multi-gpu/fsdp_qlora/fsdp_offload.json
new file mode 100644
index 0000000000000000000000000000000000000000..aa70be73d9c4144658fdaffa9551ac75faa03d7a
--- /dev/null
+++ b/ms-swift/examples/train/multi-gpu/fsdp_qlora/fsdp_offload.json
@@ -0,0 +1,28 @@
+{
+  "compute_environment": "LOCAL_MACHINE",
+  "debug": false,
+  "distributed_type": "FSDP",
+  "downcast_bf16": "no",
+  "fsdp_config": {
+    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+    "fsdp_backward_prefetch": "BACKWARD_PRE",
+    "fsdp_cpu_ram_efficient_loading": true,
+    "fsdp_forward_prefetch": false,
+    "fsdp_offload_params": true,
+    "fsdp_sharding_strategy": "FULL_SHARD",
+    "fsdp_state_dict_type": "FULL_STATE_DICT",
+    "fsdp_sync_module_states": true,
+    "fsdp_use_orig_params": false
+  },
+  "machine_rank": 0,
+  "main_training_function": "main",
+  "mixed_precision": "no",
+  "num_machines": 1,
+  "num_processes": 2,
+  "rdzv_backend": "static",
+  "same_network": true,
+  "tpu_env": [],
+  "tpu_use_cluster": false,
+  "tpu_use_sudo": false,
+  "use_cpu": false
+}
diff --git a/ms-swift/examples/train/multi-gpu/fsdp_qlora/train.sh b/ms-swift/examples/train/multi-gpu/fsdp_qlora/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3da5be1b9b39912edf53814fd166a89844f28a5c
--- /dev/null
+++ b/ms-swift/examples/train/multi-gpu/fsdp_qlora/train.sh
@@ -0,0 +1,34 @@
+# 14GiB * 2
+nproc_per_node=2
+
+CUDA_VISIBLE_DEVICES=0,1 \
+accelerate launch --config_file "./examples/train/multi-gpu/fsdp_qlora/fsdp_offload.json" \
+    swift/cli/sft.py \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --dataset 'swift/self-cognition#1000' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --quant_bits 4 \
+    --bnb_4bit_compute_dtype bfloat16 \
+    --bnb_4bit_quant_storage bfloat16 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --gradient_checkpointing true \
+    --weight_decay 0.1 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --system 'You are a helpful assistant.' \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/examples/train/multi-node/accelerate/multi_node.yaml b/ms-swift/examples/train/multi-node/accelerate/multi_node.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..b88f57f5d33787185d6c9207df1213fdbe04f0ec
--- /dev/null
+++ b/ms-swift/examples/train/multi-node/accelerate/multi_node.yaml
@@ -0,0 +1,17 @@
+compute_environment: LOCAL_MACHINE
+deepspeed_config:
+    deepspeed_multinode_launcher: standard
+    gradient_accumulation_steps: 16
+    offload_optimizer_device: none
+    offload_param_device: none
+    zero3_init_flag: false
+    zero_stage: 3
+distributed_type: DEEPSPEED
+main_process_ip: 'xxx.xxx.xxx.xxx'
+main_process_port: 29500
+main_training_function: main
+mixed_precision: bf16
+num_machines: 2
+num_processes: 8  # world size
+rdzv_backend: static
+use_cpu: false
diff --git a/ms-swift/examples/train/multi-node/accelerate/train_node1.sh b/ms-swift/examples/train/multi-node/accelerate/train_node1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..03f630e5614b26115a7a1cb12effc28a9f337898
--- /dev/null
+++ b/ms-swift/examples/train/multi-node/accelerate/train_node1.sh
@@ -0,0 +1,18 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+accelerate launch --config_file ./examples/train/multi-node/accelerate/multi_node.yaml --machine_rank 0 \
+    swift/cli/sft.py \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --torch_dtype bfloat16 \
+    --dataset 'swift/self-cognition#1000' \
+    --num_train_epochs 1 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --learning_rate 1e-4 \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/examples/train/multi-node/accelerate/train_node2.sh b/ms-swift/examples/train/multi-node/accelerate/train_node2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2149a5a83fd93e67984b7756cda6d44d504fd845
--- /dev/null
+++ b/ms-swift/examples/train/multi-node/accelerate/train_node2.sh
@@ -0,0 +1,18 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+accelerate launch --config_file ./examples/train/multi-node/accelerate/multi_node.yaml --machine_rank 1 \
+    swift/cli/sft.py \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --torch_dtype bfloat16 \
+    --dataset 'swift/self-cognition#1000' \
+    --num_train_epochs 1 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --learning_rate 1e-4 \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/examples/train/multi-node/deepspeed/README.md b/ms-swift/examples/train/multi-node/deepspeed/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6b2d8ad582647793ab87a10efd96ccd90eb78b99
--- /dev/null
+++ b/ms-swift/examples/train/multi-node/deepspeed/README.md
@@ -0,0 +1,42 @@
+# How to run
+
+## 1. Install pdsh in your nodes
+
+```shell
+# https://code.google.com/archive/p/pdsh/downloads
+# For example, download to /root:
+cd /root
+wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/pdsh/pdsh-2.29.tar.bz2
+tar -xvf pdsh-2.29.tar.bz2
+cd pdsh-2.29
+./configure --prefix=/root/pdsh-2.29 --with-ssh --without-rsh --with-exec --with-timeout=60 --with-nodeupdown --with-rcmd-rank-list=ssh
+make
+make install
+```
+
+In case of the privilege is correct:
+```shell
+chown root:root /root/pdsh-2.29
+```
+
+## Configure the ssh
+
+vim your ~/.ssh/config and input:
+```text
+Host worker-0
+    HostName your-worker-0-ip-here
+    User root
+Host worker-1
+    HostName your-worker-1-ip-here
+    User root
+```
+Say you have two nodes, when doing this, make sure your other nodes can be logined with `ssh root@worker-x` without password(with ssh-key).
+
+## Clone swift repo and run
+
+```shell
+git clone https://github.com/modelscope/ms-swift.git
+cd ms-swift
+# If your node number is different, edit examples/train/multi-node/deepspeed/host.txt
+sh examples/train/multi-node/deepspeed/train.sh
+```
diff --git a/ms-swift/examples/train/multi-node/deepspeed/train.sh b/ms-swift/examples/train/multi-node/deepspeed/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8616c737e5725ce03c4c7e729427d0bc7694d260
--- /dev/null
+++ b/ms-swift/examples/train/multi-node/deepspeed/train.sh
@@ -0,0 +1,19 @@
+# If your need only a part of the GPUs in every node, try:
+# --include="worker-0:0,1@worker-1:2,3"
+deepspeed --hostfile=./examples/train/multi-node-deepspeed/host.txt \
+    swift/cli/sft.py \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --torch_dtype bfloat16 \
+    --dataset 'swift/self-cognition#1000' \
+    --num_train_epochs 1 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --learning_rate 1e-4 \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/examples/train/multi-node/dlc/train.sh b/ms-swift/examples/train/multi-node/dlc/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..accbe6e7167a34254805d598b0d1ac46733c5d73
--- /dev/null
+++ b/ms-swift/examples/train/multi-node/dlc/train.sh
@@ -0,0 +1,24 @@
+# https://help.aliyun.com/zh/pai/user-guide/general-environment-variables
+NNODES=$WORLD_SIZE \
+NODE_RANK=$RANK \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type full \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#20000' \
+              'AI-ModelScope/alpaca-gpt4-data-en#20000' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-5 \
+    --gradient_accumulation_steps 4 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 8192 \
+    --output_dir output \
+    --system 'You are a helpful assistant.' \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --deepspeed zero2
diff --git a/ms-swift/examples/train/multi-node/swift/train_node1.sh b/ms-swift/examples/train/multi-node/swift/train_node1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f6e66edbf5b4742ffc72555ad6219c4de50d95d3
--- /dev/null
+++ b/ms-swift/examples/train/multi-node/swift/train_node1.sh
@@ -0,0 +1,30 @@
+nnodes=2
+nproc_per_node=4
+
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NNODES=$nnodes \
+NODE_RANK=0 \
+MASTER_ADDR=127.0.0.1 \
+MASTER_PORT=29500 \
+NPROC_PER_NODE=$nproc_per_node \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type full \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#20000' \
+              'AI-ModelScope/alpaca-gpt4-data-en#20000' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-5 \
+    --gradient_accumulation_steps $(expr 32 / $nproc_per_node / $nnodes) \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 8192 \
+    --output_dir output \
+    --system 'You are a helpful assistant.' \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --deepspeed zero2
diff --git a/ms-swift/examples/train/multi-node/torchrun/train_node1.sh b/ms-swift/examples/train/multi-node/torchrun/train_node1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..28a7b72fc255a59521120353701f10425e15abb3
--- /dev/null
+++ b/ms-swift/examples/train/multi-node/torchrun/train_node1.sh
@@ -0,0 +1,31 @@
+nnodes=2
+nproc_per_node=4
+
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+torchrun \
+    --master_port 29500 \
+    --nproc_per_node=$nproc_per_node \
+    --nnodes=$nnodes \
+    --node_rank=0 \
+    --master_addr=127.0.0.1 \
+    swift/cli/sft.py \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type full \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#20000' \
+              'AI-ModelScope/alpaca-gpt4-data-en#20000' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-5 \
+    --gradient_accumulation_steps $(expr 32 / $nproc_per_node / $nnodes) \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 8192 \
+    --output_dir output \
+    --system 'You are a helpful assistant.' \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --deepspeed zero2
diff --git a/ms-swift/examples/train/multimodal/audio.sh b/ms-swift/examples/train/multimodal/audio.sh
new file mode 100644
index 0000000000000000000000000000000000000000..481a0c8fab2aebe9e97dc0ab8ea1cc0479aef27e
--- /dev/null
+++ b/ms-swift/examples/train/multimodal/audio.sh
@@ -0,0 +1,23 @@
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2-Audio-7B-Instruct \
+    --dataset 'speech_asr/speech_asr_aishell1_trainsets:validation#20000' \
+    --train_type lora \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --freeze_vit true \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4
diff --git a/ms-swift/examples/train/multimodal/infer.sh b/ms-swift/examples/train/multimodal/infer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..699ede32d7aef7668efa27079be04b6469fb90a7
--- /dev/null
+++ b/ms-swift/examples/train/multimodal/infer.sh
@@ -0,0 +1,8 @@
+# Perform inference using the validation set from the training phase.
+CUDA_VISIBLE_DEVICES=0 \
+MAX_PIXELS=1003520 \
+swift infer \
+    --adapters output/vx-xxx/checkpoint-xxx \
+    --stream true \
+    --load_data_args true \
+    --max_new_tokens 2048
diff --git a/ms-swift/examples/train/multimodal/lora_llm_full_vit/custom_plugin.py b/ms-swift/examples/train/multimodal/lora_llm_full_vit/custom_plugin.py
new file mode 100644
index 0000000000000000000000000000000000000000..55734e957df5f906830fe2a7174b989bae643efd
--- /dev/null
+++ b/ms-swift/examples/train/multimodal/lora_llm_full_vit/custom_plugin.py
@@ -0,0 +1,100 @@
+import os
+from typing import Optional
+
+import safetensors.torch
+import torch
+from transformers import Trainer
+
+from swift.llm import deep_getattr, get_model_arch, get_multimodal_target_regex
+from swift.plugin import Tuner, extra_tuners, optimizers_map
+from swift.tuners import LoraConfig, Swift
+from swift.utils import get_logger
+
+logger = get_logger()
+
+
+def is_vit_param(model_arch, parameter_name: str) -> bool:
+    for module_prefix in model_arch.vision_tower + model_arch.aligner:
+        if f'.{module_prefix}.' in parameter_name:
+            return True
+    return False
+
+
+class CustomTuner(Tuner):
+    """Full-parameter training of ViT while LoRA training LLM"""
+
+    @staticmethod
+    def from_pretrained(model: torch.nn.Module, model_id: str, **kwargs) -> torch.nn.Module:
+        model = Swift.from_pretrained(model, model_id, **kwargs)
+        state_dict = safetensors.torch.load_file(os.path.join(model_id, 'vit.safetensors'))
+        model.load_state_dict(state_dict, strict=False)
+        return model
+
+    @staticmethod
+    def save_pretrained(
+        model: torch.nn.Module,
+        save_directory: str,
+        state_dict: Optional[dict] = None,
+        safe_serialization: bool = True,
+        **kwargs,
+    ) -> None:
+        if state_dict is None:
+            state_dict = {}
+            for n, p in model.named_parameters():
+                if p.requires_grad:
+                    state_dict[n] = p.detach().cpu()
+        model.save_pretrained(save_directory, state_dict=state_dict, safe_serialization=safe_serialization, **kwargs)
+        # vit
+        model_arch = get_model_arch(model.model_meta.model_arch)
+        state_dict = {k: v for k, v in state_dict.items() if is_vit_param(model_arch, k)}
+        safetensors.torch.save_file(
+            state_dict, os.path.join(save_directory, 'vit.safetensors'), metadata={'format': 'pt'})
+
+    @staticmethod
+    def prepare_model(args: 'TrainArguments', model: torch.nn.Module) -> torch.nn.Module:
+        model_arch = get_model_arch(model.model_meta.model_arch)
+        target_regex = get_multimodal_target_regex(model)
+        logger.info(f'target_regex: {target_regex}')
+        lora_config = LoraConfig(
+            task_type='CAUSAL_LM', r=args.lora_rank, lora_alpha=args.lora_alpha, target_modules=target_regex)
+        model = Swift.prepare_model(model, lora_config)
+        for module_prefix in model_arch.vision_tower + model_arch.aligner:
+            deep_getattr(model, module_prefix).requires_grad_(True)
+        return model
+
+
+def create_custom_optimizer(args, model, dataset):
+    """ViT and LLM use different learning rates."""
+    decay_parameters = set(Trainer.get_decay_parameter_names(None, model))
+    model_arch = get_model_arch(model.model_meta.model_arch)
+    vit_parameters = [(n, p) for n, p in model.named_parameters() if is_vit_param(model_arch, n) and p.requires_grad]
+    llm_parameters = [(n, p) for n, p in model.named_parameters()
+                      if not is_vit_param(model_arch, n) and p.requires_grad]
+    optimizer_grouped_parameters = [
+        # vit & merger
+        {
+            'params': [p for n, p in vit_parameters if n in decay_parameters],
+            'weight_decay': args.weight_decay,
+            'lr': 0.1 * args.learning_rate,  # 1e-5
+        },
+        {
+            'params': [p for n, p in vit_parameters if n not in decay_parameters],
+            'weight_decay': 0.0,
+            'lr': 0.1 * args.learning_rate,
+        },
+        # llm
+        {
+            'params': [p for n, p in llm_parameters if n in decay_parameters],
+            'weight_decay': args.weight_decay,
+        },
+        {
+            'params': [p for n, p in llm_parameters if n not in decay_parameters],
+            'weight_decay': 0.0,
+        },
+    ]
+    optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(args, model)
+    return optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs), None
+
+
+extra_tuners['custom'] = CustomTuner
+optimizers_map['custom'] = create_custom_optimizer
diff --git a/ms-swift/examples/train/multimodal/ocr.sh b/ms-swift/examples/train/multimodal/ocr.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a368341a2891bed83823026f56afd4c8bc865dfa
--- /dev/null
+++ b/ms-swift/examples/train/multimodal/ocr.sh
@@ -0,0 +1,25 @@
+# 20GB
+CUDA_VISIBLE_DEVICES=0 \
+MAX_PIXELS=1003520 \
+swift sft \
+    --model Qwen/Qwen2.5-VL-7B-Instruct \
+    --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#20000' \
+    --train_type lora \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --freeze_vit true \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 50 \
+    --save_steps 50 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4
diff --git a/ms-swift/examples/train/multimodal/omni/infer.sh b/ms-swift/examples/train/multimodal/omni/infer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6affc476ef1b54ec5000e745996b79b9310374eb
--- /dev/null
+++ b/ms-swift/examples/train/multimodal/omni/infer.sh
@@ -0,0 +1,10 @@
+CUDA_VISIBLE_DEVICES=0 \
+VIDEO_MAX_PIXELS=50176 \
+FPS_MAX_FRAMES=12 \
+MAX_PIXELS=1003520 \
+ENABLE_AUDIO_OUTPUT=0 \
+swift infer \
+    --adapters output/vx-xxx/checkpoint-xxx \
+    --stream true \
+    --load_data_args true \
+    --max_new_tokens 2048
diff --git a/ms-swift/examples/train/optimizer/muon.sh b/ms-swift/examples/train/optimizer/muon.sh
new file mode 100644
index 0000000000000000000000000000000000000000..22a04ad56b8da94eb13d05a717d756ecff220371
--- /dev/null
+++ b/ms-swift/examples/train/optimizer/muon.sh
@@ -0,0 +1,31 @@
+# 17GB
+# ref: https://github.com/MoonshotAI/Moonlight/blob/master/examples/toy_train.py
+# `moonshotai/Moonlight-16B-A3B-Instruct` does not support training; here we use `Qwen/Qwen2.5-7B-Instruct` as an example.
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
+              'AI-ModelScope/alpaca-gpt4-data-en#500' \
+              'swift/self-cognition#500' \
+    --optimizer muon \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 50 \
+    --save_steps 50 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --system 'You are a helpful assistant.' \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/examples/train/packing/qwen2_5_omni.sh b/ms-swift/examples/train/packing/qwen2_5_omni.sh
new file mode 100644
index 0000000000000000000000000000000000000000..19ec8b8e69430bff1a129a6d190d790a307f0248
--- /dev/null
+++ b/ms-swift/examples/train/packing/qwen2_5_omni.sh
@@ -0,0 +1,40 @@
+# 4 * 32GB
+# Multimodal packing currently only supports qwen2_vl, qwen2_5_vl, qwen2_5_omni, internvl2_5/3
+# A demo for four modalities that can be run directly
+# For local datasets, it is recommended to use streaming: `--streaming true` (save memory)
+pip uninstall transformers
+pip install git+https://github.com/huggingface/transformers
+
+NPROC_PER_NODE=4 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+VIDEO_MAX_PIXELS=50176 \
+FPS_MAX_FRAMES=12 \
+MAX_PIXELS=1003520 \
+swift sft \
+    --model Qwen/Qwen2.5-Omni-7B \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#10000' \
+              'AI-ModelScope/LaTeX_OCR#2000' \
+              'speech_asr/speech_asr_aishell1_trainsets:validation#2000' \
+    --train_type lora \
+    --torch_dtype bfloat16 \
+    --attn_impl flash_attn \
+    --packing true \
+    --num_train_epochs 3 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --freeze_vit true \
+    --gradient_accumulation_steps 1 \
+    --eval_steps 50 \
+    --save_steps 50 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 4096 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --dataset_num_proc 8 \
+    --deepspeed zero2
diff --git a/ms-swift/examples/train/packing/qwen2_5_vl.sh b/ms-swift/examples/train/packing/qwen2_5_vl.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4afdbf848951afba675d7786fef1a3e2d7db500b
--- /dev/null
+++ b/ms-swift/examples/train/packing/qwen2_5_vl.sh
@@ -0,0 +1,32 @@
+# 4 * 36GB
+# Multimodal packing currently only supports qwen2_vl, qwen2_5_vl, qwen2_5_omni, internvl2_5/3
+# Efficiency: With packing: 10 minutes; Without packing: >=1 hour
+# For local datasets, it is recommended to use streaming: `--streaming true` (save memory)
+NPROC_PER_NODE=4 \
+MAX_PIXELS=1003520 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+swift sft \
+    --model Qwen/Qwen2.5-VL-7B-Instruct \
+    --train_type lora \
+    --dataset 'AI-ModelScope/LaTeX_OCR#20000' \
+    --torch_dtype bfloat16 \
+    --attn_impl flash_attn \
+    --packing true \
+    --num_train_epochs 3 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps 1 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 8192 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --dataset_num_proc 8 \
+    --deepspeed zero2
diff --git a/ms-swift/examples/train/plugins/loss_scale.sh b/ms-swift/examples/train/plugins/loss_scale.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3722c497d8f6b7f71d4651dd798d1f0c6a4200c8
--- /dev/null
+++ b/ms-swift/examples/train/plugins/loss_scale.sh
@@ -0,0 +1,22 @@
+# loss_scale all to train all tokens
+# use loss_type loss_scale
+# This is just an example
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --dataset 'swift/self-cognition#1000' \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --model_author swift \
+    --model_name swift-robot \
+    --loss_scale all \
+    --loss_type loss_scale
diff --git a/ms-swift/examples/train/pretrain/train.sh b/ms-swift/examples/train/pretrain/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3b2e22ae1370403eb51086f4e157b33372d3f830
--- /dev/null
+++ b/ms-swift/examples/train/pretrain/train.sh
@@ -0,0 +1,31 @@
+# If not using flash_attn, or transformers<4.44,
+# or encountering an abnormally large loss (i.e., the model does not support packing),
+# please remove `--packing true`.
+nproc_per_node=4
+
+NPROC_PER_NODE=$nproc_per_node \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+swift pt \
+    --model Qwen/Qwen2.5-7B \
+    --train_type full \
+    --dataset swift/chinese-c4 \
+    --torch_dtype bfloat16 \
+    --streaming true \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-5 \
+    --gradient_accumulation_steps $(expr 64 / $nproc_per_node) \
+    --packing true \
+    --eval_steps 500 \
+    --save_steps 500 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --deepspeed zero3 \
+    --max_length 8192 \
+    --max_steps 10000 \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --dataset_num_proc 8 \
+    --save_only_model true \
+    --output_dir output/Qwen2.5-7B \
+    --attn_impl flash_attn
diff --git a/ms-swift/examples/train/qlora/awq.sh b/ms-swift/examples/train/qlora/awq.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fff724d19568bdd6409bfe6392e525ebfa8d2904
--- /dev/null
+++ b/ms-swift/examples/train/qlora/awq.sh
@@ -0,0 +1,28 @@
+# 10GB
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct-AWQ \
+    --train_type lora \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
+              'AI-ModelScope/alpaca-gpt4-data-en#500' \
+              'swift/self-cognition#500' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 50 \
+    --save_steps 50 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --system 'You are a helpful assistant.' \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/examples/train/qlora/bnb.sh b/ms-swift/examples/train/qlora/bnb.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cf2fea09fb5938632490a23acc36f96668618249
--- /dev/null
+++ b/ms-swift/examples/train/qlora/bnb.sh
@@ -0,0 +1,34 @@
+# 10GB
+# pip install bitsandbytes
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
+              'AI-ModelScope/alpaca-gpt4-data-en#500' \
+              'swift/self-cognition#500' \
+    --torch_dtype bfloat16 \
+    --bnb_4bit_compute_dtype bfloat16 \
+    --bnb_4bit_quant_type nf4 \
+    --bnb_4bit_use_double_quant true \
+    --quant_method bnb \
+    --quant_bits 4 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 50 \
+    --save_steps 50 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --system 'You are a helpful assistant.' \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/examples/train/qlora/hqq.sh b/ms-swift/examples/train/qlora/hqq.sh
new file mode 100644
index 0000000000000000000000000000000000000000..aaec6d45273ee3a05ab004791417a1e7aa0194c8
--- /dev/null
+++ b/ms-swift/examples/train/qlora/hqq.sh
@@ -0,0 +1,31 @@
+# 10GB
+# pip install hqq
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
+              'AI-ModelScope/alpaca-gpt4-data-en#500' \
+              'swift/self-cognition#500' \
+    --torch_dtype bfloat16 \
+    --quant_method hqq \
+    --quant_bits 4 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 50 \
+    --save_steps 50 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --system 'You are a helpful assistant.' \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/examples/train/rft/rft.py b/ms-swift/examples/train/rft/rft.py
new file mode 100644
index 0000000000000000000000000000000000000000..854a0be0bd64c4a6567300d03d03bc44eddaf4ca
--- /dev/null
+++ b/ms-swift/examples/train/rft/rft.py
@@ -0,0 +1,224 @@
+import os
+import shutil
+import subprocess
+import time
+from typing import List
+
+from swift.utils import get_device_count
+
+# NOTE: this script supports at most 8 GPUS in a node, if using multi node, please use custom logic.
+
+# Paste conda env
+# conda_prefix = 'source /root/miniconda3/etc/profile.d/conda.sh && conda activate py311 && '
+conda_prefix = ''
+
+
+def do_sample(model: str, model_type: str, dataset: List[str], iter: int):
+    device_count = get_device_count()
+    handlers = []
+    datasets = []
+    # Sampling cache, to avoid lmdeploy & PRM run at the same time
+    # Why lmdeploy not vllm? we found that the responses generated by lmdeploy are more similar than ones of vllm.
+    for device in range(device_count):
+        sample_cmd = (f'{conda_prefix} USE_OPENCOMPASS_EVALUATOR=True CUDA_VISIBLE_DEVICES={device} swift sample '
+                      f'--model {model} --model_type {model_type} '
+                      f'--dataset {" ".join(dataset)} '
+                      f'--data_range {device} {device_count} '
+                      f'--max_length 2048 '
+                      f'--system "You are a math model, you should **think step by step** carefully, '
+                      f'and always consider the basic math principles to avoid making calculating mistakes.'
+                      f'Give the final answer wrapped with \\boxed{{}}" '
+                      f'--load_args false '
+                      f'--sampler_engine vllm '
+                      f'--max_new_tokens 768 '
+                      f'--override_exist_file true '
+                      f'--num_sampling_per_gpu_batch_size 1 '
+                      f'--num_return_sequences 64 '
+                      f'--cache_files sample_output/iter_{iter}_proc_{device}_cache.jsonl '
+                      f'--output_file iter_{iter}_proc_{device}_cache.jsonl '
+                      f'--top_p 1.0 '
+                      f'--temperature 1.0 ')
+        print(f'Sampling caches of iter {iter}, part {device}.', flush=True)
+        env = os.environ.copy()
+        env['CUDA_VISIBLE_DEVICES'] = str(device)
+        handler = subprocess.Popen(
+            f'{sample_cmd}' + f' > logs/sample_iter_{iter}_proc_{device}_cache.log 2>&1',
+            env=os.environ.copy(),
+            shell=True,
+            executable='/bin/bash')
+        handlers.append(handler)
+
+    for proc, handler in enumerate(handlers):
+        handler.wait()
+        assert os.path.exists(os.path.join('sample_output', f'iter_{iter}_proc_{proc}_cache.jsonl'))
+
+    handlers = []
+    # Sample again, this time to filter with ORM & PRM
+    # Provide your PRM model or PRM name(add PRM in plugin/prm.py first)
+    # You can define your custom PRM logic in the plugin
+    # (like, split your steps, use the worst score/last score/avg score)
+    for device in range(device_count):
+        sample_cmd = (
+            f'{conda_prefix} USE_OPENCOMPASS_EVALUATOR=True CUDA_VISIBLE_DEVICES={device} swift sample '
+            f'--model {model} --model_type {model_type} '  # change to --resume_from_checkpoint to use the latest optimizer state # noqa
+            f'--dataset {" ".join(dataset)} '
+            f'--data_range {device} {device_count} '
+            f'--max_length 2048 '
+            f'--system "You are a math model, you should **think step by step** carefully, '
+            f'and always consider the basic math principles to avoid making calculating mistakes.'
+            f'Give the final answer wrapped with \\boxed{{}}" '
+            f'--load_args false '
+            f'--sampler_engine no '
+            f'--orm_model math '  # math defines in plugin/orm.py
+            f'--prm_model Qwen/Qwen2.5-Math-PRM-7B '
+            f'--prm_threshold {min(0.7 + 0.1*iter, 0.9)} '
+            f'--max_new_tokens 768 '
+            f'--override_exist_file true '  # no not override the existing sample files
+            f'--num_sampling_per_gpu_batch_size 1 '
+            f'--num_return_sequences 64 '
+            f'--output_file iter_{iter}_proc_{device}_sampling.jsonl '
+            f'--cache_files sample_output/iter_{iter}_proc_{device}_cache.jsonl ')
+        print(f'Sampling iter {iter}, part {device}.', flush=True)
+        env = os.environ.copy()
+        env['CUDA_VISIBLE_DEVICES'] = str(device)
+        handler = subprocess.Popen(
+            f'{sample_cmd}' + f' > logs/sample_iter_{iter}_proc_{device}.log 2>&1',
+            env=os.environ.copy(),
+            shell=True,
+            executable='/bin/bash')
+        handlers.append(handler)
+
+    for proc, handler in enumerate(handlers):
+        handler.wait()
+        assert os.path.exists(os.path.join('sample_output', f'iter_{iter}_proc_{proc}_sampling.jsonl')), (
+            f'{os.path.join("sample_output", f"iter_{iter}_proc_{proc}_sampling.jsonl")} not exists, '
+            'please check the sample logs to get the detail error.')
+        datasets.append(os.path.join('sample_output', f'iter_{iter}_proc_{proc}_sampling.jsonl'))
+    print(f'Sampling done, files:{datasets}', flush=True)
+    return datasets
+
+
+def do_train(model: str, model_type: str, datasets: List[str], iter, cmd='sft'):
+    gpu_prefix = ''
+    ds_config = ''
+    if get_device_count() > 1:
+        gpu_prefix = f'NPROC_PER_NODE={get_device_count()} '
+        ds_config = '--deepspeed zero3 '
+    extra_args = ''
+    if cmd == 'rlhf':
+        extra_args = '--rlhf_type dpo --beta 0.3 '  # use another reinforce learning method supported by swift
+    ga = 128 // get_device_count() // 2
+    train_cmd = (f'{conda_prefix} {gpu_prefix} swift {cmd} '
+                 f'--model {model} --model_type {model_type} '
+                 f'--dataset {" ".join(datasets)} '
+                 f'--max_length 2048 '
+                 f'--num_train_epochs 1 '
+                 f'--load_args false '
+                 f'--train_type full '
+                 f'{extra_args} '
+                 f'--eval_strategy no '
+                 f'--split_dataset_ratio 0 '
+                 f'--per_device_train_batch_size 2 '
+                 f'--gradient_accumulation_steps {ga} '
+                 f'--save_steps 1 '
+                 f'--save_strategy epoch '
+                 f'{ds_config} '
+                 f'--learning_rate 4e-6 ')
+
+    print(f'Training iter {iter}.', flush=True)
+    handler = subprocess.Popen(
+        f'{train_cmd}' + f' > logs/train_iter_{iter}.log 2>&1',
+        shell=True,
+        env=os.environ.copy(),
+        executable='/bin/bash')
+    handler.wait()
+    ckpt = None
+    with open(f'logs/train_iter_{iter}.log', 'r') as f:
+        for line in f.readlines():
+            if 'last_model_checkpoint: ' in line:
+                ckpt = line.split('last_model_checkpoint: ')[1]
+                break
+    assert ckpt is not None
+    print(f'Training done, ckpt: {ckpt.strip()}.', flush=True)
+    return ckpt.strip()
+
+
+def do_eval(model, model_type: str, iter):
+    eval_cmd = (
+        f'{conda_prefix} swift eval '
+        '--eval_dataset competition_math '  # eval another dataset
+        '--infer_backend vllm --eval_limit 500 '
+        f'--model {model} --model_type {model_type} '
+        '--system "You are a math model, you should **think step by step** carefully, '
+        'and always consider the basic math principles to avoid making calculating mistakes. '
+        'Give the final answer wrapped with \\boxed{}"')
+    print('Evaluating.', flush=True)
+    # Replace the original dataset to the math.json, this is for test, comment this if not need
+    replace_math_dataset()
+
+    if iter is None:
+        iter = 'origin'
+    env = os.environ.copy()
+    env['CUDA_VISIBLE_DEVICES'] = '0'
+    handler = subprocess.Popen(
+        f'{eval_cmd}' + f' > logs/eval_iter_{iter}.log 2>&1', shell=True, env=env, executable='/bin/bash')
+    handler.wait()
+
+    acc = None
+    # | math | 393424 | accuracy | gen | 39.00 |
+    with open(f'logs/eval_iter_{iter}.log', 'r') as f:
+        for line in f.readlines():
+            if 'Level 5' in line and 'AveragePass@1' in line:
+                parts = [p for p in line.split('|') if p.strip()]
+                acc = float(parts[-2])
+                break
+
+    print(f'Iter {iter} eval done with acc: {acc}.', flush=True)
+    return acc
+
+
+def replace_math_dataset():
+    # Note: This may run failed because this is special for math test,
+    # and one must run swift eval --eval_dataset math first to make sure opencompass has created
+    # the folder.
+    # You can use original math dataset either. just comment this call.
+    user_dir = os.path.expanduser('~')
+    if os.path.exists(os.path.join(user_dir, '.cache', 'opencompass', 'data', 'math', 'math.json')):
+        os.remove(os.path.join(user_dir, '.cache', 'opencompass', 'data', 'math', 'math.json'))
+    shutil.copy(
+        os.path.join('examples', 'train', 'rft', 'math.json'),
+        os.path.join(user_dir, '.cache', 'opencompass', 'data', 'math', 'math.json'))
+
+
+def main():
+    os.makedirs('logs', exist_ok=True)
+    max_acc = 0.
+    first_model = 'Qwen/Qwen2.5-Math-7B-Instruct'
+    model_type = 'qwen2_5_math'
+
+    if False:
+        # eval the original model
+        do_eval(first_model, None)
+
+    model = first_model
+    for i in range(5):
+        ts = time.time()
+        datasets = do_sample(model, model_type, ['tastelikefeet/competition_math'], i)
+        # add custom data filter here, for example: length or diversity control
+        print(f'do sample cost: {(time.time()-ts) / 60:.1f} minutes.', flush=True)
+        ts = time.time()
+        # if want to train the original dataset with datasets, add the original dataset here
+        # if want to train the original model everytime, change to first_model
+        ckpt = do_train(model, model_type, datasets, i)
+        print(f'do train cost: {(time.time() - ts) / 60:.1f} minutes.', flush=True)
+        ts = time.time()
+        acc = do_eval(ckpt, model_type, i)
+        print(f'do eval cost: {(time.time() - ts) / 60:.1f} minutes.', flush=True)
+        if acc > max_acc:
+            max_acc = acc
+        model = ckpt
+        print(f'acc: {acc}, upgrade model to : {model}', flush=True)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/ms-swift/examples/train/rlhf/cpo.sh b/ms-swift/examples/train/rlhf/cpo.sh
new file mode 100644
index 0000000000000000000000000000000000000000..35d982d921b3cf600e56139c66ff2b15a1b1c0d2
--- /dev/null
+++ b/ms-swift/examples/train/rlhf/cpo.sh
@@ -0,0 +1,28 @@
+nproc_per_node=2
+
+CUDA_VISIBLE_DEVICES=0,1 \
+NPROC_PER_NODE=$nproc_per_node \
+swift rlhf \
+    --rlhf_type cpo \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --dataset hjh0119/shareAI-Llama3-DPO-zh-en-emoji \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --deepspeed zero2 \
+    --dataset_num_proc 4
diff --git a/ms-swift/examples/train/rlhf/dpo/full.sh b/ms-swift/examples/train/rlhf/dpo/full.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ebb3f74f566f468c559ff5abbb8b2ab1f8685a3c
--- /dev/null
+++ b/ms-swift/examples/train/rlhf/dpo/full.sh
@@ -0,0 +1,26 @@
+# 4 * 50GiB
+NPROC_PER_NODE=4 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+swift rlhf \
+    --rlhf_type dpo \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type full \
+    --dataset hjh0119/shareAI-Llama3-DPO-zh-en-emoji \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-5 \
+    --gradient_accumulation_steps 4 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 8192 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --save_only_model true \
+    --dataloader_num_workers 4 \
+    --dataset_num_proc 4 \
+    --deepspeed zero3 \
+    --attn_impl flash_attn
diff --git a/ms-swift/examples/train/rlhf/kto.sh b/ms-swift/examples/train/rlhf/kto.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f86e4f33658fdf94b284848fed50c6176feb780d
--- /dev/null
+++ b/ms-swift/examples/train/rlhf/kto.sh
@@ -0,0 +1,27 @@
+nproc_per_node=2
+
+CUDA_VISIBLE_DEVICES=0,1 \
+NPROC_PER_NODE=$nproc_per_node \
+swift rlhf \
+    --rlhf_type kto \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --dataset 'AI-ModelScope/ultrafeedback-binarized-preferences-cleaned-kto#10000' \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --deepspeed zero2 \
+    --dataset_num_proc 4
diff --git a/ms-swift/examples/train/rlhf/orpo.sh b/ms-swift/examples/train/rlhf/orpo.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e13c153012d2e8e40ea9b0567310befeef9f2003
--- /dev/null
+++ b/ms-swift/examples/train/rlhf/orpo.sh
@@ -0,0 +1,28 @@
+nproc_per_node=2
+
+CUDA_VISIBLE_DEVICES=0,1 \
+NPROC_PER_NODE=$nproc_per_node \
+swift rlhf \
+    --rlhf_type orpo \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --dataset hjh0119/shareAI-Llama3-DPO-zh-en-emoji \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --deepspeed zero2 \
+    --dataset_num_proc 4
diff --git a/ms-swift/examples/train/rlhf/rm.sh b/ms-swift/examples/train/rlhf/rm.sh
new file mode 100644
index 0000000000000000000000000000000000000000..14b6fad6b9354010a2cd1542f3496e761f597764
--- /dev/null
+++ b/ms-swift/examples/train/rlhf/rm.sh
@@ -0,0 +1,28 @@
+nproc_per_node=2
+
+CUDA_VISIBLE_DEVICES=0,1 \
+NPROC_PER_NODE=$nproc_per_node \
+swift rlhf \
+    --rlhf_type rm \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --dataset hjh0119/shareAI-Llama3-DPO-zh-en-emoji \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --deepspeed zero2 \
+    --dataset_num_proc 4
diff --git a/ms-swift/examples/train/seq_cls/bert/sft.sh b/ms-swift/examples/train/seq_cls/bert/sft.sh
new file mode 100644
index 0000000000000000000000000000000000000000..538e74337b2c3b3ddca7671b9fd075ffae3c9444
--- /dev/null
+++ b/ms-swift/examples/train/seq_cls/bert/sft.sh
@@ -0,0 +1,28 @@
+# If `num_labels` is provided, it will be considered a classification task,
+# and AutoModelForSequenceClassification will be used to load the model.
+# The BERT model does not require templates, so it can usually be used without registration.
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model AI-ModelScope/bert-base-chinese \
+    --train_type lora \
+    --dataset 'DAMO_NLP/jd:cls#2000' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 50 \
+    --save_steps 50 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 512 \
+    --truncation_strategy right \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --num_labels 2 \
+    --task_type seq_cls
diff --git a/ms-swift/examples/train/seq_cls/multi_label/sft.sh b/ms-swift/examples/train/seq_cls/multi_label/sft.sh
new file mode 100644
index 0000000000000000000000000000000000000000..42d051651d2edec32db15e278a296ab0c9d02533
--- /dev/null
+++ b/ms-swift/examples/train/seq_cls/multi_label/sft.sh
@@ -0,0 +1,28 @@
+# Custom dataset format reference: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2.5-0.5B \
+    --train_type lora \
+    --dataset '<your-dataset>' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 16 \
+    --per_device_eval_batch_size 16 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps 1 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --dataset_num_proc 4 \
+    --num_labels '<num-labels>' \
+    --task_type seq_cls \
+    --use_chat_template false \
+    --problem_type multi_label_classification
diff --git a/ms-swift/examples/train/seq_cls/qwen2_5/deploy.sh b/ms-swift/examples/train/seq_cls/qwen2_5/deploy.sh
new file mode 100644
index 0000000000000000000000000000000000000000..71627c0ff8af89400cccec904e65dce98fa392dc
--- /dev/null
+++ b/ms-swift/examples/train/seq_cls/qwen2_5/deploy.sh
@@ -0,0 +1,8 @@
+CUDA_VISIBLE_DEVICES=0 \
+swift deploy \
+    --adapters output/vx-xxx/checkpoint-xxx
+
+# curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+# "model": "Qwen2.5-0.5B",
+# "messages": [{"role": "user", "content": "包装差，容易被调包。"}]
+# }'
diff --git a/ms-swift/examples/train/seq_cls/qwen2_5/infer.sh b/ms-swift/examples/train/seq_cls/qwen2_5/infer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..43aa93bcc7b4560482a2a25d364bf07f7c920eb3
--- /dev/null
+++ b/ms-swift/examples/train/seq_cls/qwen2_5/infer.sh
@@ -0,0 +1,5 @@
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --adapters output/vx-xxx/checkpoint-xxx \
+    --load_data_args true \
+    --max_batch_size 16
diff --git a/ms-swift/examples/train/seq_cls/regression/infer.sh b/ms-swift/examples/train/seq_cls/regression/infer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..43aa93bcc7b4560482a2a25d364bf07f7c920eb3
--- /dev/null
+++ b/ms-swift/examples/train/seq_cls/regression/infer.sh
@@ -0,0 +1,5 @@
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --adapters output/vx-xxx/checkpoint-xxx \
+    --load_data_args true \
+    --max_batch_size 16
diff --git a/ms-swift/examples/train/streaming/train.sh b/ms-swift/examples/train/streaming/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b864a48f281eb3f11f13a06d50cdf9f4a23d6f7b
--- /dev/null
+++ b/ms-swift/examples/train/streaming/train.sh
@@ -0,0 +1,17 @@
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --dataset 'swift/self-cognition#1000' \
+    --streaming true \
+    --max_steps 1000 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/examples/train/think_model/qwen3_demo1.sh b/ms-swift/examples/train/think_model/qwen3_demo1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4eb70b77afea33afa070be13e57b5065eac87d91
--- /dev/null
+++ b/ms-swift/examples/train/think_model/qwen3_demo1.sh
@@ -0,0 +1,30 @@
+# use `--loss_scale ignore_empty_think`
+# Avoid losing the think capability by ignoring the loss of empty `<think>\n\n</think>\n\n`
+# This method is also applicable to the Deepseek-R1 series of models.
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen3-8B \
+    --train_type lora \
+    --dataset 'swift/Qwen3-SFT-Mixin#2000' \
+              'swift/self-cognition:empty_think#600' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 50 \
+    --save_steps 50 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --use_liger_kernel true \
+    --loss_scale ignore_empty_think \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/examples/train/tuners/adalora/train.sh b/ms-swift/examples/train/tuners/adalora/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d22860d1e68c8dd3bb58f4d81e674b06b7897c56
--- /dev/null
+++ b/ms-swift/examples/train/tuners/adalora/train.sh
@@ -0,0 +1,16 @@
+# 17GiB
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type adalora \
+    --dataset 'swift/self-cognition#1000' \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --learning_rate 1e-4 \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/examples/train/tuners/galore/train_qgalore.sh b/ms-swift/examples/train/tuners/galore/train_qgalore.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cdebbe04471532c79c49617e58910efb8b893856
--- /dev/null
+++ b/ms-swift/examples/train/tuners/galore/train_qgalore.sh
@@ -0,0 +1,20 @@
+# 35GiB
+# pip install bitsandbytes==0.40.0
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type full \
+    --torch_dtype bfloat16 \
+    --dataset 'lvjianjin/AdvertiseGen#1000' \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --learning_rate 1e-5 \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --model_author swift \
+    --model_name swift-robot \
+    --use_galore true \
+    --galore_quantization true
diff --git a/ms-swift/examples/train/tuners/lora-ga/train.sh b/ms-swift/examples/train/tuners/lora-ga/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fbfe76cc6cc8b5e4e19909651e8dc2770ee06e08
--- /dev/null
+++ b/ms-swift/examples/train/tuners/lora-ga/train.sh
@@ -0,0 +1,33 @@
+# Train
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2-1.5B-Instruct \
+    --train_type lora \
+    --dataset 'swift/self-cognition#1000' \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --init_weights lora-ga \
+    --lora_ga_batch_size 2 \
+    --lora_ga_iters 2 \
+    --lora_ga_max_length 1024 \
+    --lora_ga_direction ArB2r \
+    --lora_ga_scale stable \
+    --lora_ga_stable_gamma 16 \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --model_author swift \
+    --model_name swift-robot
+
+# Infer
+# swift infer \
+#     --model Qwen/Qwen2-1.5B-Instruct \
+#     --ckpt_dir ./output/Qwen2-1.5B-Instruct/v0-20241214-191235/checkpoint-62/converted/default \
+#     --infer_backend pt \
+#     --stream true \
+#     --max_new_tokens 2048
diff --git a/ms-swift/examples/train/tuners/lora/train.sh b/ms-swift/examples/train/tuners/lora/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e8c231c671689cb238e0c295d940a94c97d1817a
--- /dev/null
+++ b/ms-swift/examples/train/tuners/lora/train.sh
@@ -0,0 +1,18 @@
+# 17.2GiB
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --dataset 'swift/self-cognition#1000' \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/examples/train/tuners/neftune/train.sh b/ms-swift/examples/train/tuners/neftune/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bf53a4d90522ec044445d6dc5e054b3f07eff00c
--- /dev/null
+++ b/ms-swift/examples/train/tuners/neftune/train.sh
@@ -0,0 +1,19 @@
+# 17GiB
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --train_type lora \
+    --dataset 'swift/self-cognition#1000' \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --neftune_noise_alpha 15 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/examples/train/tuners/unsloth/train.sh b/ms-swift/examples/train/tuners/unsloth/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8291148966b69c42e272a34d57af30fa4d4fd36b
--- /dev/null
+++ b/ms-swift/examples/train/tuners/unsloth/train.sh
@@ -0,0 +1,19 @@
+# 17GiB
+CUDA_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --tuner_backend unsloth \
+    --train_type lora \
+    --dataset 'swift/self-cognition#1000' \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --learning_rate 1e-4 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 2 \
+    --logging_steps 5 \
+    --model_author swift \
+    --model_name swift-robot
diff --git a/ms-swift/ms_swift.egg-info/dependency_links.txt b/ms-swift/ms_swift.egg-info/dependency_links.txt
new file mode 100644
index 0000000000000000000000000000000000000000..8b137891791fe96927ad78e64b0aad7bded08bdc
--- /dev/null
+++ b/ms-swift/ms_swift.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/ms-swift/ms_swift.egg-info/requires.txt b/ms-swift/ms_swift.egg-info/requires.txt
new file mode 100644
index 0000000000000000000000000000000000000000..fea395e055d71e98fdc4b864ae9743957c8412e0
--- /dev/null
+++ b/ms-swift/ms_swift.egg-info/requires.txt
@@ -0,0 +1,92 @@
+accelerate
+addict
+aiohttp
+attrdict
+binpacking
+charset_normalizer
+cpm_kernels
+dacite
+datasets<3.4,>=3.0
+einops
+fastapi
+gradio>=3.40.0
+importlib_metadata
+jieba
+matplotlib
+modelscope>=1.23
+nltk
+numpy<2.0
+openai
+oss2
+pandas
+peft<0.16,>=0.11
+pillow
+requests
+rouge
+safetensors
+scipy
+sentencepiece
+simplejson>=3.3.0
+sortedcontainers>=1.5.9
+tensorboard
+tiktoken
+tqdm
+transformers<4.53,>=4.33
+transformers_stream_generator
+trl<0.18,>=0.13
+uvicorn
+zstandard
+
+[all]
+accelerate
+addict
+aiohttp
+attrdict
+binpacking
+charset_normalizer
+cpm_kernels
+dacite
+datasets<3.4,>=3.0
+einops
+fastapi
+gradio>=3.40.0
+importlib_metadata
+jieba
+matplotlib
+modelscope>=1.23
+nltk
+numpy<2.0
+openai
+oss2
+pandas
+peft<0.16,>=0.11
+pillow
+requests
+rouge
+safetensors
+scipy
+sentencepiece
+simplejson>=3.3.0
+sortedcontainers>=1.5.9
+tensorboard
+tiktoken
+tqdm
+transformers<4.53,>=4.33
+transformers_stream_generator
+trl<0.18,>=0.13
+uvicorn
+zstandard
+evalscope[opencompass]
+evalscope[vlmeval]
+xtuner
+swanlab
+
+[eval]
+evalscope[opencompass]
+evalscope[vlmeval]
+
+[seq_parallel]
+xtuner
+
+[swanlab]
+swanlab
diff --git a/ms-swift/requirements/docs.txt b/ms-swift/requirements/docs.txt
new file mode 100644
index 0000000000000000000000000000000000000000..6a6b4df5aa57c97a803519a3b9a0642f709c2683
--- /dev/null
+++ b/ms-swift/requirements/docs.txt
@@ -0,0 +1,8 @@
+docutils>=0.16.0
+myst_parser
+recommonmark
+sphinx>=5.3.0
+sphinx-book-theme
+sphinx-copybutton
+sphinx-rtd-theme
+sphinx_markdown_tables
diff --git a/ms-swift/requirements/eval.txt b/ms-swift/requirements/eval.txt
new file mode 100644
index 0000000000000000000000000000000000000000..493b9d7fb50b19b1ed24e3c40ce2d03db33f3550
--- /dev/null
+++ b/ms-swift/requirements/eval.txt
@@ -0,0 +1,2 @@
+evalscope[opencompass]
+evalscope[vlmeval]
diff --git a/ms-swift/scripts/utils/run_model_info.py b/ms-swift/scripts/utils/run_model_info.py
new file mode 100644
index 0000000000000000000000000000000000000000..b11c27866b9ff825025107bc0cb0ff0ba7a955be
--- /dev/null
+++ b/ms-swift/scripts/utils/run_model_info.py
@@ -0,0 +1,101 @@
+from typing import Any, List
+
+from swift.llm import MODEL_MAPPING, TEMPLATE_MAPPING, ModelType, TemplateType
+from swift.utils import is_megatron_available
+
+
+def get_url_suffix(model_id):
+    if ':' in model_id:
+        return model_id.split(':')[0]
+    return model_id
+
+
+def get_cache_mapping(fpath):
+    with open(fpath, 'r', encoding='utf-8') as f:
+        text = f.read()
+    idx = text.find('| Model ID |')
+    text = text[idx:]
+    text_list = text.split('\n')[2:]
+    cache_mapping = {}
+    for text in text_list:
+        if not text:
+            continue
+        items = text.split('|')
+        if len(items) < 6:
+            break
+        cache_mapping[items[1]] = items[5]
+    return cache_mapping
+
+
+def get_model_info_table():
+    fpaths = ['docs/source/Instruction/支持的模型和数据集.md', 'docs/source_en/Instruction/Supported-models-and-datasets.md']
+    cache_mapping = get_cache_mapping(fpaths[0])
+    end_words = [['### 多模态大模型', '## 数据集'], ['### Multimodal large models', '## Datasets']]
+    result = [
+        '| Model ID | Model Type | Default Template | '
+        'Requires | Support Megatron | Tags | HF Model ID |\n'
+        '| -------- | -----------| ---------------- | '
+        '-------- | ---------------- | ---- | ----------- |\n'
+    ] * 2
+    res_llm: List[Any] = []
+    res_mllm: List[Any] = []
+    mg_count = 0
+    for template in TemplateType.get_template_name_list():
+        assert template in TEMPLATE_MAPPING
+
+    for model_type in ModelType.get_model_name_list():
+        model_meta = MODEL_MAPPING[model_type]
+        template = model_meta.template
+        for group in model_meta.model_groups:
+            for model in group.models:
+                ms_model_id = model.ms_model_id
+                hf_model_id = model.hf_model_id
+                if ms_model_id:
+                    ms_model_id = f'[{ms_model_id}](https://modelscope.cn/models/{get_url_suffix(ms_model_id)})'
+                else:
+                    ms_model_id = '-'
+                if hf_model_id:
+                    hf_model_id = f'[{hf_model_id}](https://huggingface.co/{get_url_suffix(hf_model_id)})'
+                else:
+                    hf_model_id = '-'
+                tags = ', '.join(group.tags or model_meta.tags) or '-'
+                requires = ', '.join(group.requires or model_meta.requires) or '-'
+                if is_megatron_available():
+                    from swift.megatron import model
+                    support_megatron = getattr(model_meta, 'support_megatron', False)
+                    for word in ['gptq', 'awq', 'bnb', 'aqlm', 'int', 'nf4', 'fp8']:
+                        if word in ms_model_id.lower():
+                            support_megatron = False
+                            break
+                    support_megatron = '&#x2714;' if support_megatron else '&#x2718;'
+                else:
+                    support_megatron = cache_mapping.get(ms_model_id, '&#x2718;')
+                if support_megatron == '&#x2714;':
+                    mg_count += 1
+                r = f'|{ms_model_id}|{model_type}|{template}|{requires}|{support_megatron}|{tags}|{hf_model_id}|\n'
+                if model_meta.is_multimodal:
+                    res_mllm.append(r)
+                else:
+                    res_llm.append(r)
+    print(f'LLM总数: {len(res_llm)}, MLLM总数: {len(res_mllm)}, Megatron支持模型: {mg_count}')
+    text = ['', '']  # llm, mllm
+    for i, res in enumerate([res_llm, res_mllm]):
+        for r in res:
+            text[i] += r
+        result[i] += text[i]
+
+    for i, fpath in enumerate(fpaths):
+        with open(fpath, 'r', encoding='utf-8') as f:
+            text = f.read()
+        llm_start_idx = text.find('| Model ID |')
+        mllm_start_idx = text[llm_start_idx + 1:].find('| Model ID |') + llm_start_idx + 1
+        llm_end_idx = text.find(end_words[i][0])
+        mllm_end_idx = text.find(end_words[i][1])
+        output = text[:llm_start_idx] + result[0] + '\n\n' + text[llm_end_idx:mllm_start_idx] + result[
+            1] + '\n\n' + text[mllm_end_idx:]
+        with open(fpath, 'w', encoding='utf-8') as f:
+            f.write(output)
+
+
+if __name__ == '__main__':
+    get_model_info_table()