Shifusen commited on Feb 12

Commit

c58d9cf

verified ·

1 Parent(s): 709f1fe

Upload folder using huggingface_hub

Browse files

Files changed (27) hide show

.claude/settings.local.json +11 -0
README.md +366 -0
chat_template.jinja +1 -0
config.json +76 -0
generation_config.json +6 -0
help.txt +304 -0
model-00001-of-00015.safetensors +3 -0
model-00002-of-00015.safetensors +3 -0
model-00003-of-00015.safetensors +3 -0
model-00004-of-00015.safetensors +3 -0
model-00005-of-00015.safetensors +3 -0
model-00006-of-00015.safetensors +3 -0
model-00007-of-00015.safetensors +3 -0
model-00008-of-00015.safetensors +3 -0
model-00009-of-00015.safetensors +3 -0
model-00010-of-00015.safetensors +3 -0
model-00011-of-00015.safetensors +3 -0
model-00012-of-00015.safetensors +3 -0
model-00013-of-00015.safetensors +3 -0
model-00014-of-00015.safetensors +3 -0
model-00015-of-00015.safetensors +3 -0
model.safetensors.index.json +0 -0
recipe.yaml +6 -0
sglang.sh +19 -0
special_tokens_map.json +23 -0
tokenizer.json +0 -0
tokenizer_config.json +0 -0

.claude/settings.local.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+  "permissions": {
+    "allow": [
+      "WebSearch",
+      "Bash(gh issue view:*)",
+      "Bash(python3 -c \" import json, sys data = json.load\\(sys.stdin\\) print\\(''TITLE:'', data[''title'']\\) print\\(''BODY:'', data[''body''][:3000]\\) print\\(\\) for c in data.get\\(''comments'', []\\)[:10]: print\\(''---COMMENT---''\\) print\\(c[''body''][:2000]\\) print\\(\\) \")",
+      "WebFetch(domain:docs.sglang.io)",
+      "Bash(python3 -c \" import json, sys data = json.load\\(sys.stdin\\) print\\(''TITLE:'', data[''title'']\\) print\\(''BODY:'', data[''body''][:3000]\\) print\\(\\) for c in data.get\\(''comments'', []\\)[:8]: print\\(''---COMMENT---''\\) print\\(c[''body''][:1500]\\) print\\(\\) \")"
+    ]
+  }
+}

README.md ADDED Viewed

	@@ -0,0 +1,366 @@

+---
+base_model:
+- mistralai/Mistral-Large-Instruct-2411
+base_model_relation: quantized
+library_name: vllm
+language:
+- en
+- fr
+- de
+- es
+- it
+- pt
+- zh
+- ja
+- ru
+- ko
+license: other
+license_name: mrl
+inference: false
+license_link: https://mistral.ai/licenses/MRL-0.1.md
+tags:
+- quantization
+- nvfp4
+- fp4
+- mistral-common
+---
+# Model Card for Mistral-Large-Instruct-2411
+Mistral-Large-Instruct-2411 is an advanced dense Large Language Model (LLM) of 123B parameters with state-of-the-art reasoning, knowledge and coding capabilities extending [Mistral-Large-Instruct-2407](https://huggingface.co/mistralai/Mistral-Large-Instruct-2407) with better Long Context, Function Calling and System Prompt.
+## Key features
+- **Multi-lingual by design:** Dozens of languages supported, including English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch and Polish.
+- **Proficient in coding:** Trained on 80+ coding languages such as Python, Java, C, C++, Javacsript, and Bash. Also trained on more specific languages such as Swift and Fortran.
+- **Agent-centric:** Best-in-class agentic capabilities with native function calling and JSON outputting.
+- **Advanced Reasoning:** State-of-the-art mathematical and reasoning capabilities.
+- **Mistral Research License:** Allows usage and modification for non-commercial usages.
+- **Large Context:** A large 128k context window.
+- **Robust Context Adherence:** Ensures strong adherence for RAG and large context applications.
+- **System Prompt:** Maintains strong adherence and support for more reliable system prompts.
+### System Prompt
+We appreciate the feedback received from our community regarding our system prompt handling.
+In response, we have implemented stronger support for system prompts.
+To achieve optimal results, we recommend always including a system prompt that clearly outlines the bot's purpose, even if it is minimal.
+### Basic Instruct Template (V7)
+```
+<s>[SYSTEM_PROMPT] <system prompt>[/SYSTEM_PROMPT][INST] <user message>[/INST] <assistant response></s>[INST] <user message>[/INST]
+```
+**Be careful with subtle missing or trailing white spaces!**
+*Please make sure to use [mistral-common](https://github.com/mistralai/mistral-common) as the source of truth*
+## Usage
+The model can be used with the following frameworks
+- [`vllm`](https://github.com/vllm-project/vllm): See [here](#vLLM)
+### vLLM
+We recommend using this model with the [vLLM library](https://github.com/vllm-project/vllm)
+to implement production-ready inference pipelines.
+**_Installation_**
+Make sure you install [`vLLM >= v0.6.4.post1`](https://github.com/vllm-project/vllm/releases/tag/v0.6.4.post1):
+```
+pip install --upgrade vllm
+```
+Also make sure you have [`mistral_common >= 1.5.0`](https://github.com/mistralai/mistral-common/releases/tag/v1.5.0) installed:
+```
+pip install --upgrade mistral_common
+```
+You can also make use of a ready-to-go [docker image](https://github.com/vllm-project/vllm/blob/main/Dockerfile) or on the [docker hub](https://hub.docker.com/layers/vllm/vllm-openai/latest/images/sha256-55a88146a4da0b6e193431b5b1d3492dfd7bebdc16919df4d031273e85a6157c?context=explore).
+### Server
+We recommand that you use Mistral-Large-Instruct-2411 in a server/client setting.
+1. Spin up a server:
+```
+vllm serve mistralai/Mistral-Large-Instruct-2411 --tokenizer_mode mistral --config_format mistral --load_format mistral --tensor_parallel_size 8
+```
+**Note:** Running Mistral-Large-Instruct-2411 on GPU requires over 300 GB of GPU RAM.
+2. To ping the client you can use a simple Python snippet.
+```py
+import requests
+import json
+from huggingface_hub import hf_hub_download
+from datetime import datetime, timedelta
+url = "http://<your-server>:8000/v1/chat/completions"
+headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}
+model = "mistralai/Mistral-Large-Instruct-2411"
+def load_system_prompt(repo_id: str, filename: str) -> str:
+    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
+    with open(file_path, "r") as file:
+        system_prompt = file.read()
+    today = datetime.today().strftime("%Y-%m-%d")
+    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
+    model_name = repo_id.split("/")[-1]
+    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
+SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
+messages = [
+    {"role": "system", "content": SYSTEM_PROMPT + "\n\nThink step by step. You're a math genius."},
+    {
+        "role": "user",
+        "content": "Think of four random numbers. Then add, substract or multiply them so that the solution is 10. If it's not possible, say it."
+    },
+]
+data = {"model": model, "messages": messages}
+response = requests.post(url, headers=headers, data=json.dumps(data))
+print(response.json()["choices"][0]["message"]["content"])
+#  Sure, let's start by thinking of four random numbers. For example, let's take 3, 5, 2, and 1.
+#
+#  Now, we need to find a combination of addition, subtraction, or multiplication that results in 10.
+#  Let's try:
+#  \[ 3 + 5 + 2 - 1 = 9 \]
+#  This doesn't work. Let's try another combination:
+#  \[ 3 \times 2 + 5 - 1 = 6 + 5 - 1 = 10 \]
+#  This works! So, with the numbers 3, 5, 2, and 1, we can achieve the result 10 by performing the operations \( 3 \times 2 + 5 - 1 \).
+```
+### Offline
+```py
+from vllm import LLM
+from vllm.sampling_params import SamplingParams
+from huggingface_hub import hf_hub_download
+from datetime import datetime, timedelta
+model_name = "mistralai/Mistral-Large-Instruct-2411"
+def load_system_prompt(repo_id: str, filename: str) -> str:
+    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
+    with open(file_path, 'r') as file:
+        system_prompt = file.read()
+    today = datetime.today().strftime('%Y-%m-%d')
+    yesterday = (datetime.today() - timedelta(days=1)).strftime('%Y-%m-%d')
+    model_name = repo_id.split("/")[-1]
+    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
+SYSTEM_PROMPT = load_system_prompt(model_name, "SYSTEM_PROMPT.txt") + "\n\nThink step by step. You're a math genius."
+user_prompt = "Without browsing the web, how many days ago was Mistral founded?"
+messages = [
+    {
+        "role": "system",
+        "content": SYSTEM_PROMPT
+    },
+    {
+        "role": "user",
+        "content": user_prompt
+    },
+]
+# note that running this model on GPU requires over 300 GB of GPU RAM
+llm = LLM(model=model_name, tokenizer_mode="mistral", tensor_parallel_size=8)
+sampling_params = SamplingParams(max_tokens=512)
+outputs = llm.chat(messages, sampling_params=sampling_params)
+print(outputs[0].outputs[0].text)
+# I don't have real-time web browsing capabilities or access to current data, but I can help you calculate the number of days based on the information I have.
+#
+#Mistral AI was founded in April 2023. To determine how many days ago that was from today's date, November 18, 2024, we need to calculate the total number of days between April 2023 and November 2024.
+#
+#Here's the step-by-step calculation:
+#
+#1. **Days from April 2023 to December 2023:**
+#   - April 2023: 30 days (April has 30 days)
+#   - May 2023: 31 days
+#   - June 2023: 30 days
+#   - July 2023: 31 days
+#   - August 2023: 31 days
+#   - September 2023: 30 days
+#   - October 2023: 31 days
+#   - November 2023: 30 days
+#   - December 2023: 31 days
+#
+#   Total days in 2023 from April to December = 30 + 31 + 30 + 31 + 31 + 30 + 31 + 30 + 31 = 275 days
+#
+#2. **Days from January 2024 to November 18, 2024:**
+#   - January 2024: 31 days
+#   - February 2024: 29 days (2024 is a leap year)
+#   - March 2024: 31 days
+#   - April 2024: 30 days
+#   - May 2024: 31 days
+#   - June 2024: 30 days
+#   - July 2024: 31 days
+#   - August 2024: 31 days
+#   - September 2024: 30 days
+#   - October 2024: 31 days
+#   - November 2024 (up to the 18th): 18 days
+#
+#   Total days in 2024 from January to November 18 = 31 + 29 + 31 + 30 + 31 + 30 + 31 + 31 + 30 + 31 + 18 = 323 days
+#
+#3. **Total days from April 2023 to November 18, 2024:**
+#   Total days = 275 days (2023) + 323 days (2024) = 598 days
+#
+#Therefore, Mistral AI was founded 598 days ago from today's date, November 18, 2024.
+```
+### Improved Function Calling
+Mistral-Large-2411 has much improved function calling capabilities that are fully supported
+using [`mistral_common >= 1.5.0`](https://github.com/mistralai/mistral-common/releases/tag/v1.5.0) and [`vLLM >= v0.6.4.post1`](https://github.com/vllm-project/vllm/releases/tag/v0.6.4.post1).
+Make sure to serve the model with the following flags in vLLM:
+```
+vllm serve mistralai/Pixtral-Large-Instruct-2411 --tokenizer_mode mistral --config_format mistral --load_format mistral --tensor-parallel-size 8 --tool-call-parser mistral --enable-auto-tool-choice
+```
+<details>
+  <summary>Example</summary>
+```py
+import requests
+import json
+from huggingface_hub import hf_hub_download
+from datetime import datetime, timedelta
+url = "http://<your-server>:8000/v1/chat/completions"
+headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}
+model = "mistralai/Mistral-Large-Instruct-2411"
+def load_system_prompt(repo_id: str, filename: str) -> str:
+    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
+    with open(file_path, "r") as file:
+        system_prompt = file.read()
+    today = datetime.today().strftime("%Y-%m-%d")
+    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
+    model_name = repo_id.split("/")[-1]
+    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
+SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
+tools = [
+    {
+        "type": "function",
+        "function": {
+            "name": "get_current_weather",
+            "description": "Get the current weather in a given location",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "city": {
+                        "type": "string",
+                        "description": "The city to find the weather for, e.g. 'San Francisco'",
+                    },
+                    "state": {
+                        "type": "string",
+                        "description": "The state abbreviation, e.g. 'CA' for California",
+                    },
+                    "unit": {
+                        "type": "string",
+                        "description": "The unit for temperature",
+                        "enum": ["celsius", "fahrenheit"],
+                    },
+                },
+                "required": ["city", "state", "unit"],
+            },
+        },
+    },
+    {
+        "type": "function",
+        "function": {
+            "name": "rewrite",
+            "description": "Rewrite a given text for improved clarity",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "text": {
+                        "type": "string",
+                        "description": "The input text to rewrite",
+                    }
+                },
+            },
+        },
+    },
+]
+messages = [
+    {"role": "system", "content": SYSTEM_PROMPT},
+    {
+        "role": "user",
+        "content": "Could you please make the below article more concise?\n\nOpenAI is an artificial intelligence research laboratory consisting of the non-profit OpenAI Incorporated and its for-profit subsidiary corporation OpenAI Limited Partnership.",
+    },
+    {
+        "role": "assistant",
+        "content": "",
+        "tool_calls": [
+            {
+                "id": "bbc5b7ede",
+                "type": "function",
+                "function": {
+                    "name": "rewrite",
+                    "arguments": '{"text": "OpenAI is an artificial intelligence research laboratory consisting of the non-profit OpenAI Incorporated and its for-profit subsidiary corporation OpenAI Limited Partnership."}',
+                },
+            }
+        ],
+    },
+    {
+        "role": "tool",
+        "content": '{"action":"rewrite","outcome":"OpenAI is a FOR-profit company."}',
+        "tool_call_id": "bbc5b7ede",
+        "name": "rewrite",
+    },
+    {
+        "role": "assistant",
+        "content": "---\n\nOpenAI is a FOR-profit company.",
+    },
+    {
+        "role": "user",
+        "content": "Can you tell me what the temperature will be in Dallas, in Fahrenheit?",
+    },
+]
+data = {"model": model, "messages": messages, "tools": tools}
+response = requests.post(url, headers=headers, data=json.dumps(data))
+print(response.json()["choices"][0]["message"]["tool_calls"])
+# [{'id': '8PdihwL6d', 'type': 'function', 'function': {'name': 'get_current_weather', 'arguments': '{"city": "Dallas", "state": "TX", "unit": "fahrenheit"}'}}]
+```
+</details>
+## The Mistral AI Team
+Albert Jiang, Alexandre Sablayrolles, Alexis Tacnet, Alok Kothari, Antoine Roux, Arthur Mensch, Audrey Herblin-Stoop, Augustin Garreau, Austin Birky, Bam4d, Baptiste Bout, Baudouin de Monicault, Blanche Savary, Carole Rambaud, Caroline Feldman, Devendra Singh Chaplot, Diego de las Casas, Diogo Costa, Eleonore Arcelin, Emma Bou Hanna, Etienne Metzger, Gaspard Blanchet, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Harizo Rajaona, Henri Roussez, Hichem Sattouf, Ian Mack, Jean-Malo Delignon, Jessica Chudnovsky, Justus Murke, Kartik Khandelwal, Lawrence Stewart, Louis Martin, Louis Ternon, Lucile Saulnier, Lélio Renard Lavaud, Margaret Jennings, Marie Pellat, Marie Torelli, Marie-Anne Lachaux, Marjorie Janiewicz, Mickaël Seznec, Nicolas Schuhl, Niklas Muhs, Olivier de Garrigues, Patrick von Platen, Paul Jacob, Pauline Buche, Pavan Kumar Reddy, Perry Savas, Pierre Stock, Romain Sauvestre, Sagar Vaze, Sandeep Subramanian, Saurabh Garg, Sophia Yang, Szymon Antoniak, Teven Le Scao, Thibault Schueller, Thibaut Lavril, Thomas Wang, Théophile Gervet, Timothée Lacroix, Valera Nemychnikova, Wendy Shang, William El Sayed, William Marshall

chat_template.jinja ADDED Viewed

	@@ -0,0 +1 @@

+ {{ bos_token }}{% for message in messages %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + '[/INST]' }}{% elif message['role'] == 'system' %}{{ '[SYSTEM_PROMPT] ' + message['content'] + '[/SYSTEM_PROMPT]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + message['content'] + eos_token }}{% else %}{{ raise_exception('Only user, system and assistant roles are supported!') }}{% endif %}{% endfor %}

config.json ADDED Viewed

	@@ -0,0 +1,76 @@

+{
+  "architectures": [
+    "MistralForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 1,
+  "dtype": "bfloat16",
+  "eos_token_id": 2,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 12288,
+  "initializer_range": 0.02,
+  "intermediate_size": 28672,
+  "max_position_embeddings": 131072,
+  "model_type": "mistral",
+  "num_attention_heads": 96,
+  "num_hidden_layers": 88,
+  "num_key_value_heads": 8,
+  "quantization_config": {
+    "config_groups": {
+      "group_0": {
+        "format": "nvfp4-pack-quantized",
+        "input_activations": {
+          "actorder": null,
+          "block_structure": null,
+          "dynamic": "local",
+          "group_size": 16,
+          "num_bits": 4,
+          "observer": "static_minmax",
+          "observer_kwargs": {},
+          "scale_dtype": "torch.float8_e4m3fn",
+          "strategy": "tensor_group",
+          "symmetric": true,
+          "type": "float",
+          "zp_dtype": null
+        },
+        "output_activations": null,
+        "targets": [
+          "Linear"
+        ],
+        "weights": {
+          "actorder": null,
+          "block_structure": null,
+          "dynamic": false,
+          "group_size": 16,
+          "num_bits": 4,
+          "observer": "static_minmax",
+          "observer_kwargs": {},
+          "scale_dtype": "torch.float8_e4m3fn",
+          "strategy": "tensor_group",
+          "symmetric": true,
+          "type": "float",
+          "zp_dtype": null
+        }
+      }
+    },
+    "format": "nvfp4-pack-quantized",
+    "global_compression_ratio": null,
+    "ignore": [
+      "lm_head"
+    ],
+    "kv_cache_scheme": null,
+    "quant_method": "compressed-tensors",
+    "quantization_status": "compressed",
+    "sparsity_config": {},
+    "transform_config": {},
+    "version": "0.13.0"
+  },
+  "rms_norm_eps": 1e-05,
+  "rope_theta": 1000000.0,
+  "sliding_window": null,
+  "tie_word_embeddings": false,
+  "transformers_version": "4.57.3",
+  "use_cache": true,
+  "vocab_size": 32768
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "transformers_version": "4.57.3"
+}

help.txt ADDED Viewed

	@@ -0,0 +1,304 @@

+usage: launch_server.py [-h] --model-path MODEL_PATH
+                        [--tokenizer-path TOKENIZER_PATH]
+                        [--tokenizer-mode {auto,slow}]
+                        [--tokenizer-worker-num TOKENIZER_WORKER_NUM]
+                        [--skip-tokenizer-init]
+                        [--load-format {auto,pt,safetensors,npcache,dummy,sharded_state,gguf,bitsandbytes,layered,flash_rl,remote,remote_instance,fastsafetensors,private}]
+                        [--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG]
+                        [--trust-remote-code]
+                        [--context-length CONTEXT_LENGTH] [--is-embedding]
+                        [--enable-multimodal] [--revision REVISION]
+                        [--model-impl MODEL_IMPL] [--host HOST] [--port PORT]
+                        [--fastapi-root-path FASTAPI_ROOT_PATH] [--grpc-mode]
+                        [--skip-server-warmup] [--warmups WARMUPS]
+                        [--nccl-port NCCL_PORT]
+                        [--checkpoint-engine-wait-weights-before-ready]
+                        [--dtype {auto,half,float16,bfloat16,float,float32}]
+                        [--quantization {awq,fp8,gptq,marlin,gptq_marlin,awq_marlin,bitsandbytes,gguf,modelopt,modelopt_fp8,modelopt_fp4,petit_nvfp4,w8a8_int8,w8a8_fp8,moe_wna16,qoq,w4afp8,mxfp4,auto-round,compressed-tensors,modelslim,quark_int4fp8_moe}]
+                        [--quantization-param-path QUANTIZATION_PARAM_PATH]
+                        [--kv-cache-dtype {auto,fp8_e5m2,fp8_e4m3,bf16,bfloat16,fp4_e2m1}]
+                        [--enable-fp32-lm-head]
+                        [--modelopt-quant MODELOPT_QUANT]
+                        [--modelopt-checkpoint-restore-path MODELOPT_CHECKPOINT_RESTORE_PATH]
+                        [--modelopt-checkpoint-save-path MODELOPT_CHECKPOINT_SAVE_PATH]
+                        [--modelopt-export-path MODELOPT_EXPORT_PATH]
+                        [--quantize-and-serve]
+                        [--rl-quant-profile RL_QUANT_PROFILE]
+                        [--mem-fraction-static MEM_FRACTION_STATIC]
+                        [--max-running-requests MAX_RUNNING_REQUESTS]
+                        [--max-queued-requests MAX_QUEUED_REQUESTS]
+                        [--max-total-tokens MAX_TOTAL_TOKENS]
+                        [--chunked-prefill-size CHUNKED_PREFILL_SIZE]
+                        [--prefill-max-requests PREFILL_MAX_REQUESTS]
+                        [--enable-dynamic-chunking]
+                        [--max-prefill-tokens MAX_PREFILL_TOKENS]
+                        [--schedule-policy {lpm,random,fcfs,dfs-weight,lof,priority,routing-key}]
+                        [--enable-priority-scheduling]
+                        [--abort-on-priority-when-disabled]
+                        [--schedule-low-priority-values-first]
+                        [--priority-scheduling-preemption-threshold PRIORITY_SCHEDULING_PREEMPTION_THRESHOLD]
+                        [--schedule-conservativeness SCHEDULE_CONSERVATIVENESS]
+                        [--page-size PAGE_SIZE] [--hybrid-kvcache-ratio]
+                        [--swa-full-tokens-ratio SWA_FULL_TOKENS_RATIO]
+                        [--disable-hybrid-swa-memory]
+                        [--radix-eviction-policy {lru,lfu}]
+                        [--enable-prefill-delayer]
+                        [--prefill-delayer-max-delay-passes PREFILL_DELAYER_MAX_DELAY_PASSES]
+                        [--prefill-delayer-token-usage-low-watermark PREFILL_DELAYER_TOKEN_USAGE_LOW_WATERMARK]
+                        [--prefill-delayer-forward-passes-buckets PREFILL_DELAYER_FORWARD_PASSES_BUCKETS [PREFILL_DELAYER_FORWARD_PASSES_BUCKETS ...]]
+                        [--prefill-delayer-wait-seconds-buckets PREFILL_DELAYER_WAIT_SECONDS_BUCKETS [PREFILL_DELAYER_WAIT_SECONDS_BUCKETS ...]]
+                        [--device DEVICE]
+                        [--tensor-parallel-size TENSOR_PARALLEL_SIZE]
+                        [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
+                        [--pp-max-micro-batch-size PP_MAX_MICRO_BATCH_SIZE]
+                        [--pp-async-batch-depth PP_ASYNC_BATCH_DEPTH]
+                        [--stream-interval STREAM_INTERVAL] [--stream-output]
+                        [--random-seed RANDOM_SEED]
+                        [--constrained-json-whitespace-pattern CONSTRAINED_JSON_WHITESPACE_PATTERN]
+                        [--constrained-json-disable-any-whitespace]
+                        [--watchdog-timeout WATCHDOG_TIMEOUT]
+                        [--soft-watchdog-timeout SOFT_WATCHDOG_TIMEOUT]
+                        [--dist-timeout DIST_TIMEOUT]
+                        [--download-dir DOWNLOAD_DIR]
+                        [--model-checksum [MODEL_CHECKSUM]]
+                        [--base-gpu-id BASE_GPU_ID]
+                        [--gpu-id-step GPU_ID_STEP] [--sleep-on-idle]
+                        [--custom-sigquit-handler CUSTOM_SIGQUIT_HANDLER]
+                        [--log-level LOG_LEVEL]
+                        [--log-level-http LOG_LEVEL_HTTP] [--log-requests]
+                        [--log-requests-level {0,1,2,3}]
+                        [--log-requests-format {text,json}]
+                        [--log-requests-target LOG_REQUESTS_TARGET [LOG_REQUESTS_TARGET ...]]
+                        [--uvicorn-access-log-exclude-prefixes [UVICORN_ACCESS_LOG_EXCLUDE_PREFIXES ...]]
+                        [--crash-dump-folder CRASH_DUMP_FOLDER]
+                        [--show-time-cost] [--enable-metrics]
+                        [--enable-metrics-for-all-schedulers]
+                        [--tokenizer-metrics-custom-labels-header TOKENIZER_METRICS_CUSTOM_LABELS_HEADER]
+                        [--tokenizer-metrics-allowed-custom-labels TOKENIZER_METRICS_ALLOWED_CUSTOM_LABELS [TOKENIZER_METRICS_ALLOWED_CUSTOM_LABELS ...]]
+                        [--bucket-time-to-first-token BUCKET_TIME_TO_FIRST_TOKEN [BUCKET_TIME_TO_FIRST_TOKEN ...]]
+                        [--bucket-inter-token-latency BUCKET_INTER_TOKEN_LATENCY [BUCKET_INTER_TOKEN_LATENCY ...]]
+                        [--bucket-e2e-request-latency BUCKET_E2E_REQUEST_LATENCY [BUCKET_E2E_REQUEST_LATENCY ...]]
+                        [--collect-tokens-histogram]
+                        [--prompt-tokens-buckets PROMPT_TOKENS_BUCKETS [PROMPT_TOKENS_BUCKETS ...]]
+                        [--generation-tokens-buckets GENERATION_TOKENS_BUCKETS [GENERATION_TOKENS_BUCKETS ...]]
+                        [--gc-warning-threshold-secs GC_WARNING_THRESHOLD_SECS]
+                        [--decode-log-interval DECODE_LOG_INTERVAL]
+                        [--enable-request-time-stats-logging]
+                        [--kv-events-config KV_EVENTS_CONFIG] [--enable-trace]
+                        [--otlp-traces-endpoint OTLP_TRACES_ENDPOINT]
+                        [--export-metrics-to-file]
+                        [--export-metrics-to-file-dir EXPORT_METRICS_TO_FILE_DIR]
+                        [--api-key API_KEY] [--admin-api-key ADMIN_API_KEY]
+                        [--served-model-name SERVED_MODEL_NAME]
+                        [--weight-version WEIGHT_VERSION]
+                        [--chat-template CHAT_TEMPLATE]
+                        [--hf-chat-template-name HF_CHAT_TEMPLATE_NAME]
+                        [--completion-template COMPLETION_TEMPLATE]
+                        [--file-storage-path FILE_STORAGE_PATH]
+                        [--enable-cache-report]
+                        [--reasoning-parser {deepseek-r1,deepseek-v3,glm45,gpt-oss,kimi,kimi_k2,qwen3,qwen3-thinking,minimax,minimax-append-think,step3,nano_v3,interns1}]
+                        [--tool-call-parser {deepseekv3,deepseekv31,deepseekv32,glm,glm45,glm47,gpt-oss,kimi_k2,llama3,mimo,mistral,pythonic,qwen,qwen25,qwen3_coder,step3,minimax-m2,interns1}]
+                        [--tool-server TOOL_SERVER]
+                        [--sampling-defaults {openai,model}]
+                        [--data-parallel-size DATA_PARALLEL_SIZE]
+                        [--load-balance-method {auto,round_robin,follow_bootstrap_room,total_requests,total_tokens}]
+                        [--prefill-round-robin-balance]
+                        [--dist-init-addr DIST_INIT_ADDR] [--nnodes NNODES]
+                        [--node-rank NODE_RANK]
+                        [--json-model-override-args JSON_MODEL_OVERRIDE_ARGS]
+                        [--preferred-sampling-params PREFERRED_SAMPLING_PARAMS]
+                        [--enable-lora] [--enable-lora-overlap-loading]
+                        [--max-lora-rank MAX_LORA_RANK]
+                        [--lora-target-modules [{q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj,qkv_proj,gate_up_proj,embed_tokens,lm_head,all} ...]]
+                        [--lora-paths [LORA_PATHS ...]]
+                        [--max-loras-per-batch MAX_LORAS_PER_BATCH]
+                        [--max-loaded-loras MAX_LOADED_LORAS]
+                        [--lora-eviction-policy {lru,fifo}]
+                        [--lora-backend {triton,csgmv,ascend,torch_native}]
+                        [--max-lora-chunk-size {16,32,64,128}]
+                        [--attention-backend {triton,torch_native,flex_attention,nsa,cutlass_mla,fa3,fa4,flashinfer,flashmla,trtllm_mla,trtllm_mha,dual_chunk_flash_attn,aiter,wave,intel_amx,ascend,intel_xpu}]
+                        [--prefill-attention-backend {triton,torch_native,flex_attention,nsa,cutlass_mla,fa3,fa4,flashinfer,flashmla,trtllm_mla,trtllm_mha,dual_chunk_flash_attn,aiter,wave,intel_amx,ascend,intel_xpu}]
+                        [--decode-attention-backend {triton,torch_native,flex_attention,nsa,cutlass_mla,fa3,fa4,flashinfer,flashmla,trtllm_mla,trtllm_mha,dual_chunk_flash_attn,aiter,wave,intel_amx,ascend,intel_xpu}]
+                        [--sampling-backend {flashinfer,pytorch,ascend}]
+                        [--grammar-backend {xgrammar,outlines,llguidance,none}]
+                        [--mm-attention-backend {sdpa,fa3,triton_attn,ascend_attn,aiter_attn}]
+                        [--nsa-prefill-backend {flashmla_sparse,flashmla_kv,flashmla_auto,fa3,tilelang,aiter}]
+                        [--nsa-decode-backend {flashmla_sparse,flashmla_kv,flashmla_auto,fa3,tilelang,aiter}]
+                        [--fp8-gemm-backend {auto,deep_gemm,flashinfer_trtllm,cutlass,triton,aiter}]
+                        [--fp4-gemm-backend {auto,flashinfer_cudnn,flashinfer_cutlass,flashinfer_trtllm}]
+                        [--disable-flashinfer-autotune]
+                        [--speculative-algorithm {EAGLE,EAGLE3,NEXTN,STANDALONE,NGRAM}]
+                        [--speculative-draft-model-path SPECULATIVE_DRAFT_MODEL_PATH]
+                        [--speculative-draft-model-revision SPECULATIVE_DRAFT_MODEL_REVISION]
+                        [--speculative-draft-load-format {auto,pt,safetensors,npcache,dummy,sharded_state,gguf,bitsandbytes,layered,flash_rl,remote,remote_instance,fastsafetensors,private}]
+                        [--speculative-num-steps SPECULATIVE_NUM_STEPS]
+                        [--speculative-eagle-topk SPECULATIVE_EAGLE_TOPK]
+                        [--speculative-num-draft-tokens SPECULATIVE_NUM_DRAFT_TOKENS]
+                        [--speculative-accept-threshold-single SPECULATIVE_ACCEPT_THRESHOLD_SINGLE]
+                        [--speculative-accept-threshold-acc SPECULATIVE_ACCEPT_THRESHOLD_ACC]
+                        [--speculative-token-map SPECULATIVE_TOKEN_MAP]
+                        [--speculative-attention-mode {prefill,decode}]
+                        [--speculative-draft-attention-backend SPECULATIVE_DRAFT_ATTENTION_BACKEND]
+                        [--speculative-moe-runner-backend {auto,deep_gemm,triton,triton_kernel,flashinfer_trtllm,flashinfer_cutlass,flashinfer_mxfp4,flashinfer_cutedsl,cutlass}]
+                        [--speculative-moe-a2a-backend {none,deepep,mooncake,ascend_fuseep}]
+                        [--speculative-draft-model-quantization {awq,fp8,gptq,marlin,gptq_marlin,awq_marlin,bitsandbytes,gguf,modelopt,modelopt_fp8,modelopt_fp4,petit_nvfp4,w8a8_int8,w8a8_fp8,moe_wna16,qoq,w4afp8,mxfp4,auto-round,compressed-tensors,modelslim,quark_int4fp8_moe,unquant}]
+                        [--speculative-ngram-min-match-window-size SPECULATIVE_NGRAM_MIN_MATCH_WINDOW_SIZE]
+                        [--speculative-ngram-max-match-window-size SPECULATIVE_NGRAM_MAX_MATCH_WINDOW_SIZE]
+                        [--speculative-ngram-min-bfs-breadth SPECULATIVE_NGRAM_MIN_BFS_BREADTH]
+                        [--speculative-ngram-max-bfs-breadth SPECULATIVE_NGRAM_MAX_BFS_BREADTH]
+                        [--speculative-ngram-match-type {BFS,PROB}]
+                        [--speculative-ngram-branch-length SPECULATIVE_NGRAM_BRANCH_LENGTH]
+                        [--speculative-ngram-capacity SPECULATIVE_NGRAM_CAPACITY]
+                        [--enable-multi-layer-eagle]
+                        [--expert-parallel-size EXPERT_PARALLEL_SIZE]
+                        [--moe-a2a-backend {none,deepep,mooncake,ascend_fuseep}]
+                        [--moe-runner-backend {auto,deep_gemm,triton,triton_kernel,flashinfer_trtllm,flashinfer_cutlass,flashinfer_mxfp4,flashinfer_cutedsl,cutlass}]
+                        [--flashinfer-mxfp4-moe-precision {default,bf16}]
+                        [--enable-flashinfer-allreduce-fusion]
+                        [--deepep-mode {normal,low_latency,auto}]
+                        [--ep-num-redundant-experts EP_NUM_REDUNDANT_EXPERTS]
+                        [--ep-dispatch-algorithm EP_DISPATCH_ALGORITHM]
+                        [--init-expert-location INIT_EXPERT_LOCATION]
+                        [--enable-eplb] [--eplb-algorithm EPLB_ALGORITHM]
+                        [--eplb-rebalance-num-iterations EPLB_REBALANCE_NUM_ITERATIONS]
+                        [--eplb-rebalance-layers-per-chunk EPLB_REBALANCE_LAYERS_PER_CHUNK]
+                        [--eplb-min-rebalancing-utilization-threshold EPLB_MIN_REBALANCING_UTILIZATION_THRESHOLD]
+                        [--expert-distribution-recorder-mode EXPERT_DISTRIBUTION_RECORDER_MODE]
+                        [--expert-distribution-recorder-buffer-size EXPERT_DISTRIBUTION_RECORDER_BUFFER_SIZE]
+                        [--enable-expert-distribution-metrics]
+                        [--deepep-config DEEPEP_CONFIG]
+                        [--moe-dense-tp-size MOE_DENSE_TP_SIZE]
+                        [--elastic-ep-backend {none,mooncake}]
+                        [--mooncake-ib-device MOONCAKE_IB_DEVICE]
+                        [--max-mamba-cache-size MAX_MAMBA_CACHE_SIZE]
+                        [--mamba-ssm-dtype {float32,bfloat16}]
+                        [--mamba-full-memory-ratio MAMBA_FULL_MEMORY_RATIO]
+                        [--mamba-scheduler-strategy {auto,no_buffer,extra_buffer}]
+                        [--mamba-track-interval MAMBA_TRACK_INTERVAL]
+                        [--enable-hierarchical-cache]
+                        [--hicache-ratio HICACHE_RATIO]
+                        [--hicache-size HICACHE_SIZE]
+                        [--hicache-write-policy {write_back,write_through,write_through_selective}]
+                        [--hicache-io-backend {direct,kernel,kernel_ascend}]
+                        [--hicache-mem-layout {layer_first,page_first,page_first_direct,page_first_kv_split,page_head}]
+                        [--disable-hicache-numa-detect]
+                        [--hicache-storage-backend {file,mooncake,hf3fs,nixl,aibrix,dynamic,eic}]
+                        [--hicache-storage-prefetch-policy {best_effort,wait_complete,timeout}]
+                        [--hicache-storage-backend-extra-config HICACHE_STORAGE_BACKEND_EXTRA_CONFIG]
+                        [--hierarchical-sparse-attention-extra-config HIERARCHICAL_SPARSE_ATTENTION_EXTRA_CONFIG]
+                        [--enable-lmcache] [--kt-weight-path KT_WEIGHT_PATH]
+                        [--kt-method KT_METHOD] [--kt-cpuinfer KT_CPUINFER]
+                        [--kt-threadpool-count KT_THREADPOOL_COUNT]
+                        [--kt-num-gpu-experts KT_NUM_GPU_EXPERTS]
+                        [--kt-max-deferred-experts-per-token KT_MAX_DEFERRED_EXPERTS_PER_TOKEN]
+                        [--dllm-algorithm DLLM_ALGORITHM]
+                        [--dllm-algorithm-config DLLM_ALGORITHM_CONFIG]
+                        [--enable-double-sparsity]
+                        [--ds-channel-config-path DS_CHANNEL_CONFIG_PATH]
+                        [--ds-heavy-channel-num DS_HEAVY_CHANNEL_NUM]
+                        [--ds-heavy-token-num DS_HEAVY_TOKEN_NUM]
+                        [--ds-heavy-channel-type DS_HEAVY_CHANNEL_TYPE]
+                        [--ds-sparse-decode-threshold DS_SPARSE_DECODE_THRESHOLD]
+                        [--cpu-offload-gb CPU_OFFLOAD_GB]
+                        [--offload-group-size OFFLOAD_GROUP_SIZE]
+                        [--offload-num-in-group OFFLOAD_NUM_IN_GROUP]
+                        [--offload-prefetch-step OFFLOAD_PREFETCH_STEP]
+                        [--offload-mode OFFLOAD_MODE]
+                        [--multi-item-scoring-delimiter MULTI_ITEM_SCORING_DELIMITER]
+                        [--disable-radix-cache]
+                        [--cuda-graph-max-bs CUDA_GRAPH_MAX_BS]
+                        [--cuda-graph-bs CUDA_GRAPH_BS [CUDA_GRAPH_BS ...]]
+                        [--disable-cuda-graph] [--disable-cuda-graph-padding]
+                        [--enable-profile-cuda-graph] [--enable-cudagraph-gc]
+                        [--enable-layerwise-nvtx-marker] [--enable-nccl-nvls]
+                        [--enable-symm-mem]
+                        [--disable-flashinfer-cutlass-moe-fp4-allgather]
+                        [--enable-tokenizer-batch-encode]
+                        [--disable-tokenizer-batch-decode]
+                        [--disable-outlines-disk-cache]
+                        [--disable-custom-all-reduce] [--enable-mscclpp]
+                        [--enable-torch-symm-mem] [--disable-overlap-schedule]
+                        [--enable-mixed-chunk] [--enable-dp-attention]
+                        [--enable-dp-lm-head] [--enable-two-batch-overlap]
+                        [--enable-single-batch-overlap]
+                        [--tbo-token-distribution-threshold TBO_TOKEN_DISTRIBUTION_THRESHOLD]
+                        [--enable-torch-compile]
+                        [--enable-torch-compile-debug-mode]
+                        [--enable-piecewise-cuda-graph]
+                        [--piecewise-cuda-graph-tokens PIECEWISE_CUDA_GRAPH_TOKENS [PIECEWISE_CUDA_GRAPH_TOKENS ...]]
+                        [--piecewise-cuda-graph-compiler {eager,inductor}]
+                        [--torch-compile-max-bs TORCH_COMPILE_MAX_BS]
+                        [--piecewise-cuda-graph-max-tokens PIECEWISE_CUDA_GRAPH_MAX_TOKENS]
+                        [--torchao-config TORCHAO_CONFIG]
+                        [--enable-nan-detection] [--enable-p2p-check]
+                        [--triton-attention-reduce-in-fp32]
+                        [--triton-attention-num-kv-splits TRITON_ATTENTION_NUM_KV_SPLITS]
+                        [--triton-attention-split-tile-size TRITON_ATTENTION_SPLIT_TILE_SIZE]
+                        [--num-continuous-decode-steps NUM_CONTINUOUS_DECODE_STEPS]
+                        [--delete-ckpt-after-loading] [--enable-memory-saver]
+                        [--enable-weights-cpu-backup]
+                        [--enable-draft-weights-cpu-backup]
+                        [--allow-auto-truncate]
+                        [--enable-custom-logit-processor]
+                        [--flashinfer-mla-disable-ragged]
+                        [--disable-shared-experts-fusion]
+                        [--disable-chunked-prefix-cache]
+                        [--disable-fast-image-processor]
+                        [--keep-mm-feature-on-device]
+                        [--enable-return-hidden-states]
+                        [--enable-return-routed-experts]
+                        [--scheduler-recv-interval SCHEDULER_RECV_INTERVAL]
+                        [--numa-node NUMA_NODE [NUMA_NODE ...]]
+                        [--enable-deterministic-inference]
+                        [--rl-on-policy-target {fsdp}]
+                        [--enable-attn-tp-input-scattered]
+                        [--enable-nsa-prefill-context-parallel]
+                        [--nsa-prefill-cp-mode {in-seq-split,round-robin-split}]
+                        [--enable-fused-qk-norm-rope]
+                        [--enable-precise-embedding-interpolation]
+                        [--enable-dynamic-batch-tokenizer]
+                        [--dynamic-batch-tokenizer-batch-size DYNAMIC_BATCH_TOKENIZER_BATCH_SIZE]
+                        [--dynamic-batch-tokenizer-batch-timeout DYNAMIC_BATCH_TOKENIZER_BATCH_TIMEOUT]
+                        [--debug-tensor-dump-output-folder DEBUG_TENSOR_DUMP_OUTPUT_FOLDER]
+                        [--debug-tensor-dump-layers DEBUG_TENSOR_DUMP_LAYERS [DEBUG_TENSOR_DUMP_LAYERS ...]]
+                        [--debug-tensor-dump-input-file DEBUG_TENSOR_DUMP_INPUT_FILE]
+                        [--debug-tensor-dump-inject DEBUG_TENSOR_DUMP_INJECT]
+                        [--disaggregation-mode {null,prefill,decode}]
+                        [--disaggregation-transfer-backend {mooncake,nixl,ascend,fake}]
+                        [--disaggregation-bootstrap-port DISAGGREGATION_BOOTSTRAP_PORT]
+                        [--disaggregation-decode-tp DISAGGREGATION_DECODE_TP]
+                        [--disaggregation-decode-dp DISAGGREGATION_DECODE_DP]
+                        [--disaggregation-prefill-pp DISAGGREGATION_PREFILL_PP]
+                        [--disaggregation-ib-device DISAGGREGATION_IB_DEVICE]
+                        [--disaggregation-decode-enable-offload-kvcache]
+                        [--disaggregation-decode-enable-fake-auto]
+                        [--num-reserved-decode-tokens NUM_RESERVED_DECODE_TOKENS]
+                        [--disaggregation-decode-polling-interval DISAGGREGATION_DECODE_POLLING_INTERVAL]
+                        [--encoder-only] [--language-only]
+                        [--encoder-transfer-backend {zmq_to_scheduler,zmq_to_tokenizer,mooncake}]
+                        [--encoder-urls ENCODER_URLS [ENCODER_URLS ...]]
+                        [--custom-weight-loader [CUSTOM_WEIGHT_LOADER ...]]
+                        [--weight-loader-disable-mmap]
+                        [--remote-instance-weight-loader-seed-instance-ip REMOTE_INSTANCE_WEIGHT_LOADER_SEED_INSTANCE_IP]
+                        [--remote-instance-weight-loader-seed-instance-service-port REMOTE_INSTANCE_WEIGHT_LOADER_SEED_INSTANCE_SERVICE_PORT]
+                        [--remote-instance-weight-loader-send-weights-group-ports REMOTE_INSTANCE_WEIGHT_LOADER_SEND_WEIGHTS_GROUP_PORTS]
+                        [--remote-instance-weight-loader-backend {transfer_engine,nccl}]
+                        [--remote-instance-weight-loader-start-seed-via-transfer-engine]
+                        [--enable-pdmux]
+                        [--pdmux-config-path PDMUX_CONFIG_PATH]
+                        [--sm-group-num SM_GROUP_NUM] [--config CONFIG]
+                        [--mm-max-concurrent-calls MM_MAX_CONCURRENT_CALLS]
+                        [--mm-per-request-timeout MM_PER_REQUEST_TIMEOUT]
+                        [--enable-broadcast-mm-inputs-process]
+                        [--mm-process-config MM_PROCESS_CONFIG]
+                        [--mm-enable-dp-encoder]
+                        [--limit-mm-data-per-request LIMIT_MM_DATA_PER_REQUEST]
+                        [--decrypted-config-file DECRYPTED_CONFIG_FILE]
+                        [--decrypted-draft-config-file DECRYPTED_DRAFT_CONFIG_FILE]
+                        [--enable-prefix-mm-cache]
+                        [--forward-hooks FORWARD_HOOKS]
+$ nvidia-smi -L
+GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (UUID: GPU-9d2a0cb2-d380-60e5-580d-cd14950c67a9)
+$ nvidia-smi --query-gpu=memory.free,memory.used,memory.total --format=csv
+memory.free [MiB], memory.used [MiB], memory.total [MiB]
+97215 MiB, 30 MiB, 97887 MiB

model-00001-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:447d62c327213921556b02ca0d1ac13c562bd9de6d56c9bf046beefee297dd9d
+size 4882434912

model-00002-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3abeb410e6c2c7b26029bdb1ceffadfc8676c56524e2549d830528843721a33d
+size 4869903000

model-00003-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:736f493ed8214d6e07aac0cdf74359b0daa13bbf1dacfd0f052e798dc7b45bc2
+size 4869903136

model-00004-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c61a87b5cc0be08071aad2705fe28354c44d3465f0c242fe7511f023614fcddb
+size 4969044352

model-00005-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2471290d5d20d6b78aedb970236dcb796980b7f3edcf7b21464af2f00acaaeca
+size 4954838264

model-00006-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c52bab97a0bd7113e89828a21c301fbcd5a4e5ac27ae04f1d2aa7318cc4fac63
+size 4869903136

model-00007-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:44f4497add18953da7d0071b9c722e50521d4ce2a97fc5cba814500388d6e1af
+size 4969044352

model-00008-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:985384ef888837d4d89060ba0e9b8586b772e232f2d13b7571a6d1dbc38bcc82
+size 4954838264

model-00009-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2457facc7d24efd840f2a2fa6ce7c416553aeecd5dbb9ce52e4d13f84c731394
+size 4869903136

model-00010-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:47dade11e77d8e5f48a0e6f202b5da33b90c3fc07112b4c718337cf7c9c3b98a
+size 4969044352

model-00011-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4d3b421695eca4697020448f10873255625a1334eb002963e7266835e762fa64
+size 4954838264

model-00012-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ff074803df240f9c5ca64686b459a51e5d32007d4ea7886f91bb7c7e054f8e02
+size 4869903136

model-00013-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5acbc16127454a327ab2afad1b9e328cfeb8d9fc6f9e4ff9947fa01dd4685fc0
+size 4969044352

model-00014-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5b5cebc7c7d0aee6588b8424c457c5d527623ebb660992bcd59e45aa3903b913
+size 4954838264

model-00015-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:56f4a96b87c468ed924a6661edd0f918e7fedbd39eda45711654e57d31d45722
+size 1201743176

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

recipe.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+default_stage:
+  default_modifiers:
+    QuantizationModifier:
+      targets: [Linear]
+      ignore: [lm_head]
+      scheme: NVFP4

sglang.sh ADDED Viewed

	@@ -0,0 +1,19 @@

+#!/bin/bash
+export SGLANG_ENABLE_JIT_DEEPGEMM=0
+/home/administrator/workspace/sglang/python/.venv/bin/python -m sglang.launch_server \
+  --model-path "/home/administrator/workspace2/models/Mistral-Large-Instruct-2411-NVFP4" \
+  --served-model-name "Mistral-Large-Instruct" \
+  --tp-size 1 \
+  --attention-backend triton \
+  --fp4-gemm-backend flashinfer_cutlass \
+  --fp8-gemm-backend cutlass \
+  --kv-cache-dtype fp8_e4m3 \
+  --disable-radix-cache \
+  --chunked-prefill-size -1 \
+  --mem-fraction-static 0.85 \
+  --tool-call-parser mistral \
+  --max-total-tokens 65536 \
+  --host "192.168.0.104" \
+  --port 8006

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff