Update README.md
Browse files
README.md
CHANGED
|
@@ -5,3 +5,78 @@ sudo apt update
|
|
| 5 |
sudo apt install ocl-icd-opencl-dev clinfo vulkan-tools
|
| 6 |
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
|
| 7 |
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
sudo apt install ocl-icd-opencl-dev clinfo vulkan-tools
|
| 6 |
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly mlc-ai-nightly
|
| 7 |
mlc_llm chat HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
|
| 8 |
+
|
| 9 |
+
$ mlc_llm chat --help
|
| 10 |
+
usage: MLC LLM Chat CLI [-h] [--opt OPT] [--device DEVICE] [--overrides OVERRIDES] [--model-lib MODEL_LIB] model
|
| 11 |
+
|
| 12 |
+
positional arguments:
|
| 13 |
+
model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
|
| 14 |
+
It can also be a link to a HF repository pointing to an MLC compiled model. (required)
|
| 15 |
+
|
| 16 |
+
options:
|
| 17 |
+
-h, --help show this help message and exit
|
| 18 |
+
--opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
|
| 19 |
+
O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
|
| 20 |
+
optimization that could potentially break the system. Meanwhile, optimization flags could be
|
| 21 |
+
explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
|
| 22 |
+
--device DEVICE The device used to deploy the model such as "cuda" or "cuda:0". Will detect from local available
|
| 23 |
+
GPUs if not specified. (default: "auto")
|
| 24 |
+
--overrides OVERRIDES
|
| 25 |
+
Chat configuration override. Configurations to override ChatConfig. Supports `conv_template`,
|
| 26 |
+
`context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
|
| 27 |
+
`max_batch_size` and `tensor_parallel_shards`. Meanwhile, model chat could be explicitly specified
|
| 28 |
+
via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
|
| 29 |
+
--model-lib MODEL_LIB
|
| 30 |
+
The full path to the model library file to use (e.g. a ``.so`` file). If unspecified, we will use
|
| 31 |
+
the provided ``model`` to search over possible paths. It the model lib is not found, it will be
|
| 32 |
+
compiled in a JIT manner. (default: "None")
|
| 33 |
+
(env) amd@volcano-9b20-os:~/workspace/Arun/data_dir/llamaCpp/mlc_LLM$ mlc_llm compile --help
|
| 34 |
+
usage: mlc_llm compile [-h]
|
| 35 |
+
[--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}]
|
| 36 |
+
[--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}]
|
| 37 |
+
[--device DEVICE] [--host HOST] [--opt OPT] [--system-lib-prefix SYSTEM_LIB_PREFIX] --output OUTPUT
|
| 38 |
+
[--overrides OVERRIDES] [--debug-dump DEBUG_DUMP]
|
| 39 |
+
model
|
| 40 |
+
|
| 41 |
+
positional arguments:
|
| 42 |
+
model A path to ``mlc-chat-config.json``, or an MLC model directory that contains `mlc-chat-config.json`.
|
| 43 |
+
It can also be a link to a HF repository pointing to an MLC compiled model. (required)
|
| 44 |
+
|
| 45 |
+
options:
|
| 46 |
+
-h, --help show this help message and exit
|
| 47 |
+
--quantization {q0f16,q0f32,q3f16_0,q3f16_1,q4f16_0,q4f16_1,q4f32_1,q4f16_2,q4f16_autoawq,q4f16_ft,e5m2_e5m2_f16,e4m3_e4m3_f16}
|
| 48 |
+
The quantization mode we use to compile. If unprovided, will infer from `model`. (default: look up
|
| 49 |
+
mlc-chat-config.json, choices: q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f32_1, q4f16_2,
|
| 50 |
+
q4f16_autoawq, q4f16_ft, e5m2_e5m2_f16, e4m3_e4m3_f16)
|
| 51 |
+
--model-type {auto,llama,mistral,gemma,gpt2,mixtral,gpt_neox,gpt_bigcode,phi-msft,phi,phi3,qwen,qwen2,stablelm,baichuan,internlm,rwkv5,orion,llava,rwkv6,chatglm,eagle,bert,medusa}
|
| 52 |
+
Model architecture such as "llama". If not set, it is inferred from `mlc-chat-config.json`.
|
| 53 |
+
(default: "auto")
|
| 54 |
+
--device DEVICE The GPU device to compile the model to. If not set, it is inferred from GPUs available locally.
|
| 55 |
+
(default: "auto")
|
| 56 |
+
--host HOST The host LLVM triple to compile the model to. If not set, it is inferred from the local CPU and OS.
|
| 57 |
+
Examples of the LLVM triple: 1) iPhones: arm64-apple-ios; 2) ARM64 Android phones: aarch64-linux-
|
| 58 |
+
android; 3) WebAssembly: wasm32-unknown-unknown-wasm; 4) Windows: x86_64-pc-windows-msvc; 5) ARM
|
| 59 |
+
macOS: arm64-apple-darwin. (default: "auto")
|
| 60 |
+
--opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2,
|
| 61 |
+
O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme
|
| 62 |
+
optimization that could potentially break the system. Meanwhile, optimization flags could be
|
| 63 |
+
explicitly specified via details knobs, e.g. --opt="cublas_gemm=1;cudagraph=0". (default: "O2")
|
| 64 |
+
--system-lib-prefix SYSTEM_LIB_PREFIX
|
| 65 |
+
Adding a prefix to all symbols exported. Similar to "objcopy --prefix-symbols". This is useful when
|
| 66 |
+
compiling multiple models into a single library to avoid symbol conflicts. Different from objcopy,
|
| 67 |
+
this takes no effect for shared library. (default: "auto")
|
| 68 |
+
--output OUTPUT, -o OUTPUT
|
| 69 |
+
The path to the output file. The suffix determines if the output file is a shared library or
|
| 70 |
+
objects. Available suffixes: 1) Linux: .so (shared), .tar (objects); 2) macOS: .dylib (shared), .tar
|
| 71 |
+
(objects); 3) Windows: .dll (shared), .tar (objects); 4) Android, iOS: .tar (objects); 5) Web: .wasm
|
| 72 |
+
(web assembly). (required)
|
| 73 |
+
--overrides OVERRIDES
|
| 74 |
+
Model configuration override. Configurations to override `mlc-chat-config.json`. Supports
|
| 75 |
+
`context_window_size`, `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
|
| 76 |
+
`max_batch_size` and `tensor_parallel_shards`. Meanwhile, model config could be explicitly specified
|
| 77 |
+
via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128". (default: "")
|
| 78 |
+
--debug-dump DEBUG_DUMP
|
| 79 |
+
Specifies the directory where the compiler will store its IRs for debugging purposes during various
|
| 80 |
+
phases of compilation. By default, this is set to `None`, indicating that debug dumping is disabled.
|
| 81 |
+
(default: None)
|
| 82 |
+
|