Text Generation
Transformers
Safetensors
English
llama
deepseek
fp8
vllm
conversational
text-generation-inference
compressed-tensors
Instructions to use nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic") model = AutoModelForCausalLM.from_pretrained("nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
- SGLang
How to use nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic with Docker Model Runner:
docker model run hf.co/nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-dynamic
| license: apache-2.0 | |
| language: | |
| - en | |
| tags: | |
| - deepseek | |
| - fp8 | |
| - vllm | |
| base_model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B | |
| library_name: transformers | |
| # DeepSeek-R1-Distill-Llama-70B-FP8-Dynamic | |
| ## Model Overview | |
| - **Model Architecture:** DeepSeek-R1-Distill-Llama-70B | |
| - **Input:** Text | |
| - **Output:** Text | |
| - **Model Optimizations:** | |
| - **Weight quantization:** FP8 | |
| - **Activation quantization:** FP8 | |
| - **Release Date:** 3/1/2025 | |
| - **Version:** 1.0 | |
| - **Model Developers:** Neural Magic | |
| Quantized version of [DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B). | |
| It achieves an average score of 76.52 on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 76.49. | |
| ### Model Optimizations | |
| This model was obtained by quantizing the weights and activations to FP8 data type, ready for inference with vLLM. | |
| This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized. | |
| ## Deployment | |
| ### Use with vLLM | |
| This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. | |
| ```python | |
| from transformers import AutoTokenizer | |
| from vllm import LLM, SamplingParams | |
| max_model_len, tp_size = 4096, 1 | |
| model_name = "nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-Dynamic" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True) | |
| sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id]) | |
| messages_list = [ | |
| [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}], | |
| ] | |
| prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list] | |
| outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params) | |
| generated_text = [output.outputs[0].text for output in outputs] | |
| print(generated_text) | |
| ``` | |
| vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. | |
| ## Creation | |
| This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. | |
| ```python | |
| import argparse | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from llmcompressor.modifiers.quantization import QuantizationModifier | |
| from llmcompressor.transformers import oneshot | |
| import os | |
| def main(): | |
| parser = argparse.ArgumentParser(description='Quantize a transformer model to FP8') | |
| parser.add_argument('--model_id', type=str, required=True, | |
| help='The model ID from HuggingFace (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")') | |
| parser.add_argument('--save_path', type=str, default='.', | |
| help='Custom path to save the quantized model. If not provided, will use model_name-FP8-dynamic') | |
| args = parser.parse_args() | |
| # Load model | |
| model = AutoModelForCausalLM.from_pretrained( | |
| args.model_id, device_map="auto", torch_dtype="auto", trust_remote_code=True, | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained(args.model_id) | |
| # Configure the quantization algorithm and scheme | |
| recipe = QuantizationModifier( | |
| targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"] | |
| ) | |
| # Apply quantization | |
| oneshot(model=model, recipe=recipe) | |
| save_path = os.path.join(args.save_path, args.model_id.split("/")[1] + "-FP8-dynamic") | |
| os.makedirs(save_path, exist_ok=True) | |
| # Save to disk in compressed-tensors format | |
| model.save_pretrained(save_path) | |
| tokenizer.save_pretrained(save_path) | |
| print(f"Model and tokenizer saved to: {save_path}") | |
| if __name__ == "__main__": | |
| main() | |
| ``` | |
| ## Evaluation | |
| The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/), using the following commands: | |
| OpenLLM Leaderboard V1: | |
| ``` | |
| lm_eval \ | |
| --model vllm \ | |
| --model_args pretrained="nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-Dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \ | |
| --tasks openllm \ | |
| --write_out \ | |
| --batch_size auto \ | |
| --output_path output_dir \ | |
| --show_config | |
| ``` | |
| OpenLLM Leaderboard V2: | |
| ``` | |
| lm_eval \ | |
| --model vllm \ | |
| --model_args pretrained="nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-Dynamic",dtype=auto,add_bos_token=False,max_model_len=4096,tensor_parallel_size=2,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \ | |
| --apply_chat_template \ | |
| --fewshot_as_multiturn \ | |
| --tasks leaderboard \ | |
| --write_out \ | |
| --batch_size auto \ | |
| --output_path output_dir \ | |
| --show_config | |
| ``` | |
| ### Accuracy | |
| #### OpenLLM Leaderboard V1 evaluation scores | |
| | Metric | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-Dynamic | | |
| |-----------------------------------------|:---------------------------------:|:-------------------------------------------:| | |
| | ARC-Challenge (Acc-Norm, 25-shot) | 66.38 | 66.38 | | |
| | GSM8K (Strict-Match, 5-shot) | 92.87 | 93.25 | | |
| | HellaSwag (Acc-Norm, 10-shot) | 85.41 | 85.40 | | |
| | MMLU (Acc, 5-shot) | 79.02 | 78.84 | | |
| | TruthfulQA (MC2, 0-shot) | 57.24 | 57.54 | | |
| | Winogrande (Acc, 5-shot) | 78.06 | 77.74 | | |
| | **Average Score** | **76.49** | **76.52** | | |
| | **Recovery (%)** | **100.00** | **100.03** | | |
| #### OpenLLM Leaderboard V2 evaluation scores | |
| | Metric | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | nm-testing/DeepSeek-R1-Distill-Llama-70B-FP8-Dynamic | | |
| |---------------------------------------------------------|:---------------------------------:|:-------------------------------------------:| | |
| | IFEval (Inst-and-Prompt Level Strict Acc, 0-shot) | 43.51 | 42.47 | | |
| | BBH (Acc-Norm, 3-shot) | 35.30 | 33.66 | | |
| | MMLU-Pro (Acc, 5-shot) | 41.35 | 41.05 | | |
| | **Average Score** | **40.05** | **39.06** | | |
| | **Recovery (%)** | **100.00** | **97.53** | | |
| | Math-Hard (Exact-Match, 4-shot) | 5.55 | 9.03 | | |
| | GPQA (Acc-Norm, 0-shot) | 1.64 | 1.58 | | |
| | MUSR (Acc-Norm, 0-shot) | 13.28 | 13.80 | | |
| Results on Math-Hard, GPQA, and MUSR are not considred for accuracy recovery calculation because the unquantized model has close to random prediction accuracy which doesn't provide a reliable baseline for recovery calculation. | |