| | --- |
| | license: llama3 |
| | language: |
| | - en |
| | pipeline_tag: text-generation |
| | base_model: |
| | - deepseek-ai/DeepSeek-R1-Distill-Llama-70B |
| | - meta-llama/Llama-3.3-70B-Instruct |
| | tags: |
| | - chat |
| | library_name: transformers |
| | --- |
| | |
| | # Model Overview |
| |
|
| | - **Model Optimizations:** |
| | - **Weight quantization:** FP8 |
| | - **Activation quantization:** FP8 |
| | - **Release Date:** 1/28/2025 |
| |
|
| | Quantized version of [deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B) to FP8 data type, ready for inference with SGLang >= 0.3 or vLLM >= 0.5.2. |
| | This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized. |
| |
|
| | ## License |
| |
|
| | We notice that [deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B) is licensed under MIT, however the original LLama model is licensed under llama-3. We are adopting llama-3 for this model to be consistent. |
| |
|
| | ## Deployment |
| |
|
| | ### Use with SGLang |
| |
|
| | ```bash |
| | python -m sglang.launch_server --model-path JamAndTeaStudios/DeepSeek-R1-Distill-Llama-70B-FP8-Dynamic \ |
| | --port 30000 --host 0.0.0.0 |
| | ``` |
| |
|
| | ## Creation |
| |
|
| | This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. |
| |
|
| | <details> |
| | <summary>Model Creation Code</summary> |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | from llmcompressor.modifiers.quantization import QuantizationModifier |
| | from llmcompressor.transformers import oneshot |
| | |
| | MODEL_ID = "google/gemma-2-27b-it" |
| | |
| | # 1) Load model. |
| | model = AutoModelForCausalLM.from_pretrained( |
| | MODEL_ID, device_map="auto", torch_dtype="auto" |
| | ) |
| | tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) |
| | |
| | # 2) Configure the quantization algorithm and scheme. |
| | # In this case, we: |
| | # * quantize the weights to fp8 with per channel via ptq |
| | # * quantize the activations to fp8 with dynamic per token |
| | recipe = QuantizationModifier( |
| | targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"] |
| | ) |
| | |
| | # 3) Apply quantization and save in compressed-tensors format. |
| | OUTPUT_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic" |
| | oneshot( |
| | model=model, |
| | recipe=recipe, |
| | tokenizer=tokenizer, |
| | output_dir=OUTPUT_DIR, |
| | ) |
| | |
| | # Confirm generations of the quantized model look sane. |
| | print("========== SAMPLE GENERATION ==============") |
| | input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda") |
| | output = model.generate(input_ids, max_new_tokens=20) |
| | print(tokenizer.decode(output[0])) |
| | print("==========================================") |
| | ``` |
| | </details> |
| |
|
| | ## Evaluation |
| |
|
| | TBA |
| |
|
| | ## Play Retail Mage |
| |
|
| |  |
| |
|
| | [Retail Mage (Steam)](https://store.steampowered.com/app/3224380/Retail_Mage/) is an immersive sim that uses online LLM inference in almost all features in the gameplay! |
| |
|
| | Reviews |
| |
|
| | “A true to life experience detailing how customer service really works.” |
| | 10/10 – kpolupo |
| |
|
| | “I enjoyed how many things were flammable in the store.” |
| | 5/5 – mr_srsbsns |
| | |
| | “I've only known that talking little crow plushie in MageMart for a day and a half but if anything happened to him I would petrify everyone in this store and then myself.” |
| | 7/7 – neondenki |