| | --- |
| | license: apache-2.0 |
| | base_model: |
| | - deepseek-ai/DeepSeek-R1-Distill-Llama-8B |
| | base_model_relation: quantized |
| | pipeline_tag: text2text-generation |
| | language: |
| | - zho |
| | - eng |
| | - fra |
| | - spa |
| | - por |
| | - deu |
| | - ita |
| | - rus |
| | - jpn |
| | - kor |
| | - vie |
| | - tha |
| | - ara |
| | --- |
| | |
| | # Elastic model: DeepSeek-R1-Distill-Llama-8B. Fastest and most flexible models for self-serving. |
| |
|
| | Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models: |
| |
|
| | * __XL__: Mathematically equivalent neural network, optimized with our DNN compiler. |
| |
|
| | * __L__: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks. |
| |
|
| | * __M__: Faster model, with accuracy degradation less than 1.5%. |
| |
|
| | * __S__: The fastest model, with accuracy degradation less than 2%. |
| |
|
| |
|
| | __Goals of elastic models:__ |
| |
|
| | * Provide flexibility in cost vs quality selection for inference |
| | * Provide clear quality and latency benchmarks |
| | * Provide interface of HF libraries: transformers and diffusers with a single line of code |
| | * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT. |
| | * Provide the best models and service for self-hosting. |
| |
|
| | > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well. |
| |
|
| |  |
| | ----- |
| |
|
| | ## Inference |
| |
|
| | To infer our models, you just need to replace `transformers` import with `elastic_models.transformers`: |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoTokenizer |
| | from elastic_models.transformers import AutoModelForCausalLM |
| | |
| | # Currently we require to have your HF token |
| | # as we use original weights for part of layers and |
| | # model confugaration as well |
| | model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" |
| | hf_token = '' |
| | device = torch.device("cuda") |
| | |
| | # Create mode |
| | tokenizer = AutoTokenizer.from_pretrained( |
| | model_name, token=hf_token |
| | ) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | token=hf_token, |
| | torch_dtype=torch.bfloat16, |
| | attn_implementation="sdpa", |
| | mode='S' |
| | ).to(device) |
| | model.generation_config.pad_token_id = tokenizer.eos_token_id |
| | |
| | # Inference simple as transformers library |
| | prompt = "Describe basics of DNNs quantization." |
| | messages = [ |
| | { |
| | "role": "system", |
| | "content": "You are a search bot, answer on user text queries." |
| | }, |
| | { |
| | "role": "user", |
| | "content": prompt |
| | } |
| | ] |
| | |
| | chat_prompt = tokenizer.apply_chat_template( |
| | messages, add_generation_prompt=True, tokenize=False |
| | ) |
| | |
| | inputs = tokenizer(chat_prompt, return_tensors="pt") |
| | inputs.to(device) |
| | |
| | with torch.inference_mode(): |
| | generate_ids = model.generate(**inputs, max_length=500) |
| | |
| | input_len = inputs['input_ids'].shape[1] |
| | generate_ids = generate_ids[:, input_len:] |
| | output = tokenizer.batch_decode( |
| | generate_ids, |
| | skip_special_tokens=True, |
| | clean_up_tokenization_spaces=False |
| | )[0] |
| | |
| | # Validate answer |
| | print(f"# Q:\n{prompt}\n") |
| | print(f"# A:\n{output}\n") |
| | ``` |
| |
|
| | __System requirements:__ |
| | * GPUs: H100, L40s |
| | * CPU: AMD, Intel |
| | * Python: 3.10-3.12 |
| |
|
| |
|
| | To work with our models just run these lines in your terminal: |
| |
|
| | ```shell |
| | pip install thestage |
| | pip install 'thestage-elastic-models[nvidia]' --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple |
| | pip install flash_attn==2.7.3 --no-build-isolation |
| | pip uninstall apex |
| | ``` |
| |
|
| | Then go to [app.thestage.ai](https://app.thestage.ai), login and generate API token from your profile page. Set up API token as follows: |
| |
|
| | ```shell |
| | thestage config set --api-token <YOUR_API_TOKEN> |
| | ``` |
| |
|
| | Congrats, now you can use accelerated models! |
| |
|
| | ---- |
| |
|
| | ## Benchmarks |
| |
|
| | Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers! |
| |
|
| | ### Quality benchmarks |
| |
|
| | | Metric/Model | S | M | L | XL | Original | W8A8, int8 | |
| | |---------------|---|---|---|----|----------|------------| |
| | | arc_challenge | 38.70 | 40.40 | 40.40 | 40.50 | 40.50 | 19.30 | - | |
| | | mmlu | 52.70 | 54.70 | 55.50 | 54.80 | 54.80 | 47.70 | - | |
| | | piqa | 76.30 | 75.90 | 75.70 | 76.10 | 76.10 | 55.00 | - | |
| | | winogrande | 66.60 | 66.20 | 67.80 | 68.00 | 68.00 | 56.10 | - | |
| | |
| | |
| | |
| | * **MMLU**: Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics. |
| | * **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts. |
| | * **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks. |
| | * **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity. |
| | |
| | ### Latency benchmarks |
| | |
| | __100 input/300 output; tok/s:__ |
| | |
| | | GPU/Model | S | M | L | XL | Original | W8A8, int8 | |
| | |-----------|-----|---|---|----|----------|------------| |
| | | H100 | 194 | 191 | 161 | 131 | 58 | 198 | - | |
| | | L40S | 72 | 70 | 56 | 44 | 40 | 74 | - | |
| | |
| | |
| | |
| | ## Links |
| | |
| | * __Platform__: [app.thestage.ai](app.thestage.ai) |
| | * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI) |
| | <!-- * __Elastic models Github__: [app.thestage.ai](app.thestage.ai) --> |
| | * __Contact email__: contact@thestage.ai |
| | |