TheStageAI
/

Elastic-FLUX.1-schnell

Model card Files Files and versions

xet

Community

Update README.md

by hypothetical - opened Apr 9, 2025

base: refs/heads/main

←

from: refs/pr/4

Discussion Files changed

+34

-52

Files changed (1) hide show

README.md +34 -52

README.md CHANGED Viewed

@@ -4,7 +4,7 @@ base_model:
 - black-forest-labs/FLUX.1-schnell
 ---
-# Elastic models
 Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
@@ -17,19 +17,20 @@ Elastic models are the models produced by TheStage AI ANNA: Automated Neural Net
 * __S__: The fastest model, with accuracy degradation less than 2%.
-__Goals of elastic models:__
-* Provide flexibility in cost vs quality selection for inference
-* Provide clear quality and latency benchmarks
-* Provide interface of HF libraries: transformers and diffusers with a single line of code
 * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
-* Provide the best models and service for self-hosting.
 > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
 -----
 ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6487003ecd55eec571d14f96/ouz3FYQzG8C7Fl3XpNe6t.jpeg)
 ## Inference
@@ -49,12 +50,12 @@ device = torch.device("cuda")
 pipeline = FluxPipeline.from_pretrained(
     mode_name,
     torch_dtype=torch.bfloat16,
     mode='S'
 )
 pipeline.to(device)
 prompts = ["Kitten eating a banana"]
 output = pipeline(prompt=prompts)
 for prompt, output_image in zip(prompts, output.images):
@@ -64,25 +65,30 @@ for prompt, output_image in zip(prompts, output.images):
 ### Installation
-__System requirements__
-* GPUs: H100, L40s
 * CPU: AMD, Intel
-* OS: Linux #TODO
 * Python: 3.10-3.12
-To work with our models
 ```shell
 pip install thestage
 pip install elastic_models
 ```
-Then go to app.thestage.ai, login and generate API token from your profile page. Set up API token as follows:
 ```shell
 thestage config set --api-token <YOUR_API_TOKEN>
@@ -94,48 +100,25 @@ Congrats, now you can use accelerated models!
 ## Benchmarks
-Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The `W8A8, int8 column` indicates that we applied W8A8 quantization with int8 data type to all linear layers and used the same calibration data as for ANNA. The S model achieves practically identical speed but much higher quality, as ANNA knows how to improve quantization quality on sensitive layers!
 ### Quality benchmarks
-For quality evaluation we have used: #TODO link to github
-| Metric/Model  | S | M | L | XL | Original | W8A8, int8 |
-|---------------|---|---|---|----|----------|------------|
-| MMLU          | 0 | 0 | 0 | 0  | 0        | 0          |
-| PIQA          | 0 | 0 | 0 | 0  | 0        | 0          |
-| Arc Challenge | 0 | 0 | 0 | 0  | 0        | 0          |
-| Winogrande    | 0 | 0 | 0 | 0  | 0        | 0          |
-* **MMLU**:Evaluates general knowledge across 57 subjects including science, humanities, engineering, and more. Shows model's ability to handle diverse academic topics.
-* **PIQA**: Evaluates physical commonsense reasoning through questions about everyday physical interactions. Shows model's understanding of real-world physics concepts.
-* **Arc Challenge**: Evaluates grade-school level multiple-choice questions requiring reasoning. Shows model's ability to solve complex reasoning tasks.
-* **Winogrande**: Evaluates commonsense reasoning through sentence completion tasks. Shows model's capability to understand context and resolve ambiguity.
 ### Latency benchmarks
-We have profiled models in different scenarios:
-<table>
-<tr><th> 100 input/300 output; tok/s </th><th> 1000 input/1000 output; tok/s </th></tr>
-<tr><td>
-| GPU/Model | S   | M | L | XL | Original | W8A8, int8 |
-|-----------|-----|---|---|----|----------|------------|
-| H100      | 189 | 0 | 0 | 0  | 48       | 0          |
-| L40s      | 79  | 0 | 0 | 0  | 42       | 0          |
-</td><td>
-| GPU/Model | S   | M | L | XL | Original | W8A8, int8 |
-|-----------|-----|---|---|----|----------|------------|
-| H100      | 189 | 0 | 0 | 0  | 48       | 0          |
-| L40s      | 79  | 0 | 0 | 0  | 42       | 0          |
-</td></tr> </table>
 ## Links
@@ -144,4 +127,3 @@ We have profiled models in different scenarios:
 * __Elastic models Github__: [app.thestage.ai](app.thestage.ai)
 * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
 * __Contact email__: contact@thestage.ai

 - black-forest-labs/FLUX.1-schnell
 ---
+# Elastic model: Fastest self-serving models. FLUX.1-schnell.
 Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
 * __S__: The fastest model, with accuracy degradation less than 2%.
+__Goals of Elastic Models:__
+* Provide the fastest models and service for self-hosting.
+* Provide flexibility in cost vs quality selection for inference.
+* Provide clear quality and latency benchmarks.
+* Provide interface of HF libraries: transformers and diffusers with a single line of code.
 * Provide models supported on a wide range of hardware, which are pre-compiled and require no JIT.
 > It's important to note that specific quality degradation can vary from model to model. For instance, with an S model, you can have 0.5% degradation as well.
 -----
 ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/6487003ecd55eec571d14f96/ouz3FYQzG8C7Fl3XpNe6t.jpeg)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6487003ecd55eec571d14f96/l8xFGy0p5rxsn1-UojolK.png)
 ## Inference
 pipeline = FluxPipeline.from_pretrained(
     mode_name,
     torch_dtype=torch.bfloat16,
+    token=hf_token
     mode='S'
 )
 pipeline.to(device)
 prompts = ["Kitten eating a banana"]
 output = pipeline(prompt=prompts)
 for prompt, output_image in zip(prompts, output.images):
 ### Installation
+__System requirements:__
+* GPUs: H100
 * CPU: AMD, Intel
 * Python: 3.10-3.12
+To work with our models just run these lines in your terminal:
 ```shell
 pip install thestage
 pip install elastic_models
+pip install flash_attn==2.7.3 --no-build-isolation
+# can cause problems in text encoders
+pip uninstall apex
+echo "{
+    "meta-llama/Llama-3.2-1B-Instruct": 6,
+    "mistralai/Mistral-7B-Instruct-v0.3": 7,
+    "black-forest-labs/FLUX.1-schnell": 1,
+    "black-forest-labs/FLUX.1-dev": 5
+}" > model_name_id.json
+export ELASTIC_MODEL_ID_MAPPING=./model_name_id.json
 ```
+Then go to [app.thestage.ai](app.thestage.ai), login and generate API token from your profile page. Set up API token as follows:
 ```shell
 thestage config set --api-token <YOUR_API_TOKEN>
 ## Benchmarks
+Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for models using our algorithms.
 ### Quality benchmarks
+For quality evaluation we have used: PSNR, SSIM and CLIP score. PSNR and SSIM were computed using outputs of original model.
+| Metric/Model  | S | M | L | XL | Original |
+|---------------|---|---|---|----|----------|
+| PSNR          | 29.9 | 30.2 | 31 | inf  | inf        |
+| SSIM          | 0.66 | 0.71 | 0.86 | 1.0  | 1.0 |
+| CLIP          | 11.5 | 11.6 | 11.8 | 11.9  | 11.9|
 ### Latency benchmarks
+Time in seconds to generate one image 1024x1024
+| GPU/Model | S   | M | L | XL | Original |
+|-----------|-----|---|---|----|----------|
+| H100      | 0.5 | 0.58 | 0.65 | 0.75  | 1.05 |
+| L40s      | 1.4  | 1.6 | 1.9 | 2.1  | 2.5|
 ## Links
 * __Elastic models Github__: [app.thestage.ai](app.thestage.ai)
 * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
 * __Contact email__: contact@thestage.ai