TheStageAI
/

Elastic-FLUX.1-dev

Text-to-Image

Model card Files Files and versions

xet

Community

psynote123 commited on Mar 30

Commit

dd949d7

verified ·

1 Parent(s): 1fbd5f6

Update README.md

Browse files

Files changed (1) hide show

README.md +98 -35

README.md CHANGED Viewed

@@ -1,12 +1,12 @@
 ---
-license: other
 base_model:
-- black-forest-labs/FLUX.1-dev
 base_model_relation: quantized
 pipeline_tag: text-to-image
 ---
-# Elastic model: FLUX.1-dev
 ## Overview
@@ -66,13 +66,13 @@ pip install 'thestage-elastic-models[nvidia]' \
 ---
-Elastic Models provides the same interface as HuggingFace Diffusers. Here is an example of how to use the FLUX.1-dev model:
 ```python
 import torch
 from elastic_models.diffusers import FluxPipeline
-mode_name = 'black-forest-labs/FLUX.1-dev'
 hf_token = ''
 device = torch.device("cuda")
@@ -99,9 +99,9 @@ for prompt, output_image in zip(prompts, output.images):
 ---
-We have used PartiPrompts and DrawBench datasets to evaluate the quality of images generated by different sizes of FLUX.1-dev models (S, M, L, XL) compared to the original model. The evaluation metrics include ARNIQA, CLIP IQA, PSNR, SSIM, and VQA Faithfulness.
-![Quality Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422498-f1062c24-2904-4d56-b05b-a4d62f629a26/Flux_Dev_PartiPrompts_Evaluation.png)
 ### Quality Benchmark Results
@@ -109,14 +109,14 @@ We have used PartiPrompts and DrawBench datasets to evaluate the quality of imag
 | **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original** |
  | ---  | ---  | ---  | ---  | ---  | ---  |
-| **ARNIQA (PartiPrompts)** | 64.1 | 63.2 | 61.9 | 66.8 | 66.9 |
-| **ARNIQA (DrawBench)** | 64.3 | 63.5 | 63.6 | 68.2 | 68.5 |
-| **CLIP IQA (PartiPrompts)** | 85.5 | 86.4 | 83.8 | 88.3 | 87.9 |
-| **CLIP IQA (DrawBench)** | 86.4 | 86.5 | 84.5 | 89.5 | 90.0 |
-| **VQA Faithfulness (PartiPrompts)** | 87.5 | 85.5 | 85.5 | 85.5 | 88.6 |
-| **VQA Faithfulness (DrawBench)** | 69.3 | 64.7 | 64.8 | 67.8 | 65.2 |
-| **PSNR (PartiPrompts)** | 30.22 | 30.24 | 30.38 | N/A | N/A |
-| **SSIM (PartiPrompts)** | 0.72 | 0.72 | 0.76 | 1.0 | 1.0 |
 ## Datasets
@@ -145,9 +145,9 @@ We have used PartiPrompts and DrawBench datasets to evaluate the quality of imag
 ---
-We have measured the latency of different sizes of FLUX.1-dev model (S, M, L, XL, original) on various GPUs. The measurements were taken for generating images of size 1024x1024 pixels.
-![Latency Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422520-f2e2dedd-f475-4609-8277-b28fe5629623/Flux_Dev_1024x1024_image_generation.png)
 ### Latency Benchmark Results
@@ -157,10 +157,10 @@ Latency (in seconds) for generating a 1024x1024 image using different model size
 | **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original** |
  | ---  | ---  | ---  | ---  | ---  | ---  |
-| **H100** | 2.88 | 3.06 | 3.25 | 4.18 | 6.46 |
-| **L40s** | 9.22 | 10.07 | 10.67 | 14.39 | 16 |
-| **B200** | 1.93 | 2.04 | 2.15 | 2.77 | 4.52 |
-| **GeForce RTX 5090** | 5.79 | N/A | N/A | N/A | N/A |
 ## Benchmarking Methodology
@@ -171,7 +171,7 @@ Latency (in seconds) for generating a 1024x1024 image using different model size
 The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
 > **Algorithm summary:**
-> 1. Load the FLUX.1-dev model with the specified size (S, M, L, XL, original).
 > 2. Move the model to the GPU.
 > 3. Prepare a sample prompt for image generation.
 > 4. Run the model for a number of iterations (e.g., 10) and measure the time taken for each iteration. On each iteration:
@@ -191,7 +191,7 @@ The benchmarking was performed on a single GPU with a batch size of 1. Each mode
 import torch
 from elastic_models.diffusers import FluxPipeline
-mode_name = 'black-forest-labs/FLUX.1-dev'
 hf_token = ''
 device = torch.device("cuda")
@@ -209,7 +209,7 @@ prompt = ["Kitten eating a banana"]
 generate_kwargs={
     "height": 1024,
     "width": 1024,
-    "num_inference_steps": 28,
     "cfg_scale": 0.0
 }
@@ -244,6 +244,69 @@ print(f"Average Latency over {num_runs} runs: {average_latency} seconds")
 ```
 ## Serving with Docker Image
 ---
@@ -267,7 +330,7 @@ docker run --rm -ti \
   --name serving_thestage_model \
   -p 8000:80 \
   -e AUTH_TOKEN=<AUTH_TOKEN> \
-  -e MODEL_REPO=black-forest-labs/FLUX.1-dev \
   -e MODEL_SIZE=<MODEL_SIZE> \
   -e MODEL_BATCH=<MAX_BATCH_SIZE> \
   -e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
@@ -295,13 +358,13 @@ You can invoke the endpoint using CURL as follows:
 curl -X POST <http://127.0.0.1:8000/v1/images/generations>  \
     -H "Authorization: Bearer <AUTH_TOKEN>"  \
     -H "Content-Type: application/json" \
-    -H "X-Model-Name: flux-1-dev-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>" \
     -d '{
           "prompt": "Cat eating banana",
           "seed": 12,
           "aspect_ratio": "1:1",
           "guidance_scale": 6.5,
-          "num_inference_steps": 28
         }' \
     --output sunset.webp -D -
 ```
@@ -317,12 +380,12 @@ payload = json.dumps({
   "seed": 12,
   "aspect_ratio": "1:1",
   "guidance_scale": 6.5,
-  "num_inference_steps": 28
 })
 headers = {
   'Authorization': 'Bearer <AUTH_TOKEN>',
   'Content-Type': 'application/json',
-  'X-Model-Name': 'flux-1-dev-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>'
 }
 response = requests.request("POST", url, headers=headers, data=payload)
 with open("sunset.webp", "wb") as f:
@@ -337,7 +400,7 @@ from openai import OpenAI
 BASE_URL = "http://<your_ip>/v1"
 API_KEY  = ""
-MODEL    = "flux-1-dev-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>"
 client = OpenAI(
     api_key=API_KEY,
@@ -353,7 +416,7 @@ response = client.with_raw_response.images.generate(
         "seed": 111,
         "aspect_ratio": "1:1",
         "guidance_scale": 3.5,
-        "num_inference_steps": 28
     },
 )
@@ -386,7 +449,7 @@ with open("thestage_image.webp", "wb") as f:
 > `X-Model-Name`: `string`
 >
-> Specifies the model to use for generation. Format: `flux-1-dev-<size>-bs<batch_size>`, where `<size>` is one of `S`, `M`, `L`, `XL`, `original` and `<batch_size>` is the maximum batch size configured during container startup.
 ### Input Body
@@ -402,7 +465,7 @@ with open("thestage_image.webp", "wb") as f:
 > `num_inference_steps`: `int32`
 >
-> Number of diffusion steps to use for generation. Higher values yield better quality but take longer. Default is 28
 > `aspect_ratio`: `string`
 >
@@ -451,7 +514,7 @@ Set your environment variables in `modal_serving.py`:
 # modal_serving.py
 ENVS = {
-    "MODEL_REPO": "black-forest-labs/FLUX.1-dev",
     "MODEL_BATCH": "4",
     "THESTAGE_AUTH_TOKEN": "",
     "HUGGINGFACE_ACCESS_TOKEN": "",
@@ -482,7 +545,7 @@ Set your desired GPU type and autoscaling variables in `modal_serving.py`:
 )
 @modal.web_server(
     80,
-    label="black-forest-labs/FLUX.1-dev-test",
     startup_timeout=60*20
 )
 def serve():

 ---
+license: apache-2.0
 base_model:
+- black-forest-labs/FLUX.1-schnell
 base_model_relation: quantized
 pipeline_tag: text-to-image
 ---
+# Elastic model: FLUX.1-schnell
 ## Overview
 ---
+Elastic Models provides the same interface as HuggingFace Diffusers. Here is an example of how to use the FLUX.1-schnell model:
 ```python
 import torch
 from elastic_models.diffusers import FluxPipeline
+mode_name = 'black-forest-labs/FLUX.1-schnell'
 hf_token = ''
 device = torch.device("cuda")
 ---
+We have used PartiPrompts and DrawBench datasets to evaluate the quality of images generated by different sizes of FLUX.1-schnell models (S, M, L, XL) compared to the original model. The evaluation metrics include ARNIQA, CLIP IQA, PSNR, SSIM, and VQA Faithfulness.
+![Quality Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422411-00a75174-9e7c-42b8-b396-408d18e80544/Flux_Schnell_PartiPrompts_Evaluation.png)
 ### Quality Benchmark Results
 | **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original** |
  | ---  | ---  | ---  | ---  | ---  | ---  |
+| **ARNIQA (PartiPrompts)** | 62.8 | 63.2 | 64.3 | 65.2 | 65.2 |
+| **ARNIQA (DrawBench)** | 61.4 | 62.5 | 63.9 | 64 | 64 |
+| **CLIP IQA (PartiPrompts)** | 83.6 | 84.1 | 84.9 | 85.7 | 85.7 |
+| **CLIP IQA (DrawBench)** | 82.7 | 84 | 84.4 | 84.5 | 84.5 |
+| **VQA Faithfulness (PartiPrompts)** | 87 | 86 | 86.2 | 85.7 | 85.7 |
+| **VQA Faithfulness (DrawBench)** | 73.8 | 72.7 | 74.4 | 74.3 | 74.3 |
+| **PSNR (PartiPrompts)** | 29.9 | 30.2 | 31 | N/A | N/A |
+| **SSIM (PartiPrompts)** | 0.66 | 0.71 | 0.86 | 1.0 | 1.0 |
 ## Datasets
 ---
+We have measured the latency of different sizes of FLUX.1-schnell model (S, M, L, XL, original) on various GPUs. The measurements were taken for generating images of size 1024x1024 pixels.
+![Latency Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773422541-6712ce0b-a602-47fb-9468-1c0cd18417b5/Flux_Schnell_1024x1024_image_generation.png)
 ### Latency Benchmark Results
 | **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original** |
  | ---  | ---  | ---  | ---  | ---  | ---  |
+| **H100** | 0.51 | 0.51 | 0.51 | 0.71 | 1.04 |
+| **L40s** | 1.59 | 1.6 | 1.6 | 2.19 | 2.5 |
+| **B200** | 0.38 | 0.38 | 0.38 | 0.39 | 0.75 |
+| **GeForce RTX 5090** | 1.19 | N/A | N/A | N/A | N/A |
 ## Benchmarking Methodology
 The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.
 > **Algorithm summary:**
+> 1. Load the FLUX.1-schnell model with the specified size (S, M, L, XL, original).
 > 2. Move the model to the GPU.
 > 3. Prepare a sample prompt for image generation.
 > 4. Run the model for a number of iterations (e.g., 10) and measure the time taken for each iteration. On each iteration:
 import torch
 from elastic_models.diffusers import FluxPipeline
+mode_name = 'black-forest-labs/FLUX.1-schnell'
 hf_token = ''
 device = torch.device("cuda")
 generate_kwargs={
     "height": 1024,
     "width": 1024,
+    "num_inference_steps": 4,
     "cfg_scale": 0.0
 }
 ```
+## LoRA Support
+---
+Elastic FLUX.1-schnell engines support **runtime LoRA hot-swap** — load, switch, or disable LoRA files without recompilation or engine reload. LoRA weights are dynamic tensor inputs to the compiled engine.
+- **Supported ranks**: 1–256 (compiled with dynamic rank)
+- **Supported formats**: XLabs, diffusers, BFL Control (auto-detected)
+- **Hot-swap**: switch LoRA instantly by calling `load_lora_weights()`
+- **Disable**: `unload_lora_weights()` removes LoRA with minimal overhead
+> LoRA adds ~5-15% latency overhead. LoRA files must be downloaded locally before use (e.g. via `huggingface-cli download`).
+### Usage with LoRA
+---
+```python
+import torch
+from elastic_models.diffusers import FluxPipeline
+model_name = "black-forest-labs/FLUX.1-schnell"
+device = torch.device("cuda")
+pipeline = FluxPipeline.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    mode="S",
+    lora_support=True,
+)
+pipeline.to(device)
+# Load a LoRA and generate
+pipeline.load_lora_weights("./loras/realism_lora.safetensors", strength=1.0)
+output = pipeline(prompt=["A portrait photo of a woman in golden hour light"])
+output.images[0].save("realism_lora.png")
+# Hot-swap to a different LoRA (no engine reload)
+pipeline.load_lora_weights("./loras/anime_lora.safetensors", strength=1.0)
+output = pipeline(prompt=["Anime girl with blue hair in a garden"])
+output.images[0].save("anime_lora.png")
+# Disable LoRA
+pipeline.unload_lora_weights()
+output = pipeline(prompt=["A castle on a hill at sunset"])
+output.images[0].save("no_lora.png")
+```
+### LoRA Latency Benchmarks
+---
+Time in seconds to generate one 1024x1024 image (average over 3 LoRAs — rank 32, 32, 256).
+| **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original (unfused)** |
+ | ---  | ---  | ---  | ---  | ---  | ---  |
+| **H100** | 0.71 | 0.71 | 0.71 | 0.87 | 1.24 |
+| **L40s** | 1.9 | 1.9 | 1.9 | 2.4 | 2.93 |
+| **B200** | 0.59 | 0.59 | 0.59 | 0.53 | 0.89 |
+| **GeForce RTX 5090** | 1.46 | N/A | N/A | N/A | N/A |
 ## Serving with Docker Image
 ---
   --name serving_thestage_model \
   -p 8000:80 \
   -e AUTH_TOKEN=<AUTH_TOKEN> \
+  -e MODEL_REPO=black-forest-labs/FLUX.1-schnell \
   -e MODEL_SIZE=<MODEL_SIZE> \
   -e MODEL_BATCH=<MAX_BATCH_SIZE> \
   -e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
 curl -X POST <http://127.0.0.1:8000/v1/images/generations>  \
     -H "Authorization: Bearer <AUTH_TOKEN>"  \
     -H "Content-Type: application/json" \
+    -H "X-Model-Name: flux-1-schnell-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>" \
     -d '{
           "prompt": "Cat eating banana",
           "seed": 12,
           "aspect_ratio": "1:1",
           "guidance_scale": 6.5,
+          "num_inference_steps": 4
         }' \
     --output sunset.webp -D -
 ```
   "seed": 12,
   "aspect_ratio": "1:1",
   "guidance_scale": 6.5,
+  "num_inference_steps": 4
 })
 headers = {
   'Authorization': 'Bearer <AUTH_TOKEN>',
   'Content-Type': 'application/json',
+  'X-Model-Name': 'flux-1-schnell-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>'
 }
 response = requests.request("POST", url, headers=headers, data=payload)
 with open("sunset.webp", "wb") as f:
 BASE_URL = "http://<your_ip>/v1"
 API_KEY  = ""
+MODEL    = "flux-1-schnell-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>"
 client = OpenAI(
     api_key=API_KEY,
         "seed": 111,
         "aspect_ratio": "1:1",
         "guidance_scale": 3.5,
+        "num_inference_steps": 4
     },
 )
 > `X-Model-Name`: `string`
 >
+> Specifies the model to use for generation. Format: `flux-1-schnell-<size>-bs<batch_size>`, where `<size>` is one of `S`, `M`, `L`, `XL`, `original` and `<batch_size>` is the maximum batch size configured during container startup.
 ### Input Body
 > `num_inference_steps`: `int32`
 >
+> Number of diffusion steps to use for generation. Higher values yield better quality but take longer. Default is 4
 > `aspect_ratio`: `string`
 >
 # modal_serving.py
 ENVS = {
+    "MODEL_REPO": "black-forest-labs/FLUX.1-schnell",
     "MODEL_BATCH": "4",
     "THESTAGE_AUTH_TOKEN": "",
     "HUGGINGFACE_ACCESS_TOKEN": "",
 )
 @modal.web_server(
     80,
+    label="black-forest-labs/FLUX.1-schnell-test",
     startup_timeout=60*20
 )
 def serve():