Instructions to use nvidia/Cosmos3-Super-Text2Image with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Cosmos
How to use nvidia/Cosmos3-Super-Text2Image with Cosmos:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Diffusers
How to use nvidia/Cosmos3-Super-Text2Image with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("nvidia/Cosmos3-Super-Text2Image", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
File size: 4,803 Bytes
fdafd05 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | # Agentic Prompt Upsampling
This repository includes a standalone text-to-image agentic prompt upsampler for Cosmos3-Super-Text2Image.
The loop:
1. Upsamples the user prompt into a structured Cosmos3 T2I JSON prompt.
2. Generates an image through a vLLM-Omni `/v1/images/generations` endpoint.
3. Scores the image with a VLM critic.
4. Rewrites both the positive JSON prompt and generator-side negative prompt from the critic feedback.
5. Repeats up to the configured iteration limit and returns the best scored image.
## Install
From the repository root:
```bash
python -m pip install requests pillow
```
Recommended vLLM-Omni serving configuration for `nvidia/Cosmos3-Super-Text2Image` on 4xH200 is:
```bash
vllm serve nvidia/Cosmos3-Super-Text2Image \
--omni \
--cfg-parallel-size 2 \
--ulysses-degree 2 \
--tensor-parallel-size 1
```
With the no-offload configuration above, 1024x1024 image generation with 50 steps is expected to take roughly 5 seconds server-side per request.
## Default Models
The default prompt upsampler and rewriter are OpenAI GPT-5.5 through the public OpenAI chat completions API:
```text
endpoint: https://api.openai.com/v1
model: gpt-5.5
extra body: {"reasoning_effort": "low"}
env var: OPENAI_API_KEY
```
The default critic is Gemini 3.1 Pro Preview through Google's OpenAI-compatible chat completions endpoint:
```text
endpoint: https://generativelanguage.googleapis.com/v1beta/openai/
model: gemini-3.1-pro-preview
env var: GEMINI_API_KEY
```
Set credentials:
```bash
export OPENAI_API_KEY=...
export GEMINI_API_KEY=...
```
If your vLLM-Omni generation endpoint requires auth:
```bash
export AGENTIC_UPSAMPLING_GENERATION_AUTH_KEY=...
```
## Run One Prompt
```bash
python -m agentic_upsampling.run \
--prompt "a cinematic photo of a glass greenhouse at sunrise" \
--output-dir outputs/agentic_greenhouse \
--generation-endpoint https://YOUR_VLLM_OMNI_ENDPOINT
```
The generation call is a standard vLLM-Omni image request:
```text
POST /v1/images/generations
model: nvidia/Cosmos3-Super-Text2Image
size: 1024x1024
response_format: b64_json
num_inference_steps: 50
guidance_scale: 4.0
flow_shift: 3.0
negative_prompt: ""
extra_args: {"guardrails": false, "use_resolution_template": false}
```
## Run A Batch
Text file, one prompt per non-empty line:
```bash
python -m agentic_upsampling.run \
--prompts prompts.txt \
--output-dir outputs/agentic_batch \
--generation-endpoint https://YOUR_VLLM_OMNI_ENDPOINT
```
JSONL rows can be strings or objects with `prompt` and optional `id`:
```json
{"id": "greenhouse", "prompt": "a glass greenhouse at sunrise"}
{"id": "city", "prompt": "a clean futuristic city plaza after rain"}
```
CSV files must include a `prompt` or `Prompt` column and may include an `id` column.
## Useful Options
```bash
python -m agentic_upsampling.run \
--prompt "a precise product photo of a transparent mechanical keyboard" \
--output-dir outputs/keyboard \
--generation-endpoint https://YOUR_VLLM_OMNI_ENDPOINT \
--max-iterations 2 \
--samples-per-iteration 3 \
--seed-base 42 \
--size 1024x1024 \
--guidance 4.0 \
--flow-shift 3.0
```
- `--max-iterations` controls total prompt stages. The default is `2`, meaning the initial upsample plus up to two rewrites.
- `--samples-per-iteration` runs a best-of-N seed search for each prompt stage. Generation requests for those seeds are submitted concurrently within the iteration.
- `--seed-base` makes seeds deterministic. Sample seeds are `seed_base + sample_index`.
- `--size` is the vLLM-Omni image size in `WIDTHxHEIGHT` format.
- `--guidance` sets `guidance_scale`; the default is `4.0`.
- `--flow-shift` sets `flow_shift`; the default is `3.0`.
- `--generation-extra-args` overrides the default vLLM-Omni generation `extra_args` JSON object.
- Early stopping is enabled by default when the critic score clears the strict threshold. Use `--disable-early-stop` to always run every iteration.
- Reruns resume from completed artifacts by default. Use `--overwrite` to regenerate them.
## Output Layout
```text
output_dir/
run_config.json
summary.json
manifest.jsonl
failures.jsonl
0001/
best.json
iter_00/
prompt.json
negative_prompt.json
image.jpg
generation_meta.json
analysis.json
samples.json
meta.json
iter_01/
...
```
For `--samples-per-iteration N`, each iteration contains `sample_00/`, `sample_01/`, and so on.
## Export Best Images
Copy the selected best image for every completed prompt into one folder:
```bash
python -m agentic_upsampling.extract_best \
--output-dir outputs/agentic_batch \
--export-dir outputs/agentic_batch_best \
--overwrite
```
The exporter writes:
```text
best_generations.jsonl
best_generations.csv
images/
```
|