Cosmos3-Super-Text2Image / AGENTIC_UPSAMPLING.md
mingyuliutw's picture
Super-squash branch 'main' using huggingface_hub
fdafd05

Agentic Prompt Upsampling

This repository includes a standalone text-to-image agentic prompt upsampler for Cosmos3-Super-Text2Image.

The loop:

  1. Upsamples the user prompt into a structured Cosmos3 T2I JSON prompt.
  2. Generates an image through a vLLM-Omni /v1/images/generations endpoint.
  3. Scores the image with a VLM critic.
  4. Rewrites both the positive JSON prompt and generator-side negative prompt from the critic feedback.
  5. Repeats up to the configured iteration limit and returns the best scored image.

Install

From the repository root:

python -m pip install requests pillow

Recommended vLLM-Omni serving configuration for nvidia/Cosmos3-Super-Text2Image on 4xH200 is:

vllm serve nvidia/Cosmos3-Super-Text2Image \
  --omni \
  --cfg-parallel-size 2 \
  --ulysses-degree 2 \
  --tensor-parallel-size 1

With the no-offload configuration above, 1024x1024 image generation with 50 steps is expected to take roughly 5 seconds server-side per request.

Default Models

The default prompt upsampler and rewriter are OpenAI GPT-5.5 through the public OpenAI chat completions API:

endpoint: https://api.openai.com/v1
model: gpt-5.5
extra body: {"reasoning_effort": "low"}
env var: OPENAI_API_KEY

The default critic is Gemini 3.1 Pro Preview through Google's OpenAI-compatible chat completions endpoint:

endpoint: https://generativelanguage.googleapis.com/v1beta/openai/
model: gemini-3.1-pro-preview
env var: GEMINI_API_KEY

Set credentials:

export OPENAI_API_KEY=...
export GEMINI_API_KEY=...

If your vLLM-Omni generation endpoint requires auth:

export AGENTIC_UPSAMPLING_GENERATION_AUTH_KEY=...

Run One Prompt

python -m agentic_upsampling.run \
  --prompt "a cinematic photo of a glass greenhouse at sunrise" \
  --output-dir outputs/agentic_greenhouse \
  --generation-endpoint https://YOUR_VLLM_OMNI_ENDPOINT

The generation call is a standard vLLM-Omni image request:

POST /v1/images/generations
model: nvidia/Cosmos3-Super-Text2Image
size: 1024x1024
response_format: b64_json
num_inference_steps: 50
guidance_scale: 4.0
flow_shift: 3.0
negative_prompt: ""
extra_args: {"guardrails": false, "use_resolution_template": false}

Run A Batch

Text file, one prompt per non-empty line:

python -m agentic_upsampling.run \
  --prompts prompts.txt \
  --output-dir outputs/agentic_batch \
  --generation-endpoint https://YOUR_VLLM_OMNI_ENDPOINT

JSONL rows can be strings or objects with prompt and optional id:

{"id": "greenhouse", "prompt": "a glass greenhouse at sunrise"}
{"id": "city", "prompt": "a clean futuristic city plaza after rain"}

CSV files must include a prompt or Prompt column and may include an id column.

Useful Options

python -m agentic_upsampling.run \
  --prompt "a precise product photo of a transparent mechanical keyboard" \
  --output-dir outputs/keyboard \
  --generation-endpoint https://YOUR_VLLM_OMNI_ENDPOINT \
  --max-iterations 2 \
  --samples-per-iteration 3 \
  --seed-base 42 \
  --size 1024x1024 \
  --guidance 4.0 \
  --flow-shift 3.0
  • --max-iterations controls total prompt stages. The default is 2, meaning the initial upsample plus up to two rewrites.
  • --samples-per-iteration runs a best-of-N seed search for each prompt stage. Generation requests for those seeds are submitted concurrently within the iteration.
  • --seed-base makes seeds deterministic. Sample seeds are seed_base + sample_index.
  • --size is the vLLM-Omni image size in WIDTHxHEIGHT format.
  • --guidance sets guidance_scale; the default is 4.0.
  • --flow-shift sets flow_shift; the default is 3.0.
  • --generation-extra-args overrides the default vLLM-Omni generation extra_args JSON object.
  • Early stopping is enabled by default when the critic score clears the strict threshold. Use --disable-early-stop to always run every iteration.
  • Reruns resume from completed artifacts by default. Use --overwrite to regenerate them.

Output Layout

output_dir/
  run_config.json
  summary.json
  manifest.jsonl
  failures.jsonl
  0001/
    best.json
    iter_00/
      prompt.json
      negative_prompt.json
      image.jpg
      generation_meta.json
      analysis.json
      samples.json
      meta.json
    iter_01/
      ...

For --samples-per-iteration N, each iteration contains sample_00/, sample_01/, and so on.

Export Best Images

Copy the selected best image for every completed prompt into one folder:

python -m agentic_upsampling.extract_best \
  --output-dir outputs/agentic_batch \
  --export-dir outputs/agentic_batch_best \
  --overwrite

The exporter writes:

best_generations.jsonl
best_generations.csv
images/