Cosmos3-Super-Text2Image / AGENTIC_UPSAMPLING.md

Super-squash branch 'main' using huggingface_hub

fdafd05 1 day ago

4.8 kB

	# Agentic Prompt Upsampling

	This repository includes a standalone text-to-image agentic prompt upsampler for Cosmos3-Super-Text2Image.

	The loop:

	1. Upsamples the user prompt into a structured Cosmos3 T2I JSON prompt.
	2. Generates an image through a vLLM-Omni `/v1/images/generations` endpoint.
	3. Scores the image with a VLM critic.
	4. Rewrites both the positive JSON prompt and generator-side negative prompt from the critic feedback.
	5. Repeats up to the configured iteration limit and returns the best scored image.

	## Install

	From the repository root:

	```bash
	python -m pip install requests pillow
	```

	Recommended vLLM-Omni serving configuration for `nvidia/Cosmos3-Super-Text2Image` on 4xH200 is:

	```bash
	vllm serve nvidia/Cosmos3-Super-Text2Image \
	--omni \
	--cfg-parallel-size 2 \
	--ulysses-degree 2 \
	--tensor-parallel-size 1
	```

	With the no-offload configuration above, 1024x1024 image generation with 50 steps is expected to take roughly 5 seconds server-side per request.

	## Default Models

	The default prompt upsampler and rewriter are OpenAI GPT-5.5 through the public OpenAI chat completions API:

	```text
	endpoint: https://api.openai.com/v1
	model: gpt-5.5
	extra body: {"reasoning_effort": "low"}
	env var: OPENAI_API_KEY
	```

	The default critic is Gemini 3.1 Pro Preview through Google's OpenAI-compatible chat completions endpoint:

	```text
	endpoint: https://generativelanguage.googleapis.com/v1beta/openai/
	model: gemini-3.1-pro-preview
	env var: GEMINI_API_KEY
	```

	Set credentials:

	```bash
	export OPENAI_API_KEY=...
	export GEMINI_API_KEY=...
	```

	If your vLLM-Omni generation endpoint requires auth:

	```bash
	export AGENTIC_UPSAMPLING_GENERATION_AUTH_KEY=...
	```

	## Run One Prompt

	```bash
	python -m agentic_upsampling.run \
	--prompt "a cinematic photo of a glass greenhouse at sunrise" \
	--output-dir outputs/agentic_greenhouse \
	--generation-endpoint https://YOUR_VLLM_OMNI_ENDPOINT
	```

	The generation call is a standard vLLM-Omni image request:

	```text
	POST /v1/images/generations
	model: nvidia/Cosmos3-Super-Text2Image
	size: 1024x1024
	response_format: b64_json
	num_inference_steps: 50
	guidance_scale: 4.0
	flow_shift: 3.0
	negative_prompt: ""
	extra_args: {"guardrails": false, "use_resolution_template": false}
	```

	## Run A Batch

	Text file, one prompt per non-empty line:

	```bash
	python -m agentic_upsampling.run \
	--prompts prompts.txt \
	--output-dir outputs/agentic_batch \
	--generation-endpoint https://YOUR_VLLM_OMNI_ENDPOINT
	```

	JSONL rows can be strings or objects with `prompt` and optional `id`:

	```json
	{"id": "greenhouse", "prompt": "a glass greenhouse at sunrise"}
	{"id": "city", "prompt": "a clean futuristic city plaza after rain"}
	```

	CSV files must include a `prompt` or `Prompt` column and may include an `id` column.

	## Useful Options

	```bash
	python -m agentic_upsampling.run \
	--prompt "a precise product photo of a transparent mechanical keyboard" \
	--output-dir outputs/keyboard \
	--generation-endpoint https://YOUR_VLLM_OMNI_ENDPOINT \
	--max-iterations 2 \
	--samples-per-iteration 3 \
	--seed-base 42 \
	--size 1024x1024 \
	--guidance 4.0 \
	--flow-shift 3.0
	```

	- `--max-iterations` controls total prompt stages. The default is `2`, meaning the initial upsample plus up to two rewrites.
	- `--samples-per-iteration` runs a best-of-N seed search for each prompt stage. Generation requests for those seeds are submitted concurrently within the iteration.
	- `--seed-base` makes seeds deterministic. Sample seeds are `seed_base + sample_index`.
	- `--size` is the vLLM-Omni image size in `WIDTHxHEIGHT` format.
	- `--guidance` sets `guidance_scale`; the default is `4.0`.
	- `--flow-shift` sets `flow_shift`; the default is `3.0`.
	- `--generation-extra-args` overrides the default vLLM-Omni generation `extra_args` JSON object.
	- Early stopping is enabled by default when the critic score clears the strict threshold. Use `--disable-early-stop` to always run every iteration.
	- Reruns resume from completed artifacts by default. Use `--overwrite` to regenerate them.

	## Output Layout

	```text
	output_dir/
	run_config.json
	summary.json
	manifest.jsonl
	failures.jsonl
	0001/
	best.json
	iter_00/
	prompt.json
	negative_prompt.json
	image.jpg
	generation_meta.json
	analysis.json
	samples.json
	meta.json
	iter_01/
	...
	```

	For `--samples-per-iteration N`, each iteration contains `sample_00/`, `sample_01/`, and so on.

	## Export Best Images

	Copy the selected best image for every completed prompt into one folder:

	```bash
	python -m agentic_upsampling.extract_best \
	--output-dir outputs/agentic_batch \
	--export-dir outputs/agentic_batch_best \
	--overwrite
	```

	The exporter writes:

	```text
	best_generations.jsonl
	best_generations.csv
	images/
	```