File size: 4,803 Bytes
fdafd05
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
# Agentic Prompt Upsampling

This repository includes a standalone text-to-image agentic prompt upsampler for Cosmos3-Super-Text2Image.

The loop:

1. Upsamples the user prompt into a structured Cosmos3 T2I JSON prompt.
2. Generates an image through a vLLM-Omni `/v1/images/generations` endpoint.
3. Scores the image with a VLM critic.
4. Rewrites both the positive JSON prompt and generator-side negative prompt from the critic feedback.
5. Repeats up to the configured iteration limit and returns the best scored image.

## Install

From the repository root:

```bash
python -m pip install requests pillow
```

Recommended vLLM-Omni serving configuration for `nvidia/Cosmos3-Super-Text2Image` on 4xH200 is:

```bash
vllm serve nvidia/Cosmos3-Super-Text2Image \
  --omni \
  --cfg-parallel-size 2 \
  --ulysses-degree 2 \
  --tensor-parallel-size 1
```

With the no-offload configuration above, 1024x1024 image generation with 50 steps is expected to take roughly 5 seconds server-side per request.

## Default Models

The default prompt upsampler and rewriter are OpenAI GPT-5.5 through the public OpenAI chat completions API:

```text
endpoint: https://api.openai.com/v1
model: gpt-5.5
extra body: {"reasoning_effort": "low"}
env var: OPENAI_API_KEY
```

The default critic is Gemini 3.1 Pro Preview through Google's OpenAI-compatible chat completions endpoint:

```text
endpoint: https://generativelanguage.googleapis.com/v1beta/openai/
model: gemini-3.1-pro-preview
env var: GEMINI_API_KEY
```

Set credentials:

```bash
export OPENAI_API_KEY=...
export GEMINI_API_KEY=...
```

If your vLLM-Omni generation endpoint requires auth:

```bash
export AGENTIC_UPSAMPLING_GENERATION_AUTH_KEY=...
```

## Run One Prompt

```bash
python -m agentic_upsampling.run \
  --prompt "a cinematic photo of a glass greenhouse at sunrise" \
  --output-dir outputs/agentic_greenhouse \
  --generation-endpoint https://YOUR_VLLM_OMNI_ENDPOINT
```

The generation call is a standard vLLM-Omni image request:

```text
POST /v1/images/generations
model: nvidia/Cosmos3-Super-Text2Image
size: 1024x1024
response_format: b64_json
num_inference_steps: 50
guidance_scale: 4.0
flow_shift: 3.0
negative_prompt: ""
extra_args: {"guardrails": false, "use_resolution_template": false}
```

## Run A Batch

Text file, one prompt per non-empty line:

```bash
python -m agentic_upsampling.run \
  --prompts prompts.txt \
  --output-dir outputs/agentic_batch \
  --generation-endpoint https://YOUR_VLLM_OMNI_ENDPOINT
```

JSONL rows can be strings or objects with `prompt` and optional `id`:

```json
{"id": "greenhouse", "prompt": "a glass greenhouse at sunrise"}
{"id": "city", "prompt": "a clean futuristic city plaza after rain"}
```

CSV files must include a `prompt` or `Prompt` column and may include an `id` column.

## Useful Options

```bash
python -m agentic_upsampling.run \
  --prompt "a precise product photo of a transparent mechanical keyboard" \
  --output-dir outputs/keyboard \
  --generation-endpoint https://YOUR_VLLM_OMNI_ENDPOINT \
  --max-iterations 2 \
  --samples-per-iteration 3 \
  --seed-base 42 \
  --size 1024x1024 \
  --guidance 4.0 \
  --flow-shift 3.0
```

- `--max-iterations` controls total prompt stages. The default is `2`, meaning the initial upsample plus up to two rewrites.
- `--samples-per-iteration` runs a best-of-N seed search for each prompt stage. Generation requests for those seeds are submitted concurrently within the iteration.
- `--seed-base` makes seeds deterministic. Sample seeds are `seed_base + sample_index`.
- `--size` is the vLLM-Omni image size in `WIDTHxHEIGHT` format.
- `--guidance` sets `guidance_scale`; the default is `4.0`.
- `--flow-shift` sets `flow_shift`; the default is `3.0`.
- `--generation-extra-args` overrides the default vLLM-Omni generation `extra_args` JSON object.
- Early stopping is enabled by default when the critic score clears the strict threshold. Use `--disable-early-stop` to always run every iteration.
- Reruns resume from completed artifacts by default. Use `--overwrite` to regenerate them.

## Output Layout

```text
output_dir/
  run_config.json
  summary.json
  manifest.jsonl
  failures.jsonl
  0001/
    best.json
    iter_00/
      prompt.json
      negative_prompt.json
      image.jpg
      generation_meta.json
      analysis.json
      samples.json
      meta.json
    iter_01/
      ...
```

For `--samples-per-iteration N`, each iteration contains `sample_00/`, `sample_01/`, and so on.

## Export Best Images

Copy the selected best image for every completed prompt into one folder:

```bash
python -m agentic_upsampling.extract_best \
  --output-dir outputs/agentic_batch \
  --export-dir outputs/agentic_batch_best \
  --overwrite
```

The exporter writes:

```text
best_generations.jsonl
best_generations.csv
images/
```