Jack04810 commited on 4 days ago

Commit

cc0721b

verified ·

1 Parent(s): 4438605

Add files using upload-large-folder tool

Browse files

Files changed (50) hide show

README.md +712 -0
client_utils/openai_api.py +123 -0
config/__init__.py +21 -0
config/__pycache__/config_rlsd_chartqa.cpython-310.pyc +0 -0
config/__pycache__/config_trimode.cpython-310.pyc +0 -0
config/__pycache__/config_trimode_antidegen.cpython-310.pyc +0 -0
config/__pycache__/loader.cpython-312.pyc +0 -0
config/config_7B.py +82 -0
config/config_aok.py +119 -0
config/config_llavacot.py +118 -0
config/config_low.py +120 -0
config/config_opd_7b_chartqa.py +48 -0
config/config_rlsd_chartqa.py +152 -0
config/config_trimode.py +88 -0
default_config_8gpu.yaml +16 -0
default_config_8gpu_deepspeed.yaml +21 -0
default_config_zero2_8gpu.yaml +18 -0
eval/eval_chartqa.py +310 -0
figs/chartqa.png +0 -0
kill_all.sh +55 -0
main.py +522 -0
main_llm.py +197 -0
main_sft.py +80 -0
multi_node_config_raw.yaml +21 -0
opsd_utils/__pycache__/opsd_loss.cpython-312.pyc +0 -0
opsd_utils/gate_policy.py +107 -0
opsd_utils/health_monitor.py +410 -0
opsd_utils/privileged/__pycache__/providers.cpython-310.pyc +0 -0
opsd_utils/privileged/image_utils.py +143 -0
opsd_utils/prompt_builder.py +265 -0
outputs/logs/.ipynb_checkpoints/train_opd_7b_ds_20260614_175014-checkpoint.log +0 -0
outputs/opd-7b-chartqa-ds/checkpoint-1764/zero_to_fp32.py +760 -0
outputs/opd-7b-chartqa-ds/checkpoint-2352/preprocessor_config.json +171 -0
outputs/opd-7b-chartqa-ds/checkpoint-588/config.json +235 -0
papers/full_text.txt +1211 -0
requirements.txt +16 -0
reward_utils/__pycache__/format_checks.cpython-310.pyc +0 -0
reward_utils/compute_rewards.py +126 -0
reward_utils/refiner.py +162 -0
tests/test_data_health_probe.py +28 -0
tests/test_degeneration_probe.py +98 -0
tests/test_health_monitor.py +72 -0
tests/test_mode_router_rlsd.py +97 -0
tests/test_privileged.py +172 -0
tests/test_privileged_debug_artifacts.py +40 -0
tests/test_slice_completion_logits.py +18 -0
tests/test_teacher_dual_image.py +43 -0
tests/test_vocab_align.py +71 -0
trainer/DyMETrainer_7B.py +983 -0
trainer/__init__.py +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,712 @@

+# DyME: Empowering Small-scale VLMs with Reliable Thinking Capabilities
+[![ICLR 2026](https://img.shields.io/badge/ICLR-2026-blue.svg)](#)
+[![arXiv](https://img.shields.io/badge/arXiv-2506.23061-b31b1b.svg)](https://arxiv.org/abs/2506.23061)
+This repository provides the official implementation of **DyME** (**Dy**namically selecting between **M**emorization and **E**xploration), accepted at **ICLR 2026**.
+## Overview
+Small-scale Vision-Language Models (SVLMs) are highly suited for proprietary tasks, but equipping them with reasoning and thinking capabilities remains challenging. Traditional Supervised Fine-Tuning (SFT) can force memorization of pseudo thinking traces, while Reinforcement Learning with Verifiable Reward (RLVR) often leads to unstable exploration (advantage collapse) due to limited model capacity.
+**DyME** is a novel training paradigm that dynamically synergizes SFT and RLVR. At each optimization step, DyME dynamically selects between Memorization (via SFT) and Exploration (via RLVR), ensuring every update contributes to an optimal trade-off. To further enhance this, we introduce a **Visual Supervision mechanism** (a visual checker and refiner) to inject dynamically enhanced, image-grounded guidance during training.
+Extensive experiments show that DyME delivers substantial performance improvements, establishing it as a robust strategy for stabilizing SVLM learning.
+## Repository Structure
+```text
+DyME/
+├── client_utils/         # Client tools for online Visual Supervision (LLM API)
+├── data/                 # Preprocessed textual datasets
+├── data_utils/           # Data processing and formatting scripts
+│   ├── aokvqa/
+│   ├── chart/
+│   └── commom_util.py
+├── eval/                 # Evaluation scripts for different benchmarks
+├── reward_utils/         # Reward function implementations for RLVR
+├── config/               # Modular configuration files for experiments
+├── opsd_utils/           # Privileged-context OPSD / TriMode extensions for DyMETrainer
+├── default_config.yaml   # Default DDP (MULTI_GPU, no DeepSpeed required)
+├── default_config_deepspeed.yaml  # Optional ZeRO-0 only (no sharding); needs pip install deepspeed
+├── default_config_zero2.yaml      # ZeRO-2 student sharding (OPD 7B colocate)
+├── default_config_zero3_colocate.yaml  # ZeRO-3 + CPU optimizer offload (tight VRAM)
+├── configs/deepspeed/             # DeepSpeed JSON templates (HF official "auto" fields)
+├── main.py               # Entry point for DyME training
+├── main_*.py             # Additional experimental variants (e.g., 7B, LLM-only)
+├── requirements.txt      # Python dependencies
+└── ...
+```
+## Configuration
+Before launching training, please prepare the relevant configuration files. The main settings are managed through configuration files such as `config/config.py` and `default_config.yaml`.
+### `CLIENT_CONFIG`
+This configuration is required when **Visual Supervision** is enabled. It specifies the online large-model API used by the visual checker and visual refiner.
+### `TRAINING_CONFIG`
+This section contains standard training hyperparameters for both the memorization phase and the exploration phase, including optimizer settings, batch size, learning rate, and related options.
+### `RL_CONFIG`
+This section defines critical variables for reward computation and response parsing during RLVR training. In particular, the following delimiters must be properly specified:
+* `answer_flag`: used to explicitly separate the final answer from auxiliary generated content such as intermediate reasoning traces.
+* `end_flag`: used to mark the end of generation.
+These delimiters are essential for stable parsing, reward assignment, and evaluation consistency.
+### `DYME_OPSD_CONFIG` (OPSD / TriMode)
+`config/config.py` defines `DYME_OPSD_CONFIG`, merged into `CONFIG["opsd"]`. When `enabled=False` (default), training follows the original DyME behavior. Set `enabled=True` or pass CLI flags to activate privileged-context **Self-OPSD** inside `DyMETrainer`.
+| Field | Description |
+| --- | --- |
+| `enabled` | Master switch. `False` → original DyME only. |
+| `mode` | Routing mode (see table below). |
+| `privileged_profile` | Teacher preset: `text` \| `visual` \| `hybrid` (default **`hybrid`** in `config_trimode.py`). |
+| `privileged_providers` | Override provider list; default derived from profile. |
+| `privileged_image` | Teacher image layout: `mode` `single` (ChartQA default) or `dual` (full + crop); plus `crop_strategy`, `bbox_coord`, `margin_ratio`. |
+| `privileged_debug` | Periodic artifact logging: `save_images`, `image_subdir` (`logs/images`), `max_samples_per_detail`. |
+| `gate.correct_threshold` | Reward threshold to count a rollout as correct. |
+| `gate.teacher_recoverable` | Recoverability gate: `privileged_available` (default) or `logprob_gain`. |
+| `loss.beta` | JSD temperature for OPSD distillation. |
+| `loss.opsd_weight` / `grpo_weight` / `sft_weight` | Per-mode loss weights. |
+**Routing modes (`mode`):**
+| Mode | Behavior |
+| --- | --- |
+| `dyme` | Original DyME: any correct rollout → GRPO; all wrong → SFT. |
+| `trimode` | Any correct → OPSD (replaces GRPO); all wrong → SFT (DyME cold-start via `sft_check`, ignores recoverable). |
+| `opsd_only` | All prompts use OPSD. |
+| `replace_sft` | Any correct → GRPO; all wrong → OPSD (no SFT). |
+| `opsd_on_wrong` | Any correct → GRPO; all wrong + recoverable → OPSD; all wrong + not recoverable → SFT (legacy three-way routing). |
+| `grpo_opsd_joint` | Any correct → GRPO (+ optional joint OPSD loss); all wrong + recoverable → OPSD; else SFT. |
+Under `trimode`, the SFT share is determined by accuracy (how often prompts are all-wrong) and DyME's per-group `sft_check` (teacher injection on the first generation only)—no extra `sft_ratio` hyperparameter.
+**Privileged profiles** (`privileged_profile`):
+| Profile | Teacher images | Teacher text suffix |
+| --- | --- | --- |
+| `text` | Single full image (same as student) | hint + answer |
+| `visual` | **Dual**: full + evidence crop | Visual Facts only (no answer leak) |
+| `hybrid` | Single full image by default (`privileged_image.mode=single`); dual with `mode=dual` | Visual Facts + hint + answer |
+Student `collate_fn` never reads privileged fields. With `privileged_image.mode=dual`, teacher forward uses `[full, crop]`; crop comes from normalized `evidence_bbox` (C2), A-OKVQA `visual_fact` heuristic (D2), or center fallback (D1). ChartQA defaults to `single` (no crop zoom).
+**Privileged providers** (under `opsd_utils/privileged/`):
+* `text` — uses the `hint` / `answer` fields in training samples.
+* `visual_facts` — uses `visual_fact` JSON (B1 raw string), plus ChartQA `visual_fact_hint` (F1) and `visual_fact_deplot` (F2).
+* `crop` — evidence region as second teacher image (via `image_utils`, not a text suffix).
+* `hybrid` — combines text + visual_facts providers per profile.
+**Debug / artifact logging**
+* Verbose OPSD logs: `--opsd_debug` or `DYME_OPSD_DEBUG=1`.
+* Full diagnostic bundle every N steps: `--opsd_detail_every N` or `DYME_OPSD_DETAIL_EVERY`.
+* On detail steps, teacher privileged images are saved under `{output_dir}/logs/images/` as `step_XXXXXX_idx_Y_full.png`, `_crop.png`, and `_meta.json` (controlled by `privileged_debug.max_samples_per_detail`).
+**ChartQA visual-facts preprocessing (run on server before TriMode training)**
+TriMode with `privileged_providers=text,visual_facts` requires `visual_fact_hint` / `visual_fact_deplot` (and optionally `visual_fact`) on each sample. Raw `train_medium.json` only has `hint` — without this step, logs show `visual_fact_len=0` and the VisualFacts teacher channel is empty.
+From the repo root on your GPU server:
+```bash
+cd /path/to/agentic-rl-main   # project root (parent of scripts/, config/, data/)
+# F1: copy hint → visual_fact_hint (+ visual_fact for backward compat)
+python scripts/build_visual_facts_chartqa.py \
+  --input data/chartqa/train_medium.json \
+  --output data/chartqa/train_medium_vf_hint.json \
+  --also-set-visual-fact
+# F2: DePlot offline table extraction (google/deplot, batched GPU inference; default ON)
+python scripts/build_visual_facts_chartqa_deplot.py \
+  --input data/chartqa/train_medium_vf_hint.json \
+  --output data/chartqa/train_medium_vf_full.json \
+  --batch-size 8 \
+  --cache data/chartqa/deplot_cache.json
+# Fast placeholder-only mode (no GPU / CI): add --no-enabled
+# DYME_DEPLOT_ENABLED=0 bash scripts/train_local_gpus.sh
+# quick sanity check (expect non-zero lengths)
+python -c "
+import json, random
+d = json.load(open('data/chartqa/train_medium_vf_full.json', encoding='utf-8'))
+s = random.choice(d)
+assert s.get('visual_fact_hint'), 'missing visual_fact_hint'
+assert s.get('visual_fact_deplot'), 'missing visual_fact_deplot'
+print('ok', len(d), 'records; sample visual_fact_hint len', len(s['visual_fact_hint']))
+"
+```
+`config/config.py` points `train_dataset` at `data/chartqa/train_medium_vf_full.json`. Generated `*_vf_*.json` files are gitignored — **generate them on each server** (or copy from shared storage); do not rely on cloning them from GitHub.
+`scripts/train_local_gpus.sh` will auto-run the two Python steps above if `train_medium_vf_full.json` is missing.
+**Training examples (TriMode + hybrid default)**
+```bash
+# Text-only OPSD ablation
+python main.py --config trimode --opsd_privilege_profile text
+# Vision-OPD style (no answer text to teacher)
+python main.py --config trimode --opsd_privilege_profile visual
+# Full hybrid (default in config_trimode)
+python main.py --config trimode --opsd_privilege_profile hybrid --opsd_detail_every 10
+```
+**Privileged sample schema**
+| Field | Used by | Notes |
+| --- | --- | --- |
+| `prompt`, `image` | Student + teacher | Student always single full image |
+| `hint`, `answer` | Teacher (`text` / `hybrid`) | Never in student collate |
+| `visual_fact` | Teacher | Raw JSON string (A-OKVQA) |
+| `visual_fact_hint` | Teacher (ChartQA F1) | Hint placeholder pipeline |
+| `visual_fact_deplot` | Teacher (ChartQA F2) | DePlot `parsed_table` text (`google/deplot`; placeholder skipped) |
+| `evidence_bbox` | Teacher crop | Normalized `[x0,y0,x1,y1]` in `[0,1]` |
+Adapter helpers for future datasets: `data_utils/privileged_schema.py` (`normalize_evidence_bbox`, `parse_visual_fact`, `resolve_crop_bbox`).
+For legacy ChartQA single-field preprocessing, see `scripts/build_visual_facts_chartqa.py`.
+## Data Preparation
+We provide example preprocessing scripts in the `data_utils/` directory. After preprocessing, the training data should be organized as a list of dictionaries (e.g., `metadata_list`) following the format below:
+```python
+metadata_list.append({
+    "question": question,               # Full prompt used for training
+    "question_wo_prompt": question,     # Raw question without prompt template
+    "answer": answer,                   # SFT target; should follow the answer_flag format
+    "image": image_save_path,           # Local path to the corresponding image
+})
+```
+### Field Description
+* `question`: the complete model input used during training.
+* `question_wo_prompt`: the raw question content without any additional prompt wrapper.
+* `answer`: the training target for SFT; this field should be formatted consistently with the delimiter specification in `RL_CONFIG`.
+* `image`: the local file path of the associated image, if applicable.
+## Environment Setup
+Please first install the required dependencies and configure the distributed training environment:
+```bash
+pip install -r requirements.txt
+accelerate config
+```
+The `accelerate config` step is required to initialize the distributed environment for both training and evaluation.
+## Dataset Setup
+### Text Data
+Preprocessed text splits are provided under the `data/` directory.
+### Image Data
+Due to storage constraints, image datasets are not included in this repository. Download scripts write images under `data/images/` by default:
+```text
+data/images/
+├── chartqa/
+│   ├── images/     # train_000000.png, val_000000.png, test_000000.png, ...
+│   └── json/       # train.json, val.json, test.json (from download.py)
+└── aokvqa/
+    ├── images/     # train_0000000.png, ...
+    └── json/       # train.json, validation.json, test.json (from download.py)
+```
+**ChartQA** (images only, no API required):
+```bash
+python data_utils/chart/download.py
+```
+**A-OKVQA** (images only by default; set `FETCH_VISUAL_FACTS=1` only if local VLM APIs are running on ports 23333–23340):
+```bash
+python data_utils/aokvqa/download.py
+```
+If you already downloaded ChartQA to `chartqa_output/` at the project root, move it into the canonical layout:
+```bash
+mkdir -p data/images
+mv chartqa_output data/images/chartqa
+```
+Preprocessed text annotations with hints live separately under `data/chartqa/` and `data/aokvqa/`. Image paths inside those JSON files are resolved automatically at load time (legacy prefixes like `/chartqa_output/` map to `data/images/chartqa/`).
+### Demo Samples
+A small subset of demo images for verifying the data loading pipeline may be provided in a future update.
+## Dataset Examples
+### ChartQA
+**ChartQA** is a visual question answering benchmark grounded in chart images. To illustrate different supervision granularities, we provide representative examples with three levels of reasoning-trace quality: **High**, **Medium**, and **Low**.
+<div align="center">
+| Example                                                         |
+| --------------------------------------------------------------- |
+| <img src="figs/chartqa.png" alt="ChartQA Example" width="220"/> |
+</div>
+#### High-quality Example
+<details>
+<summary><code>High-quality ChartQA Example</code></summary>
+```json
+{
+  "question": "When does the unfavorable view reach the peak?",
+  "answer": "2017",
+  "hint": "<SUMMARY> To solve the problem, I will examine the image to identify trends in unfavorable views of Pakistan in India over time. I'll closely inspect the data points within the graph to determine the year where the \"very unfavorable view\" reaches its peak. This involves identifying the maximum value on the vertical axis and noting the corresponding year on the horizontal axis. </SUMMARY> \n\n<CAPTION> The image is a line graph titled \"Very unfavorable views of Pakistan increasing in India,\" with the subtitle \"Very unfavorable view of Pakistan.\" The y-axis represents the percentage of unfavorable views, ranging from 0% to 100%. The x-axis displays years from 2013 to 2017. The data points show the percentages of very unfavorable views over these years, with specific values marked: 54% in 2013, 49% in 2014, 51% in 2015, 55% in 2016, and 64% in 2017. The graph shows a general upward trend in unfavorable views, peaking in 2017. </CAPTION> \n\n<REASONING> To determine when the unfavorable view reaches its peak, one should observe the graph for the data point with the highest percentage on the y-axis. The graph shows percentages for each year from 2013 to 2017: starting at 54% in 2013, decreasing to 49% in 2014, and then gradually increasing to 51% in 2015 and 55% in 2016. The graph culminates with the highest percentage of 64% in 2017. Thus, the peak of unfavorable views is associated with the year 2017. </REASONING> \n\n<CONCLUSION> 2017 </CONCLUSION>"
+}
+```
+</details>
+#### Medium-quality Example
+<details>
+<summary><code>Medium-quality ChartQA Example</code></summary>
+```json
+{
+  "question": "When does the unfavorable view reach the peak?",
+  "answer": "2017",
+  "hint": "Goal: Find the year when the unfavorable view reaches its peak.\nObservation: The data shows the values for each year are: 2013: 0, 2014: 0, 2015: 0, 2016: 55, and 2017: 64.\nReasoning: By comparing the values in each year, the highest value is 64, which occurs in 2017.\nConclusion: The unfavorable view reaches its peak in 2017."
+}
+```
+</details>
+#### Low-quality Example
+<details>
+<summary><code>Low-quality ChartQA Example</code></summary>
+```json
+{
+  "question": "When does the unfavorable view reach the peak?",
+  "answer": "2017",
+  "hint": "I'm trying to figure out the year when the unfavorable view reaches its highest point. Looking at the data, I see that the values for each year are pretty low until 2016, where it jumps to 55. But the peak doesn't happen until 2017, when the value spikes to 64. So, it seems like the unfavorable view really hits its maximum in 2017."
+}
+```
+</details>
+### A-OKVQA
+**A-OKVQA** is an open-ended visual question answering benchmark that requires world knowledge, commonsense reasoning, and visual understanding. Below we provide a representative example together with its corresponding annotation.
+<div align="center">
+| Example                                                        |
+| -------------------------------------------------------------- |
+| <img src="figs/aokvqa.png" alt="A-OKVQA Example" width="220"/> |
+</div>
+<details>
+<summary><code>View A-OKVQA JSON Example</code></summary>
+```json
+{
+  "question": "What is the man by the bags awaiting?",
+  "answer": "cab",
+  "visual_fact": "{\n  \"description\": \"The image shows a man standing in the middle of a street, facing away from the camera. He is holding a red bag in one hand and appears to be pulling a black suitcase with wheels. Another black suitcase is lying on the ground near him. The setting is an urban area with houses, parked cars, and trees in the background. The man seems to be waiting or preparing to cross the street.\",\n  \"objects\": [\n    {\n      \"name\": \"man\",\n      \"attributes\": [\"wearing a light blue and white shirt\", \"blue jeans\", \"carrying a red bag\", \"pulling a black suitcase\"],\n      \"position\": \"center\"\n    },\n    {\n      \"name\": \"red bag\",\n      \"attributes\": [\"held by the man\"],\n      \"position\": \"left side of the man\"\n    },\n    {\n      \"name\": \"black suitcase\",\n      \"attributes\": [\"with wheels\", \"being pulled by the man\"],\n      \"position\": \"near the man's feet\"\n    },\n    {\n      \"name\": \"black suitcase\",\n      \"attributes\": [\"on the ground\"],\n      \"position\": \"on the ground near the man\"\n    },\n    {\n      \"name\": \"street\",\n      \"attributes\": [\"asphalt\", \"urban setting\"],\n      \"position\": \"foreground\"\n    },\n    {\n      \"name\": \"houses\",\n      \"attributes\": [\"visible in the background\"],\n      \"position\": \"left side\"\n    },\n    {\n      \"name\": \"parked cars\",\n      \"attributes\": [\"red SUV\", \"other vehicles\"],\n      \"position\": \"left and center background\"\n    },\n    {\n      \"name\": \"trees\",\n      \"attributes\": [\"green foliage\"],\n      \"position\": \"right side\"\n    }\n  ]\n}",
+  "hint": "A train would not be on the street, he would not have luggage waiting for a delivery, and the skateboarder is there and not paying attention to him, so a cab is the only plausible answer."
+}
+```
+</details>
+### GSM8K
+**GSM8K** is a mathematical word problem benchmark. Since it is text-only, we provide a representative question-answer example together with its reasoning trace.
+<details>
+<summary><code>View GSM8K JSON Example</code></summary>
+```json
+{
+  "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
+  "answer": "72",
+  "hint": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72"
+}
+```
+</details>
+## Training
+All training scripts are launched using `accelerate`. Pass `--config` as a **Python config file path** (recommended) or a shorthand alias (`norm`, `trimode`, `llavacot`, `low`, `aok`).
+**Important:** `num_processes` must match the number of visible GPUs on your node. Helper scripts auto-detect GPU count and use **native PyTorch DDP** (`default_config.yaml`, `distributed_type: MULTI_GPU`) — **DeepSpeed is not required** for 0.5B multi-GPU training.
+Optional: if you already have `deepspeed` installed and want the Accelerate integration without parameter sharding, use ZeRO-0 (`default_config_deepspeed.yaml`, `zero_stage: 0`). Do **not** use ZeRO-2/3 for 0.5B-only RLSD unless you need the integration path.
+**7B OPD (student + frozen teacher on each GPU):** use DeepSpeed ZeRO to shard the **trainable 0.5B student**; the frozen 7B teacher stays outside DeepSpeed on `cuda:{LOCAL_RANK}`.
+```bash
+# ZeRO-2 colocate (recommended first try on 2×80G)
+bash scripts/train_opd_7b_chartqa_deepspeed.sh
+# Tighter memory: ZeRO-3 + CPU optimizer offload
+ACCELERATE_CONFIG=default_config_zero3_colocate.yaml bash scripts/train_opd_7b_chartqa_deepspeed.sh
+```
+Refs: [Transformers DeepSpeed](https://huggingface.co/docs/transformers/deepspeed), [Accelerate DeepSpeed](https://huggingface.co/docs/accelerate/usage_guides/deepspeed).
+```bash
+# 4-GPU node (default DDP, recommended)
+bash scripts/train_rlsd_chartqa.sh
+# Explicit DDP config
+accelerate launch --config_file default_config.yaml --num_processes 4 main.py --config config/config.py --mode rl
+# Optional ZeRO-0 (requires deepspeed, no sharding)
+ACCELERATE_CONFIG=default_config_deepspeed.yaml bash scripts/train_rlsd_chartqa.sh
+```
+Or override explicitly: `NUM_GPUS=4 bash scripts/train_trimode.sh`
+For TriMode on **all visible local GPUs** (auto-detect via `CUDA_VISIBLE_DEVICES` / `torch.cuda.device_count()`):
+```bash
+# 1) One-time (or when raw data changes): enrich ChartQA JSON on the server — see
+#    "ChartQA visual-facts preprocessing" above. train_local_gpus.sh also auto-runs
+#    this if train_medium_vf_full.json is absent.
+# 2) Start training (default: OPSD verbose off, detail every 50 steps, probe on)
+bash scripts/train_local_gpus.sh
+# Optional: full verbose debug (large logs)
+# DYME_OPSD_DEBUG=1 DYME_OPSD_DETAIL_EVERY=10 bash scripts/train_local_gpus.sh
+# Roll back to original trimode config (pre-antidegen hyperparameters)
+# DYME_CONFIG=config/config_trimode.py bash scripts/train_local_gpus.sh
+```
+### Anti-degeneration config (`config_trimode_antidegen`)
+`scripts/train_local_gpus.sh` defaults to **`config/config_trimode_antidegen.py`** (alias `trimode_antidegen`). Overrides are based on offline analysis of `train_trimode_4gpu_20260610_173637.log` (1225 steps):
+| Issue | Baseline log evidence | Antidegen change |
+|-------|----------------------|------------------|
+| Logit collapse | `LOGIT_MODE_COLLAPSE` 212×; step 1 clip 0→1.0; step 1175 clip≈0.92 | `max_completion_length=150`, `temperature=0.7`, `repetition_penalty=1.25` |
+| Step-1 gradient shock | `GEN_CLIP_COLLAPSE` from step 1; `OPT_GRAD_SPIKE` 44× | `learning_rate=5e-5`, `warmup_steps=50` |
+| OPSD coverage low | `opsd_mask` mean 5.6%; 492/1226 zero-mask steps | `require_format_for_opsd=False` (env default `DYME_OPSD_REQUIRE_FORMAT=0`) |
+| RL signal sparse | `RL_ZERO_SIGNAL` expected in trimode | `reward_weights=[0.5, 1.5, 1.0]` (format, context F1, acc) |
+| visual_fact empty | `visual_fact_empty_rate=0` throughout | no data change |
+Environment overrides:
+```bash
+export DYME_CONFIG=config/config_trimode_antidegen.py   # default in train_local_gpus.sh
+export DYME_OPSD_REQUIRE_FORMAT=0                       # antidegen default; set 1 to restore strict gate
+export DYME_REWARD_WEIGHTS=0.5,1.5,1.0                  # format, context, accuracy
+```
+After a new run (~200+ steps), compare against the baseline log:
+```bash
+python scripts/parse_trimode_log.py outputs/logs/train_trimode_*_new.log
+python scripts/degeneration_report.py outputs/logs/train_trimode_*_new.log
+python scripts/compare_trimode_logs.py train_trimode_4gpu_20260610_173637.log outputs/logs/train_trimode_*_new.log
+```
+Success criteria (candidate vs baseline): step 1 `clip` &lt; 1.0; `LOGIT_MODE_COLLAPSE` count down &gt;30%; `opsd_mask` mean &gt; 8%; step 200+ `mean_length` median &lt; 130. `RL_ZERO_SIGNAL` may remain high (trimode design).
+### 1. Training DyME (original)
+Default config keeps OPSD disabled (`DYME_OPSD_CONFIG.enabled=False`):
+```bash
+accelerate launch main.py --config config/config.py --mode rl
+```
+### OPSD debug logging + tee
+When debugging OPSD / TriMode (e.g. NCCL timeout), enable verbose logs and save stdout/stderr:
+```bash
+export DYME_OPSD_DEBUG=1
+mkdir -p ./outputs/logs
+LOG_FILE=./outputs/logs/train_$(date +%Y%m%d_%H%M%S).log
+accelerate launch --config_file default_config.yaml --num_processes "$(nvidia-smi -L | wc -l)" main.py \
+  --config config/config_trimode.py \
+  --mode rl \
+  --opsd_enabled \
+  --opsd_debug \
+  --opsd_mode trimode \
+  --opsd_providers text,visual_facts \
+  2>&1 | tee "${LOG_FILE}"
+```
+Logs are prefixed with `[OPSD-DEBUG]` and include rank, step, `[SYNC_POINT]` markers before every distributed collective in the OPSD chain (reward gather, teacher prompt build, metrics gather, OPSD loss). Search the log for the last `[SYNC_POINT]` on each rank to locate where a hang occurred.
+You can also use the helper script (debug + tee enabled by default):
+```bash
+bash scripts/train_trimode.sh
+```
+Disable debug when not needed: `DYME_OPSD_DEBUG=0 bash scripts/train_trimode.sh`
+### Periodic weak-signal diagnostics (`[OPSD-DETAIL]`)
+Separate from per-step `[OPSD-DEBUG]` spam: every **N global steps** (default **10**, rank 0 only) a full diagnostic bundle is printed to investigate **reward ≈ 0** and **gradient ≈ 0** while the OPSD chain still runs:
+- Generation: EOS rate, clipped ratio, effective completion tokens, decoded samples
+- Reward: format / acc / context breakdown, advantage stats, per-sample table
+- Routing: OPSD mask ratio, TriMode counts, advantage token distribution
+- Loss: GRPO per-token logps, coef\_1, clip counts, weak-signal hints
+- OPSD JSD: per-token JSD, student/teacher top-1 agreement, max-JSD token
+Configure via config, CLI, or env:
+```bash
+# default: every 10 steps (config_trimode.py)
+export DYME_OPSD_DETAIL_EVERY=10
+python main.py --config config/config_trimode.py --mode rl \
+  --opsd_enabled --opsd_detail_every 10
+# disable periodic detail
+export DYME_OPSD_DETAIL_EVERY=0
+```
+Search logs for `[OPSD-DETAIL]` (not `[OPSD-DEBUG]`).
+**Per-generate probe (`[OPSD-PROBE]`)** — enabled by default in `config_trimode.py`; fires on every `(re)generate` on rank 0 (no need to wait for step 10). Logs raw `completion_ids`, decode with/without special tokens, `eos_idx`, flags `ONE_TOKEN` / `EMPTY_DECODE` / `FIRST_IS_EOS`, and patterns `PAREN_THEN_EOS` / `REPEAT_LOOP`. Disable with `DYME_OPSD_PROBE_ON_GENERATE=0` or `--no_opsd_probe_on_generate`.
+**Deep generate debug (`[OPSD-GENDBG]`)** — runs alongside `[OPSD-PROBE]` when probe is enabled. Before each `model.generate`, logs model training context, prompt tail tokens/decode, and first-token logits (`p_eos`, `p_token_340`, `entropy`, `top5`) via **per-sample** forward (up to `probe_sample_count`, default 4) to avoid OOM on large VLM batches. After generate, logs greedy-vs-actual first token, delta vs previous regenerate, and cross-rank summary.
+```bash
+export DYME_OPSD_PROBE_ON_GENERATE=1   # default in config_trimode
+grep -E '\[OPSD-(PROBE|GENDBG)\]' train.log
+```
+| Observation in logs | Likely root cause |
+|---------------------|-------------------|
+| `p_eos` very high + `greedy_token_id==eos` | Weight collapse / train-mode distribution |
+| `prompt_tail_decode` ends with unclosed template + high `p_token_340` | Prompt / chat template issue (legacy `"Answer: .."` quoted placeholder biased token 340 `)`; fixed in `data_utils/rl_prompt.py`) |
+| `greedy_matches_actual=False` with low `p_eos` | Sampling noise (temperature / top_p) |
+| Large `one_token_count` gap across ranks in `cross_rank` | Data sharding / batch composition |
+| `delta_one_token_count` spikes at `generate_call_index>=2` | Weight drift after optimizer step |
+Optional env overrides:
+```bash
+export DYME_OPSD_PROBE_FIRST_TOKEN_LOGITS=0   # skip extra forward before generate
+export DYME_OPSD_PROBE_PROMPT_TAIL_TOKENS=24
+export DYME_OPSD_PROBE_LOG_MODEL_CONTEXT=0
+```
+### 2. Training TriMode (DyME + OPSD)
+Use `config/config_trimode.py` (OPSD pre-enabled) or override on the base config via CLI:
+```bash
+accelerate launch main.py \
+  --config config/config_trimode.py \
+  --mode rl \
+  --opsd_enabled \
+  --opsd_mode trimode \
+  --opsd_providers text,visual_facts
+```
+Equivalent one-liner with base config + CLI only:
+```bash
+accelerate launch main.py --config config/config.py --mode rl \
+  --opsd_enabled --opsd_mode trimode --opsd_providers text,visual_facts
+```
+**CLI OPSD flags** (override `CONFIG["opsd"]`):
+| Flag | Description |
+| --- | --- |
+| `--opsd_enabled` | Enable OPSD / TriMode extensions. |
+| `--opsd_debug` | Verbose OPSD chain logs (`[OPSD-DEBUG]`, or env `DYME_OPSD_DEBUG=1`). |
+| `--opsd_detail_every N` | Full weak-signal bundle every N steps (`[OPSD-DETAIL]`, default 10; `0` = off). |
+| `--opsd_probe_on_generate` / `--no_opsd_probe_on_generate` | Per-generate `[OPSD-PROBE]` on rank 0 (trimode default on). |
+| `--opsd_mode MODE` | Routing mode: `trimode` (legacy), `rlsd` (anti-leakage), `copsd_opd`, `dyme`, `opsd_only`, `replace_sft`, … |
+| `--opsd_providers LIST` | Comma-separated providers, e.g. `text`, `format_only`, `visual_facts`. Empty = same-prompt OPD only. |
+### 2b. RLSD / anti-leakage OPSD (recommended for ChartQA)
+`trimode` routes OPSD on **correct** completions and injects gold answer into the teacher prompt (information leakage). Use **`rlsd`** instead:
+- **Correct** → GRPO (on-policy self-learning, no privileged suffix)
+- **Wrong** → same-prompt OPSD / OPD (no `[Reference Answer]` in teacher)
+- **All-wrong group** → online SFT replace on the first generation (DyME cold-start; no separate offline SFT phase)
+**Important — online SFT ≠ offline SFT:** From step 0, training is always **RL + sparse online SFT** (typically 1/8 of completions per prompt when the group is all-wrong). There is no dedicated SFT-only phase unless you run a separate offline stage (see below).
+**Anti-collapse knobs (ChartQA RLSD / OPD):**
+| Env / config | Purpose |
+| --- | --- |
+| `DYME_MAX_COMPLETION_LENGTH`, `DYME_TEMPERATURE`, `DYME_REPETITION_PENALTY` | Antidegen decoding (RLSD defaults: 128 / 0.6 / 1.35) |
+| `DYME_FORMAT_MIN_THINKING` | Minimum chars before `Answer:` for format reward (default 8) |
+| `DYME_OPSD_SKIP_DEGENERATE=0` | Never skip OPSD on degenerate completions |
+| `DYME_OPSD_DEGEN_WARMUP_STEPS` | Before this step, degenerate samples still run OPSD (default 200) |
+| `DYME_SFT_WARMUP_SLOTS` | During warmup, inject GT into first N gens per all-wrong group (default 2) |
+**OPD config trap:** `config/config_opd_7b_chartqa.py` must inherit `CONFIG["training"]["dyme_args"]` from `config_rlsd_chartqa` (not stale `TRAINING_CONFIG["dyme_args"]` from antidegen). If logs show `max_new_tokens=150, temperature=0.7`, you are on the wrong decode path — stop and `git pull`.
+**Stop-training heuristics:** If after ~200 steps you see `degenerate_rate≈1`, `opsd_mask_true=0`, `grad_norm=0`, and `format_mean≈1` with `accuracy=0`, the run is collapsed — restart from base 0.5B or an early checkpoint.
+```bash
+bash scripts/train_rlsd_chartqa.sh
+# or: --config config/config_rlsd_chartqa.py --opsd_mode rlsd --opsd_providers format_only
+```
+**Two-stage cold start (optional offline SFT → RLSD/OPD):**
+```bash
+bash scripts/train_chartqa_sft.sh
+export DYME_PRETRAINED_MODEL=./outputs/chartqa-sft/final_checkpoint
+bash scripts/train_opd_7b_chartqa_deepspeed.sh
+```
+**200-step smoke (OPD fixes):**
+```bash
+bash scripts/train_opd_7b_smoke.sh
+# Success: degenerate_rate<0.5, opsd_mask>8%, advantage_abs_mean>0, grad_norm>0
+```
+**Cross-model OPD (7B frozen teacher + 0.5B student):**
+```bash
+# Default: teacher on each rank's GPU (cuda:LOCAL_RANK). 2-GPU: student+teacher share the same card per rank.
+# Optional dedicated teacher GPU: export DYME_TEACHER_DEVICE_MAP=cuda:1
+# Vocab alignment debug at startup: DYME_VOCAB_ALIGN_FULL=1 (exhaustive) or DYME_VOCAB_ALIGN_STRIDE=500
+bash scripts/train_opd_7b_chartqa.sh
+```
+Note: `main.py --mode rl --config config/config.py` uses **`dyme_args`** (not the unused `grpo_args` block in the same file). Pure GRPO baselines use `main_rebuttal.py`.
+**Helper scripts** (under `scripts/`):
+```bash
+# TriMode on ChartQA (legacy; leakage risk on ChartQA)
+bash scripts/train_trimode.sh
+# Anti-leakage RLSD (recommended)
+bash scripts/train_rlsd_chartqa.sh
+# Ablation matrix: MODE=dyme|trimode|replace_sft|opsd_only|...
+MODE=trimode DYME_OPSD_PROVIDERS=text,visual_facts bash scripts/train_baselines.sh
+# Post-training eval (set CHECKPOINT_DIR)
+CHECKPOINT_DIR=./outputs/trimode-chartqa/final_checkpoint bash scripts/run_eval_ablation.sh
+```
+### 3. Reproducing Baselines
+To reproduce baseline settings such as standard SFT or RL training, use **`main_sft.py`** (offline ChartQA SFT) or `main.py` with `--opsd_enabled` off for pure DyME.
+#### Supervised Fine-Tuning (SFT) — offline two-stage
+```bash
+bash scripts/train_chartqa_sft.sh
+# or: accelerate launch main_sft.py --config config/config_rlsd_chartqa.py
+```
+Then point RLSD/OPD at the SFT checkpoint via `DYME_PRETRAINED_MODEL` or `MODEL_CONFIG.pretrained_model_path`.
+#### Reinforcement Learning (GRPO / RL)
+```bash
+accelerate launch main.py --config config/config.py --mode rl
+```
+(`main_rebuttal.py` is referenced in the original DyME paper repo but is not shipped here; use `main_sft.py` + `main.py` instead.)
+### 4. Additional Experimental Variants
+For specific experimental settings such as different model scales or architecture-specific ablations, please use the corresponding scripts:
+* `main_7B.py`: experiments at the 7B scale
+* `main_llm.py`: LLM-specific variants
+* `main_change.py`: additional ablation settings
+## Evaluation
+We support multi-process evaluation through `accelerate`. Evaluation scripts are located in the `eval/` directory and can be launched as Python modules.
+### General Usage
+```bash
+accelerate launch -m eval.<eval_script_name>
+```
+### Example: ChartQA Evaluation
+```bash
+accelerate launch -m eval.eval_chartqa
+```
+### Evaluation Setup
+Before running evaluation, please open the corresponding evaluation script (for example, `eval_chartqa.py`) and modify the following fields as needed:
+* `model_id`: the path or identifier of the checkpoint to be evaluated
+* prompt templates: these should match the formatting used during training
+Ensuring consistency between training and evaluation prompts is important for obtaining reliable results.
+## Citation
+If you find this repository useful in your research, please consider citing our paper:
+```bibtex
+@inproceedings{dyme2026,
+  title={Empowering Small VLMs to Think with Dynamic Memorization and Exploration},
+  author={Jiazhen Liu, Yuchuan Deng, Long Chen},
+  booktitle={ICLR},
+  year={2026},
+}
+```

client_utils/openai_api.py ADDED Viewed

	@@ -0,0 +1,123 @@

+import time
+import httpx
+from openai import OpenAI
+from typing import Optional
+# It's good practice to define a simple configuration object or use a dictionary
+# for passing credentials, rather than a generic object.
+# For this example, we'll assume a config object like this.
+class ClientConfig:
+    def __init__(self, api_key: str, base_url: str, model_id: str):
+        self.api_key = api_key
+        self.base_url = base_url
+        self.model_id = model_id
+class OpenAIClient:
+    """
+    A client wrapper for interacting with the OpenAI ChatCompletion API.
+    It handles client initialization and API calls with retry logic.
+    """
+    def __init__(self, config: ClientConfig, max_retries: int = 3):
+        # The OpenAI client is initialized directly within the class constructor.
+        # This improves encapsulation by making the class self-contained.
+        # It takes the configuration object as a direct argument.
+        custom_http_client = httpx.Client(trust_env=False)
+        self.client = OpenAI(
+            api_key=config['api_key'],  # Required: your API key
+            base_url=config['api_base'],  # Optional: only needed for third-party services
+            http_client=custom_http_client,
+        )
+        self.model_id = config['model_id']
+        self.max_retries = max_retries
+    def get_completion(
+            self,
+            user_prompt: str,
+            system_prompt: Optional[str] = None,
+            max_tokens: int = 1024
+    ) -> Optional[str]:
+        """
+        Calls the OpenAI ChatCompletion API and returns the result.
+        Includes retry logic for handling transient errors.
+        Args:
+            user_prompt (str): The main input/prompt from the user.
+            system_prompt (Optional[str]): The system-level instruction for the model. Defaults to None.
+            max_tokens (int): The maximum number of tokens to generate. Defaults to 1024.
+        Returns:
+            Optional[str]: The content of the model's response, or None if the API call fails after all retries.
+        """
+        # Build the message list based on provided prompts.
+        messages = []
+        if system_prompt:
+            messages.append({'role': 'system', 'content': system_prompt})
+        messages.append({'role': 'user', 'content': user_prompt})
+        # Implement a clear retry loop instead of 'while True'.
+        for attempt in range(self.max_retries):
+            try:
+                # Make the API call to the chat completions endpoint.
+                response = self.client.chat.completions.create(
+                    model=self.model_id,
+                    messages=messages,
+                    max_tokens=max_tokens,
+                    # temperature=0.2 # You can add other parameters here as needed.
+                )
+                # If the call is successful, return the message content and exit the loop.
+                return response.choices[0].message.content
+            except Exception as e:
+                # If an error occurs, print a helpful message.
+                print(f"API call failed on attempt {attempt + 1}/{self.max_retries}. Error: {e}")
+                # If this was the last attempt, break the loop to return None.
+                if attempt + 1 == self.max_retries:
+                    print("All retry attempts failed.")
+                    break
+                # Wait for a short period before trying again.
+                print("Retrying in 2 seconds...")
+                time.sleep(2)
+        # Return None if all retries fail.
+        return None
+# --- How to Use the Refactored Class ---
+if __name__ == '__main__':
+    # 1. Define your configuration
+    # Replace with your actual credentials and mode
+    CLIENT_CONFIG = {
+        "client_type": "openai",
+        "api_key": "none",
+        "api_base": "http://127.0.0.1:23333/v1",
+        "timeout": 60,
+        "model_id": "Qwen/Qwen2.5-14B-Instruct-AWQ",
+        "init_port": 23333,
+        "num_server": 8
+    }
+    # 2. Instantiate the client
+    my_client = OpenAIClient(config=CLIENT_CONFIG)
+    # 3. Define your prompts
+    user_message = "What is the capital of France?"
+    system_message = "You are a helpful assistant that provides concise answers."
+    # 4. Get the model's response
+    response_content = my_client.get_completion(
+        user_prompt=user_message,
+        system_prompt=system_message
+    )
+    # 5. Print the result
+    if response_content:
+        print("\nModel Response:")
+        print(response_content)
+    else:
+        print("\nFailed to get a response from the model.")

config/__init__.py ADDED Viewed

	@@ -0,0 +1,21 @@

+from config.config import (
+    CLIENT_CONFIG,
+    CONFIG,
+    DATASET_CONFIG,
+    DYME_OPSD_CONFIG,
+    MODEL_CONFIG,
+    RL_CONFIG,
+    TRAINING_CONFIG,
+    save_config,
+)
+__all__ = [
+    "CLIENT_CONFIG",
+    "CONFIG",
+    "DATASET_CONFIG",
+    "DYME_OPSD_CONFIG",
+    "MODEL_CONFIG",
+    "RL_CONFIG",
+    "TRAINING_CONFIG",
+    "save_config",
+]

config/__pycache__/config_rlsd_chartqa.cpython-310.pyc ADDED Viewed

Binary file (4 kB). View file

config/__pycache__/config_trimode.cpython-310.pyc ADDED Viewed

Binary file (2.62 kB). View file

config/__pycache__/config_trimode_antidegen.cpython-310.pyc ADDED Viewed

Binary file (1.92 kB). View file

config/__pycache__/loader.cpython-312.pyc ADDED Viewed

Binary file (3.36 kB). View file

config/config_7B.py ADDED Viewed

	@@ -0,0 +1,82 @@

+import os
+import torch
+# ====== Model Configuration ======
+MODEL_CONFIG = {
+    "pretrained_model_path": "Qwen/Qwen2.5-VL-7B-Instruct",
+    "use_flash_attention_2": False,
+    "torch_dtype": "bfloat16",
+}
+# ====== Training Configuration ======
+TRAINING_CONFIG = {
+    "task": 'chart',
+    "num_gpus": 8,
+    "num_client": 8,
+    "dyme_args": {
+        "output_dir": '/path/to/dyme-qwen25_7B-chart-llava_cot',
+        "logging_steps": 1,
+        "num_generations": 8,
+        "max_completion_length": 300,
+        "per_device_train_batch_size": 1,
+        "gradient_accumulation_steps": 16,
+        "num_train_epochs": 10,
+        "learning_rate": 1e-5,
+        "bf16": True,
+        "gradient_checkpointing": False,
+        "ddp_find_unused_parameters": False,
+        "max_grad_norm": 1.0,
+        "save_steps": 100,
+        "weight_decay": 0.01,
+        "warmup_steps": 0,
+        "eval_strategy": "steps",
+        "eval_steps": 10000,
+        "beta": 0.0,  # GRPO specific
+        "loss_type": 'grpo',  # GRPO specific
+        "seed": 42,
+    },
+}
+RL_CONFIG = {
+    "answer_flag": "Answer:",
+    "end_flag": "<|im_end|>"
+}
+# ====== Client Configuration for Reward Calculation ======
+CLIENT_CONFIG = {
+    "client_type": "openai",
+    "api_key": "none",
+    "api_base": "http://127.0.0.1:%s/v1",
+    "timeout": 60,
+    "model_id": "Qwen/Qwen2.5-14B-Instruct-AWQ",
+    "init_port": 23333,
+    "num_server": 8
+}
+# ====== Dataset Configuration ======
+DATASET_CONFIG = {
+    "train_dataset": "/path/to/data/chartqa_output/llavacot/json/chartqa_train_processed.json",
+    "eval_dataset": "HuggingFaceM4/ChartQA",
+}
+CONFIG = {
+    "model": MODEL_CONFIG,
+    "training": TRAINING_CONFIG,
+    "rl": RL_CONFIG,
+    "client": CLIENT_CONFIG,
+    "dataset": DATASET_CONFIG,
+}
+def save_config(config, config_path="./config.json"):
+    import json
+    with open(config_path, "w") as f:
+        json.dump(config, f, indent=4)
+if __name__ == "__main__":
+    save_config(CONFIG)

config/config_aok.py ADDED Viewed

	@@ -0,0 +1,119 @@

+import os
+import torch
+from data_utils.paths import project_path
+MODEL_CONFIG = {
+    "pretrained_model_path": "llava-hf/llava-onevision-qwen2-0.5b-ov-hf",
+    "use_flash_attention_2": True,
+    "torch_dtype": "bfloat16",
+}
+TRAINING_CONFIG = {
+    "task": 'world',
+    "num_gpus": 8,
+    "num_client": 8,
+    "dyme_args": {
+        "output_dir": '/path/to/dyme-aok-online',
+        "logging_steps": 1,
+        "num_generations": 4,
+        "max_completion_length": 300,
+        "per_device_train_batch_size": 1,
+        "gradient_accumulation_steps": 16,
+        "num_train_epochs": 10,
+        "learning_rate": 1e-5,
+        "bf16": True,
+        "gradient_checkpointing": False,
+        "ddp_find_unused_parameters": False,
+        "max_grad_norm": 1.0,
+        "save_strategy": "epoch",
+        "weight_decay": 0.01,
+        "warmup_steps": 0,
+        "beta": 0.0,
+        "loss_type": 'grpo',
+        "seed": 42,
+    },
+    "sft_args": {
+        "output_dir": '/path/to/sft-aok',
+        "logging_steps": 1,
+        "per_device_train_batch_size": 2,
+        "gradient_accumulation_steps": 4,
+        "num_train_epochs": 10,
+        "learning_rate": 1e-5,
+        "bf16": True,
+        "gradient_checkpointing": False,
+        "ddp_find_unused_parameters": False,
+        "max_grad_norm": 1.0,
+        "save_strategy": "epoch",
+        "weight_decay": 0.01,
+        "warmup_steps": 0,
+        "seed": 42,
+        "remove_unused_columns": False
+    },
+    "grpo_args":{
+        "output_dir": '/path/to/grpo-aok',
+        "logging_steps": 1,
+        "num_generations": 4,
+        "max_completion_length": 576,
+        "max_prompt_length": None,
+        "per_device_train_batch_size": 2,
+        "gradient_accumulation_steps": 4,
+        "num_train_epochs": 10,
+        "learning_rate": 1e-5,
+        "bf16": True,
+        "gradient_checkpointing": False,
+        "ddp_find_unused_parameters": False,
+        "max_grad_norm": 1.0,
+        "save_strategy": "epoch",
+        "weight_decay": 0.01,
+        "warmup_steps": 0,
+        "beta": 0.04,
+        "loss_type": 'grpo',
+        "seed": 42,
+    }
+}
+RL_CONFIG = {
+    "answer_flag": "Answer:",
+    "end_flag": "<|im_end|>"
+}
+# ====== Client Configuration for Reward Calculation ======
+CLIENT_CONFIG = {
+    "client_type": "openai",
+    "api_key": "none",
+    "api_base": "http://127.0.0.1:%s/v1",
+    "timeout": 60,
+    "model_id": "Qwen/Qwen2.5-14B-Instruct-AWQ",
+    "init_port": 23333,
+    "num_server": 8
+}
+# ====== Dataset Configuration ======
+DATASET_CONFIG = {
+    "train_dataset": project_path("data/aokvqa/train.json"),
+    "eval_dataset": "HuggingFaceM4/A-OKVQA",
+}
+# ====== Full Configuration ======
+CONFIG = {
+    "model": MODEL_CONFIG,
+    "training": TRAINING_CONFIG,
+    "rl": RL_CONFIG,
+    "client": CLIENT_CONFIG,
+    "dataset": DATASET_CONFIG,
+}
+# Save configuration to a file for reference
+def save_config(config, config_path="./config.json"):
+    import json
+    with open(config_path, "w") as f:
+        json.dump(config, f, indent=4)
+# Example usage to save config
+if __name__ == "__main__":
+    save_config(CONFIG)

config/config_llavacot.py ADDED Viewed

	@@ -0,0 +1,118 @@

+import os
+import torch
+# ====== Model Configuration ======
+MODEL_CONFIG = {
+    "pretrained_model_path": '/path/to/sft-llavaov-chart-llava_cot/checkpoint-802',  # two-stage grpo
+    "use_flash_attention_2": True,
+    "torch_dtype": "bfloat16",
+}
+# ====== Training Configuration ======
+TRAINING_CONFIG = {
+    "task": 'chart',
+    "num_gpus": 8,
+    "num_client": 8,
+    "dyme_args": {
+        "output_dir": '/path/to/dyme-llavaov-chart-llava_cot',
+        "logging_steps": 1,
+        "num_generations": 4,
+        "max_completion_length": 300,
+        "per_device_train_batch_size": 1,
+        "gradient_accumulation_steps": 16,
+        "num_train_epochs": 10,
+        "learning_rate": 8e-5,
+        "bf16": True,
+        "gradient_checkpointing": False,
+        "ddp_find_unused_parameters": False,
+        "max_grad_norm": 1.0,
+        "save_strategy": "epoch",
+        "weight_decay": 0.01,
+        "warmup_steps": 0,
+        "beta": 0.0,  # GRPO specific
+        "loss_type": 'grpo',  # GRPO specific
+        "seed": 42,
+    },
+    "sft_args": {
+        "output_dir": '/path/to/sft-llavaov-chart-llava_cot',
+        "logging_steps": 1,
+        "per_device_train_batch_size": 2,
+        "gradient_accumulation_steps": 4,
+        "num_train_epochs": 10,
+        "learning_rate": 1e-5,
+        "bf16": True,
+        "gradient_checkpointing": False,
+        "ddp_find_unused_parameters": False,
+        "max_grad_norm": 1.0,
+        "max_length": 4096,
+        "save_strategy": "epoch",
+        "weight_decay": 0.01,
+        "warmup_steps": 0,
+        "seed": 42,
+        "remove_unused_columns": False
+    },
+    "grpo_args":{
+        "output_dir": '/path/to/grpo-llavaov-chart-beta',
+        "logging_steps": 1,
+        "num_generations": 4,
+        "max_completion_length": 576,
+        "max_prompt_length": None,
+        "per_device_train_batch_size": 4,
+        "gradient_accumulation_steps": 4,
+        "num_train_epochs": 10,
+        "learning_rate": 1e-5,
+        "bf16": True,
+        "gradient_checkpointing": False,
+        "ddp_find_unused_parameters": False,
+        "max_grad_norm": 1.0,
+        "save_strategy": "epoch",
+        "weight_decay": 0.01,
+        "warmup_steps": 0,
+        "beta": 0.04,  # GRPO specific
+        "loss_type": 'grpo',  # GRPO specific
+        "seed": 42,
+    }
+}
+RL_CONFIG = {
+    "answer_flag": "Answer:",
+    "end_flag": "<|im_end|>"
+}
+# ====== Client Configuration for Reward Calculation ======
+CLIENT_CONFIG = {
+    "client_type": "openai",
+    "api_key": "none",
+    "api_base": "http://127.0.0.1:%s/v1",
+    "timeout": 60,
+    "model_id": "Qwen/Qwen2.5-14B-Instruct-AWQ",
+    "init_port": 23333,
+    "num_server": 8
+}
+# ====== Dataset Configuration ======
+DATASET_CONFIG = {
+    "train_dataset": "/path/to/data/chartqa_output/llavacot/json/chartqa_train_processed.json",
+    "eval_dataset": "HuggingFaceM4/ChartQA",
+}
+# ====== Full Configuration ======
+CONFIG = {
+    "model": MODEL_CONFIG,
+    "training": TRAINING_CONFIG,
+    "rl": RL_CONFIG,
+    "client": CLIENT_CONFIG,
+    "dataset": DATASET_CONFIG,
+}
+# Save configuration to a file for reference
+def save_config(config, config_path="./config.json"):
+    import json
+    with open(config_path, "w") as f:
+        json.dump(config, f, indent=4)
+# Example usage to save config
+if __name__ == "__main__":
+    save_config(CONFIG)

config/config_low.py ADDED Viewed

	@@ -0,0 +1,120 @@

+import os
+import torch
+# ====== Model Configuration ======
+MODEL_CONFIG = {
+    "pretrained_model_path": "/path/to/sft-llavaov-chart-low/checkpoint-296",
+    "use_flash_attention_2": True,
+    "torch_dtype": "bfloat16",
+}
+# ====== Training Configuration ======
+TRAINING_CONFIG = {
+    "task": 'chart',
+    "num_gpus": 8,
+    "num_client": 8,
+    "dyme_args": {
+        "output_dir": '/path/to/dyme-llavaov-chart-low',
+        "logging_steps": 1,
+        "num_generations": 4,
+        "max_completion_length": 300,
+        "per_device_train_batch_size": 1,
+        "gradient_accumulation_steps": 16,
+        "num_train_epochs": 10,
+        "learning_rate": 1e-5,
+        "bf16": True,
+        "gradient_checkpointing": False,
+        "ddp_find_unused_parameters": False,
+        "max_grad_norm": 1.0,
+        "save_strategy": "epoch",
+        "weight_decay": 0.01,
+        "warmup_steps": 0,
+        "beta": 0.0,  # GRPO specific
+        "loss_type": 'grpo',  # GRPO specific
+        "seed": 42,
+    },
+    "sft_args": {
+        "output_dir": '/path/to/sft-llavaov-chart-low',
+        "logging_steps": 1,
+        "per_device_train_batch_size": 2,
+        "gradient_accumulation_steps": 4,
+        "num_train_epochs": 10,
+        "learning_rate": 1e-5,
+        "bf16": True,
+        "gradient_checkpointing": False,
+        "ddp_find_unused_parameters": False,
+        "max_grad_norm": 1.0,
+        "max_length": 4096,
+        # "save_steps": 100,
+        "save_strategy": "epoch",
+        "weight_decay": 0.01,
+        "warmup_steps": 0,
+        "seed": 42,
+        "remove_unused_columns": False
+    },
+    "grpo_args":{
+        "output_dir": '/path/to/grpo-llavaov-chart-low',
+        "logging_steps": 1,
+        "num_generations": 4,
+        "max_completion_length": 576,
+        "max_prompt_length": None,
+        "per_device_train_batch_size": 2,
+        "gradient_accumulation_steps": 4,
+        "num_train_epochs": 10,
+        "learning_rate": 1e-5,
+        "bf16": True,
+        "gradient_checkpointing": False,
+        "ddp_find_unused_parameters": False,
+        "max_grad_norm": 1.0,
+        "save_strategy": "epoch",
+        "weight_decay": 0.01,
+        "warmup_steps": 0,
+        "beta": 0.0,  # GRPO specific
+        "loss_type": 'grpo',  # GRPO specific
+        "seed": 42,
+    }
+}
+RL_CONFIG = {
+    "answer_flag": "Answer:",
+    "end_flag": "<|im_end|>"
+}
+# ====== Client Configuration for Reward Calculation ======
+CLIENT_CONFIG = {
+    "client_type": "openai",
+    "api_key": "none",
+    "api_base": "http://127.0.0.1:%s/v1",
+    "timeout": 60,
+    "model_id": "Qwen/Qwen2.5-14B-Instruct-AWQ",
+    "init_port": 23333,
+    "num_server": 8
+}
+# ====== Dataset Configuration ======
+DATASET_CONFIG = {
+    "train_dataset": "/path/to/data/chartqa_output/json/train_low.json",
+    "eval_dataset": "HuggingFaceM4/ChartQA",
+}
+# ====== Full Configuration ======
+CONFIG = {
+    "model": MODEL_CONFIG,
+    "training": TRAINING_CONFIG,
+    "rl": RL_CONFIG,
+    "client": CLIENT_CONFIG,
+    "dataset": DATASET_CONFIG,
+}
+# Save configuration to a file for reference
+def save_config(config, config_path="./config.json"):
+    import json
+    with open(config_path, "w") as f:
+        json.dump(config, f, indent=4)
+# Example usage to save config
+if __name__ == "__main__":
+    save_config(CONFIG)

config/config_opd_7b_chartqa.py ADDED Viewed

	@@ -0,0 +1,48 @@

+"""
+COPSD-style cross-model OPD on ChartQA (Method 2).
+Frozen LLaVA-OneVision 7B teacher; student default 0.5B.
+Inherits RLSD routing + embedded SFT cold-start gates from config_rlsd_chartqa.
+"""
+import os
+import config.config_rlsd_chartqa as rlsd
+from data_utils.paths import OUTPUTS_DIR
+MODEL_CONFIG = {
+    **rlsd.MODEL_CONFIG,
+    "teacher_model_path": os.environ.get(
+        "DYME_TEACHER_MODEL",
+        "llava-hf/llava-onevision-qwen2-7b-ov-hf",
+    ),
+    "teacher_dtype": os.environ.get("DYME_TEACHER_DTYPE", "bfloat16"),
+    "teacher_device_map": os.environ.get("DYME_TEACHER_DEVICE_MAP") or None,
+}
+DYME_OPSD_CONFIG = {
+    **rlsd.DYME_OPSD_CONFIG,
+    "mode": os.environ.get("DYME_OPSD_MODE", "rlsd"),
+    "privileged_providers": [],
+    "loss": {
+        **rlsd.DYME_OPSD_CONFIG.get("loss", {}),
+        "opsd_weight": float(os.environ.get("DYME_OPSD_WEIGHT", "1.5")),
+    },
+}
+CONFIG = {
+    "model": MODEL_CONFIG,
+    "training": {
+        **rlsd.CONFIG["training"],
+        "dyme_args": {
+            **rlsd.CONFIG["training"]["dyme_args"],
+            "output_dir": os.environ.get(
+                "DYME_OUTPUT_DIR",
+                os.path.join(OUTPUTS_DIR, "opd-7b-chartqa"),
+            ),
+        },
+    },
+    "rl": rlsd.CONFIG["rl"],
+    "opsd": DYME_OPSD_CONFIG,
+    "client": rlsd.CONFIG["client"],
+    "dataset": rlsd.CONFIG["dataset"],
+}

config/config_rlsd_chartqa.py ADDED Viewed

	@@ -0,0 +1,152 @@

+"""
+RLSD / anti-leakage ChartQA config (Method 1).
+- mode=rlsd: correct → GRPO, wrong → same-prompt OPSD, all-wrong group → online SFT
+- No gold answer / visual_facts in teacher privileged context
+- Hyperparameters based on config_trimode_antidegen
+"""
+import os
+import config.config_trimode_antidegen as antidegen
+from data_utils.paths import OUTPUTS_DIR
+MODEL_CONFIG = dict(antidegen.MODEL_CONFIG)
+TRAINING_CONFIG = dict(antidegen.TRAINING_CONFIG)
+_reward_weights_raw = os.environ.get("DYME_REWARD_WEIGHTS", "0.5,1.5,1.0")
+try:
+    _reward_weights = [float(x.strip()) for x in _reward_weights_raw.split(",") if x.strip()]
+    if len(_reward_weights) != 3:
+        raise ValueError("expected 3 weights")
+except ValueError:
+    _reward_weights = [0.5, 1.5, 1.0]
+_providers_raw = os.environ.get("DYME_OPSD_PROVIDERS", "format_only").strip()
+_privileged_providers = [p.strip() for p in _providers_raw.split(",") if p.strip()] if _providers_raw else []
+_skip_degen_env = os.environ.get("DYME_OPSD_SKIP_DEGENERATE", "").strip().lower()
+if _skip_degen_env in ("0", "false", "no", "off"):
+    _skip_degenerate_for_opsd = False
+elif _skip_degen_env in ("1", "true", "yes", "on"):
+    _skip_degenerate_for_opsd = True
+else:
+    _skip_degenerate_for_opsd = True
+# Embedded SFT cold-start + RLSD warmup gates (env overrides optional).
+_RLSD_GATE_DEFAULTS = {
+    "skip_degenerate_for_opsd": _skip_degenerate_for_opsd,
+    "degen_skip_warmup_steps": 200,
+    "sft_warmup_steps": 500,
+    "sft_warmup_slots_per_group": 4,
+    # First N steps: skip generate, 100% GT injection, pure SFT NLL (no OPSD/GRPO).
+    "sft_cold_start_frac": 0.08,
+}
+DYME_OPSD_CONFIG = {
+    **antidegen.DYME_OPSD_CONFIG,
+    "mode": os.environ.get("DYME_OPSD_MODE", "rlsd"),
+    "text_include_gold": False,
+    "privileged_profile": os.environ.get("DYME_OPSD_PRIVILEGE_PROFILE", "text"),
+    "privileged_providers": _privileged_providers,
+    "gate": {
+        **antidegen.DYME_OPSD_CONFIG.get("gate", {}),
+        "per_completion_opsd": True,
+        "recoverable_without_privilege": True,
+        "require_format_for_opsd": os.environ.get("DYME_OPSD_REQUIRE_FORMAT", "0").strip().lower()
+        not in ("0", "false", "no", "off"),
+        "online_sft_on_all_wrong": True,
+        # ChartQA short numeric answers lack "Answer:" — do not block OPSD on format alone
+        "opsd_degenerate_require_answer_flag": False,
+        **_RLSD_GATE_DEFAULTS,
+    },
+    "loss": {
+        **antidegen.DYME_OPSD_CONFIG.get("loss", {}),
+        "acc_gate": True,
+        "opsd_weight": float(os.environ.get("DYME_OPSD_WEIGHT", "1.5")),
+        "grpo_weight": 1.0,
+    },
+    "reward_weights": _reward_weights,
+}
+_dyme_args = {
+    **TRAINING_CONFIG["dyme_args"],
+    "output_dir": os.environ.get(
+        "DYME_OUTPUT_DIR",
+        os.path.join(OUTPUTS_DIR, "rlsd-chartqa"),
+    ),
+    # Mitigate early RL collapse (newline + bare number + immediate EOS)
+    "max_completion_length": 96,
+    "temperature": 0.5,
+    "repetition_penalty": 1.5,
+}
+_max_steps_raw = os.environ.get("DYME_MAX_STEPS", "").strip()
+if _max_steps_raw:
+    _dyme_args["max_steps"] = int(_max_steps_raw)
+_temp_raw = os.environ.get("DYME_TEMPERATURE", "").strip()
+if _temp_raw:
+    _dyme_args["temperature"] = float(_temp_raw)
+_rep_raw = os.environ.get("DYME_REPETITION_PENALTY", "").strip()
+if _rep_raw:
+    _dyme_args["repetition_penalty"] = float(_rep_raw)
+_max_len_raw = os.environ.get("DYME_MAX_COMPLETION_LENGTH", "").strip()
+if _max_len_raw:
+    _dyme_args["max_completion_length"] = int(_max_len_raw)
+# Keep module-level TRAINING_CONFIG in sync so imports of TRAINING_CONFIG["dyme_args"] match CONFIG.
+TRAINING_CONFIG = {**TRAINING_CONFIG, "dyme_args": _dyme_args}
+# Optional env overrides for gate defaults (see _RLSD_GATE_DEFAULTS above).
+_degen_warmup_raw = os.environ.get("DYME_OPSD_DEGEN_WARMUP_STEPS", "").strip()
+if _degen_warmup_raw:
+    DYME_OPSD_CONFIG["gate"]["degen_skip_warmup_steps"] = int(_degen_warmup_raw)
+_sft_warmup_raw = os.environ.get("DYME_SFT_WARMUP_STEPS", "").strip()
+if _sft_warmup_raw:
+    DYME_OPSD_CONFIG["gate"]["sft_warmup_steps"] = int(_sft_warmup_raw)
+_sft_slots_raw = os.environ.get("DYME_SFT_WARMUP_SLOTS", "").strip()
+if _sft_slots_raw:
+    DYME_OPSD_CONFIG["gate"]["sft_warmup_slots_per_group"] = int(_sft_slots_raw)
+_cold_start_steps_raw = os.environ.get("DYME_SFT_COLD_START_STEPS", "").strip()
+if _cold_start_steps_raw:
+    DYME_OPSD_CONFIG["gate"]["sft_cold_start_steps"] = int(_cold_start_steps_raw)
+    DYME_OPSD_CONFIG["gate"].pop("sft_cold_start_frac", None)
+else:
+    _cold_start_frac_raw = os.environ.get("DYME_SFT_COLD_START_FRAC", "").strip()
+    if _cold_start_frac_raw:
+        DYME_OPSD_CONFIG["gate"]["sft_cold_start_frac"] = float(_cold_start_frac_raw)
+CONFIG = {
+    "model": MODEL_CONFIG,
+    "training": {
+        **TRAINING_CONFIG,
+        "dyme_args": _dyme_args,
+        "sft_args": {
+            "output_dir": os.environ.get(
+                "DYME_SFT_OUTPUT_DIR",
+                os.path.join(OUTPUTS_DIR, "chartqa-sft"),
+            ),
+            "logging_steps": 10,
+            "per_device_train_batch_size": 2,
+            "gradient_accumulation_steps": 4,
+            "num_train_epochs": int(os.environ.get("DYME_SFT_EPOCHS", "2")),
+            "learning_rate": 1e-5,
+            "bf16": True,
+            "gradient_checkpointing": True,
+            "ddp_find_unused_parameters": False,
+            "max_grad_norm": 1.0,
+            "save_strategy": "epoch",
+            "weight_decay": 0.01,
+            "warmup_steps": 0,
+            "seed": 42,
+            "remove_unused_columns": False,
+        },
+    },
+    "rl": antidegen.CONFIG["rl"],
+    "opsd": DYME_OPSD_CONFIG,
+    "client": antidegen.CONFIG["client"],
+    "dataset": antidegen.CONFIG["dataset"],
+}

config/config_trimode.py ADDED Viewed

	@@ -0,0 +1,88 @@

+import os
+from config import CLIENT_CONFIG, DATASET_CONFIG, DYME_OPSD_CONFIG, MODEL_CONFIG, RL_CONFIG, TRAINING_CONFIG
+from data_utils.paths import OUTPUTS_DIR
+MODEL_CONFIG = dict(MODEL_CONFIG)
+TRAINING_CONFIG = {
+    **TRAINING_CONFIG,
+    "dyme_args": {
+        **TRAINING_CONFIG["dyme_args"],
+        "output_dir": os.environ.get("DYME_OUTPUT_DIR", os.path.join(OUTPUTS_DIR, "dyme-trimode")),
+    },
+}
+_detail_every_raw = os.environ.get("DYME_OPSD_DETAIL_EVERY", "10")
+try:
+    _detail_every = max(0, int(_detail_every_raw))
+except ValueError:
+    _detail_every = 10
+_probe_raw = os.environ.get("DYME_OPSD_PROBE_ON_GENERATE", "1").strip().lower()
+_probe_on_generate = _probe_raw not in ("0", "false", "no", "off")
+_first_logits_raw = os.environ.get("DYME_OPSD_PROBE_FIRST_TOKEN_LOGITS", "1").strip().lower()
+_probe_first_token_logits = _first_logits_raw not in ("0", "false", "no", "off")
+_tail_raw = os.environ.get("DYME_OPSD_PROBE_PROMPT_TAIL_TOKENS", "16").strip()
+try:
+    _probe_prompt_tail_tokens = max(1, int(_tail_raw))
+except ValueError:
+    _probe_prompt_tail_tokens = 16
+_model_ctx_raw = os.environ.get("DYME_OPSD_PROBE_LOG_MODEL_CONTEXT", "1").strip().lower()
+_probe_log_model_context = _model_ctx_raw not in ("0", "false", "no", "off")
+_health_raw = os.environ.get("DYME_OPSD_HEALTH_MONITOR", "1").strip().lower()
+_health_monitor_enabled = _health_raw not in ("0", "false", "no", "off")
+_require_format_raw = os.environ.get("DYME_OPSD_REQUIRE_FORMAT", "1").strip().lower()
+_require_format_for_opsd = _require_format_raw not in ("0", "false", "no", "off")
+DYME_OPSD_CONFIG = {
+    **DYME_OPSD_CONFIG,
+    "enabled": True,
+    "mode": os.environ.get("DYME_OPSD_MODE", "trimode"),
+    "privileged_profile": os.environ.get("DYME_OPSD_PRIVILEGE_PROFILE", "hybrid"),
+    "privileged_providers": os.environ.get("DYME_OPSD_PROVIDERS", "text,visual_facts").split(","),
+    "privileged_image": {
+        **DYME_OPSD_CONFIG.get("privileged_image", {}),
+        "mode": os.environ.get("DYME_OPSD_PRIVILEGE_IMAGE_MODE", "single"),
+        "crop_strategy": os.environ.get("DYME_OPSD_CROP_STRATEGY", "bbox_then_center"),
+        "bbox_coord": "normalized",
+        "margin_ratio": float(os.environ.get("DYME_OPSD_CROP_MARGIN", "0.25")),
+    },
+    "privileged_debug": {
+        **DYME_OPSD_CONFIG.get("privileged_debug", {}),
+        "save_images": os.environ.get("DYME_OPSD_SAVE_PRIVILEGED_IMAGES", "1").strip().lower()
+        not in ("0", "false", "no", "off"),
+        "image_subdir": os.environ.get("DYME_OPSD_PRIVILEGED_IMAGE_DIR", "logs/images"),
+        "max_samples_per_detail": int(os.environ.get("DYME_OPSD_PRIVILEGED_IMAGE_MAX", "2")),
+    },
+    "gate": {
+        **DYME_OPSD_CONFIG.get("gate", {}),
+        "require_format_for_opsd": _require_format_for_opsd,
+    },
+    "debug": {
+        **DYME_OPSD_CONFIG.get("debug", {}),
+        "detail_every": _detail_every,
+        "probe_on_generate": _probe_on_generate,
+        "probe_first_token_logits": _probe_first_token_logits,
+        "probe_prompt_tail_tokens": _probe_prompt_tail_tokens,
+        "probe_log_model_context": _probe_log_model_context,
+        "health_monitor": {
+            **DYME_OPSD_CONFIG.get("debug", {}).get("health_monitor", {}),
+            "enabled": _health_monitor_enabled,
+        },
+    },
+}
+CONFIG = {
+    "model": MODEL_CONFIG,
+    "training": TRAINING_CONFIG,
+    "rl": RL_CONFIG,
+    "opsd": DYME_OPSD_CONFIG,
+    "client": CLIENT_CONFIG,
+    "dataset": DATASET_CONFIG,
+}

default_config_8gpu.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

default_config_8gpu_deepspeed.yaml ADDED Viewed

	@@ -0,0 +1,21 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  gradient_accumulation_steps: 16
+  zero3_init_flag: false
+  zero_stage: 0
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+machine_rank: 0
+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+# Optional ZeRO-0 for 8-GPU nodes. Requires: pip install deepspeed
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

default_config_zero2_8gpu.yaml ADDED Viewed

	@@ -0,0 +1,18 @@

+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  deepspeed_config_file: configs/deepspeed/zero2_bf16.json
+  zero3_init_flag: false
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+machine_rank: 0
+main_training_function: main
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false

eval/eval_chartqa.py ADDED Viewed

	@@ -0,0 +1,310 @@

+import argparse
+import os
+import torch
+from PIL import Image
+from accelerate import Accelerator
+# Ensure this path is correct and the utility is available.
+from datasets import load_dataset
+from torch.distributed import all_gather_object
+from transformers import AutoProcessor, AutoConfig, AutoTokenizer, LlavaOnevisionForConditionalGeneration
+from trl.models import unwrap_model_for_generation
+from data_utils.chart.evaluator import eval_one_chart
+from data_utils.rl_prompt import PROMPT_TEMPLATE
+from reward_utils.compute_rewards import split_initial_context
+accelerator = Accelerator()
+from tqdm import tqdm
+import numpy as np
+DEVICE = accelerator.device
+# Model and Processor Configuration
+model_args = {}  # Use {"torch_dtype": torch.bfloat16} if desired and supported
+_eval_parser = argparse.ArgumentParser(add_help=False)
+_eval_parser.add_argument("--model_path", default=None)
+_eval_args, _ = _eval_parser.parse_known_args()
+model_id = (
+    _eval_args.model_path
+    or os.environ.get("CHECKPOINT_DIR")
+    or "/path/to/dyme-k-8/final_checkpoint"
+)
+config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained(model_id, config=config, trust_remote_code=True)
+model = LlavaOnevisionForConditionalGeneration.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    low_cpu_mem_usage=True,
+).to(DEVICE)
+model.eval()
+# Make sure model and processor are loaded before being potentially used in generate_inner if it were called
+# model = Idefics3ForConditionalGeneration.from_pretrained(model_id, **model_args).to(DEVICE)
+processor = AutoProcessor.from_pretrained(model_id)
+# Configure image processor size
+# This can consume significant VRAM. Ensure it's intended.
+if hasattr(processor.image_processor, 'size') and isinstance(processor.image_processor.size, dict):
+    # if 'longest_edge' in processor.image_processor.size:
+    #     print('Setting image processor longest_edge to 2048')
+    #     processor.image_processor.size['longest_edge'] = 512 * 4
+    processor.tokenizer.padding_side = 'left'
+else:
+    print(
+        f"Warning: Could not directly set 'longest_edge' via dict. Current image processor size config: {processor.image_processor.size}"
+    )
+    # Attempt an alternative if applicable, e.g.
+    # processor.image_processor.size = {"longest_edge": 512 * 4} # if size itself can be replaced
+    # Or this might indicate that `size` is a single value or a different structure.
+def run_kh_batch(batch_data_list):  # Renamed from run_kh, takes a batch
+    batch_images = []
+    batch_formatted_prompts_for_chat_template = []
+    for item in batch_data_list:
+        image_path = item['image_path']
+        # 'item_model_input_text' already contains chart instructions + raw_question
+        item_model_input_text = item['model_input_text'].strip()
+        # question_with_tags = prompt + item_model_input_text
+        # question_with_tags = f"""{item_model_input_text} Think step by step and then answer the question."""
+        question_with_tags = PROMPT_TEMPLATE.format(question=item_model_input_text)
+        if isinstance(image_path, str):
+            image = Image.open(image_path).convert("RGB")
+        else:
+            image = image_path.convert("RGB")  # Assuming image_path is already a PIL Image object
+        batch_images.append(image)
+        messages = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "image"},
+                    {"type": "text", "text": question_with_tags},
+                ]
+            },
+        ]
+        try:
+            templated_prompt_str = processor.apply_chat_template(messages, add_generation_prompt=True)
+            templated_prompt_str = templated_prompt_str.strip()
+        except Exception:
+            templated_prompt_str = f"USER: <image>\n{question_with_tags}\nASSISTANT:"
+        batch_formatted_prompts_for_chat_template.append(templated_prompt_str)
+    inputs = processor(
+        text=batch_formatted_prompts_for_chat_template,
+        images=batch_images,
+        return_tensors="pt",
+        padding=True,
+        truncation=True
+    )
+    # inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
+    inputs = {
+        k: v.to(DEVICE).to(torch.bfloat16) if v.is_floating_point() else v.to(DEVICE)
+        for k, v in inputs.items()
+    }
+    with unwrap_model_for_generation(model, accelerator) as unwrapped_model_instance:
+        generated_ids = unwrapped_model_instance.generate(
+            **inputs,
+            max_new_tokens=1024,
+            do_sample=False,
+        )
+    input_ids_length = inputs['input_ids'].shape[1]
+    newly_generated_ids = generated_ids[:, input_ids_length:]
+    generated_texts = processor.batch_decode(
+        newly_generated_ids,
+        skip_special_tokens=True,  # Special tokens like <eos> are removed. <image> might be too.
+    )
+    return [text.strip('.').strip() for text in generated_texts]
+# --- Main Evaluation Logic ---
+task = 'chart'
+# dt_record_local is initialized inside the if task == 'chart' block
+if task == 'chart':
+    dt_record_local = {}  # Store results for the current process
+    if accelerator.is_main_process:
+        print("Loading ChartQA dataset...")
+    try:
+        full_dataset = load_dataset("HuggingFaceM4/ChartQA", trust_remote_code=True)['test']
+    except Exception as e:
+        if accelerator.is_main_process:
+            print(f"Failed to load dataset directly. Error: {e}")
+            print("Attempting to load with specific revision if applicable, or check path/connection.")
+        # For example, you can try a specific revision (if known) or ensure path and network connection are correct
+        # full_dataset = load_dataset("HuggingFaceM4/ChartQA", revision="main", trust_remote_code=True)['test']
+        raise  # Re-raise the exception since we cannot proceed without the dataset
+    # full_dataset = full_dataset.select(range(80)) # Uncomment for quick testing
+    eval_datasets_all_prepared = []
+    # chart_instructions_prefix = (
+    #     "For the question below, follow the following instructions:\n"
+    #     # ... (your detailed instructions) ...
+    #     + "-Try to include the full label from the graph when asked about an entity.\n"
+    #     + "Question: "
+    # )
+    for d_item in tqdm(full_dataset, desc="Preparing dataset", disable=not accelerator.is_main_process):
+        image_path = d_item['image']
+        raw_question = d_item['query']
+        answer_list = d_item.get('label')  # Use .get() in case 'label' field does not exist
+        if not answer_list:  # If 'label' is missing or an empty list
+            if accelerator.is_main_process:
+                tqdm.write(
+                    f"Warning: Item missing 'label' or 'label' is empty. Query: {raw_question[:50]}..."
+                )
+            # Decide how to handle this: skip this sample or use a default answer
+            continue  # Skip this sample
+        answer = answer_list[0]
+        model_input_text_for_template = raw_question
+        eval_datasets_all_prepared.append({
+            'image_path': image_path,
+            'model_input_text': model_input_text_for_template,
+            'answer': answer,
+            'original_question': raw_question
+        })
+    num_processes = accelerator.num_processes
+    process_index = accelerator.process_index
+    total_items = len(eval_datasets_all_prepared)
+    if total_items == 0:
+        if accelerator.is_main_process:
+            print("No data prepared for evaluation after filtering. Exiting chart evaluation.")
+    else:
+        items_per_proc = total_items // num_processes
+        extra_items = total_items % num_processes
+        local_start_index = process_index * items_per_proc + min(process_index, extra_items)
+        num_local_items = items_per_proc + (1 if process_index < extra_items else 0)
+        local_end_index = local_start_index + num_local_items
+        eval_datasets_local = eval_datasets_all_prepared[local_start_index:local_end_index]
+        BATCH_SIZE = 32  # Adjust according to your VRAM
+        REPORT_INTERVAL_BATCHES = 1  # Report once every N local batches (main process prints global stats)
+        # if accelerator.is_main_process:
+        #     print(f"Total items for evaluation: {total_items}")
+        #     print(f"Process {process_index} handling {len(eval_datasets_local)} items.")
+        #     print(f"Batch size per process: {BATCH_SIZE}, Reporting interval: {REPORT_INTERVAL_BATCHES} local batches.")
+        pbar = None
+        if accelerator.is_main_process and len(eval_datasets_local) > 0:  # Create pbar only if there is data
+            pbar = tqdm(total=len(eval_datasets_local), desc=f"Eval Proc {process_index}", dynamic_ncols=True)
+        dt_record_local['res'] = []
+        num_local_batches = (len(eval_datasets_local) + BATCH_SIZE - 1) // BATCH_SIZE
+        for batch_idx_local in range(num_local_batches):
+            start_idx = batch_idx_local * BATCH_SIZE
+            end_idx = min((batch_idx_local + 1) * BATCH_SIZE, len(eval_datasets_local))
+            current_batch_list = eval_datasets_local[start_idx:end_idx]
+            if not current_batch_list:
+                continue
+            batch_predictions_texts = run_kh_batch(current_batch_list)
+            for item_idx_in_batch, full_pred_text in enumerate(batch_predictions_texts):
+                original_item = current_batch_list[item_idx_in_batch]
+                ground_truth_answer = original_item['answer']
+                _, parsed_pred_answer = split_initial_context(full_pred_text)
+                if not parsed_pred_answer.strip():
+                    parsed_pred_answer = full_pred_text  # Fallback to full prediction if parsed answer is empty
+                score = eval_one_chart(parsed_pred_answer, ground_truth_answer)  # nlp object is global
+                dt_record_local['res'].append(score)
+                # (Optional) Main process prints a few prediction details
+                if accelerator.is_main_process:
+                    print(full_pred_text, "######", ground_truth_answer, "######", score)
+            if pbar:
+                pbar.update(len(current_batch_list))
+            # --- Intermediate reporting logic ---
+            is_last_local_batch = (batch_idx_local == num_local_batches - 1)
+            # Every REPORT_INTERVAL_BATCHES local batches, or on the last local batch of this process,
+            # perform synchronization and reporting
+            should_sync_and_report = ((batch_idx_local + 1) % REPORT_INTERVAL_BATCHES == 0) or is_last_local_batch
+            # Make sure that even if REPORT_INTERVAL_BATCHES is 1, we do not report when there is no data
+            # (e.g., len(eval_datasets_local) == 0)
+            if len(eval_datasets_local) == 0:  # If the current process has no data, skip reporting logic
+                should_sync_and_report = False
+                # If num_local_batches > 0, this check ensures we report only when there is data
+            if num_local_batches == 0 and is_last_local_batch:  # Special case: process has no data but must join final sync
+                should_sync_and_report = True
+            if should_sync_and_report:
+                accelerator.wait_for_everyone()  # Wait for all processes to reach the sync point
+                gathered_all_processes_data = [None] * num_processes
+                # Each process sends its *current accumulated* dt_record_local
+                # If a process has no data, dt_record_local['res'] is an empty list, which is fine
+                all_gather_object(gathered_all_processes_data, dt_record_local)
+                if accelerator.is_main_process:
+                    current_global_scores_list = []
+                    for process_data_dict in gathered_all_processes_data:
+                        if process_data_dict and 'res' in process_data_dict:
+                            current_global_scores_list.extend(process_data_dict['res'])
+                    total_samples_processed_globally = len(current_global_scores_list)
+                    report_title = "--- Intermediate Report ---"
+                    # Check whether this is the final reporting point where all processes have finished
+                    # A simple heuristic: if this is the last local batch on the main process
+                    # and the total collected samples equal the total number of items
+                    if is_last_local_batch and total_samples_processed_globally == total_items:
+                        report_title = "--- Final Report ---"
+                    elif is_last_local_batch:  # Last batch on main process but perhaps not all samples are done yet
+                        report_title = (
+                            f"--- Report (Main Proc Last Batch, {batch_idx_local + 1}/{num_local_batches}) ---"
+                        )
+                    tqdm.write(f"\n{report_title}")  # Use tqdm.write to avoid clashing with the progress bar
+                    if current_global_scores_list:
+                        mean_acc_global = np.array(current_global_scores_list).mean()
+                        if accelerator.is_main_process:
+                            print(f"Global samples processed: {total_samples_processed_globally} / {total_items}")
+                            print(f"Current Global Mean Accuracy: {mean_acc_global:.4f}")
+                            if pbar:
+                                pbar.set_description(
+                                    f"Global Acc: {mean_acc_global:.4f} ({total_samples_processed_globally}/{total_items})"
+                                )
+                    else:
+                        if accelerator.is_main_process:
+                            print(
+                                f"No scores to report globally yet (Total processed: {total_samples_processed_globally})."
+                            )
+                accelerator.wait_for_everyone()  # Sync again after reporting in case some processes move ahead faster
+        if pbar:
+            pbar.close()
+        # Final metrics have already been printed in the last report
+        # (when is_last_local_batch is True)
+        if accelerator.is_main_process and len(eval_datasets_local) == 0 and total_items > 0:
+            print(
+                "Main process had no data, but other processes might have. "
+                "Final global metrics are printed by the last reporting sync."
+            )
+        elif accelerator.is_main_process and total_items == 0:
+            print("No data was prepared for evaluation. Nothing to report.")
+else:
+    if accelerator.is_main_process:
+        print(f"Task '{task}' is not configured for batched evaluation in this script.")

figs/chartqa.png ADDED Viewed

kill_all.sh ADDED Viewed

	@@ -0,0 +1,55 @@

+#!/bin/bash
+set -e
+readonly WORKER_HOSTS=(
+    "xx.xx.xx.xx"
+)
+readonly REMOTE_USER="root"
+readonly TRAIN_SCRIPT="main"
+echo "--- Killing local processes matching '${TRAIN_SCRIPT}' first ---"
+pkill -9 -f "${TRAIN_SCRIPT}" || true
+pkill -f python
+echo "Local check complete."
+echo
+echo "🛑 Sending targeted kill signal to processes matching '${TRAIN_SCRIPT}' on all remote hosts in parallel..."
+for HOST in "${WORKER_HOSTS[@]}"; do
+    (
+        echo "--- Processing host: ${HOST} ---"
+        ssh -n "${REMOTE_USER}@${HOST}" "
+            set -e # 远程脚本也应该在出错时停止
+            pkill -f python
+            # 精确查找由 python 启动的、且包含 TRAIN_SCRIPT 名称的进程
+            # 这是为了避免误杀其他同名进程（比如一个名为 'main_rebuttal' 的shell脚本）
+            PIDS=\$(pgrep -f \"python.*${TRAIN_SCRIPT}\")
+            if [ -z \"\$PIDS\" ]; then
+                echo '[INFO] ✅ No matching processes found on this host.'
+            else
+                echo '[WARN] 🔥 Found processes to kill:'
+                # 在杀死前显示详细信息，增加安全性
+                ps -fp \$PIDS
+                echo '[KILL] Killing PIDs: '\$PIDS'...'
+                kill -9 \$PIDS
+                echo '[OK] ✅ Processes killed successfully.'
+            fi
+        "
+        echo "--- Finished host: ${HOST} ---"
+        echo
+    ) &
+done
+wait
+echo "🎉 All hosts have been processed."

main.py ADDED Viewed

	@@ -0,0 +1,522 @@

+# train_grpo.py
+"""
+Main script for training a Llava-based model using the custom MyGRPOTrainer.
+This script handles:
+1. Configuration loading.
+2. Initialization of Weights & Biases (wandb) and Hugging Face Accelerate.
+3. Loading the model and processor.
+4. Preparing the training and evaluation datasets.
+5. Setting up and running the GPRO trainer.
+"""
+import argparse
+import os
+from functools import partial
+from typing import Dict, Any
+import torch
+import wandb
+from accelerate import Accelerator
+from datasets import Dataset, load_dataset
+from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
+from trl import GRPOConfig
+from config.loader import load_config
+from data_utils.commom_util import collate_fn, define_task_data_func
+from trainer.DyMETrainer import DyMETrainer
+from reward_utils.checker import RewardCalculator, RewardCalculatorLocal
+from reward_utils.refiner import ContextRefiner, ContextRefinerLocal
+from opsd_utils import debug_log as opsd_debug
+from opsd_utils.teacher_batching import (
+    log_teacher_placement,
+    resolve_teacher_device_map,
+)
+from opsd_utils.deepspeed_utils import (
+    deepspeed_zero_stage,
+    gradient_checkpointing_enable_kwargs,
+    is_deepspeed_accelerate_config,
+    should_disable_gradient_checkpointing,
+    uses_deepspeed_json_file,
+)
+def _run_cross_model_vocab_checks(model, processor, teacher_model, model_config: Dict[str, Any]) -> None:
+    """Startup checks for cross-model OPD vocab slice + tokenizer alignment."""
+    from transformers import AutoProcessor
+    from opsd_utils.vocab_align import print_vocab_align_report, verify_shared_tokenizer_alignment
+    student_vocab = getattr(model.config, "vocab_size", len(processor.tokenizer))
+    teacher_vocab = getattr(teacher_model.config, "vocab_size", student_vocab)
+    shared = min(student_vocab, teacher_vocab)
+    print(
+        f"[OPSD-VOCAB] lm_head widths: student={student_vocab} teacher={teacher_vocab} "
+        f"shared_slice={shared}",
+        flush=True,
+    )
+    if student_vocab == teacher_vocab:
+        print("[OPSD-VOCAB] vocab sizes match — no slice needed", flush=True)
+        return
+    teacher_path = model_config.get("teacher_model_path")
+    teacher_processor = AutoProcessor.from_pretrained(teacher_path)
+    full_scan = os.environ.get("DYME_VOCAB_ALIGN_FULL", "0").strip().lower() in ("1", "true", "yes")
+    stride = int(os.environ.get("DYME_VOCAB_ALIGN_STRIDE", "500"))
+    report = verify_shared_tokenizer_alignment(
+        processor.tokenizer,
+        teacher_processor.tokenizer,
+        shared_vocab=shared,
+        full_scan=full_scan,
+        sample_stride=stride,
+    )
+    print_vocab_align_report(report)
+def _wandb_disabled_by_env() -> bool:
+    if os.environ.get("WANDB_DISABLED", "").lower() in ("true", "1", "yes", "on"):
+        return True
+    if os.environ.get("WANDB_MODE", "").lower() in ("disabled", "off"):
+        return True
+    return False
+def _try_wandb_login() -> bool:
+    """Return True if wandb credentials are available (env, offline, or prior login)."""
+    if os.environ.get("WANDB_MODE", "").lower() == "offline":
+        return True
+    wandb_key = os.environ.get("WANDB_API_KEY")
+    if wandb_key:
+        wandb.login(key=wandb_key)
+        return True
+    try:
+        wandb.login(relogin=False)
+        key = wandb.api.api_key
+        return bool(key and len(key) >= 40)
+    except Exception:
+        return False
+def setup_accelerator_and_wandb(bf16, want_wandb: bool) -> tuple[Accelerator, bool]:
+    """
+    Initialize Accelerator and optionally wandb.
+    Returns:
+        (accelerator, use_wandb)
+    """
+    use_wandb = want_wandb and not _wandb_disabled_by_env()
+    if use_wandb:
+        use_wandb = _try_wandb_login()
+    accel_kwargs: dict = {}
+    # bf16 for DDP/MULTI_GPU only; with deepspeed_config_file, precision lives in the JSON.
+    if bf16 and not uses_deepspeed_json_file():
+        accel_kwargs["mixed_precision"] = "bf16"
+    if use_wandb:
+        accel_kwargs["log_with"] = "wandb"
+    return Accelerator(**accel_kwargs), use_wandb
+def load_model_and_processor(model_config: Dict[str, Any]):
+    """
+    Loads the pre-trained vision-language model and its associated processor.
+    Args:
+        model_config (Dict[str, Any]): Configuration dictionary for the model.
+    Returns:
+        Tuple[LlavaOnevisionForConditionalGeneration, PreTrainedProcessor]: The loaded model and processor.
+    """
+    model_id = model_config['pretrained_model_path']
+    model = LlavaOnevisionForConditionalGeneration.from_pretrained(
+        model_id,
+        torch_dtype=getattr(torch, model_config['torch_dtype']),
+        attn_implementation='flash_attention_2' if model_config['use_flash_attention_2'] else 'sdpa',
+        low_cpu_mem_usage=True,
+    )
+    # Freeze the vision tower to save memory and computation
+    model.base_model.vision_tower.requires_grad_(False)
+    processor = AutoProcessor.from_pretrained(model_id)
+    processor.tokenizer.padding_side = "left"
+    return model, processor
+def load_teacher_model(model_config: Dict[str, Any], *, local_rank: int = 0, num_gpus: int = 1):
+    """Load optional frozen teacher for cross-model OPD (e.g. LLaVA-OneVision 7B)."""
+    teacher_path = model_config.get("teacher_model_path")
+    if not teacher_path:
+        return None
+    dtype_name = model_config.get("teacher_dtype", model_config.get("torch_dtype", "bfloat16"))
+    torch_dtype = getattr(torch, dtype_name)
+    requested_map = model_config.get("teacher_device_map")
+    if not requested_map:
+        env_map = os.environ.get("DYME_TEACHER_DEVICE_MAP", "").strip()
+        if env_map:
+            requested_map = env_map
+    device_map = resolve_teacher_device_map(
+        requested_map,
+        local_rank=local_rank,
+        num_gpus=max(1, num_gpus),
+    )
+    log_teacher_placement(
+        local_rank=local_rank,
+        num_gpus=max(1, num_gpus),
+        teacher_path=teacher_path,
+        resolved_device=device_map,
+        requested_map=requested_map,
+    )
+    load_kwargs: Dict[str, Any] = {
+        "torch_dtype": torch_dtype,
+        "low_cpu_mem_usage": True,
+        "device_map": device_map,
+    }
+    teacher = LlavaOnevisionForConditionalGeneration.from_pretrained(
+        teacher_path,
+        attn_implementation='flash_attention_2' if model_config.get('use_flash_attention_2') else 'sdpa',
+        **load_kwargs,
+    )
+    teacher.eval()
+    teacher.requires_grad_(False)
+    if hasattr(teacher, "base_model") and hasattr(teacher.base_model, "vision_tower"):
+        teacher.base_model.vision_tower.requires_grad_(False)
+    return teacher
+def prepare_datasets(task: str, dataset_config: Dict[str, Any], mode='rl') -> (Dataset, Dataset):
+    """
+    Prepares the training and evaluation datasets based on the specified task.
+    Args:
+        task (str): The name of the task (e.g., 'chartqa').
+        dataset_config (Dict[str, Any]): Configuration for datasets.
+    Returns:
+        Tuple[Dataset, Dataset]: The training and evaluation datasets.
+    """
+    data_func = define_task_data_func(task, mode=mode)
+    # Create training dataset
+    train_data_list = data_func(json_path=dataset_config['train_dataset'])
+    train_dataset = Dataset.from_list(train_data_list)
+    # Create evaluation dataset
+    if 'chart' in task:
+        eval_dataset = load_dataset(dataset_config['eval_dataset'])['test']
+        # Note: You can uncomment the line below for quick testing/debugging.
+        # eval_dataset = eval_dataset.select(range(1000, 1100))
+    else:
+        # Extend this section for other tasks if needed in the future.
+        eval_dataset = None
+    return train_dataset, eval_dataset
+def main():
+    """
+    Main function to orchestrate the model training pipeline.
+    """
+    parser = argparse.ArgumentParser(description="Train a Llava model using either SFT or GRPO.")
+    parser.add_argument(
+        '--config', type=str, default='config/config.py',
+        help="Python config path (e.g. config/config.py, config/config_trimode.py) "
+             "or shorthand alias: norm | trimode | llavacot | low | aok",
+    )
+    parser.add_argument(
+        '--mode', type=str, default='rl',
+    )
+    parser.add_argument(
+        '--opsd_mode', type=str, default=None,
+        help="OPSD routing mode: dyme | trimode | rlsd | copsd_opd | opsd_only | replace_sft | opsd_on_wrong | grpo_opsd_joint",
+    )
+    parser.add_argument(
+        '--opsd_providers', type=str, default=None,
+        help="Comma-separated privileged providers: text,visual_facts,crop,hybrid",
+    )
+    parser.add_argument(
+        '--opsd_privilege_profile', type=str, default=None,
+        help="Privileged profile preset: text | visual | hybrid (default hybrid in config_trimode)",
+    )
+    parser.add_argument(
+        '--reward_weights', type=str, default=None,
+        help="Comma-separated reward weights: format,context,acc (e.g. 0.5,1.5,1.0). "
+             "Overrides config; env DYME_REWARD_WEIGHTS also supported in antidegen config.",
+    )
+    parser.add_argument(
+        '--opsd_enabled', action='store_true',
+        help="Enable OPSD / TriMode training extensions",
+    )
+    parser.add_argument(
+        '--opsd_debug', action='store_true',
+        help="Enable verbose OPSD debug logs (or set env DYME_OPSD_DEBUG=1)",
+    )
+    parser.add_argument(
+        '--opsd_detail_every', type=int, default=None,
+        help="Emit full weak-signal diagnostic bundle every N global steps on rank 0 "
+             "(default 10; config opsd.debug.detail_every or env DYME_OPSD_DETAIL_EVERY)",
+    )
+    parser.add_argument(
+        '--opsd_probe_on_generate', dest='opsd_probe_on_generate', action='store_true',
+        help="Emit [OPSD-PROBE] on every (re)generate on rank 0 (config_trimode default on)",
+    )
+    parser.add_argument(
+        '--no_opsd_probe_on_generate', dest='opsd_probe_on_generate', action='store_false',
+        help="Disable per-generate [OPSD-PROBE] logs",
+    )
+    parser.set_defaults(opsd_probe_on_generate=None)
+    parser.add_argument(
+        '--no_opsd_probe_first_token_logits', dest='opsd_probe_first_token_logits', action='store_false',
+        help="Disable pre-generate first-token logits probe ([OPSD-GENDBG])",
+    )
+    parser.set_defaults(opsd_probe_first_token_logits=None)
+    parser.add_argument(
+        '--wandb', dest='wandb', action='store_true',
+        help="Force enable Weights & Biases logging",
+    )
+    parser.add_argument(
+        '--no_wandb', dest='wandb', action='store_false',
+        help="Disable Weights & Biases logging (or set WANDB_MODE=offline/disabled)",
+    )
+    parser.set_defaults(wandb=None)
+    args = parser.parse_args()
+    mode = args.mode
+    # 1. Load Configurations
+    CONFIG = load_config(args.config)
+    model_config = CONFIG['model']
+    training_config = CONFIG['training']
+    rl_config = CONFIG['rl']
+    client_config = CONFIG['client']
+    dataset_config = CONFIG['dataset']
+    task = training_config['task']
+    opsd_config = dict(CONFIG.get('opsd', {"enabled": False, "mode": "dyme"}))
+    if args.opsd_enabled:
+        opsd_config["enabled"] = True
+    if args.opsd_mode is not None:
+        opsd_config["enabled"] = True
+        opsd_config["mode"] = args.opsd_mode
+    if args.opsd_providers is not None:
+        opsd_config["privileged_providers"] = [p.strip() for p in args.opsd_providers.split(",") if p.strip()]
+    if args.opsd_privilege_profile is not None:
+        opsd_config["privileged_profile"] = args.opsd_privilege_profile.strip()
+    reward_weights_raw = args.reward_weights or os.environ.get("DYME_REWARD_WEIGHTS")
+    if reward_weights_raw:
+        parts = [p.strip() for p in reward_weights_raw.split(",") if p.strip()]
+        if len(parts) != 3:
+            raise ValueError(
+                f"reward_weights must have exactly 3 comma-separated values (format,context,acc), got: {reward_weights_raw!r}"
+            )
+        opsd_config["reward_weights"] = [float(p) for p in parts]
+    debug_cfg = opsd_config.setdefault("debug", {})
+    detail_every = debug_cfg.get("detail_every", 10)
+    if args.opsd_detail_every is not None:
+        detail_every = max(0, args.opsd_detail_every)
+        debug_cfg["detail_every"] = detail_every
+    probe_on_generate = debug_cfg.get("probe_on_generate", False)
+    if args.opsd_probe_on_generate is not None:
+        probe_on_generate = args.opsd_probe_on_generate
+        debug_cfg["probe_on_generate"] = probe_on_generate
+    probe_first_token_logits = debug_cfg.get("probe_first_token_logits", True)
+    if args.opsd_probe_first_token_logits is not None:
+        probe_first_token_logits = args.opsd_probe_first_token_logits
+        debug_cfg["probe_first_token_logits"] = probe_first_token_logits
+    debug_enabled = opsd_debug.configure(
+        enabled=args.opsd_debug or None,
+        detail_every=detail_every,
+        probe_on_generate=probe_on_generate,
+        probe_first_token_logits=probe_first_token_logits,
+        probe_prompt_tail_tokens=debug_cfg.get("probe_prompt_tail_tokens", 16),
+        probe_log_model_context=debug_cfg.get("probe_log_model_context", True),
+    )
+    if debug_enabled:
+        opsd_debug.log_config("main", "resolved OPSD config", opsd_config)
+        opsd_debug.log("main", "training entry", mode=mode, config_path=args.config)
+    # 2. Setup Environment
+    want_wandb = True if args.wandb is None else args.wandb
+    accelerator, use_wandb = setup_accelerator_and_wandb(
+        bf16=training_config['dyme_args']['bf16'],
+        want_wandb=want_wandb,
+    )
+    if want_wandb and not use_wandb and args.wandb is True:
+        raise RuntimeError(
+            "wandb was requested (--wandb) but no API key is configured. "
+            "Run `wandb login`, set WANDB_API_KEY, or use WANDB_MODE=offline."
+        )
+    if accelerator.is_main_process:
+        if use_wandb:
+            print("[DyME] wandb enabled for training logs")
+        elif want_wandb:
+            print(
+                "[DyME] wandb disabled (no credentials). Training continues with report_to=none. "
+                "Run `wandb login`, export WANDB_API_KEY, or pass --wandb after configuring."
+            )
+    device_id = accelerator.process_index
+    opsd_debug.configure(
+        enabled=debug_enabled,
+        detail_every=detail_every,
+        probe_on_generate=probe_on_generate,
+        probe_first_token_logits=probe_first_token_logits,
+        probe_prompt_tail_tokens=debug_cfg.get("probe_prompt_tail_tokens", 16),
+        probe_log_model_context=debug_cfg.get("probe_log_model_context", True),
+        rank=accelerator.process_index,
+        world_size=accelerator.num_processes,
+    )
+    if debug_enabled:
+        opsd_debug.log(
+            "main",
+            "accelerator initialized",
+            process_index=accelerator.process_index,
+            local_process_index=accelerator.local_process_index,
+            num_processes=accelerator.num_processes,
+            device=str(accelerator.device),
+        )
+    visible_gpus = torch.cuda.device_count()
+    local_rank = int(os.environ.get("LOCAL_RANK", accelerator.local_process_index))
+    if visible_gpus == 0:
+        raise RuntimeError("No CUDA devices are visible to this process.")
+    if accelerator.num_processes > visible_gpus:
+        raise RuntimeError(
+            f"GPU/process mismatch: launched {accelerator.num_processes} distributed processes "
+            f"but only {visible_gpus} CUDA device(s) are visible "
+            f"(CUDA_VISIBLE_DEVICES={os.environ.get('CUDA_VISIBLE_DEVICES', '<unset>')}).\n"
+            f"Fix: accelerate launch --num_processes {visible_gpus} ...\n"
+            f"Or: bash scripts/train_local_gpus.sh  (auto-detects {visible_gpus} GPU(s))"
+        )
+    if local_rank >= visible_gpus:
+        raise RuntimeError(
+            f"LOCAL_RANK={local_rank} but only {visible_gpus} GPU(s) visible. "
+            f"Reduce --num_processes to {visible_gpus}."
+        )
+    if accelerator.is_main_process:
+        print(
+            f"[DyME] Distributed launch OK: num_processes={accelerator.num_processes}, "
+            f"visible_gpus={visible_gpus}, CUDA_VISIBLE_DEVICES={os.environ.get('CUDA_VISIBLE_DEVICES', '<unset>')}"
+        )
+    # 3. Initialize Model and Processor
+    ds_zero_stage = deepspeed_zero_stage()
+    if accelerator.is_main_process and is_deepspeed_accelerate_config():
+        print(
+            f"[DyME] DeepSpeed enabled via ACCELERATE_CONFIG "
+            f"({os.environ.get('ACCELERATE_CONFIG', '<unset>')}), ZeRO stage={ds_zero_stage}",
+            flush=True,
+        )
+    model, processor = load_model_and_processor(model_config)
+    if os.environ.get("DYME_GRADIENT_CHECKPOINTING", "").strip().lower() in ("1", "true", "yes", "on"):
+        if should_disable_gradient_checkpointing():
+            if accelerator.is_main_process:
+                print(
+                    "[DyME] gradient checkpointing skipped: incompatible with DeepSpeed ZeRO-1/2 "
+                    "(multiple student forwards / checkpoint backward). "
+                    "Use ZeRO-3, DDP, or DYME_GRADIENT_CHECKPOINTING=0.",
+                    flush=True,
+                )
+        else:
+            gc_kwargs = gradient_checkpointing_enable_kwargs()
+            if gc_kwargs:
+                model.gradient_checkpointing_enable(gradient_checkpointing_kwargs=gc_kwargs)
+            else:
+                model.gradient_checkpointing_enable()
+            if accelerator.is_main_process:
+                mode = f"use_reentrant={gc_kwargs['use_reentrant']}" if gc_kwargs else "default"
+                print(
+                    f"[DyME] gradient checkpointing enabled on student "
+                    f"(DYME_GRADIENT_CHECKPOINTING, {mode})",
+                    flush=True,
+                )
+    cold_start_frac = float(
+        opsd_config.get("gate", {}).get("sft_cold_start_frac", 0.0) or 0.0
+    )
+    cold_start_steps = opsd_config.get("gate", {}).get("sft_cold_start_steps")
+    lazy_teacher = bool(cold_start_steps) or cold_start_frac > 0.0
+    teacher_model = None
+    teacher_model_config = None
+    if lazy_teacher:
+        teacher_model_config = dict(model_config)
+        if accelerator.is_main_process:
+            print(
+                "[DyME] SFT cold-start enabled: deferring 7B teacher load until RL phase",
+                flush=True,
+            )
+    else:
+        teacher_model = load_teacher_model(
+            model_config,
+            local_rank=local_rank,
+            num_gpus=visible_gpus,
+        )
+    if accelerator.is_main_process and teacher_model is not None:
+        _run_cross_model_vocab_checks(
+            model,
+            processor,
+            teacher_model,
+            model_config,
+        )
+    # 4. Prepare Datasets
+    train_dataset, eval_dataset = prepare_datasets(task, dataset_config, mode=mode)
+    # 5. Initialize Reward Calculator
+    # checker = RewardCalculator(rl_config, client_config.copy(), gpu_id=device_id)
+    # refiner = ContextRefiner(rl_config, client_config.copy(), gpu_id=device_id)
+    checker = RewardCalculatorLocal(rl_config, client_config.copy(), gpu_id=device_id)
+    refiner = ContextRefinerLocal(rl_config, client_config.copy(), gpu_id=device_id)
+    # 6. Define Training Arguments
+    dyme_args = dict(training_config['dyme_args'])
+    if ds_zero_stage is not None and ds_zero_stage >= 3:
+        dyme_args.setdefault("ds3_gather_for_generation", True)
+    if not use_wandb:
+        dyme_args["report_to"] = "none"
+    training_args = GRPOConfig(**dyme_args)
+    collate_fn_with_processor = partial(collate_fn, processor=processor)
+    # 7. Initialize the Trainer
+    dyme_trainer = DyMETrainer(
+        model=model,
+        checker=checker,
+        refiner=refiner,
+        args=training_args,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        processing_class=processor,
+        processing_func=collate_fn_with_processor,
+        task_name=task,
+        end_flag=rl_config['end_flag'],
+        opsd_config=opsd_config,
+        teacher_model=teacher_model,
+        teacher_model_config=teacher_model_config,
+    )
+    # 8. Start Training
+    dyme_trainer.train()
+    output_dir = training_args.output_dir
+    output_dir = os.path.join(output_dir, "final_checkpoint")
+    if accelerator.is_main_process and is_deepspeed_accelerate_config():
+        print(
+            "[DyME] Saving consolidated student checkpoint (DeepSpeed ZeRO gather if configured)...",
+            flush=True,
+        )
+    dyme_trainer.save_model(output_dir)
+    if accelerator.is_main_process:
+        processor.save_pretrained(output_dir)
+        print(f"Model and processor saved to {output_dir}")
+if __name__ == "__main__":
+    main()

main_llm.py ADDED Viewed

	@@ -0,0 +1,197 @@

+# train_grpo.py
+"""
+Main script for training a Llava-based model using the custom MyGRPOTrainer.
+This script handles:
+1. Configuration loading.
+2. Initialization of Weights & Biases (wandb) and Hugging Face Accelerate.
+3. Loading the model and processor.
+4. Preparing the training and evaluation datasets.
+5. Setting up and running the GRPO trainer.
+"""
+import argparse
+import os
+from functools import partial
+from typing import Dict, Any
+import torch
+import wandb
+from accelerate import Accelerator
+from datasets import Dataset, load_dataset
+from peft import LoraConfig, get_peft_model, TaskType
+from transformers import AutoProcessor, AutoModelForCausalLM
+from trl import GRPOConfig
+from config.config_llm import CONFIG
+from data_utils.commom_util import collate_fn, define_task_data_func, collate_fn_woI
+from trainer.DyMETrainer_llm import DyMETrainer
+from reward_utils.checker import RewardCalculator, RewardCalculatorLocal
+from reward_utils.refiner import ContextRefiner, ContextRefinerLocal
+def print_trainable_parameters(model):
+    """
+    Prints the number of trainable parameters in the model.
+    """
+    trainable_params = 0
+    all_param = 0
+    for _, param in model.named_parameters():
+        all_param += param.numel()
+        if param.requires_grad:
+            trainable_params += param.numel()
+    print(
+        f"trainable params: {trainable_params} || all params: {all_param} || "
+        f"trainable%: {100 * trainable_params / all_param:.2f}"
+    )
+def setup_accelerator_and_wandb(bf16) -> Accelerator:
+    """
+    Initializes Weights & Biases and the Hugging Face Accelerator.
+    Returns:
+        Accelerator: The configured accelerator instance.
+    """
+    wandb_key = os.environ.get("WANDB_API_KEY")
+    if wandb_key:
+        wandb.login(key=wandb_key)
+    if bf16:
+        accelerator = Accelerator(mixed_precision="bf16", log_with="wandb")
+    else:
+        accelerator = Accelerator(log_with="wandb")
+    return accelerator
+def load_model_and_processor(model_config: Dict[str, Any], peft_config: Dict[str, Any]):
+    """
+    Loads the base model, applies LoRA configuration, and loads its processor.
+    Args:
+        model_config (Dict[str, Any]): Configuration dictionary for the model.
+        peft_config (Dict[str, Any]): Configuration dictionary for PEFT (LoRA).
+    Returns:
+        Tuple[PeftModel, PreTrainedProcessor]: The loaded PEFT model and processor.
+    """
+    model_id = model_config['pretrained_model_path']
+    # Load base model
+    base_model = AutoModelForCausalLM.from_pretrained(
+        model_id,
+        torch_dtype=getattr(torch, model_config['torch_dtype']),
+        attn_implementation='flash_attention_2' if model_config['use_flash_attention_2'] else 'sdpa',
+        low_cpu_mem_usage=True,
+    )
+    processor = AutoProcessor.from_pretrained(model_id, padding_side='left')
+    processor._tokenizer.padding_side = "left"
+    lora_config = peft_config
+    model = get_peft_model(base_model, lora_config)
+    print("LoRA model created:")
+    print_trainable_parameters(model)
+    return model, processor
+# ## --- LoRA modification End --- ##
+def prepare_datasets(task: str, dataset_config: Dict[str, Any]) -> (Dataset, Dataset):
+    """
+    Prepares the training and evaluation datasets based on the specified task.
+    """
+    data_func = define_task_data_func(task)
+    train_data_list = data_func(json_path=dataset_config['train_dataset'])
+    train_dataset = Dataset.from_list(train_data_list)
+    if 'chart' in task:
+        eval_dataset = load_dataset(dataset_config['eval_dataset'])['test']
+    else:
+        eval_dataset = None
+    return train_dataset, eval_dataset
+def main():
+    """
+    Main function to orchestrate the model training pipeline.
+    """
+    parser = argparse.ArgumentParser(description="Train a model using GRPO with LoRA.")
+    parser.add_argument(
+        '--config', type=str, default='norm',
+        help="config file to use: 'norm' or 'llavacot'..."
+    )
+    args = parser.parse_args()
+    config_select = args.config
+    if config_select == 'norm':
+        from config_llm import CONFIG
+    # 1. Load Configurations
+    model_config = CONFIG['model']
+    training_config = CONFIG['training']
+    rl_config = CONFIG['rl']
+    client_config = CONFIG['client']
+    dataset_config = CONFIG['dataset']
+    peft_config = LoraConfig(
+        r=16,
+        lora_alpha=64,
+        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj", "gate_proj"],
+        task_type="CAUSAL_LM",
+        lora_dropout=0.05,
+    )
+    task = training_config['task']
+    # 2. Setup Environment
+    accelerator = setup_accelerator_and_wandb(bf16=training_config['dyme_args']['bf16'])
+    device_id = accelerator.process_index
+    # 3. Initialize Model and Processor
+    # ## --- LoRA modification Start --- ##
+    #  Pass peft_config to the model loading function
+    model, processor = load_model_and_processor(model_config, peft_config)
+    # ## --- LoRA modification End --- ##
+    # 4. Prepare Datasets
+    train_dataset, eval_dataset = prepare_datasets(task, dataset_config)
+    # 5. Initialize Reward Calculator
+    checker = RewardCalculatorLocal(rl_config, client_config.copy(), gpu_id=device_id)
+    refiner = ContextRefinerLocal(rl_config, client_config.copy(), gpu_id=device_id)
+    # 6. Define Training Arguments
+    training_args = GRPOConfig(**training_config['dyme_args'])
+    collate_fn_with_processor = partial(collate_fn_woI, processor=processor)
+    # 7. Initialize the Trainer
+    # Trainer handles PeftModel automatically
+    dyme_trainer = DyMETrainer(
+        model=model,
+        checker=checker,
+        refiner=refiner,
+        args=training_args,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        processing_class=processor,
+        processing_func=collate_fn_with_processor,
+        task_name=task,
+        end_flag=rl_config['end_flag'],
+    )
+    # 8. Start Training
+    dyme_trainer.train()
+    # When saving, the Trainer automatically saves only the LoRA adapter weights
+    output_dir = training_args.output_dir
+    output_dir = os.path.join(output_dir, "final_checkpoint")
+    dyme_trainer.save_model(output_dir)
+    if accelerator.is_main_process:
+        # Non-model files like the processor still need to be saved manually
+        processor.save_pretrained(output_dir)
+        print(f"LoRA adapters and processor saved to {output_dir}")
+if __name__ == "__main__":
+    main()

main_sft.py ADDED Viewed

	@@ -0,0 +1,80 @@

+"""
+Offline supervised fine-tuning for ChartQA (two-stage cold start before RLSD/OPD).
+Usage:
+  accelerate launch main_sft.py --config config/config_rlsd_chartqa.py
+  bash scripts/train_chartqa_sft.sh
+"""
+from __future__ import annotations
+import argparse
+import os
+from functools import partial
+from accelerate import Accelerator
+from datasets import Dataset
+from transformers import Trainer, TrainingArguments
+from config.loader import load_config
+from data_utils.commom_util import collate_fn, define_task_data_func
+from main import load_model_and_processor
+def main() -> None:
+    parser = argparse.ArgumentParser(description="ChartQA offline SFT (hint + Answer GT).")
+    parser.add_argument(
+        "--config",
+        type=str,
+        default="config/config_rlsd_chartqa.py",
+        help="Config module (uses training.sft_args and dataset.train_dataset).",
+    )
+    parser.add_argument(
+        "--pretrained_model_path",
+        type=str,
+        default=None,
+        help="Override CONFIG model path (e.g. base 0.5B before RL).",
+    )
+    args = parser.parse_args()
+    config = load_config(args.config)
+    model_config = dict(config["model"])
+    if args.pretrained_model_path:
+        model_config["pretrained_model_path"] = args.pretrained_model_path
+    training_config = config["training"]
+    task = training_config["task"]
+    sft_args = dict(training_config.get("sft_args") or config.get("training", {}).get("sft_args", {}))
+    if not sft_args:
+        raise ValueError("Config must define training.sft_args for offline SFT.")
+    output_dir = os.environ.get("DYME_SFT_OUTPUT_DIR", sft_args.get("output_dir", "./outputs/chartqa-sft"))
+    sft_args["output_dir"] = output_dir
+    sft_args.setdefault("remove_unused_columns", False)
+    accelerator = Accelerator()
+    if accelerator.is_main_process:
+        os.makedirs(output_dir, exist_ok=True)
+    model, processor = load_model_and_processor(model_config)
+    data_func = define_task_data_func(task, mode="sft")
+    train_list = data_func(json_path=config["dataset"]["train_dataset"])
+    train_dataset = Dataset.from_list(train_list)
+    label_id = processor.tokenizer.convert_tokens_to_ids("<|im_start|>")
+    data_collator = partial(collate_fn, processor=processor, label_id=label_id)
+    train_args = TrainingArguments(**sft_args)
+    trainer = Trainer(
+        model=model,
+        args=train_args,
+        train_dataset=train_dataset,
+        data_collator=data_collator,
+    )
+    trainer.train()
+    trainer.save_model(os.path.join(output_dir, "final_checkpoint"))
+    if accelerator.is_main_process:
+        processor.save_pretrained(os.path.join(output_dir, "final_checkpoint"))
+if __name__ == "__main__":
+    main()

multi_node_config_raw.yaml ADDED Viewed

	@@ -0,0 +1,21 @@

+compute_environment: LOCAL_MACHINE
+distributed_type: MULTI_GPU
+downcast_bf16: 'no'
+gpu_ids: all
+machine_rank: 0
+main_process_ip: 'xx.xx.xx.xx'
+main_process_port: 36001
+main_training_function: main
+mixed_precision: 'bf16'
+num_machines: 2
+num_processes: 16
+rdzv_backend: static
+same_network: true
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
+# Optional DeepSpeed ZeRO-0 (no sharding). Prefer MULTI_GPU DDP for single-node training.
+deepspeed_config:
+  zero_stage: 0
+  gradient_accumulation_steps: 1
+  zero3_init_flag: false

opsd_utils/__pycache__/opsd_loss.cpython-312.pyc ADDED Viewed

Binary file (5.03 kB). View file

opsd_utils/gate_policy.py ADDED Viewed

	@@ -0,0 +1,107 @@

+"""RLSD warmup gates for OPSD degenerate skip, denser online SFT, and embedded SFT cold start."""
+from __future__ import annotations
+import math
+from typing import Any, Mapping, Optional
+def current_global_step(trainer: Any) -> int:
+    return int(getattr(getattr(trainer, "state", None), "global_step", getattr(trainer, "_step", 0)) or 0)
+def resolve_max_training_steps(trainer: Any) -> Optional[int]:
+    """Resolve total optimizer steps for gate math (cold start frac, warmup windows).
+    Priority: TrainingArguments.max_steps > Trainer.state.max_steps > epoch estimate.
+    HF sets state.max_steps when max_steps<=0 from num_train_epochs * len(dataloader).
+    """
+    args = getattr(trainer, "args", None)
+    if args is not None:
+        arg_max = getattr(args, "max_steps", None)
+        if arg_max is not None and int(arg_max) > 0:
+            return int(arg_max)
+    state = getattr(trainer, "state", None)
+    if state is not None:
+        state_max = getattr(state, "max_steps", None)
+        if state_max is not None and int(state_max) > 0:
+            return int(state_max)
+    if args is None:
+        return None
+    num_epochs = getattr(args, "num_train_epochs", None)
+    grad_accum = max(1, int(getattr(args, "gradient_accumulation_steps", 1) or 1))
+    if num_epochs is None or float(num_epochs) <= 0:
+        return None
+    dataloader = getattr(trainer, "train_dataloader", None)
+    if dataloader is None and hasattr(trainer, "get_train_dataloader"):
+        try:
+            dataloader = trainer.get_train_dataloader()
+        except Exception:
+            dataloader = None
+    if dataloader is None:
+        return None
+    try:
+        steps_per_epoch = len(dataloader)
+    except TypeError:
+        return None
+    if steps_per_epoch <= 0:
+        return None
+    total = math.ceil(float(num_epochs) * steps_per_epoch / grad_accum)
+    return total if total > 0 else None
+def sft_cold_start_steps(opsd_config: Mapping[str, Any], max_steps: Optional[int]) -> int:
+    """Steps at start of training devoted to embedded offline-style SFT (no generate / no OPSD)."""
+    gate = opsd_config.get("gate", {})
+    steps_env = gate.get("sft_cold_start_steps")
+    if steps_env is not None:
+        return max(0, int(steps_env))
+    frac = float(gate.get("sft_cold_start_frac", 0.0) or 0.0)
+    if frac <= 0.0 or max_steps is None or max_steps <= 0:
+        return 0
+    return max(1, int(max_steps * frac))
+def in_sft_cold_start(
+    opsd_config: Mapping[str, Any],
+    global_step: int,
+    max_steps: Optional[int],
+) -> bool:
+    cold_steps = sft_cold_start_steps(opsd_config, max_steps)
+    return cold_steps > 0 and global_step < cold_steps
+def resolve_skip_degenerate_opsd(
+    opsd_config: Mapping[str, Any],
+    global_step: int,
+    max_steps: Optional[int] = None,
+) -> bool:
+    gate = opsd_config.get("gate", {})
+    if not gate.get("skip_degenerate_for_opsd", False):
+        return False
+    cold_end = sft_cold_start_steps(opsd_config, max_steps)
+    warmup = int(gate.get("degen_skip_warmup_steps", 200))
+    # Do not skip degenerate OPSD during embedded SFT cold start or its degen warmup window.
+    threshold = cold_end + warmup if cold_end > 0 else warmup
+    return global_step >= threshold
+def sft_slots_for_step(
+    opsd_config: Mapping[str, Any],
+    global_step: int,
+    max_steps: Optional[int] = None,
+) -> int:
+    if in_sft_cold_start(opsd_config, global_step, max_steps):
+        return 0
+    gate = opsd_config.get("gate", {})
+    warmup_steps = int(gate.get("sft_warmup_steps", 200))
+    cold_end = sft_cold_start_steps(opsd_config, max_steps)
+    effective_warmup_end = cold_end + warmup_steps if cold_end > 0 else warmup_steps
+    if global_step < effective_warmup_end:
+        return max(1, int(gate.get("sft_warmup_slots_per_group", 2)))
+    return 1

opsd_utils/health_monitor.py ADDED Viewed

	@@ -0,0 +1,410 @@

+"""Training health monitor: degeneration alerts, rolling stats, cross-step correlation."""
+from __future__ import annotations
+import math
+from collections import deque
+from typing import Any, Optional
+from opsd_utils import debug_log as opsd_debug
+ALERT_GEN_CLIP_COLLAPSE = "GEN_CLIP_COLLAPSE"
+ALERT_GEN_REPEAT_DEGEN = "GEN_REPEAT_DEGEN"
+ALERT_OPT_GRAD_SPIKE = "OPT_GRAD_SPIKE"
+ALERT_OPT_NAN_INF = "OPT_NAN_INF"
+ALERT_RL_ZERO_SIGNAL = "RL_ZERO_SIGNAL"
+ALERT_REWARD_FORMAT_HACK = "REWARD_FORMAT_HACK"
+ALERT_DATA_EMPTY_VF = "DATA_EMPTY_VF"
+ALERT_LOGIT_MODE_COLLAPSE = "LOGIT_MODE_COLLAPSE"
+ALERT_ANSWER_TOKEN_DRIFT = "ANSWER_TOKEN_DRIFT"
+ALERT_CLIP_FALSE_HEALTHY = "CLIP_FALSE_HEALTHY"
+ALERT_OPSD_LEAKAGE_PATTERN = "OPSD_LEAKAGE_PATTERN"
+ALERT_OPSD_ON_CORRECT = "OPSD_ON_CORRECT"
+def _safe_float(v: Any, default: float = 0.0) -> float:
+    try:
+        if v is None:
+            return default
+        f = float(v)
+        if math.isnan(f) or math.isinf(f):
+            return default
+        return f
+    except (TypeError, ValueError):
+        return default
+def _rolling_mean_std(values: list[float]) -> tuple[float, float]:
+    if not values:
+        return 0.0, 0.0
+    mean = sum(values) / len(values)
+    if len(values) < 2:
+        return mean, 0.0
+    var = sum((x - mean) ** 2 for x in values) / len(values)
+    return mean, math.sqrt(var)
+class TrainingHealthMonitor:
+    """Collect per-step signals, emit layered [OPSD-HEALTH] logs, expose metrics keys."""
+    def __init__(self, config: Optional[dict[str, Any]] = None):
+        cfg = config or {}
+        self.enabled = bool(cfg.get("enabled", True))
+        self.window = max(2, int(cfg.get("window", 20)))
+        self.log_on_generate = bool(cfg.get("log_on_generate", True))
+        self.log_every_step = bool(cfg.get("log_every_step", True))
+        self.log_detail_bundle = bool(cfg.get("log_detail_bundle", True))
+        self.log_alerts_immediately = bool(cfg.get("log_alerts_immediately", True))
+        self.metrics_every_step = bool(cfg.get("metrics_every_step", True))
+        self._history: deque[dict[str, Any]] = deque(maxlen=self.window)
+        self._step_fields: dict[str, Any] = {}
+        self._step_alerts: list[str] = []
+        self._p_greedy_history: deque[float] = deque(maxlen=5)
+        self._p_answer_history: deque[float] = deque(maxlen=5)
+        self._eos_history: deque[float] = deque(maxlen=5)
+        self._last_step: Optional[int] = None
+    def reset_step(self, step: int) -> None:
+        self._step_fields = {"global_step": step}
+        self._step_alerts = []
+        self._last_step = step
+    def _emit_alert(self, step: int, code: str, **fields: Any) -> None:
+        if code not in self._step_alerts:
+            self._step_alerts.append(code)
+        if self.log_alerts_immediately and opsd_debug.should_log_health_alerts_immediately():
+            opsd_debug.log_health("ALERT", code, global_step=step, **fields)
+    def _check_generate_alerts(self, step: int, stats: dict[str, Any], logits: dict[str, Any]) -> list[str]:
+        clipped = _safe_float(stats.get("clipped_rate"))
+        eos_rate = _safe_float(stats.get("eos_terminated_rate"))
+        degenerate_rate = _safe_float(stats.get("degenerate_rate"))
+        repeat_loop = int(stats.get("repeat_loop_count", 0) or 0)
+        p_greedy = _safe_float(logits.get("p_greedy_first"))
+        p_eos = _safe_float(logits.get("p_eos_first"))
+        p_answer = _safe_float(logits.get("p_answer_first"))
+        if clipped > 0.8 and degenerate_rate < 0.05:
+            self._emit_alert(
+                step,
+                ALERT_CLIP_FALSE_HEALTHY,
+                clipped_rate=clipped,
+                degenerate_rate=degenerate_rate,
+                hint="high clip with low degenerate_rate often masks Answer-only collapse",
+            )
+        if clipped > 0.7 and eos_rate < 0.3:
+            self._emit_alert(
+                step,
+                ALERT_GEN_CLIP_COLLAPSE,
+                clipped_rate=clipped,
+                eos_rate=eos_rate,
+                hint="raise repetition_penalty, lower temperature, or shorten max_completion_length",
+            )
+        if degenerate_rate > 0.5 or repeat_loop > 0:
+            self._emit_alert(
+                step,
+                ALERT_GEN_REPEAT_DEGEN,
+                degenerate_rate=degenerate_rate,
+                repeat_loop_count=repeat_loop,
+            )
+        if p_greedy > 0:
+            self._p_greedy_history.append(p_greedy)
+            self._eos_history.append(eos_rate)
+            if (
+                len(self._p_greedy_history) >= 3
+                and all(p > 0.99 for p in list(self._p_greedy_history)[-3:])
+                and len(self._eos_history) >= 2
+                and self._eos_history[-1] < self._eos_history[-2] - 0.1
+            ):
+                self._emit_alert(
+                    step,
+                    ALERT_LOGIT_MODE_COLLAPSE,
+                    p_greedy_first=p_greedy,
+                    p_eos_first=p_eos,
+                    eos_rate=eos_rate,
+                    hint="first token collapsed to Goal: template; EOS probability near zero",
+                )
+        if p_answer > 0:
+            self._p_answer_history.append(p_answer)
+            if len(self._p_answer_history) >= 3 and all(
+                p < 0.5 for p in list(self._p_answer_history)[-3:]
+            ):
+                self._emit_alert(
+                    step,
+                    ALERT_ANSWER_TOKEN_DRIFT,
+                    p_answer_first=p_answer,
+                    hint="first-token Answer probability low for 3 consecutive generate batches",
+                )
+        return list(self._step_alerts)
+    def record_generate(
+        self,
+        step: int,
+        stats: dict[str, Any],
+        logits_stats: Optional[dict[str, Any]] = None,
+    ) -> list[str]:
+        if not self.enabled:
+            return []
+        logits_stats = logits_stats or {}
+        self._step_fields.update(
+            {
+                "degenerate_rate": stats.get("degenerate_rate"),
+                "clipped_rate": stats.get("clipped_rate"),
+                "eos_terminated_rate": stats.get("eos_terminated_rate"),
+                "repeat_loop_count": stats.get("repeat_loop_count"),
+                "char_repeat_count": stats.get("char_repeat_count", 0),
+                "p_greedy_first": logits_stats.get("p_greedy_first"),
+                "p_eos_first": logits_stats.get("p_eos_first"),
+                "p_answer_first": logits_stats.get("p_answer_first"),
+                "entropy_first": logits_stats.get("entropy_first"),
+                "degenerate_rate_format": stats.get("degenerate_rate_format"),
+                "degenerate_rate_repeat": stats.get("degenerate_rate_repeat"),
+                "format_without_thinking_rate": stats.get("format_without_thinking_rate"),
+            }
+        )
+        alerts = self._check_generate_alerts(step, stats, logits_stats)
+        alert_str = ",".join(alerts) if alerts else "none"
+        if self.log_on_generate and opsd_debug.should_log_health_on_generate():
+            opsd_debug.log_health(
+                "generate",
+                "batch health",
+                global_step=step,
+                degenerate_rate=stats.get("degenerate_rate"),
+                clipped_rate=stats.get("clipped_rate"),
+                eos_rate=stats.get("eos_terminated_rate"),
+                repeat_loop_count=stats.get("repeat_loop_count"),
+                char_repeat_count=stats.get("char_repeat_count", 0),
+                p_greedy=logits_stats.get("p_greedy_first"),
+                p_eos=logits_stats.get("p_eos_first"),
+                alerts=alert_str,
+            )
+        return alerts
+    def record_data(self, step: int, fields: dict[str, Any]) -> None:
+        if not self.enabled:
+            return
+        self._step_fields.update(fields)
+        vf_empty = _safe_float(fields.get("visual_fact_empty_rate"))
+        if vf_empty > 0.5:
+            self._emit_alert(
+                step,
+                ALERT_DATA_EMPTY_VF,
+                visual_fact_empty_rate=vf_empty,
+                hint="rebuild train_medium_vf_full.json with visual_fact hints",
+            )
+    def record_routing(self, step: int, fields: dict[str, Any]) -> None:
+        if not self.enabled:
+            return
+        self._step_fields.update(fields)
+        format_mean = _safe_float(fields.get("format_mean"))
+        acc_mean = _safe_float(fields.get("accuracy_mean"))
+        degenerate_rate = _safe_float(self._step_fields.get("degenerate_rate"))
+        if format_mean > 0.7 and acc_mean < 0.05 and degenerate_rate > 0.4:
+            self._emit_alert(
+                step,
+                ALERT_REWARD_FORMAT_HACK,
+                format_mean=format_mean,
+                accuracy_mean=acc_mean,
+                degenerate_rate=degenerate_rate,
+            )
+        opsd_on_correct = _safe_float(fields.get("opsd_on_correct_rate"))
+        if opsd_on_correct > 0.01:
+            self._emit_alert(
+                step,
+                ALERT_OPSD_ON_CORRECT,
+                opsd_on_correct_rate=opsd_on_correct,
+            )
+        leakage_skip = int(fields.get("opsd_skipped_leakage", 0) or 0)
+        if leakage_skip > 0:
+            self._emit_alert(
+                step,
+                ALERT_OPSD_LEAKAGE_PATTERN,
+                opsd_skipped_leakage=leakage_skip,
+            )
+    def record_loss(self, step: int, fields: dict[str, Any]) -> None:
+        if not self.enabled:
+            return
+        self._step_fields.update(fields)
+        loss_val = fields.get("combined_loss_scalar", fields.get("grpo_loss_scalar"))
+        if loss_val is not None and not math.isfinite(_safe_float(loss_val, default=float("nan"))):
+            self._emit_alert(step, ALERT_OPT_NAN_INF, loss=loss_val)
+        adv_abs = _safe_float(fields.get("advantages_abs_mean"))
+        zero_grpo = _safe_float(fields.get("grpo_zero_loss_rate"))
+        if adv_abs < 1e-6 and zero_grpo > 0.8:
+            self._emit_alert(
+                step,
+                ALERT_RL_ZERO_SIGNAL,
+                advantages_abs_mean=adv_abs,
+                grpo_zero_loss_rate=zero_grpo,
+            )
+    def record_optimizer(self, step: int, grad_norm: Optional[float], lr: Optional[float]) -> None:
+        if not self.enabled:
+            return
+        gn = _safe_float(grad_norm) if grad_norm is not None else None
+        if gn is not None:
+            self._step_fields["grad_norm"] = gn
+            hist = [h.get("grad_norm") for h in self._history if h.get("grad_norm") is not None]
+            if len(hist) >= 3:
+                mean, std = _rolling_mean_std([float(x) for x in hist])
+                if std > 1e-8 and gn > mean + 3 * std:
+                    self._emit_alert(
+                        step,
+                        ALERT_OPT_GRAD_SPIKE,
+                        grad_norm=gn,
+                        rolling_mean=mean,
+                        rolling_std=std,
+                    )
+        if lr is not None:
+            self._step_fields["learning_rate"] = lr
+    def correlate(self) -> dict[str, Any]:
+        """Cross-step deltas and root-cause hints from rolling history."""
+        hints: list[str] = []
+        out: dict[str, Any] = {"root_cause_hints": hints}
+        if len(self._history) < 2:
+            out["root_cause_hints"] = ["insufficient history for correlation"]
+            return out
+        prev = self._history[-1]
+        prev2 = self._history[-2] if len(self._history) >= 2 else prev
+        for key in ("grad_norm", "clipped_rate", "eos_terminated_rate", "p_greedy_first", "degenerate_rate"):
+            cur_v = self._step_fields.get(key)
+            old_v = prev.get(key)
+            if cur_v is not None and old_v is not None:
+                out[f"delta_{key}"] = _safe_float(cur_v) - _safe_float(old_v)
+        gn_prev = prev.get("grad_norm")
+        clip_cur = self._step_fields.get("clipped_rate")
+        if gn_prev is not None and clip_cur is not None and _safe_float(clip_cur) > 0.7:
+            hints.append("high clip rate may follow recent gradient update (check delta_grad_norm)")
+        p_prev = prev2.get("p_greedy_first")
+        p_cur = self._step_fields.get("p_greedy_first")
+        eos_prev = prev2.get("eos_terminated_rate")
+        eos_cur = self._step_fields.get("eos_terminated_rate")
+        if (
+            p_prev is not None
+            and p_cur is not None
+            and _safe_float(p_cur) > 0.99
+            and eos_prev is not None
+            and eos_cur is not None
+            and _safe_float(eos_prev) > 0.5
+            and _safe_float(eos_cur) < 0.2
+        ):
+            hints.append("after gradient step: p_greedy rose to ~1.0 and eos_rate collapsed")
+        if ALERT_RL_ZERO_SIGNAL in self._step_alerts and ALERT_GEN_REPEAT_DEGEN in self._step_alerts:
+            hints.append("RL zero signal co-occurs with repetition degeneration")
+        if not hints:
+            hints.append("none")
+        out["root_cause_hints"] = hints
+        return out
+    def maybe_log_detail_bundle(self, step: int) -> None:
+        if not self.enabled or not self.log_detail_bundle:
+            return
+        if not opsd_debug.should_log_detail(step):
+            return
+        opsd_debug.log_health_detail_banner(step, "TRAINING HEALTH BUNDLE")
+        corr = self.correlate()
+        hist_keys = (
+            "degenerate_rate",
+            "clipped_rate",
+            "eos_terminated_rate",
+            "grad_norm",
+            "p_greedy_first",
+            "grpo_zero_loss_rate",
+            "sft_replaced_ratio",
+        )
+        rolling: dict[str, Any] = {}
+        for key in hist_keys:
+            vals = [_safe_float(h[key]) for h in self._history if h.get(key) is not None]
+            if vals:
+                mean, std = _rolling_mean_std(vals)
+                rolling[f"{key}_mean"] = mean
+                rolling[f"{key}_std"] = std
+        snapshot_fields = {k: v for k, v in self._step_fields.items() if k != "global_step"}
+        opsd_debug.log_health_detail(
+            "health",
+            "step snapshot",
+            global_step=step,
+            alerts=self._step_alerts or ["none"],
+            **snapshot_fields,
+            **rolling,
+        )
+        opsd_debug.log_health_detail(
+            "health",
+            "cross-step correlation",
+            global_step=step,
+            **corr,
+        )
+    def finish_step(self, step: int) -> dict[str, float]:
+        """L2 step summary log + metrics keys for Trainer.log()."""
+        snapshot = dict(self._step_fields)
+        snapshot["alert_count"] = len(self._step_alerts)
+        snapshot["alerts"] = list(self._step_alerts)
+        self._history.append(snapshot)
+        if self.log_every_step and opsd_debug.should_log_health_every_step():
+            corr = self.correlate()
+            opsd_debug.log_health(
+                "step",
+                "step summary",
+                global_step=step,
+                grad_norm=snapshot.get("grad_norm"),
+                lr=snapshot.get("learning_rate"),
+                sft_replaced_ratio=snapshot.get("sft_replaced_ratio"),
+                grpo_zero_loss_rate=snapshot.get("grpo_zero_loss_rate"),
+                degenerate_rate=snapshot.get("degenerate_rate"),
+                clipped_rate=snapshot.get("clipped_rate"),
+                eos_rate=snapshot.get("eos_terminated_rate"),
+                alert_count=len(self._step_alerts),
+                hints=corr.get("root_cause_hints"),
+            )
+        self.maybe_log_detail_bundle(step)
+        if not self.metrics_every_step:
+            return {}
+        metrics: dict[str, float] = {}
+        mapping = {
+            "completions/degenerate_rate": "degenerate_rate",
+            "completions/eos_rate": "eos_terminated_rate",
+            "completions/repeat_loop_count": "repeat_loop_count",
+            "routing/sft_replaced_ratio": "sft_replaced_ratio",
+            "routing/opsd_skipped_degenerate": "opsd_skipped_degenerate",
+            "routing/opsd_skipped_leakage": "opsd_skipped_leakage",
+            "routing/opsd_on_correct_rate": "opsd_on_correct_rate",
+            "routing/grpo_on_correct_rate": "grpo_on_correct_rate",
+            "routing/opd_teacher_call_rate": "opd_teacher_call_rate",
+            "teacher/privileged_suffix_has_gold_rate": "privileged_suffix_has_gold_rate",
+            "teacher/visual_fact_empty_rate": "visual_fact_empty_rate",
+            "teacher/suffix_len_mean": "teacher_suffix_len_mean",
+            "signal/grpo_zero_loss_rate": "grpo_zero_loss_rate",
+            "signal/advantage_abs_mean": "advantages_abs_mean",
+            "logits/p_greedy_first": "p_greedy_first",
+            "logits/p_eos_first": "p_eos_first",
+            "logits/p_answer_first": "p_answer_first",
+            "completions/degenerate_rate_format": "degenerate_rate_format",
+            "completions/degenerate_rate_repeat": "degenerate_rate_repeat",
+            "metrics/format_without_thinking_rate": "format_without_thinking_rate",
+            "phase/sft_cold_start": "phase_sft_cold_start",
+            "health/alert_count": "alert_count",
+        }
+        for metric_key, field_key in mapping.items():
+            val = snapshot.get(field_key)
+            if val is not None:
+                metrics[metric_key] = _safe_float(val)
+        return metrics

opsd_utils/privileged/__pycache__/providers.cpython-310.pyc ADDED Viewed

Binary file (5.43 kB). View file

opsd_utils/privileged/image_utils.py ADDED Viewed

	@@ -0,0 +1,143 @@

+"""Image loading and crop utilities for privileged teacher dual-image forward."""
+from __future__ import annotations
+from typing import Any, Optional
+from PIL import Image
+from data_utils.paths import resolve_image_path
+from data_utils.privileged_schema import resolve_crop_bbox
+from opsd_utils import debug_log as opsd_debug
+def load_rgb(image: Any) -> Optional[Image.Image]:
+    """Load sample image as RGB PIL from path or in-memory object."""
+    if image is None:
+        return None
+    if isinstance(image, Image.Image):
+        return image.convert("RGB") if image.mode != "RGB" else image
+    if isinstance(image, str):
+        path = resolve_image_path(image)
+        try:
+            img = Image.open(path)
+            return img.convert("RGB")
+        except (FileNotFoundError, OSError):
+            opsd_debug.log("privileged_image", "load_rgb failed", path=path)
+            return None
+    return None
+def center_crop(img: Image.Image, margin_ratio: float = 0.25) -> Image.Image:
+    w, h = img.size
+    margin_w = int(w * margin_ratio)
+    margin_h = int(h * margin_ratio)
+    return img.crop((margin_w, margin_h, w - margin_w, h - margin_h))
+def crop_image(
+    img: Image.Image,
+    bbox_norm: Optional[list[float]] = None,
+    strategy: str = "center",
+    margin_ratio: float = 0.25,
+    fallback_reason: Optional[str] = None,
+) -> tuple[Image.Image, str]:
+    """
+    Crop image using C2 normalized bbox or center fallback.
+    Returns (cropped_image, crop_strategy_used).
+    """
+    if bbox_norm is not None and strategy in ("bbox", "heuristic", "bbox_then_center"):
+        w, h = img.size
+        x0 = int(bbox_norm[0] * w)
+        y0 = int(bbox_norm[1] * h)
+        x1 = int(bbox_norm[2] * w)
+        y1 = int(bbox_norm[3] * h)
+        x0, x1 = max(0, min(x0, w - 1)), max(1, min(x1, w))
+        y0, y1 = max(0, min(y0, h - 1)), max(1, min(y1, h))
+        if x1 > x0 and y1 > y0:
+            used = strategy if strategy != "bbox_then_center" else "bbox"
+            opsd_debug.log(
+                "privileged_image",
+                "crop_image bbox",
+                strategy=used,
+                bbox_norm=bbox_norm,
+                crop_px=(x0, y0, x1, y1),
+                fallback_reason=fallback_reason,
+            )
+            return img.crop((x0, y0, x1, y1)), used
+    crop = center_crop(img, margin_ratio=margin_ratio)
+    used = "center_fallback" if fallback_reason else "center"
+    opsd_debug.log(
+        "privileged_image",
+        "crop_image center",
+        strategy=used,
+        bbox_norm=bbox_norm,
+        margin_ratio=margin_ratio,
+        fallback_reason=fallback_reason,
+    )
+    return crop, used
+def heuristic_crop_from_visual_fact(
+    img: Image.Image,
+    sample: dict[str, Any],
+    crop_cfg: Optional[dict[str, Any]] = None,
+) -> tuple[Image.Image, str, Optional[list[float]]]:
+    """D2 with D1 fallback: heuristic bbox from visual_fact, else center crop."""
+    crop_cfg = crop_cfg or {}
+    margin_ratio = float(crop_cfg.get("margin_ratio", 0.25))
+    bbox_norm, strategy = resolve_crop_bbox(sample, crop_cfg)
+    fallback_reason = None
+    if strategy == "center" and sample.get("visual_fact"):
+        fallback_reason = "heuristic_failed"
+    crop, used = crop_image(
+        img,
+        bbox_norm=bbox_norm,
+        strategy=strategy if bbox_norm else "center",
+        margin_ratio=margin_ratio,
+        fallback_reason=fallback_reason,
+    )
+    return crop, used, bbox_norm
+def resolve_teacher_images(
+    sample: dict[str, Any],
+    profile: str,
+    crop_cfg: Optional[dict[str, Any]] = None,
+) -> tuple[list[Image.Image], dict[str, Any]]:
+    """
+    Build teacher image list for privileged forward.
+    text -> [full]; visual/hybrid + mode=dual -> [full, crop]; otherwise [full].
+    Returns (images, debug_meta).
+    """
+    crop_cfg = crop_cfg or {}
+    image = sample.get("image")
+    if image is None:
+        return [], {"crop_strategy": "none", "num_teacher_images": 0, "has_bbox": False}
+    full = load_rgb(image)
+    if full is None:
+        return [], {"crop_strategy": "load_failed", "num_teacher_images": 0, "has_bbox": False}
+    image_mode = str(crop_cfg.get("mode", "single")).strip().lower()
+    if profile == "text" or image_mode in ("single", "full", "off", "none"):
+        meta = {
+            "crop_strategy": "single_full",
+            "num_teacher_images": 1,
+            "has_bbox": False,
+            "bbox_norm": None,
+            "image_mode": image_mode,
+        }
+        return [full], meta
+    crop, crop_strategy, bbox_norm = heuristic_crop_from_visual_fact(full, sample, crop_cfg)
+    meta = {
+        "crop_strategy": crop_strategy,
+        "num_teacher_images": 2,
+        "has_bbox": bbox_norm is not None,
+        "bbox_norm": bbox_norm,
+        "full_size": full.size,
+        "crop_size": crop.size,
+        "image_mode": image_mode,
+    }
+    return [full, crop], meta

opsd_utils/prompt_builder.py ADDED Viewed

	@@ -0,0 +1,265 @@

+import os
+from typing import Any, Optional
+import torch
+from PIL import Image
+from opsd_utils import debug_log as opsd_debug
+from opsd_utils.privileged import build_privileged_context, maybe_save_privileged_images
+from opsd_utils.teacher_batching import (
+    count_image_tokens,
+    process_teacher_sample,
+    stack_teacher_processor_batches,
+)
+def _build_teacher_text(student_prompt: str, privileged_suffix: str) -> str:
+    teacher_text = student_prompt
+    if privileged_suffix.strip():
+        teacher_text = f"{student_prompt}\n\n{privileged_suffix.strip()}"
+    return teacher_text
+def tokenize_teacher_prompt(
+    processor,
+    student_prompt: str,
+    privileged_suffix: str,
+    images: Any,
+) -> dict:
+    """Tokenize teacher multimodal prompt = student question + privileged suffix + N images."""
+    if isinstance(images, list):
+        pil_images = [img for img in images if isinstance(img, Image.Image)]
+    else:
+        from opsd_utils.privileged.image_utils import load_rgb
+        one = load_rgb(images)
+        pil_images = [one] if one is not None else []
+    teacher_text = _build_teacher_text(student_prompt, privileged_suffix)
+    opsd_debug.log(
+        "teacher_prompt",
+        "tokenize_teacher_prompt",
+        num_images=len(pil_images),
+        suffix_len=len(privileged_suffix.strip()),
+        teacher_text_len=len(teacher_text),
+    )
+    batch = process_teacher_sample(processor, teacher_text, pil_images)
+    opsd_debug.log(
+        "teacher_prompt",
+        "tokenize_teacher_prompt result",
+        input_ids_shape=tuple(batch["input_ids"].shape),
+        has_pixel_values="pixel_values" in batch,
+        pixel_values_shape=tuple(batch["pixel_values"].shape) if "pixel_values" in batch else None,
+        image_token_count=count_image_tokens(batch["input_ids"], processor),
+    )
+    return batch
+def build_teacher_prompt_batch(
+    processor,
+    samples: list[dict[str, Any]],
+    indices: list[int],
+    provider_names: list[str],
+    device,
+    *,
+    opsd_config: Optional[dict[str, Any]] = None,
+    global_step: Optional[int] = None,
+    output_dir: Optional[str] = None,
+) -> dict[str, Any]:
+    """Build padded teacher prompt tensors for OPSD samples at given indices."""
+    opsd_config = opsd_config or {}
+    privileged_profile = opsd_config.get("privileged_profile", "hybrid")
+    crop_cfg = opsd_config.get("privileged_image") or {}
+    privileged_debug_cfg = opsd_config.get("privileged_debug") or {}
+    opsd_debug.log(
+        "teacher_prompt",
+        "build_teacher_prompt_batch enter",
+        num_indices=len(indices),
+        indices=indices,
+        num_samples=len(samples),
+        provider_names=provider_names,
+        privileged_profile=privileged_profile,
+        device=str(device),
+        global_step=global_step,
+    )
+    if not indices:
+        opsd_debug.log("teacher_prompt", "empty indices, return {}")
+        return {}
+    sample_payloads: list[dict[str, Any]] = []
+    for idx in indices:
+        sample = samples[idx]
+        suffix, teacher_images = build_privileged_context(
+            sample,
+            provider_names,
+            privileged_profile=privileged_profile,
+            crop_cfg=crop_cfg,
+            opsd_config=opsd_config,
+        )
+        if not teacher_images:
+            from opsd_utils.privileged.image_utils import load_rgb
+            full = load_rgb(sample.get("image"))
+            teacher_images = [full] if full is not None else []
+        full_img = teacher_images[0] if teacher_images else None
+        crop_img = teacher_images[1] if len(teacher_images) > 1 else None
+        maybe_save_privileged_images(
+            global_step,
+            idx,
+            full_img,
+            crop_img,
+            meta={
+                "privileged_profile": privileged_profile,
+                "num_teacher_images": len(teacher_images),
+                "suffix_len": len(suffix.strip()),
+            },
+            output_dir=output_dir,
+            privileged_debug_cfg=privileged_debug_cfg,
+        )
+        teacher_text = _build_teacher_text(sample["prompt"], suffix)
+        sample_payloads.append(
+            {
+                "teacher_text": teacher_text,
+                "images": teacher_images,
+                "suffix_len": len(suffix.strip()),
+                "num_teacher_images": len(teacher_images),
+            }
+        )
+    batch = _build_teacher_batch_with_oom_retry(processor, sample_payloads)
+    out = {
+        "teacher_prompt_ids": batch["input_ids"].to(device),
+        "teacher_prompt_mask": batch["attention_mask"].to(device),
+    }
+    if batch.get("pixel_values_list"):
+        out["teacher_pixel_values_list"] = [pv.to(device) for pv in batch["pixel_values_list"]]
+    if batch.get("image_sizes_list"):
+        out["teacher_image_sizes_list"] = [sz.to(device) for sz in batch["image_sizes_list"]]
+    teacher_num_images = [int(max(0, n)) for n in batch.get("batch_num_images", [])]
+    if not teacher_num_images:
+        teacher_num_images = [p["num_teacher_images"] for p in sample_payloads]
+    out["teacher_num_images"] = torch.tensor(teacher_num_images, device=device, dtype=torch.long)
+    student_len = None
+    if indices and samples[indices[0]].get("prompt"):
+        student_messages = [
+            {
+                "role": "user",
+                "content": [{"type": "image"}, {"type": "text", "text": samples[indices[0]]["prompt"]}],
+            }
+        ]
+        student_text = processor.apply_chat_template(student_messages, add_generation_prompt=True)
+        student_len = len(processor(text=[student_text], return_tensors="pt")["input_ids"][0])
+    teacher_len = int(out["teacher_prompt_ids"].shape[1])
+    opsd_debug.log(
+        "teacher_prompt",
+        "build_teacher_prompt_batch done",
+        teacher_prompt_ids_shape=tuple(out["teacher_prompt_ids"].shape),
+        teacher_prompt_mask_shape=tuple(out["teacher_prompt_mask"].shape),
+        has_teacher_pixel_values=bool(out.get("teacher_pixel_values_list")),
+        teacher_pixel_values_shapes=[
+            tuple(pv.shape) for pv in out.get("teacher_pixel_values_list", [])[:4]
+        ],
+        teacher_images_count=sample_payloads[0]["num_teacher_images"] if sample_payloads else 0,
+        teacher_num_images=teacher_num_images,
+        teacher_image_token_counts=batch.get("image_token_counts"),
+        teacher_prompt_len=teacher_len,
+        vision_placeholder_delta=(teacher_len - student_len) if student_len else None,
+    )
+    opsd_debug.log_detail(
+        "teacher_prompt",
+        "teacher prompt batch built",
+        global_step=global_step,
+        batch_size=len(indices),
+        teacher_prompt_len=teacher_len,
+        teacher_pixel_values_shapes=[
+            tuple(pv.shape) for pv in out.get("teacher_pixel_values_list", [])[:4]
+        ],
+        teacher_image_token_counts=batch.get("image_token_counts"),
+    )
+    from opsd_utils.leakage import privileged_suffix_has_gold
+    vf_empty = 0
+    gold_suffix_count = 0
+    for idx in indices:
+        sample = samples[idx]
+        vf = (
+            sample.get("visual_fact_hint")
+            or sample.get("visual_fact")
+            or sample.get("visual_facts")
+            or ""
+        )
+        if not str(vf).strip():
+            vf_empty += 1
+        priv_suffix, _ = build_privileged_context(
+            sample,
+            provider_names,
+            privileged_profile=privileged_profile,
+            crop_cfg=crop_cfg,
+            opsd_config=opsd_config,
+        )
+        if privileged_suffix_has_gold(priv_suffix, sample):
+            gold_suffix_count += 1
+    suffix_lens = [p["suffix_len"] for p in sample_payloads]
+    n_idx = max(len(indices), 1)
+    out["teacher_stats"] = {
+        "teacher_suffix_len_mean": float(sum(suffix_lens) / len(suffix_lens)) if suffix_lens else 0.0,
+        "visual_fact_empty_rate": vf_empty / n_idx,
+        "privileged_suffix_has_gold_rate": gold_suffix_count / n_idx,
+        "num_teacher_images_mean": float(
+            sum(p["num_teacher_images"] for p in sample_payloads) / len(sample_payloads)
+        )
+        if sample_payloads
+        else 0.0,
+    }
+    return out
+def _build_teacher_batch_with_oom_retry(
+    processor,
+    sample_payloads: list[dict[str, Any]],
+) -> dict:
+    """Process each teacher sample separately; on OOM halve micro-batch and retry."""
+    n = len(sample_payloads)
+    if n == 0:
+        return {}
+    micro = n
+    while micro >= 1:
+        try:
+            per_sample_batches: list[dict[str, Any]] = []
+            for start in range(0, n, micro):
+                end = min(start + micro, n)
+                for payload in sample_payloads[start:end]:
+                    per_sample_batches.append(
+                        process_teacher_sample(
+                            processor,
+                            payload["teacher_text"],
+                            payload["images"],
+                        )
+                    )
+            return stack_teacher_processor_batches(processor, per_sample_batches)
+        except RuntimeError as exc:
+            if "out of memory" not in str(exc).lower() or micro == 1:
+                raise
+            opsd_debug.log(
+                "teacher_forward_oom",
+                "teacher prompt batch OOM, halving micro-batch",
+                original_batch=n,
+                micro_batch_size=micro,
+                new_micro_batch_size=max(1, micro // 2),
+            )
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
+            micro = max(1, micro // 2)
+    return {}

outputs/logs/.ipynb_checkpoints/train_opd_7b_ds_20260614_175014-checkpoint.log ADDED Viewed

The diff for this file is too large to render. See raw diff

outputs/opd-7b-chartqa-ds/checkpoint-1764/zero_to_fp32.py ADDED Viewed

	@@ -0,0 +1,760 @@

+#!/usr/bin/env python
+# Copyright (c) Microsoft Corporation.
+# SPDX-License-Identifier: Apache-2.0
+# DeepSpeed Team
+# This script extracts fp32 consolidated weights from a zero 1, 2 and 3 DeepSpeed checkpoints. It gets
+# copied into the top level checkpoint dir, so the user can easily do the conversion at any point in
+# the future. Once extracted, the weights don't require DeepSpeed and can be used in any
+# application.
+#
+# example:
+#   python zero_to_fp32.py . output_dir/
+#   or
+#   python zero_to_fp32.py . output_dir/ --safe_serialization
+import argparse
+import torch
+import glob
+import math
+import os
+import re
+import gc
+import json
+import numpy as np
+from tqdm import tqdm
+from collections import OrderedDict
+from dataclasses import dataclass
+# while this script doesn't use deepspeed to recover data, since the checkpoints are pickled with
+# DeepSpeed data structures it has to be available in the current python environment.
+from deepspeed.utils import logger
+from deepspeed.checkpoint.constants import (DS_VERSION, OPTIMIZER_STATE_DICT, SINGLE_PARTITION_OF_FP32_GROUPS,
+                                            FP32_FLAT_GROUPS, ZERO_STAGE, PARTITION_COUNT, PARAM_SHAPES, BUFFER_NAMES,
+                                            FROZEN_PARAM_SHAPES, FROZEN_PARAM_FRAGMENTS)
+@dataclass
+class zero_model_state:
+    buffers: dict()
+    param_shapes: dict()
+    shared_params: list
+    ds_version: int
+    frozen_param_shapes: dict()
+    frozen_param_fragments: dict()
+debug = 0
+# load to cpu
+device = torch.device('cpu')
+def atoi(text):
+    return int(text) if text.isdigit() else text
+def natural_keys(text):
+    '''
+    alist.sort(key=natural_keys) sorts in human order
+    http://nedbatchelder.com/blog/200712/human_sorting.html
+    (See Toothy's implementation in the comments)
+    '''
+    return [atoi(c) for c in re.split(r'(\d+)', text)]
+def get_model_state_file(checkpoint_dir, zero_stage):
+    if not os.path.isdir(checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
+    # there should be only one file
+    if zero_stage <= 2:
+        file = os.path.join(checkpoint_dir, "mp_rank_00_model_states.pt")
+    elif zero_stage == 3:
+        file = os.path.join(checkpoint_dir, "zero_pp_rank_0_mp_rank_00_model_states.pt")
+    if not os.path.exists(file):
+        raise FileNotFoundError(f"can't find model states file at '{file}'")
+    return file
+def get_checkpoint_files(checkpoint_dir, glob_pattern):
+    # XXX: need to test that this simple glob rule works for multi-node setup too
+    ckpt_files = sorted(glob.glob(os.path.join(checkpoint_dir, glob_pattern)), key=natural_keys)
+    if len(ckpt_files) == 0:
+        raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
+    return ckpt_files
+def get_optim_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
+def get_model_state_files(checkpoint_dir):
+    return get_checkpoint_files(checkpoint_dir, "*_model_states.pt")
+def parse_model_states(files):
+    zero_model_states = []
+    for file in files:
+        state_dict = torch.load(file, map_location=device, weights_only=False)
+        if BUFFER_NAMES not in state_dict:
+            raise ValueError(f"{file} is not a model state checkpoint")
+        buffer_names = state_dict[BUFFER_NAMES]
+        if debug:
+            print("Found buffers:", buffer_names)
+        # recover just the buffers while restoring them to fp32 if they were saved in fp16
+        buffers = {k: v.float() for k, v in state_dict["module"].items() if k in buffer_names}
+        param_shapes = state_dict[PARAM_SHAPES]
+        # collect parameters that are included in param_shapes
+        param_names = []
+        for s in param_shapes:
+            for name in s.keys():
+                param_names.append(name)
+        # update with frozen parameters
+        frozen_param_shapes = state_dict.get(FROZEN_PARAM_SHAPES, None)
+        if frozen_param_shapes is not None:
+            if debug:
+                print(f"Found frozen_param_shapes: {frozen_param_shapes}")
+            param_names += list(frozen_param_shapes.keys())
+        # handle shared params
+        shared_params = [[k, v] for k, v in state_dict["shared_params"].items()]
+        ds_version = state_dict.get(DS_VERSION, None)
+        frozen_param_fragments = state_dict.get(FROZEN_PARAM_FRAGMENTS, None)
+        z_model_state = zero_model_state(buffers=buffers,
+                                         param_shapes=param_shapes,
+                                         shared_params=shared_params,
+                                         ds_version=ds_version,
+                                         frozen_param_shapes=frozen_param_shapes,
+                                         frozen_param_fragments=frozen_param_fragments)
+        zero_model_states.append(z_model_state)
+    return zero_model_states
+def parse_optim_states(files, ds_checkpoint_dir):
+    total_files = len(files)
+    state_dicts = []
+    for f in tqdm(files, desc='Loading checkpoint shards'):
+        state_dict = torch.load(f, map_location=device, mmap=True, weights_only=False)
+        # immediately discard the potentially huge 2 optimizer states as we only care for fp32 master weights
+        # and also handle the case where it was already removed by another helper script
+        state_dict["optimizer_state_dict"].pop("optimizer_state_dict", None)
+        state_dicts.append(state_dict)
+    if ZERO_STAGE not in state_dicts[0][OPTIMIZER_STATE_DICT]:
+        raise ValueError(f"{files[0]} is not a zero checkpoint")
+    zero_stage = state_dicts[0][OPTIMIZER_STATE_DICT][ZERO_STAGE]
+    world_size = state_dicts[0][OPTIMIZER_STATE_DICT][PARTITION_COUNT]
+    # For ZeRO-2 each param group can have different partition_count as data parallelism for expert
+    # parameters can be different from data parallelism for non-expert parameters. So we can just
+    # use the max of the partition_count to get the dp world_size.
+    if type(world_size) is list:
+        world_size = max(world_size)
+    if world_size != total_files:
+        raise ValueError(
+            f"Expected {world_size} of '*_optim_states.pt' under '{ds_checkpoint_dir}' but found {total_files} files. "
+            "Possibly due to an overwrite of an old checkpoint, or a checkpoint didn't get saved by one or more processes."
+        )
+    # the groups are named differently in each stage
+    if zero_stage <= 2:
+        fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
+    elif zero_stage == 3:
+        fp32_groups_key = FP32_FLAT_GROUPS
+    else:
+        raise ValueError(f"unknown zero stage {zero_stage}")
+    fp32_flat_groups = [state_dicts[i][OPTIMIZER_STATE_DICT][fp32_groups_key] for i in range(len(state_dicts))]
+    return zero_stage, world_size, fp32_flat_groups
+def _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters):
+    """
+    Returns fp32 state_dict reconstructed from ds checkpoint
+    Args:
+        - ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder (where the optimizer files are)
+    """
+    print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
+    optim_files = get_optim_files(ds_checkpoint_dir)
+    zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
+    print(f"Detected checkpoint of type zero stage {zero_stage}, world_size: {world_size}")
+    model_files = get_model_state_files(ds_checkpoint_dir)
+    zero_model_states = parse_model_states(model_files)
+    print(f'Parsing checkpoint created by deepspeed=={zero_model_states[0].ds_version}')
+    if zero_stage <= 2:
+        return _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+    elif zero_stage == 3:
+        return _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                                          exclude_frozen_parameters)
+def _zero2_merge_frozen_params(state_dict, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+    frozen_param_fragments = zero_model_states[0].frozen_param_fragments
+    if debug:
+        num_elem = sum(s.numel() for s in frozen_param_shapes.values())
+        print(f'rank 0: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in frozen_param_fragments.values()])
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        state_dict[name] = frozen_param_fragments[name]
+        if debug:
+            print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+def _has_callable(obj, fn):
+    attr = getattr(obj, fn, None)
+    return callable(attr)
+def _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    # Reconstruction protocol:
+    #
+    # XXX: document this
+    if debug:
+        for i in range(world_size):
+            for j in range(len(fp32_flat_groups[0])):
+                print(f"{FP32_FLAT_GROUPS}[{i}][{j}].shape={fp32_flat_groups[i][j].shape}")
+    # XXX: memory usage doubles here (zero2)
+    num_param_groups = len(fp32_flat_groups[0])
+    merged_single_partition_of_fp32_groups = []
+    for i in range(num_param_groups):
+        merged_partitions = [sd[i] for sd in fp32_flat_groups]
+        full_single_fp32_vector = torch.cat(merged_partitions, 0)
+        merged_single_partition_of_fp32_groups.append(full_single_fp32_vector)
+    avail_numel = sum(
+        [full_single_fp32_vector.numel() for full_single_fp32_vector in merged_single_partition_of_fp32_groups])
+    if debug:
+        wanted_params = sum([len(shapes) for shapes in param_shapes])
+        wanted_numel = sum([sum(shape.numel() for shape in shapes.values()) for shapes in param_shapes])
+        # not asserting if there is a mismatch due to possible padding
+        print(f"Have {avail_numel} numels to process.")
+        print(f"Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    total_numel = 0
+    total_params = 0
+    for shapes, full_single_fp32_vector in zip(param_shapes, merged_single_partition_of_fp32_groups):
+        offset = 0
+        avail_numel = full_single_fp32_vector.numel()
+        for name, shape in shapes.items():
+            unpartitioned_numel = shape.numel() if _has_callable(shape, 'numel') else math.prod(shape)
+            total_numel += unpartitioned_numel
+            total_params += 1
+            if debug:
+                print(f"{name} full shape: {shape} unpartitioned numel {unpartitioned_numel} ")
+            state_dict[name] = full_single_fp32_vector.narrow(0, offset, unpartitioned_numel).view(shape)
+            offset += unpartitioned_numel
+        # Z2 started to align to 2*world_size to improve nccl performance. Therefore both offset and
+        # avail_numel can differ by anywhere between 0..2*world_size. Due to two unrelated complex
+        # paddings performed in the code it's almost impossible to predict the exact numbers w/o the
+        # live optimizer object, so we are checking that the numbers are within the right range
+        align_to = 2 * world_size
+        def zero2_align(x):
+            return align_to * math.ceil(x / align_to)
+        if debug:
+            print(f"original offset={offset}, avail_numel={avail_numel}")
+        offset = zero2_align(offset)
+        avail_numel = zero2_align(avail_numel)
+        if debug:
+            print(f"aligned  offset={offset}, avail_numel={avail_numel}")
+        # Sanity check
+        if offset != avail_numel:
+            raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero2_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero2_merge_frozen_params(state_dict, zero_model_states)
+    _zero2_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def zero3_partitioned_param_info(unpartitioned_numel, world_size):
+    remainder = unpartitioned_numel % world_size
+    padding_numel = (world_size - remainder) if remainder else 0
+    partitioned_numel = math.ceil(unpartitioned_numel / world_size)
+    return partitioned_numel, padding_numel
+def _zero3_merge_frozen_params(state_dict, world_size, zero_model_states):
+    if zero_model_states[0].frozen_param_shapes is None or len(zero_model_states[0].frozen_param_shapes) == 0:
+        return
+    if debug:
+        for i in range(world_size):
+            num_elem = sum(s.numel() for s in zero_model_states[i].frozen_param_fragments.values())
+            print(f'rank {i}: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
+        frozen_param_shapes = zero_model_states[0].frozen_param_shapes
+        wanted_params = len(frozen_param_shapes)
+        wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
+        avail_numel = sum([p.numel() for p in zero_model_states[0].frozen_param_fragments.values()]) * world_size
+        print(f'Frozen params: Have {avail_numel} numels to process.')
+        print(f'Frozen params: Need {wanted_numel} numels in {wanted_params} params')
+    total_params = 0
+    total_numel = 0
+    for name, shape in zero_model_states[0].frozen_param_shapes.items():
+        total_params += 1
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        param_frags = tuple(model_state.frozen_param_fragments[name] for model_state in zero_model_states)
+        state_dict[name] = torch.cat(param_frags, 0).narrow(0, 0, unpartitioned_numel).view(shape)
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Frozen params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+    print(f"Reconstructed Frozen fp32 state dict with {total_params} params {total_numel} elements")
+class GatheredTensor:
+    """
+    A pseudo tensor that collects partitioned weights.
+    It is more memory efficient when there are multiple groups.
+    """
+    def __init__(self, flat_groups, flat_groups_offset, offset, partitioned_numel, shape):
+        self.flat_groups = flat_groups
+        self.flat_groups_offset = flat_groups_offset
+        self.offset = offset
+        self.partitioned_numel = partitioned_numel
+        self.shape = shape
+        self.dtype = self.flat_groups[0][0].dtype
+    def contiguous(self):
+        """
+        Merge partitioned weights from flat_groups into a single tensor.
+        """
+        end_idx = self.offset + self.partitioned_numel
+        world_size = len(self.flat_groups)
+        pad_flat_param_chunks = []
+        for rank_i in range(world_size):
+            # for each rank, we need to collect weights from related group/groups
+            flat_groups_at_rank_i = self.flat_groups[rank_i]
+            start_group_id = None
+            end_group_id = None
+            for group_id in range(len(self.flat_groups_offset)):
+                if self.flat_groups_offset[group_id] <= self.offset < self.flat_groups_offset[group_id + 1]:
+                    start_group_id = group_id
+                if self.flat_groups_offset[group_id] < end_idx <= self.flat_groups_offset[group_id + 1]:
+                    end_group_id = group_id
+                    break
+            # collect weights from related group/groups
+            for group_id in range(start_group_id, end_group_id + 1):
+                flat_tensor = flat_groups_at_rank_i[group_id]
+                start_offset = self.offset - self.flat_groups_offset[group_id]
+                end_offset = min(end_idx, self.flat_groups_offset[group_id + 1]) - self.flat_groups_offset[group_id]
+                pad_flat_param_chunks.append(flat_tensor[start_offset:end_offset])
+        # collect weights from all ranks
+        pad_flat_param = torch.cat(pad_flat_param_chunks, dim=0)
+        param = pad_flat_param[:self.shape.numel()].view(self.shape).contiguous()
+        return param
+def _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states):
+    param_shapes = zero_model_states[0].param_shapes
+    avail_numel = sum([flat_group.numel() for flat_group in fp32_flat_groups[0]]) * world_size
+    # Reconstruction protocol: For zero3 we need to zip the partitions together at boundary of each
+    # param, re-consolidating each param, while dealing with padding if any
+    # merge list of dicts, preserving order
+    param_shapes = {k: v for d in param_shapes for k, v in d.items()}
+    if debug:
+        for i in range(world_size):
+            print(f"{FP32_FLAT_GROUPS}[{i}].shape={fp32_flat_groups[i].shape}")
+        wanted_params = len(param_shapes)
+        wanted_numel = sum(shape.numel() for shape in param_shapes.values())
+        # not asserting if there is a mismatch due to possible padding
+        avail_numel = fp32_flat_groups[0].numel() * world_size
+        print(f"Trainable params: Have {avail_numel} numels to process.")
+        print(f"Trainable params: Need {wanted_numel} numels in {wanted_params} params.")
+    # params
+    # XXX: for huge models that can't fit into the host's RAM we will have to recode this to support
+    # out-of-core computing solution
+    offset = 0
+    total_numel = 0
+    total_params = 0
+    flat_groups_offset = [0] + list(np.cumsum([flat_tensor.numel() for flat_tensor in fp32_flat_groups[0]]))
+    for name, shape in tqdm(param_shapes.items(), desc='Gathering sharded weights'):
+        unpartitioned_numel = shape.numel()
+        total_numel += unpartitioned_numel
+        total_params += 1
+        partitioned_numel, partitioned_padding_numel = zero3_partitioned_param_info(unpartitioned_numel, world_size)
+        if debug:
+            print(
+                f"Trainable params: {total_params} {name} full shape: {shape} partition0 numel={partitioned_numel} partitioned_padding_numel={partitioned_padding_numel}"
+            )
+        # memory efficient tensor
+        tensor = GatheredTensor(fp32_flat_groups, flat_groups_offset, offset, partitioned_numel, shape)
+        state_dict[name] = tensor
+        offset += partitioned_numel
+    offset *= world_size
+    # Sanity check
+    if offset != avail_numel:
+        raise ValueError(f"consumed {offset} numels out of {avail_numel} - something is wrong")
+    print(f"Reconstructed Trainable fp32 state dict with {total_params} params {total_numel} elements")
+def _get_fp32_state_dict_from_zero3_checkpoint(world_size, fp32_flat_groups, zero_model_states,
+                                               exclude_frozen_parameters):
+    state_dict = OrderedDict()
+    # buffers
+    buffers = zero_model_states[0].buffers
+    state_dict.update(buffers)
+    if debug:
+        print(f"added {len(buffers)} buffers")
+    if not exclude_frozen_parameters:
+        _zero3_merge_frozen_params(state_dict, world_size, zero_model_states)
+    _zero3_merge_trainable_params(state_dict, world_size, fp32_flat_groups, zero_model_states)
+    # recover shared parameters
+    for pair in zero_model_states[0].shared_params:
+        if pair[1] in state_dict:
+            state_dict[pair[0]] = state_dict[pair[1]]
+    return state_dict
+def to_torch_tensor(state_dict, return_empty_tensor=False):
+    """
+    Convert state_dict of GatheredTensor to torch tensor
+    """
+    torch_state_dict = {}
+    converted_tensors = {}
+    for name, tensor in state_dict.items():
+        tensor_id = id(tensor)
+        if tensor_id in converted_tensors:  # shared tensors
+            shared_tensor = torch_state_dict[converted_tensors[tensor_id]]
+            torch_state_dict[name] = shared_tensor
+        else:
+            converted_tensors[tensor_id] = name
+            if return_empty_tensor:
+                torch_state_dict[name] = torch.empty(tensor.shape, dtype=tensor.dtype)
+            else:
+                torch_state_dict[name] = tensor.contiguous()
+    return torch_state_dict
+def get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir,
+                                             tag=None,
+                                             exclude_frozen_parameters=False,
+                                             lazy_mode=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated state_dict that can be loaded with
+    ``load_state_dict()`` and used for training without DeepSpeed or shared with others, for example
+    via a model hub.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in 'latest' file. e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+        - ``lazy_mode``: get state_dict in lazy mode. It returns a dict of pesduo tensor instead of torch tensor, which is more memory efficient.
+          Convert the pesduo tensor to torch tensor by ``.contiguous()``
+    Returns:
+        - pytorch ``state_dict``
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+        # do the training and checkpoint saving
+        state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
+        model = model.cpu() # move to cpu
+        model.load_state_dict(state_dict)
+        # submit to model hub or save the model to share with others
+    In this example the ``model`` will no longer be usable in the deepspeed context of the same
+    application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    If you want it all done for you, use ``load_state_dict_from_zero_checkpoint`` instead.
+    Note: the above usage may not work if your application doesn't have sufficient free CPU memory.
+    You may need to use the offline approach using the ``zero_to_fp32.py`` script that is saved with
+    the checkpoint. Or you can load state_dict in lazy mode ::
+        from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
+        state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, lazy_mode=True) # not on cpu
+        for name, lazy_tensor in state_dict.item():
+            tensor = lazy_tensor.contiguous()  # to cpu
+            print(name, tensor)
+            # del tensor to release memory if it no longer in use
+    """
+    if tag is None:
+        latest_path = os.path.join(checkpoint_dir, 'latest')
+        if os.path.isfile(latest_path):
+            with open(latest_path, 'r') as fd:
+                tag = fd.read().strip()
+        else:
+            raise ValueError(f"Unable to find 'latest' file at {latest_path}")
+    ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
+    if not os.path.isdir(ds_checkpoint_dir):
+        raise FileNotFoundError(f"Directory '{ds_checkpoint_dir}' doesn't exist")
+    state_dict = _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
+    if lazy_mode:
+        return state_dict
+    else:
+        return to_torch_tensor(state_dict)
+def convert_zero_checkpoint_to_fp32_state_dict(checkpoint_dir,
+                                               output_dir,
+                                               max_shard_size="5GB",
+                                               safe_serialization=False,
+                                               tag=None,
+                                               exclude_frozen_parameters=False):
+    """
+    Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict`` file that can be
+    loaded with ``torch.load(file)`` + ``load_state_dict()`` and used for training without DeepSpeed.
+    Args:
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``output_dir``: directory to the pytorch fp32 state_dict output files
+        - ``max_shard_size``: the maximum size for a checkpoint before being sharded, default value is 5GB
+        - ``safe_serialization``:  whether to save the model using `safetensors` or the traditional PyTorch way (that uses `pickle`).
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+        - ``exclude_frozen_parameters``: exclude frozen parameters
+    """
+    # Dependency pre-check
+    if safe_serialization:
+        try:
+            from safetensors.torch import save_file
+        except ImportError:
+            print('If you want to use `safe_serialization`, please `pip install safetensors`')
+            raise
+    if max_shard_size is not None:
+        try:
+            from huggingface_hub import split_torch_state_dict_into_shards
+        except ImportError:
+            print('If you want to use `max_shard_size`, please `pip install huggingface_hub`')
+            raise
+    # Convert zero checkpoint to state_dict
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir,
+                                                          tag,
+                                                          exclude_frozen_parameters,
+                                                          lazy_mode=True)
+    # Shard the model if it is too big.
+    weights_name = "model.safetensors" if safe_serialization else "pytorch_model.bin"
+    if max_shard_size is not None:
+        filename_pattern = weights_name.replace(".bin", "{suffix}.bin").replace(".safetensors", "{suffix}.safetensors")
+        # an memory-efficient approach for sharding
+        empty_state_dict = to_torch_tensor(state_dict, return_empty_tensor=True)
+        state_dict_split = split_torch_state_dict_into_shards(empty_state_dict,
+                                                              filename_pattern=filename_pattern,
+                                                              max_shard_size=max_shard_size)
+    else:
+        from collections import namedtuple
+        StateDictSplit = namedtuple("StateDictSplit", ["is_sharded", "filename_to_tensors"])
+        state_dict_split = StateDictSplit(is_sharded=False,
+                                          filename_to_tensors={weights_name: list(state_dict.keys())})
+    # Save the model by shard
+    os.makedirs(output_dir, exist_ok=True)
+    filename_to_tensors = state_dict_split.filename_to_tensors.items()
+    for shard_file, tensors in tqdm(filename_to_tensors, desc="Saving checkpoint shards"):
+        shard_state_dict = {tensor_name: state_dict[tensor_name] for tensor_name in tensors}
+        shard_state_dict = to_torch_tensor(shard_state_dict)
+        output_path = os.path.join(output_dir, shard_file)
+        if safe_serialization:
+            save_file(shard_state_dict, output_path, metadata={"format": "pt"})
+        else:
+            torch.save(shard_state_dict, output_path)
+        # release the memory of current shard
+        for tensor_name in list(shard_state_dict.keys()):
+            del state_dict[tensor_name]
+            del shard_state_dict[tensor_name]
+        del shard_state_dict
+        gc.collect()
+    # Save index if sharded
+    if state_dict_split.is_sharded:
+        index = {
+            "metadata": state_dict_split.metadata,
+            "weight_map": state_dict_split.tensor_to_filename,
+        }
+        save_index_file = "model.safetensors.index.json" if safe_serialization else "pytorch_model.bin.index.json"
+        save_index_file = os.path.join(output_dir, save_index_file)
+        with open(save_index_file, "w", encoding="utf-8") as f:
+            content = json.dumps(index, indent=2, sort_keys=True) + "\n"
+            f.write(content)
+def load_state_dict_from_zero_checkpoint(model, checkpoint_dir, tag=None):
+    """
+    1. Put the provided model to cpu
+    2. Convert ZeRO 2 or 3 checkpoint into a single fp32 consolidated ``state_dict``
+    3. Load it into the provided model
+    Args:
+        - ``model``: the model object to update
+        - ``checkpoint_dir``: path to the desired checkpoint folder. (one that contains the tag-folder, like ``global_step14``)
+        - ``tag``: checkpoint tag used as a unique identifier for checkpoint. If not provided will attempt to load tag in the file named ``latest`` in the checkpoint folder, e.g., ``global_step14``
+    Returns:
+        - ``model`: modified model
+    Make sure you have plenty of CPU memory available before you call this function. If you don't
+    have enough use the ``zero_to_fp32.py`` utility to do the conversion. You will find it
+    conveniently placed for you in the checkpoint folder.
+    A typical usage might be ::
+        from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
+        model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
+        # submit to model hub or save the model to share with others
+    Note, that once this was run, the ``model`` will no longer be usable in the deepspeed context
+    of the same application. i.e. you will need to re-initialize the deepspeed engine, since
+    ``model.load_state_dict(state_dict)`` will remove all the deepspeed magic from it.
+    """
+    logger.info("Extracting fp32 weights")
+    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
+    logger.info("Overwriting model with fp32 weights")
+    model = model.cpu()
+    model.load_state_dict(state_dict, strict=False)
+    return model
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("checkpoint_dir",
+                        type=str,
+                        help="path to the desired checkpoint folder, e.g., path/checkpoint-12")
+    parser.add_argument("output_dir",
+                        type=str,
+                        help="directory to the pytorch fp32 state_dict output files"
+                        "(e.g. path/checkpoint-12-output/)")
+    parser.add_argument(
+        "--max_shard_size",
+        type=str,
+        default="5GB",
+        help="The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size"
+        "lower than this size. If expressed as a string, needs to be digits followed by a unit (like `5MB`"
+        "We default it to 5GB in order for models to be able to run easily on free-tier google colab instances"
+        "without CPU OOM issues.")
+    parser.add_argument(
+        "--safe_serialization",
+        default=False,
+        action='store_true',
+        help="Whether to save the model using `safetensors` or the traditional PyTorch way (that uses `pickle`).")
+    parser.add_argument("-t",
+                        "--tag",
+                        type=str,
+                        default=None,
+                        help="checkpoint tag used as a unique identifier for checkpoint. e.g., global_step1")
+    parser.add_argument("--exclude_frozen_parameters", action='store_true', help="exclude frozen parameters")
+    parser.add_argument("-d", "--debug", action='store_true', help="enable debug")
+    args = parser.parse_args()
+    debug = args.debug
+    convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
+                                               args.output_dir,
+                                               max_shard_size=args.max_shard_size,
+                                               safe_serialization=args.safe_serialization,
+                                               tag=args.tag,
+                                               exclude_frozen_parameters=args.exclude_frozen_parameters)

outputs/opd-7b-chartqa-ds/checkpoint-2352/preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,171 @@

+{
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_pad": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_grid_pinpoints": [
+    [
+      384,
+      384
+    ],
+    [
+      384,
+      768
+    ],
+    [
+      384,
+      1152
+    ],
+    [
+      384,
+      1536
+    ],
+    [
+      384,
+      1920
+    ],
+    [
+      384,
+      2304
+    ],
+    [
+      768,
+      384
+    ],
+    [
+      768,
+      768
+    ],
+    [
+      768,
+      1152
+    ],
+    [
+      768,
+      1536
+    ],
+    [
+      768,
+      1920
+    ],
+    [
+      768,
+      2304
+    ],
+    [
+      1152,
+      384
+    ],
+    [
+      1152,
+      768
+    ],
+    [
+      1152,
+      1152
+    ],
+    [
+      1152,
+      1536
+    ],
+    [
+      1152,
+      1920
+    ],
+    [
+      1152,
+      2304
+    ],
+    [
+      1536,
+      384
+    ],
+    [
+      1536,
+      768
+    ],
+    [
+      1536,
+      1152
+    ],
+    [
+      1536,
+      1536
+    ],
+    [
+      1536,
+      1920
+    ],
+    [
+      1536,
+      2304
+    ],
+    [
+      1920,
+      384
+    ],
+    [
+      1920,
+      768
+    ],
+    [
+      1920,
+      1152
+    ],
+    [
+      1920,
+      1536
+    ],
+    [
+      1920,
+      1920
+    ],
+    [
+      1920,
+      2304
+    ],
+    [
+      2304,
+      384
+    ],
+    [
+      2304,
+      768
+    ],
+    [
+      2304,
+      1152
+    ],
+    [
+      2304,
+      1536
+    ],
+    [
+      2304,
+      1920
+    ],
+    [
+      2304,
+      2304
+    ]
+  ],
+  "image_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "image_processor_type": "LlavaOnevisionImageProcessor",
+  "image_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "processor_class": "LlavaOnevisionProcessor",
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "height": 384,
+    "width": 384
+  }
+}

outputs/opd-7b-chartqa-ds/checkpoint-588/config.json ADDED Viewed

	@@ -0,0 +1,235 @@

+{
+  "architectures": [
+    "LlavaOnevisionForConditionalGeneration"
+  ],
+  "dtype": "bfloat16",
+  "eos_token_id": 151645,
+  "ignore_index": -100,
+  "image_grid_pinpoints": [
+    [
+      384,
+      384
+    ],
+    [
+      384,
+      768
+    ],
+    [
+      384,
+      1152
+    ],
+    [
+      384,
+      1536
+    ],
+    [
+      384,
+      1920
+    ],
+    [
+      384,
+      2304
+    ],
+    [
+      768,
+      384
+    ],
+    [
+      768,
+      768
+    ],
+    [
+      768,
+      1152
+    ],
+    [
+      768,
+      1536
+    ],
+    [
+      768,
+      1920
+    ],
+    [
+      768,
+      2304
+    ],
+    [
+      1152,
+      384
+    ],
+    [
+      1152,
+      768
+    ],
+    [
+      1152,
+      1152
+    ],
+    [
+      1152,
+      1536
+    ],
+    [
+      1152,
+      1920
+    ],
+    [
+      1152,
+      2304
+    ],
+    [
+      1536,
+      384
+    ],
+    [
+      1536,
+      768
+    ],
+    [
+      1536,
+      1152
+    ],
+    [
+      1536,
+      1536
+    ],
+    [
+      1536,
+      1920
+    ],
+    [
+      1536,
+      2304
+    ],
+    [
+      1920,
+      384
+    ],
+    [
+      1920,
+      768
+    ],
+    [
+      1920,
+      1152
+    ],
+    [
+      1920,
+      1536
+    ],
+    [
+      1920,
+      1920
+    ],
+    [
+      1920,
+      2304
+    ],
+    [
+      2304,
+      384
+    ],
+    [
+      2304,
+      768
+    ],
+    [
+      2304,
+      1152
+    ],
+    [
+      2304,
+      1536
+    ],
+    [
+      2304,
+      1920
+    ],
+    [
+      2304,
+      2304
+    ]
+  ],
+  "image_token_index": 151646,
+  "model_type": "llava_onevision",
+  "multimodal_projector_bias": true,
+  "pad_token_id": 151643,
+  "projector_hidden_act": "gelu",
+  "text_config": {
+    "_name_or_path": "Qwen/Qwen2-0.5B-Instruct",
+    "architectures": [
+      "Qwen2ForCausalLM"
+    ],
+    "attention_dropout": 0.0,
+    "bos_token_id": 151643,
+    "dtype": "bfloat16",
+    "eos_token_id": 151645,
+    "hidden_act": "silu",
+    "hidden_size": 896,
+    "initializer_range": 0.02,
+    "intermediate_size": 4864,
+    "layer_types": [
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention",
+      "full_attention"
+    ],
+    "max_position_embeddings": 32768,
+    "max_window_layers": 24,
+    "model_type": "qwen2",
+    "num_attention_heads": 14,
+    "num_hidden_layers": 24,
+    "num_key_value_heads": 2,
+    "rms_norm_eps": 1e-06,
+    "rope_scaling": null,
+    "rope_theta": 1000000.0,
+    "sliding_window": null,
+    "tie_word_embeddings": true,
+    "use_cache": true,
+    "use_sliding_window": false,
+    "vocab_size": 152000
+  },
+  "tie_word_embeddings": false,
+  "transformers_version": "4.57.1",
+  "use_image_newline_parameter": true,
+  "video_token_index": 151647,
+  "vision_aspect_ratio": "anyres_max_9",
+  "vision_config": {
+    "attention_dropout": 0.0,
+    "dtype": "bfloat16",
+    "hidden_act": "gelu_pytorch_tanh",
+    "hidden_size": 1152,
+    "image_size": 384,
+    "intermediate_size": 4304,
+    "layer_norm_eps": 1e-06,
+    "model_type": "siglip_vision_model",
+    "num_attention_heads": 16,
+    "num_channels": 3,
+    "num_hidden_layers": 26,
+    "patch_size": 14,
+    "vision_use_head": false
+  },
+  "vision_feature_layer": -1,
+  "vision_feature_select_strategy": "full"
+}

papers/full_text.txt ADDED Viewed

	@@ -0,0 +1,1211 @@

+--- PAGE 1 ---
+PublishedasaconferencepaperatICLR2026
+EMPOWERING SMALL VLMS TO THINK WITH DY-
+NAMIC MEMORIZATION AND EXPLORATION
+JiazhenLiu ,YuchuanDeng ,andLongChen∗
+TheHongKongUniversityofScienceandTechnology
+https://github.com/HKUST-LongGroup/DyME
+ABSTRACT
+Small-scaleVision–LanguageModels(SVLMs)areexceptionallywell-suitedfor
+proprietarytasks. Equippingthemwiththinkingcapabilitiesisacriticalstepto
+enhance their performance and reliability in these specific domains. However,
+existingtrainingparadigms,includingSupervisedFine-Tuning(SFT)andRein-
+forcementLearningwithVerifiableReward(RLVR),imposesubstantialdemands
+onthebaseVLM,exceedingthecapacityofSVLMs. Consequently,directlyap-
+plyingtheseparadigmstoSVLMsfailstoinstillthedesiredthinkingabilities. A
+natural solution is to combine SFT and RLVR, leveraging their complementar-
+ity to reduce the dependence on model capacity. Yet the core challenge lies in
+managingtheinherenttrade-off: excessiverelianceonSFTcanforcethemodel
+tomemorizepseudothinkingtraces,whileover-emphasizingRLVRcanleadto
+unstableexploration(i.e.,advantagecollapse). Toaddressthis,weproposeDyME,
+anoveltrainingparadigmthatDynamicallyselectsbetweenMemorization(via
+SFT) and Exploration (via RLVR) at each optimization step. By ensuring that
+every update contributes to the trade-off, DyME serves as a robust, standalone
+strategythatstabilizesSVLMlearning. Complementingthisparadigm,wefurther
+introduceasynergisticVisualSupervisionmechanism(comprisingavisualchecker
+andrefiner)designedtoinjectdynamicallyenhanced,image-groundedguidance
+duringoptimization. Extensiveexperimentsacrossdiversedomainsdemonstrate
+that DyME consistently achieves this balance, and thus delivers substantial per-
+formance improvements on specialized tasks. These results establish DyME as
+apracticalandeffectivesolutionforempoweringSVLMswithreliablethinking
+capabilities.
+1 INTRODUCTION
+EquippingVision–LanguageModels(VLMs)withthinkingcapabilitiesisapivotalstepthatmoves
+thembeyondrecognitiontowardreasoning. Recentstudieshaveadvancedthisgoalthroughspe-
+cializedtraining,achievingstrongresultsonaspectrumofvisualtasks,fromrecognition-intensive
+applicationslikegrounding(Laietal.,2025;Liu&Chen,2025;Pengetal.,2025;Liuetal.,2025c;a)
+toreasoning-intensivechallengessuchaschartunderstanding(Zhangetal.,2025a;Xiaetal.,2024)
+andgeometricproblemsolving(Shenetal.,2025;Chenetal.,2025b;Xiaetal.,2025). Whilethis
+progressissignificant,thesuccessoftheseapproachesiscontingentuponthebaseVLMpossessing
+strongfoundationalcapabilities,namely,sufficientcapacityandrobustinstructionadherence(Yang
+etal.,2025a). Inpractice,onlyahandfulofVLMsmeettheseprerequisites,presentingasignificant
+challengeforSmall-scaleVLMs(SVLMs)whichstruggletodevelopthinkingcapabilitiesunder
+existingtrainingparadigms.
+Tocontextualizethislimitation,webrieflyreviewthetwodominantparadigms,bothofwhichare
+primarilytailoredforLarge-scaleVLMs(LVLMs). 1)SupervisedFine-Tuning(SFT)onChain-of-
+Thought(CoT)data(Xuetal.,2024;Lietal.,2024b;Xiaetal.,2025;Gaoetal.,2025): VLMsare
+supervisedtomemorizepredefinedthinkingpatternsfromlarge-scaleCoTannotations. SinceCoT
+dataareoftenverboseandcontainmuchvision-irrelevantcontent,modelsmustpossesssufficient
+capacity to absorb long textual content without compromising visual grounding (Marafioti et al.,
+∗Correspondingauthor(longchen@ust.hk)
+1
+6202
+beF
+72
+]VC.sc[
+2v16032.6052:viXra
+--- PAGE 2 ---
+PublishedasaconferencepaperatICLR2026
+After CoT SFT training During RLVR training
+(Input Image)
+Rapid Decline in Brazilians' Which year shows a greater divergence of opinions about Brazil's Which year has the most divergent
+Assessment of Economy economy, 2010 or 2012? Output the thinking process and then give opinions about Brazil's economy? Output
+the final answer in <answer> </answer> tag. the thinking process and then give the
+final answer in <answer> </answer> tag.
+We locate 2010 and 2012. To answer the question, we To answer the question, we examine the
+Then, extract 2010 has 62 first locate the relevant years chart and compare the value differences
+and 36, 2012 has 65 and in the chart. Then, we extract for each year. In 2015, the values are 87
+35. Comparing 62 − 36 = 26 the values for each year. and 13, showing the most significant
+< 65 − 35 = 30, the year with Compare the differences, the divergence. <answer>2015</answer>
+greater divergence is 2012. year with greater divergence is
+<answer>2012</answer> 2010. <answer>2010</answer> 2015 has the greatest divergence.
+Large-scale VLM
+LVLMs: vision preserved SVLMs: vision compromised SVLMs poorly follow instructions
+Small-scale VLM
+Grounded thinking traces Pseudo thinking traces (fail) Advantage collapsing (fail)
+a) SFT and RL paradigms fail to enable SVLMs to think.
+Two-stage training DyME training
+Extent
+1. Locate the year; Constrained
+2. Extract the value; exploration RL RL
+3. Get the answer. RL safe to
+<answer>2012</answer> explore
+(hard to achieve) (single step)
+SFT . t < . h . a e g n c r s h o w a u e r n r t d > .. e 2 . d A 0 1 v f 2 t a e < l r u / t a e h n i f n s r w o k m i e n r g > , RL c T ap h a in b k il i i n ti g e s Switcher Switcher c T ap h a in b k il i i n ti g es
+(numerous steps)
+2012 has the greatest div- need to
+e (n rg o e f n o c r e m . at answer) RL A d co v l a la n p ta s g e e memorize SFT SFT
+b) Two-stage training vs. DyME
+Figure1: TrainingparadigmsforenablingVLMthinking. TheLVLMisQwen2.5-VL-32B(Bai
+etal.,2025)andtheSVLMisSmolVLM-500M(Marafiotietal.,2025). (a)Existingparadigmsare
+effectiveforLVLMsbutunsuitableforSVLMs. (b)Thetwo-stagetrainingparadigm(SFT→RL)
+facesachallengingtrade-off. OurproposedDyMEdynamicallybalancesthistrade-off.
+2025). This capability gap is illustrated in Fig. 1a: After SFT, LVLMs can generate grounded
+thinkingtraceswithaccurateintermediatevalues(ingreen),whileSVLMscannot. 2)Reinforcement
+LearningwithVerifiableReward(RLVR)(Zhangetal.,2025a;Chenetal.,2025b;Pengetal.,
+2025;Shenetal.,2025): ontheotherhand,promotesexplorationofthinkingpatternsratherthan
+imitations. Inthisparadigm,VLMsareinstructedtogenerateathoughtprocessfollowedbyastrictly
+formattedanswer(e.g.,enclosedintags). Thisformatenablesverifiablerewardstoreinforcecorrect
+generationsandpenalizeincorrectones. Owingtoitsrelianceoninstructionadherence,thisapproach
+ispracticalprimarilyforstrongVLMsthatcanreliablygeneratestructuredoutputs.
+Consequently,bothestablishedparadigmsareinadequate
+forinstillingthinkinginSVLMs. Theextremelylimited
+Baseline
+capacity(e.g.,under1Bparameters)ofSVLMsrenders Baseline +CoT SFT
++RLVR
+the SFT paradigm ineffective, as a high volume of tex- +Two-stage
++CoT SFT +Ours
+tual information in CoT data can overwhelm the capac-
+ity(Marafiotietal.,2025;Chenetal.,2025a). Moreover, +RLVR
+the limited instruction adherence of SVLMs frequently
++Two-stage
+resultsinunverifiableoutputs(Chuetal.,2025;Guoetal.,
+2025),precipitatingadvantagecollapseduringRLVR.We +Ours
+quantitatively verify these limitations (cf., Fig. 2): both
+35 40 45 50 55 60 65 70 75
+SFTandRLVRparadigmsindeedimpairtheperformance.
+Figure2: PerformanceofSmolVLM-
+Considering that SVLMs offer high efficiency and are 500M (Marafioti et al., 2025) on
+crucialfordeploymentonedgedevices(Marafiotietal., ChartQA (Masry et al., 2022). Ex-
+2025), enabling them to think addresses a strong prac- istingparadigmsdegradeperformance,
+tical demand. Thinking enhances the reliability and whereasDyMEyieldsimprovements.
+performance of vision tasks (Zhang et al., 2025a), and
+task-specificSVLMsprovideacompellingalternativeto
+LVLMsinresource-constrainedsettings. Thismotivatesthedevelopmentofanewtrainingparadigm
+thatempowersSVLMswiththinkingcapabilities,atleastforspecializedtasks.
+2
+--- PAGE 3 ---
+PublishedasaconferencepaperatICLR2026
+ApromisingsolutionistofuseSFTandRLVR,asawell-calibratedtrade-offcanlowerthehigh
+demandsonthebasemodel(DeepSeek,Inc.,2025;Yanetal.,2025): SFTencouragesthemodelto
+memorizeverifiablethinkingpatternstopreventadvantagecollapse,whileRLforcesexplorationto
+preventrigidtemplatesfromoverwhelmingthemodel’scapacity. Thecentralchallenge,however,is
+thatSVLMsstruggletoachievethisbalance. Existinghybridmethods,liketwo-stagetraining(Chen
+etal.,2025a;Chuetal.,2025)orannealedSFTlosses1(Zhangetal.,2025b),relyonastatictrade-off
+governedbyhyperparameterssetempirically. Thisrigidityisthecriticalflawbecausetheminimal
+capacityofSVLMsmeansthewindowforasuccessfulstaticbalanceisincrediblynarrow,making
+failurealmostinevitable(cf. Fig. 1b). Ourrepeatedtrialswithtwo-stagetrainingconfirmedthisissue,
+withperformanceoftenfallingbelowthebaseline(cf. Fig. 2).
+SVLMs therefore require a more intelligent paradigm to navigate this trade-off. To this end, we
+proposeDyME(DynamicMemorize–Explore),whichintegratesSFTandRLVRthroughadynamic
+switchingmechanism. AsillustratedinFig.1b,DyMEassessesthemodel’sgenerationateachstep
+andadaptsitstrainingmodeaccordingly. Whenthemodelfailstofollowinstructions,itswitchestoa
+memorizationmode(SFT)toguaranteestableoptimizationsignals. Conversely,forvalidgenerations,
+itengagesanexplorationmode(RLVR)toencouragediverseandgroundedthinking.Thisstate-driven
+approachensuresmemorizationandexplorationarealwayscomplementary,dynamicallymaintaining
+thedelicatetrade-off. Whilethisdynamicswitchingaloneguaranteestrainingstability,wefurther
+maximizethemodel’spotentialbyincorporatingasynergisticVisualSupervisionmechanism. This
+modulefacilitatesanadaptiveinteraction: theCoTground-truthguidesthescoringofexploration
+(viaavisualchecker),whilesuccessfulexplorationtracesdynamicallyrefinetheCoTground-truth
+(viaavisualrefiner).
+TheaforementioneddesignmakesDyMEahighlyeffectiveparadigmforempoweringthinkingin
+SVLMsforspecifictasks. Wevalidatethisacrossthreediversedomains,rangingfromrecognition-
+intensivetasks(medicalVQA)toreasoning-intensivechallenges(chartunderstandingandgeometric
+problemsolving). Remarkably, usingonlyafewthousandtrainingsamples, DyMEachievessub-
+stantial performance gains, enabling it to match or even surpass several LVLMs. Our primary
+contributionsareasfollows:
+1. We propose DyME, the first training paradigm that equips SVLMs with thinking capabilities,
+substantiallyreducingrelianceonthebaseVLM’sinitialcapacity.
+2. Throughdynamicswitchingandsynergisticsupervision,DyMEalleviatespseudothinkingtraces
+andadvantagecollapseinSVLMs,yieldingimage-groundedthinkingandconsistentperformance
+improvements.
+3. WedemonstratetheeffectivenessandpracticalityofDyMEacrossthreediversedomains,each
+consistentlyshowingsubstantialperformancegainswithonlyafewthousandtrainingsamples.
+2 RELATED WORK
+Vision-LanguageModels. ModernVLMs,suchasLLaVA(Liuetal.,2024a)andQwen-VL(Bai
+etal.,2023),havedemonstratedremarkablecapabilitiesacrossawidearrayofvisiontasks. How-
+ever,theirsubstantialparametercountsandcomputationaldemandsrestricttheiruseinresource-
+constrained environments like edge devices. This has motivated a growing interest in SVLMs
+designedforefficiency(Zhouetal.,2024;Marafiotietal.,2025;Korrapati,2024). Althoughworks
+likeTinyLLaVA(Zhouetal.,2024)andSmolVLM(Marafiotietal.,2025)haveshownthatcarefully
+designedSVLMscanachievecompetitiveperformance, theyexhibitacriticalweakness. Recent
+studieshighlightthattheirperformancedegradessignificantlyontasksrequiringcomplex,multi-step
+instructionfollowing,indicatingagapintheircompositionalunderstandingandgeneralreasoning
+abilities(Albalaketal.,2022;Ghoshetal.,2024;Liuetal.,2025b).
+Empowering Thinking Capabilities in VLMs. Recent advances in LLM thinking (e.g., GPT-
+o1(OpenAI,2024),DeepSeek-R1(Guoetal.,2025))havemotivatedeffortstoequipVLMswith
+similarcapabilitiesviadedicatedtrainingparadigms.
+SFTonCoTdata (Xu et al., 2024; Xia et al., 2024; 2025; Gao et al., 2025; Yang et al., 2025b).
+Thisparadigmleverageslarge-scaleCoTsupervisiontoteachmodelstomemorizeandgeneralize
+thinkingpatterns. Multimodal-CoT(Zhangetal.,2023)wasanearlyattemptusingfusedvisual–text
+1Seethesupplementarymaterialforfurthercomparison.
+3
+--- PAGE 4 ---
+PublishedasaconferencepaperatICLR2026
+inputs,butitssmallscaledatalimitedgenuinethinking. Subsequentworkshighlighttheroleofscale:
+G-LLaVA(Gaoetal.,2025)constructs170Kgeometry-specificCoTsamples;ChartVLM(Xiaetal.,
+2024)compilesalargechartcorpus;andLLaVA-CoT(Xuetal.,2024)aswellasR1-OneVision(Yang
+etal.,2025b)curatediverse,structuredCoTdatathroughlarge-scalepromptengineering. These
+approachesfacelonginputs,requiringlargeVLMsthatcanprocessrichtextualinformationwhile
+preservingvisualgrounding(Marafiotietal.,2025;Zhaietal.,2023).
+RLwithVerifiableReward(RLVR)(Zhangetal.,2025a;Chenetal.,2025b;Pengetal.,2025;Shen
+et al., 2025; Liu et al., 2025c). RLVR adopts a distinct paradigm that elicits thinking through
+autonomousexplorationwithminimalexternalsupervision. ThepopularlyusedalgorithmisGroup
+RelativePolicyOptimization(GRPO),introducedbyDeepSeek-Math(Shaoetal.,2024), which
+exploitsmodels’abilitytoproducestructuredoutputsthatseparatethinkingfromfinalanswers. It
+leverages rule-verifiable data to optimize high-scoring generations, while light SFT is employed
+forcold-startwhentheoutputstructureisunclear. ThisparadigmhasbeenextendedtoVLMsin
+severalworks. R1-V(Chenetal.,2025b)appliesGRPOtoVLMs,enablingthinkingintaskssuchas
+countingandgeometry. LMM-R1(Pengetal.,2025)introducesatwo-stagepipelinethattransfers
+textualthinkingintomultimodallearning. VisualRFT(Liuetal.,2025c)andR1-VL(Zhangetal.,
+2025a)incorporatevision-specificrewardstoguidefine-grained,visuallygroundedoptimization.
+SinceGRPOdependsonmodels’initialstructuredthinkingability,thesemethodstypicallybuildon
+strongVLMs,suchastheQwen-VLseries(Baietal.,2025).
+HybridTrainingParadigms(Chuetal.,2025;Yanetal.,2025;Zhangetal.,2025b). Toharnessthe
+complementarystrengthsofSFTandRL,researchershavealsoinvestigatedhybridparadigms. A
+commonapproachisatwo-stagetrainingprocess(Chuetal.,2025)thatfirstusesSFTtoteachthe
+modelthedesiredoutputformat,followedbyRLforexploration. Althoughintuitive,thismethodis
+highlysensitivetotheamountofSFT,aparameterthatisparticularlychallengingtotuneforSVLMs,
+asthesesmallermodelscaneasilybecometrappedinsuboptimalstates. Alternativestrategiesattempt
+tocontinuouslyblendSFTwithRL,forinstance, byincorporatingSFTasanannealedauxiliary
+loss(Zhangetal.,2025b)orbymanagingitsinfluencewithanempiricalshapingfunction(Yanetal.,
+2025). However,allthesestrategiesultimatelyrelyonanempiricallydeterminedbalancebetween
+thetwoparadigms. ThisrigidityrepresentsacriticalflawwhenappliedtoSVLMs. Theabsenceof
+adaptivecontrolovertheSFTweightrendersthesemethodsbrittleandunreliable.
+Thus,existingparadigmsarenotdirectlytransferabletoSVLMsduetotheirinherentlimitations
+inmodelcapacityandinstruction-followingability. Thishighlightstheneedforanoveltraining
+paradigmthatimposesminimalrequirementsonthebaseVLM.
+3 APPROACH
+3.1 PRELIMINARIES
+Wefirstbrieflyrecapthetwotrainingparadigms(SFTandRLVR)thatunderlieourmethod. Let
+D ={(x ,y )}N bethetrainingset,wherexdenotestheinput(e.g. animage-instructionpair)and
+i i i=1
+ythedesiredoutput. Themodeldefinesaconditionaldistributionp (y |x)withparametersθ.
+θ
+Supervised Fine-Tuning (SFT). For each training pair (x,y) in D, SFT updates the model by
+minimizingthenegativelog-likelihood(cross-entropy)ofthedesiredoutputyundertheconditional
+distributionp (y |x):
+θ
+L (θ)=−E (cid:2) logp (y |x) (cid:3) . (1)
+SFT (x,y)∼D θ
+Thisteacher-forcinglossallowsmodelstomemorizeextensivetrainingexamples,compellingthe
+modeltoabsorbthisknowledge.
+GroupRelativePolicyOptimization(GRPO).GRPOisanRLalgorithmthatexploresopen-ended
+generationbycomparingcandidateoutputswithinagroup. Foreachinputx,thepolicyp samplesa
+θ
+set{y˜k}K ;arewardfunctionr (y˜k)iscomputedbasedonthecorrectnessoftheoutputanswer,
+k=1 a
+andeachsample’sadvantageAismeasuredrelativetotheothergroupmembers:
+(cid:118)
+A(y˜k) = r a (y˜ σ k) + − ε r¯ a, r¯ a = K 1 (cid:88) K r a (y˜j), σ= (cid:117) (cid:117) (cid:116) K 1 (cid:88) K (r a (y˜j)−r¯)2, (2)
+j=1 j=1
+4
+--- PAGE 5 ---
+PublishedasaconferencepaperatICLR2026
+where ε is a small constant for numerical stability. The policy then updates its parameters by
+minimizingthefollowingloss,regularisedbyaKLconstraint:
+L (θ)=−E E (cid:104) min (cid:0) r (x,y˜)A(y˜),clip (cid:0) r (x,y˜);1−ϵ,1+ϵ (cid:1) A(y˜) (cid:1)(cid:105)
+GRPO x∼D y˜∼pθ θ θ
++βD (cid:2) p (·|x)∥p (·|x) (cid:3) , where r (x,y˜) = p θ (y˜|x) . (3)
+KL θ ref θ p (y˜|x)
+old
+TheclipandKLtermsworktogethertokeepeachupdateclosetosaferegionsoftheparameterspace:
+theclipgatelimitsstepsizearoundtherolloutpolicyp ,whiletheKLterm(βD )tethersthe
+old KL
+policytothereferencep (typicallytheinitialmodel).
+ref
+GradientCompatibilityofSFTandGRPO.Below,werevealthattheoptimizationobjectivesof
+SFTandGRPOareformallyequivalent,withtheformertargetingtheground-truthdatadistribution
+andthelatteraninternalone.
+ThegradientoftheSFTlossisstraightforward:
+∇ L (θ)=−E [∇ logp (y |x)]. (4)
+θ SFT (x,y)∼D θ θ
+Similarly,theGRPOgradient(ignoringclippingandanyKL-penalty)canbewrittenas
+∇ L (θ)=−E (cid:2) r (x,y˜)A(y˜)∇ logp (y˜|x) (cid:3) . (5)
+θ GRPO x∼D, θ θ θ
+y˜∼pold(·|x)
+ThiscomparisonshowsthattheSFTgradientisaspecialcaseoftheGRPOgradient,obtainedwhen
+theground-truthsampleisusedwithunitadvantage. Thisequivalenceenablesaunifiedlossthat
+balancesexternalimitation(SFT)withinternalrefinement(GRPO).Achievingthisfusionrequires
+dynamicallyweightingthetwosignals(§3.2)andensuringstylisticconsistencybetweenexternal
+ground-truthandself-generatedoutputs(§3.3).
+3.2 DYNAMICMEMORIZE–EXPLORE(DYME)
+Torealizethiscomplementarity,weproposetheDynamicMemorize–Explore(DyME)paradigm,
+whichadaptivelyswitchesbetweenSFTandGRPOateachtrainingstep. Inthefollowing,wefirst
+outlinetheoverallpipelineandthenelaborateontheoptimizationproceduresforeachmode.
+Overall.AsshowninFig.3a,eachtrainingstepbeginswithaninputx=(I,q),whereIistheimage
+andq isaninstruction. ThepolicySVLMp generatesK responses{y˜k}K . Eachresponseis
+θ k=1
+parsedintoathinkingtraceandafinalanswer,whichisthenverifiedforcorrectnessusingpredefined
+rules. Theverificationresultsfallintotwocategories: eitherallresponsesareincorrect(including
+those that fail to parse), or at least one is correct. The decision rule: if at least one response is
+correct, themodelproceedswithGRPO-basedexploration; otherwise, itfallsbacktoSFT-based
+memorization. Formally,thetrainingmodeisswitchedas:
+(cid:26) GRPO, if max r (y˜k)=1,
+mode(x)= k a (6)
+SFT, otherwise,
+where r (y˜k) ∈ {0,1} indicates whether y˜k passes rule-based verification. Though simple, this
+a
+decisionruleishighlyeffective. Whenallresponsesareincorrect,theanswerrewardsareessentially
+allzeroandthenormalizedadvantagesbecomedominatedbynoise,makingGRPOupdatesfora
+smallSVLMunstable. Inthisregime,fallingbacktoSFTprovidesalow-variance,ground-truth
+guidedgradient. Conversely,theappearanceofatleastonecorrectresponseindicatesthatthecurrent
+policy has already discovered a feasible solution for this input, so GRPO can safely exploit the
+relativeadvantagestodriveexploration.
+GRPOMode. DyMEintroducesakeyrefinementtotheoriginalGRPO:beyondtheanswerreward
+r ,itincorporatesanauxiliaryrewardr forthinkingtraces. Thisrewardiscomputedbyevaluating
+a t
+thegeneratedtracesagainstexpectedthinkingpatterns(e.g.,viatoken-levelF1scoreground-truth
+comparison),promotingstructuredthinking.
+Giventheserewards,weupdatethepolicyusingamodifiedGRPOobjective. Unlikethestandard
+formulation(Eqs.2&3),weomittheKLpenaltyandclippingterms,asthedynamicintegrationof
+5
+--- PAGE 6 ---
+PublishedasaconferencepaperatICLR2026
+GiveGni vveinsu vails ufaacl tfsacts
+AnyA cnoyr rceocrtr ect FromFr tohme ftihgeu frieg, uthree, vthaelu veasl rueepsr reespernetseedn bteyd by
+GRPGORPO
+resproenspseosn?ses? the tthhree eth breaers b caarns cbaen d biree cdtilrye cotblys eorbvseedr.ved.
+TakiTnagk tihnegi rt hmeiera mn eyaienl dysie alnd so avne roavllerall
+q q SwiStcwhietrcher or or averaavgeer aogf ea popf raopxpirmoaxtiemlya t8e.l3y, 8w.3h,i cwhhich
+SVLSMVLM provpirdoevsi ad ecso an cciosne csiusme smuamrym oafr yt hoef the
+distrdibisutrtiiobnu trieofnl ercetfeledc itnVe dtih siuenV actihlhs ueaC racth.lh eaCcrkth.eerc:k 0er: 0
+I I SFTSFT
+OveOravlelr pailpl epliipneeline VisuVails uRaelf Rineefriner
+pip pip
+AnsAwnesrw Reerw Raerwdard
+GTGT VisuVails uRaelf Rineefriner To aTnosw aners wtheer qthuee sqtuioens,t i.o..n, ,s .t.e.,p s1t eisp 1to is to
+(0, 1(,0 1, ,1 0, )1, 0) extraecxtt rtahcet vthaelu veasl fuoers Lfoart vLiaat, vPiao,r tPuogratlu,gal,
+anda Rnodm Raonmiaa:n 3i.a0:, 38..09,, 8a.n9d, a 1n3d.0 1,3.0,
+respreecstpiveecltyiv. eSltye.p S 2te ips 2to i sc atolc cualalcteu ltahtee the
+VisuVails uCahl eCckheercker
+averaavgeer aogf et hoefs teh veaselu veasl:u (e3s.:0 (+3 .80. 9+ +8.9 +
+Add Athdidn kthiningk rienwg arredward logiltosgits 13.01) 3/. 03) = / 83. 3=. 8T.3h.e Trehfeorreef,o trhee, athvee raavgeer aisge is
+(0, 1(+0, 1, +1+, 1, +0), 0) 8.3. 8.3.
+VisuVails uCahl eCckheerc:k 1er.0: 1.0
+GRGPOR PmOo dmeode SFT S FmTo dmeode
+(a)ThepipelineforDyME. (b)Visualrefinerandchecker.
+Figure3: WorkflowandmodulecomponentsofDyME.Ateachtrainingstep,DyMEdynamically
+switchesbetweenmemorization(viaSFT)andexploration(viaGRPO)modesbasedonitsgenerations.
+Visualsupervisionisintroducedthroughthevisualrefinerandvisualchecker. Therefinerenhances
+the targets for memorization by incorporating richer visual elements (green), while the checker
+rewardsthethinkingcontextgeneratedbasedontheirvisualrelevance.
+SFTalreadystabilizestraining. Thisavoidsoverlyconservativeupdatesandyieldsacleanergradient
+form,enablingsmootheralignmentbetweenSFTandGRPO:
+L˜ (θ)=−E E [r (x,y˜)A(y˜)], (7)
+GRPO x∼D y˜∼pθ(·|x) θ
+where A(y˜k) is the group-normalized advantage calculated from the combined answer (r ) and
+a
+thinking(r )rewards,andr (x,y˜k)= pθ(y˜|x) istheimportancesamplingratio.
+t θ pold(y˜|x)
+SFTMode.WhentrainingfallsbacktoSFT,themodelisoptimizedtowardtheground-truthresponse
+yusingthestandardsupervisedloss(Eq.1). Thisensuresthatwheneverthemodelfailstoexplore
+effectively,itreceivesastable,ground-truth-guidedgradientupdatetocorrectitsbehavior.
+DyMEObjective. Thefinallossdynamicallycombinesthetwoobjectivesbasedonresponsecorrect-
+ness:
+(cid:20) (cid:21) (cid:18) (cid:20) (cid:21)(cid:19)
+L (θ)=1 maxr (y˜k)=1 ·L˜ (θ)+ 1−1 maxr (y˜k)=1 ·L (θ), (8)
+DyME a GRPO a SFT
+k k
+where1[·]istheindicatorfunction,returning1iftheconditionholds,0otherwise.
+3.3 VISIONSUPERVISION
+DyME with Visual Supervision. While the aforementioned Pure DyME (using standard r and
+t
+staticground-truth)alreadyguaranteestrainingstabilitythroughitsdynamicswitchingmechanism,
+wecanfurtherexploitthisdynamicnaturetomaximizeperformance. Specifically,theswitching
+mechanismallowsustotailorthesupervisionsignalsateachoptimizationstep: refiningthereward
+duringexplorationandenhancingtheground-truthduringmemorization. Tothisend,weintroducea
+checker–refinerframework(cf. Fig.3b),whichconstitutestheFullDyME.
+Thisframeworkreorganizestheground-truthtoadheretoapredefinedstructure,cruciallytransform-
+ingitintoagroundedthinkingtrace. Therefinerrestructurestheexternalground-truthintostructured,
+visuallygroundedresponses,whilethecheckerevaluatesself-generatedoutputsfortheirstructural
+organizationandcoverageofvisualcontent. Werefertotheresultingsupervisionsignalscollectively
+asvisionsupervision. TheimplementationiscarriedoutviaLLM-basedpromptengineering.
+VisualFactsI arecentraltorealizingvisionsupervision. Theyaredefinedasfine-grainedvisual
+c
+componentsextractedfromeachimage,includingobjects,attributes,andstates. Theseelementsplay
+6
+--- PAGE 7 ---
+PublishedasaconferencepaperatICLR2026
+adualrole: theyprovideevidenceforevaluatinggenerationsagainsttheimageandserveasbuilding
+blocksforconstructingcompleteground-truthresponses.
+VisualChecker.Thevisualcheckerevaluatesresponsesalongtwodimensions:(i)whethertheoutput
+containssufficientcorrectvisualelementscomparedtoI ,and(ii)whetheritalignsstylisticallywith
+c
+providedexamples.TheseexamplesmaybemanuallydefinedorextractedfromtheSFTground-truth.
+VisualRefiner. TherefinerproducesvisuallygroundedresponsesforSFTbyleveragingthemodel’s
+validatedexplorations. High-scoringtracesidentifiedbythevisualcheckerarestoredinadynamic
+examplepool. AnLLMthendrawsfromthispooltogenerateground-truthresponses,integrating
+structuraltemplateswithvisualfactsfromI andreferencingthecollectedexamples.
+c
+Inessence,theacquisitionofVisualFacts,theevaluationbytheVisualChecker,andthesynthesisby
+theVisualRefinerareallimplementedviastructuredpromptengineeringusingQwen2.5-14B.Please
+refertotheSupplementaryMaterialsforthefullpromptsusedinourpipeline.
+4 EXPERIMENTS
+TorigorouslyevaluateDyME,westructureourexperimentsintotwoparts: (1)AlgorithmicVal-
+idation, where we evaluate “Pure DyME” in a controlled setting using offline data to isolate the
+contributionofourdynamicswitchingmechanism;and(2)SystemEffectiveness,whereweevaluate
+thefullDyMEpipeline(withVisualSupervision)acrossdiversedomainstodemonstrateitspractical
+capabilityinempoweringSVLMs.
+4.1 PARTI:ALGORITHMICVALIDATION(PUREDYME)
+Setup.SinceSVLMslackintrinsicreasoningcapabilitiesandcannotautonomouslydiscovercomplex
+reasoningpaths,pre-constructedCoTdataisamandatoryprerequisiteforalltrainingparadigms. We
+thereforeevaluatedallmethodsonChartQA(Masryetal.,2022)usingLLaVA-OV-S(Lietal.,2024a),
+the0.5Bvariant,withthreepre-constructedCoTdatasetsofvaryingqualities: Low(Undesigned)
+containing unstructured traces (∼80 words); Medium (Standard) consisting of semi-structured
+traces(∼89words)fromQwen2.5-14B;andHigh(Premium)comprisinghighlystructuredtraces
+(∼142words)fromGPT-4o. Followingestablishedprotocols(Liuetal.,2023;Masryetal.,2022),
+wereportrelaxedcorrectness,whichallowsa5%tolerancefornumericalanswers.
+Wepresentathreefoldevaluationtovalidatedatarobustness,designoptimality,andgeneralization:
+(1)RobustnesstoDataQuality. Table1(a)demonstratedDyME’ssuperiority. OnLowqualitydata,
+PureDyME(61.9%)significantlyoutperformstheunstableTwo-stagebaseline(57.6%). Remarkably,
+using only Medium data, it surpasses the SFT baseline trained on premium High (GPT-4o) data
+(61.6%). ThisconfirmsthatDyMEactsasarobuststudent,effectivelymaximizingdataefficiency.
+(2)OptimalityofBinarySwitching. Tovalidateourbinarydesign,wecompareditagainstthree
+alternativeswitchingheuristicsinTable1(b): (i)RewardThresholding,whichswitchestoRLonly
+if the batch average reward exceeds a threshold t; (ii) SFT Annealing, which applies a weighted
+SFTlossalongsideRLateverystep;and(iii)SFTBudget,whichperformsfocusedSFTupdateson
+accumulatedfailurecases(hardmining).
+Results: RewardThresholdingprovesbrittle,collapsingatsuboptimalthresholds(t=0.5,52.4%).
+SFTAnnealingincursaheavycomputationaltax(+25%)duetotheauxiliarySFTgradientcalculation.
+SFTBudgetyieldsinferiorresults(59.6%)asoverwhelmingthemodelwithconcentratedfailures
+destabilizeslearning. Incontrast,DyME’sbinaryswitchisparameter-free,efficient,andempirically
+optimal(64.9%).
+(3) Mechanism Generality. Going beyond the primary setup, while DyME is primarily tailored
+forSVLMs, weverifytheuniversalityofitscoreswitchingmechanism(seeSupplementary). In
+thetext-onlydomain,itbooststhesmall-scaleQwen2.5-0.5BonGSM8K(Cobbeetal.,2021)to
+55.3%(+5.8%overGRPO),confirmingDyMEisaneffectiveparadigmforempoweringthinkingin
+small-parametermodelsregardlessofmodality. Moreover,theparadigmscaleseffectively: onthe
+strongerQwen2.5-VL-7B,itfurtherimprovesChartQAperformanceto89.6%(+2.3%).
+7
+--- PAGE 8 ---
+PublishedasaconferencepaperatICLR2026
+Table1: AlgorithmicValidationofPureDyME.(a)DyMEoutperformsSFTandTwo-stagevariants
+(w/andw/oKLpenalty)acrossalldataqualities. (b)Thebinaryswitchismorerobustandefficient
+thansoftorhard-miningalternatives(evaluatedonMediumdata).
+(a)RobustnessacrossDataQuality (b)SwitchingStrategyAblation
+Method Low Medium High Strategy Hyperparam. Acc. Cost
+SFT 50.5 57.8 61.6 RewardThreshold t=0.5/0.8/0.9 52.4/64.1/63.4 None
+Two-stage 57.6 59.9 54.5 SFTAnnealing Cosine 64.0 +25%
+Two-stage(w/KL) 55.4 60.8 62.7 SFTBudget HardMining 59.6 Budget-dep.
+PureDyME 61.9 64.9 68.5 BinarySwitch(Ours) – 64.9 Baseline
+4.2 PARTII:SYSTEMEFFECTIVENESS(FULLDYME)
+Havingvalidatedthealgorithmiccore,wenowevaluatetheFullDyMEpipeline,augmentedwith
+VisualSupervision,acrossthreediversedomains:MedicalVQA,ChartUnderstanding,andGeometry.
+Eachfollowedtheevaluationprotocolsofpriorwork(Zongetal.,2024).
+Setup & Source of I . Unlike Part I, here we activate the Visual Supervision module to enable
+c
+the full online loop. Crucially, to demonstrate DyME’s capability to bootstrap from raw signals,
+we utilize the “Undesigned” CoT data (defined in §4.1) derived from SLAKE (Liu et al., 2021),
+ChartQA(Masryetal.,2022),andGeo170K(Gaoetal.,2025)asthecommontrainingsourcefor
+all methods. Acquiring the necessary visual facts (I ) is a fully automated process: we leverage
+c
+standarddomaintools(e.g.,BiomedGPT(Zhangetal.,2024a)formedical,DePlot(Liuetal.,2023)
+forcharts)orpromptgeneralistLLMs(e.g.,Qwen2.5(Team,2024))toparseimagesintostructured
+textualdescriptions. Theautomatedpipelineandpromptsareincludedinthesupplementary.
+EvaluationProtocol. Weusedofficialtrain-testsplitsforSLAKE(Accuracy/Recall)andChartQA
+(Relaxedcorrectness). ForGeometry, sinceGeo170K(Gaoetal.,2025)providesnotestset, we
+evaluatedAccuracyonMathVerse(Zhangetal.,2024b),consistentwithZongetal.(2024).
+4.2.1 MAINRESULTS
+DyMEvs. ExistingTrainingParadigms. Thecomprehen-
+siveresultsinTable2showthatDyMEconsistentlydeliv-
+erssubstantialgains. Notably,aftertrainingwithDyME,
+SmolVLM improves from 49.9 to 55.6 (+5.7), LLaVA-
+OV-S from 50.7 to 55.4 (+4.7), and InternVL2-S from
+56.3to58.1(+1.8). Incontrast,existingparadigmstend
+todegradeperformance(e.g.,SFTlowersSmolVLMto
+44.1),validatingouranalysisthatSFTyieldspseudothink-
+ingtracesandGRPOfacesadvantagecollapse(cf. Fig.4).
+DyME effectively mitigates these issues. It promotes
+grounded traces that are concise yet informative (cf.
+Fig.5),aligningwellwiththelimitedcapacityofSVLMs.
+Importantly,DyMEplacesminimaldemandsonthebase
+Figure 4: Training rewards. GRPO
+model: evenSmolVLM(0.5B)achievessubstantialgains,
+andtwo-stagetrainingsufferfromsevere
+anditstilldeliversimprovements(+2.6%)onextensively
+advantagecollapse.
+pretrainedmodelslikeInternVL2-S.Wefurthercorrobo-
+ratedthesefindingsthroughmanualinspection,asdetailed
+intheSupplementaryMaterial.
+MatchingtheEfficacyofGPT-4oSupervisionwithOpen-SourceModels. Comparingresults
+between Part I and Part II reveals a crucial finding: LLaVA-OV-S trained with the full DyME
+pipeline(usingtheaccessibleQwen2.5-14B)achieves67.5%(Table2). Thiseffectivelymatchesthe
+performanceofPureDyMEtrainedonexpensiveGPT-4odata(68.5%,cf. Table1). Thisprovesthat
+fullDyMEallowsopen-sourcesupervisiontoachievetrainingoutcomescomparabletothosederived
+fromtop-tierproprietarymodels,eliminatingtheneedforexpensivedataannotation.
+DyME-trainedSVLMsCanBeCompetitivewithLVLMs. Weensuredfairnessbyexposingall
+baselinestoourtrainingdata. AsshowninTable2,SVLMstrainedwithDyMEcansurpassstronger
+8
+--- PAGE 9 ---
+PublishedasaconferencepaperatICLR2026
+Table2: Comparisonsacrossthreedomains: medicalVQA,chartunderstanding,andgeometry
+solving.TheevaluationfollowstheVLMEvalKitframework(Duanetal.,2024).ForSVLMs,existing
+trainingparadigmsdegradetheirperformance,whereasDyMEconsistentlybringsimprovements. The
+bestperformanceachievedbyeachSVLMishighlightedinbold,withtherelativeimprovementalso
+indicated. Notably,afterbeingtrainedwithDyME,SVLMsachieveperformancecomparabletothat
+ofMoVA(underlined).
+Model ViT LLM Medical Chart Geometry Avg.
+LVLMs
+LLaVA-Med(Lietal.,2023) CLIP-ViT-L/14 Vicuna-7B 64.3 – – –
+Cambrian-1(Tongetal.,2024) Hybrid-3B Llama3-8B – 72.6 22.0 –
+LLaVA-1.5(Liuetal.,2024a) CLIP-ViT-L/14 Vicuna-7B 69.4 17.8 – –
+LLaVA-1.6(Liuetal.,2024b) CLIP-ViT-L/14 Vicuna-7B 78.2 49.2 13.4 47.0
+MoVA(Zongetal.,2024) Hybrid-3B Vicuna-7B 74.5 68.3 19.7 54.2
+LLaVA-OV-L(Lietal.,2024a) SigLIP-SO400M Qwen2-7B 75.7 80.9 24.5 60.4
+InternVL2-L(Chenetal.,2024) InternViT-300M InternLM2.5-7B 80.2 82.1 37.3 66.5
+SVLMs
+SmolVLM(Marafiotietal.,2025) SigLIP-93M SmolLM2-360M 72.1 63.2 14.6 49.9
++CoTSFT SigLIP-93M SmolLM2-360M 60.1 57.7 14.5 44.1
++GRPO SigLIP-93M SmolLM2-360M 61.1 53.8 17.1 44.0
++Two-stage SigLIP-93M SmolLM2-360M 59.4 60.1 16.7 45.4
++DyME SigLIP-93M SmolLM2-360M 78.1 69.7 18.9 55.6
+(+6.0%) (+6.5%) (+4.3%) (+5.7%)
+LLaVA-OV-S(Lietal.,2024a) SigLIP-400M Qwen2-0.5B 74.9 61.4 15.9 50.7
++Two-stage SigLIP-400M Qwen2-0.5B 74.5 52.9 16.5 48.0
++DyME SigLIP-400M Qwen2-0.5B 78.3 67.5 20.4 55.4
+(+3.4%) (+6.1%) (+4.5%) (+4.7%)
+InternVL2-S(Chenetal.,2024) InternViT-300M Qwen2-0.5B 78.3 71.9 18.7 56.3
++Two-stage InternViT-300M Qwen2-0.5B 73.6 55.7 17.1 48.8
++DyME InternViT-300M Qwen2-0.5B 80.0 74.5 19.8 58.1
+(+1.7%) (+2.6%) (+1.1%) (+1.8%)
+LVLMslikeMoVA(54.2)onthesespecializeddomains,withSmolVLMreaching55.6andLLaVA-
+OV-S55.4. Asaresult,DyME-trainedSVLMsbecomereliableoptionsfortask-specificapplications
+onresource-constrainededgedevices.
+4.2.2 ABLATIONSTUDY
+Todissectthesourceofthesegains,weconductedanablationstudytoanalyzethecontributionof
+DyME’sfourcorecomponentswithinthefullpipeline: thememorizationmode,explorationmode,
+visualrefiner,andvisualchecker. Table3showstheperformanceimpact.
+DynamicSwitchingMechanism. Theresults
+confirmthatMemorizationandExplorationare Table3: Ablationstudy. Model: LLaVA-OV-S.
+symbiotic. Disabling memorization causes a
+catastrophicdrop(55.4→43.9),effectivelyre-
+DyMEVariant Medical Chart Geometry Average
+vertingtounconstrained,unstableexploration.
+DyME(full) 78.3 67.5 20.4 55.4
+Conversely, removing exploration (50.4) re- w/omemorization 63.2 53.4 15.0 43.9(20.6%↓)
+strictsthemodeltothestaticimitationofsub-
+w/oexploration 75.5 61.3 14.5 50.4(9.0%↓)
+w/ovisualrefiner 75.6 62.3 16.8 51.6(6.9%↓)
+optimaldata. AsshowninFig.4,theirdynamic w/ovisualchecker 76.9 64.3 17.1 52.8(4.7%↓)
+interplay prevents the advantage collapse ob-
+servedinbaselines,ensuringoptimizationstabilitythroughoutthelearningprocess.
+Visual Supervision. Removing the visual checker and refiner drops performance by 4.7% and
+6.9%,respectively. Thisvalidatesthepivotalroleofvisualsupervisioninbootstrappingfromnoisy,
+undesigneddata. GiventhelimitedcapacityofSVLMs,theyareeasilypronetohallucinationwhen
+trainedonlow-qualitytraces. Thevisualcomponentsactasadynamicdenoiser,ensuringthatraw,
+imperfectdataisfilteredandrefinedintogroundedvisualfacts(I )beforeoptimization,thusenabling
+c
+robustlearningevenfromweaksupervision.
+9
+--- PAGE 10 ---
+PublishedasaconferencepaperatICLR2026
+What is the difference between The difference between the Given AB // CD, angle 1 = 50.0,
+the values of 2017 and 2016? values of 2017 and 2016 is then what the degree of angle 2? Angle 2 is equal to angle 3,
+(Input Image) 19000. angle 2 is also 50 degrees.
+(Original)
+Extraction: Extraction:
+data is value 36700 for 2017 AB is parallel to CD,
+and 29000 for 2016 angle 1 = 50°.
+Calculation: Calculation:
+36700 - 29000 = 7700 angle 2 = 180° - angle 1
+Conclusion: = 180° - 50° = 130°.
+The difference between the Conclusion:
+2017 and 2016 values is 7700.(DyME) The degree of angle 2 is 130°.
+Answer: 7700 Answer: 130°
+Figure 5: Showcases on chart understanding and geometry solving. We use LLaVA-OV-S
+todemonstratetheresults. TheSVLMoriginallyproduceshallucinatedanswers(red), whilethe
+DyME-trainedmodelgeneratesstructuredthinkingtraces(green)thatincorporategroundedvalues,
+effectivelyimprovingtheperformance.
+4.3 TRAININGEFFICIENCY&DISCUSSION
+Weanalyzethecomputationalefficiencyandperformancetrade-offsassociatedwithdifferentconfig-
+urationsofDyME.ThecomparativeresultsaredetailedinTable4.
+Computational Efficiency vs. Data Cost. The frame-
+work offers two distinct operating regimes catering to Table4: Cost-BenefitAnalysis. Time
+different resource profiles. Pure DyME represents the measuredinsec/step. Runon8xH800.
+high-efficiency regime: when offline CoT data is pre-
+constructed,itmaintainstrainingthroughputcomparable
+Method Ext.Model Time Acc.
+tostandardGRPO(∼14s/step)whiledeliveringsuperior
+GRPO(Baseline) Qwen2.5-14B† 14.8s 60.8
+performance. Incontrast,FullDyME(withVisualSupervi- PureDyME Qwen2.5-14B† 14.0s 64.9
+sion)prioritizesdataautonomy. Whiletheonlineinterac- PureDyME GPT-4o† 19.1s 68.5
+tionintroducesacomputationaloverhead(∼1.6×training FullDyME Qwen2.5-7B 21.2s 66.8
+FullDyME Qwen2.5-14B 23.4s 67.5
+time),itenablesthemodeltobootstraphigh-performance †Usedforofflinedataconstructiononly.
+reasoningsolelyfromopen-sourcemodels,bypassingthe
+dependencyonexpensive,proprietarydataannotation(e.g.,GPT-4o).
+Sensitivity to External Model Capacity. For Full DyME, we further examine the impact of the
+externalhelper’ssizeonsystemperformance. AsshowninTable4, replacingtheQwen2.5-14B
+helperwiththesmaller7Bvariantresultsinanegligibleperformancevariation(67.5%→66.8%).
+Thisindicatesthatourstructuredpromptengineeringeffectivelydecomposescomplexreasoning
+tasks, allowing even smaller external models to provide sufficient guidance for SVLMs without
+necessitatingheavy-weightmodels.
+ApplicabilityofVisualSupervision. TheeffectivenessoftheVisualSupervisionmodulerelieson
+theexplicitextractionofVisualFacts(I ). Thisprocesscreatesspecificapplicabilityboundaries. For
+c
+domainsinvolvingabstractsemantics(e.g.,ironyinmemes)orunstructuredperception(e.g.,dense
+crowds),convertingholisticvisualsignalsintodiscretetextmayresultininformationloss. Insuch
+scenarios,revertingtothePureDyMEparadigmservesasamorerobustalternative.
+5 CONCLUSION
+Inthiswork,weintroducedDyME,anoveltrainingparadigmdesignedtoempowerthinkingcapa-
+bilitieswithinSVLMs. Atitscore,DyMEcombinesmemorization(viaSFT)modeandexploration
+(viaRLVR)modethroughadynamicswitchingmechanism. Ourexperimentsdemonstratethatthis
+approachnotonlyresolvesthecriticaltrade-offbetweenthesetwomodesbutalsoyieldssubstantial
+performance gains on a wide spectrum of vision tasks, from recognition-intensive to reasoning-
+intensivescenarios. ThesuccessofDyMEisattributedtoitscarefullydesignedcomponents: the
+dynamicswitchingmechanismaddressespseudothinkingtracesandadvantagecollapse,whilethe
+visualcheckerandrefinerprovidecoordinated,high-qualityvisualsupervision. Itimposesminimal
+requirementsonthebaseVLM,makingitbroadlyapplicabletoawiderangeofmodels,including
+extremelylightweightSVLMs. Therefore,DyMEservesasthepracticalsolutionforempowering
+SVLMstothink.
+10
+--- PAGE 11 ---
+PublishedasaconferencepaperatICLR2026
+ACKNOWLEDGMENT
+ThisworkwassupportedbytheHongKongSARRGCGeneralResearchFund(16219025),National
+NaturalScienceFoundationofChinaYoungScholarFundCategoryB(62522216),NationalNatural
+ScienceFoundationofChinaYoungScholarFundCategoryC(62402408),andHongKongSAR
+RGCEarlyCareerScheme(26208924).
+REFERENCES
+Alon Albalak, Akshat Shrivastava, Chinnadhurai Sankar, Adithya Sagar, and Mike Ross. Data-
+efficiencywithasinglegpu: Anexplorationoftransfermethodsforsmalllanguagemodels. arXiv
+preprintarXiv:2210.03871,2022.
+JinzeBai,ShuaiBai,ShushengYang,ShijieWang,SinanTan,PengWang,JunyangLin,ChangZhou,
+andJingrenZhou. Qwen-VL:Aversatilevision-languagemodelforunderstanding,localization,
+textreading,andbeyond. arXivpreprintarXiv:2308.12966,2023.
+ShuaiBai,KeqinChen,XuejingLiu,JialinWang,WenbinGe,SiboSong,KaiDang,PengWang,
+ShijieWang,JunTang,HumenZhong,YuanzhiZhu,MingkunYang,ZhaohaiLi,JianqiangWan,
+PengfeiWang,WeiDing,ZherenFu,YihengXu,JiaboYe,XiZhang,TianbaoXie,ZesenCheng,
+HangZhang,ZhiboYang,HaiyangXu,andJunyangLin. Qwen2.5-VLtechnicalreport. arXiv
+preprintarXiv:2502.13923,2025.
+HardyChen,HaoqinTu,FaliWang,HuiLiu,XianfengTang,XinyaDu,YuyinZhou,andCihangXie.
+SFTorRL?anearlyinvestigationintotrainingR1-likereasoninglargevision-languagemodels.
+arXivpreprintarXiv:2504.11468,2025a.
+LiangChen,LeiLi,HaozheZhao,YifanSong,andVinci. R1-V:Reinforcingsupergeneralization
+abilityinvision-languagemodelswithlessthan$3. https://github.com/Deep-Agent/
+R1-V,2025b. Accessed: 2025-02-02.
+ZheChen,WeiyunWang,YueCao,YangzhouLiu,ZhangweiGao,ErfeiCui,JinguoZhu,Shenglong
+Ye,HaoTian,ZhaoyangLiu,etal. Expandingperformanceboundariesofopen-sourcemultimodal
+modelswithmodel,data,andtest-timescaling. arXivpreprintarXiv:2412.05271,2024.
+TianzheChu,YuexiangZhai,JihanYang,ShengbangTong,SainingXie,DaleSchuurmans,QuocV
+Le,SergeyLevine,andYiMa.SFTmemorizes,RLgeneralizes:Acomparativestudyoffoundation
+modelpost-training. arXivpreprintarXiv:2501.17161,2025.
+Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
+MatthiasPlappert,JerryTworek,JacobHilton,ReiichiroNakano,etal. Trainingverifierstosolve
+mathwordproblems. arXivpreprintarXiv:2110.14168,2021.
+DeepSeek, Inc. DeepSeek-R1 Release. https://api-docs.deepseek.com/news/
+news250120,January2025. Accessed: Jun.21,2025.
+HaodongDuan,JunmingYang,YuxuanQiao,XinyuFang,LinChen,YuanLiu,XiaoyiDong,Yuhang
+Zang,PanZhang,JiaqiWang,etal. VLMEvalKit: Anopen-sourcetoolkitforevaluatinglarge
+multi-modalitymodels. InACMMM,2024.
+Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong,
+JianhuaHan,HangXu,ZhenguoLi,etal. G-LLaVA:Solvinggeometricproblemwithmulti-modal
+largelanguagemodel. InICLR,2025.
+AkashGhosh, ArkadeepAcharya, SriparnaSaha, VinijaJain, andAmanChadha. Exploringthe
+frontierofvision-languagemodels:Asurveyofcurrentmethodologiesandfuturedirections. arXiv
+preprintarXiv:2404.07214,2024.
+Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu,
+ShirongMa,PeiyiWang,XiaoBi,etal. Deepseek-R1: IncentivizingreasoningcapabilityinLLMs
+viareinforcementlearning. arXivpreprintarXiv:2501.12948,2025.
+VikKorrapati. Moondream. https://moondream.ai/,2024. Accessed: 2025-03-27.
+11
+--- PAGE 12 ---
+PublishedasaconferencepaperatICLR2026
+Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, and Xiaofeng Yang. Med-R1: Reinforce-
+ment learning for generalizable medical reasoning in vision-language models. arXiv preprint
+arXiv:2503.13939,2025.
+BoLi,YuanhanZhang,DongGuo,RenruiZhang,FengLi,HaoZhang,KaichenZhang,Yanwei
+Li,ZiweiLiu,andChunyuanLi. LLaVA-OneVision: Easyvisualtasktransfer. arXivpreprint
+arXiv:2408.03326,2024a.
+Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan
+Naumann,HoifungPoon,andJianfengGao. LLaVA-Med: Trainingalargelanguage-and-vision
+assistantforbiomedicineinoneday. AdvancesinNeuralInformationProcessingSystems,36:
+28541–28564,2023.
+Zhuowan Li, Bhavan Jasani, Peng Tang, and Shabnam Ghadar. Synthesize step-by-step: Tools
+templatesandLLMsasdatageneratorsforreasoning-basedchartVQA. InCVPR,2024b.
+BoLiu,Li-MingZhan,LiXu,LinMa,YanYang,andXiao-MingWu. SLAKE:Asemantically-
+labeledknowledge-enhanceddatasetformedicalvisualquestionanswering. InISBI,2021.
+FangyuLiu,JulianEisenschlos,FrancescoPiccinno,SyrineKrichene,ChenxiPang,KentonLee,
+MandarJoshi,WenhuChen,NigelCollier,andYaseminAltun. DePlot: One-shotvisuallanguage
+reasoningbyplot-to-tabletranslation. InFindingsoftheACL,2023.
+HaotianLiu,ChunyuanLi,YuhengLi,andYongJaeLee. Improvedbaselineswithvisualinstruction
+tuning. InCVPR,2024a.
+Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee.
+LLaVA-NeXT:Improvedreasoning,OCR,andworldknowledge,January2024b. URLhttps:
+//llava-vl.github.io/blog/2024-01-30-llava-next/.
+JiazhenLiuandLongChen. Segmentationasaplug-and-playcapabilityforfrozenmultimodalllms.
+arXivpreprintarXiv:2510.16785,2025.
+JiazhenLiu,MingkuanFeng,andLongChen. Better,stronger,faster: Tacklingthetrilemmainmllm-
+basedsegmentationwithsimultaneoustextualmaskprediction. arXivpreprintarXiv:2512.00395,
+2025a.
+JiazhenLiu,YuhanFu,RuobingXie,RunquanXie,XingwuSun,FengzongLian,ZhanhuiKang,
+andXirongLi. PhD:Achatgpt-promptedvisualhallucinationevaluationdataset. InCVPR,2025b.
+ZiyuLiu,ZeyiSun,YuhangZang,XiaoyiDong,YuhangCao,HaodongDuan,DahuaLin,andJiaqi
+Wang. Visual-RFT:Visualreinforcementfine-tuning. arXivpreprintarXiv:2503.01785,2025c.
+Andre´sMarafioti,OrrZohar,MiquelFarre´,MerveNoyan,ElieBakouch,PedroCuenca,CyrilZakka,
+Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. SmolVLM: Redefining small and
+efficientmultimodalmodels. arXivpreprintarXiv:2504.05299,2025.
+AhmedMasry,DoXuanLong,JiaQingTan,ShafiqJoty,andEnamulHoque.ChartQA:Abenchmark
+for question answering about charts with visual and logical reasoning. In Smaranda Muresan,
+PreslavNakov,andAlineVillavicencio(eds.),FindingsoftheACL,May2022.
+OpenAI. IntroducingOpenAIo1. https://openai.com/o1/,December2024. Accessed: Jun.
+21,2025.
+Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang,
+XingzhongXu,XinGeng,andXuYang. LMM-R1: Empowering3BLMMswithstrongreasoning
+abilitiesthroughtwo-stagerule-basedrl. arXivpreprintarXiv:2503.07536,2025.
+Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang,
+Mingchuan Zhang, YK Li, Y Wu, et al. DeepSeekMath: Pushing the limits of mathematical
+reasoninginopenlanguagemodels. arXivpreprintarXiv:2402.03300,2024.
+HaozhanShen,PengLiu,JingchengLi,ChunxinFang,YiboMa,JiajiaLiao,QiaoliShen,Zilun
+Zhang,KangjiaZhao,QianqianZhang,RuochenXu,andTianchengZhao. VLM-R1: Astable
+andgeneralizableR1-stylelargevision-languagemodel. arXivpreprintarXiv:2504.07615,2025.
+12
+--- PAGE 13 ---
+PublishedasaconferencepaperatICLR2026
+QwenTeam. Qwen2.5: Apartyoffoundationmodels,September2024. URLhttps://qwenlm.
+github.io/blog/qwen2.5/.
+PeterTong,EllisBrown,PenghaoWu,SanghyunWoo,AdithyaJairamVedagiriIYER,SaiCharitha
+Akula,ShushengYang,JihanYang,ManojMiddepogu,ZitengWang,etal. Cambrian-1: Afully
+open,vision-centricexplorationofmultimodalLLMs. AdvancesinNeuralInformationProcessing
+Systems,37:87310–87356,2024.
+RenqiuXia,BoZhang,HanchengYe,XiangchaoYan,QiLiu,HongbinZhou,ZijunChen,PengYe,
+MinDou,BotianShi,etal. ChartX&ChartVLM:Aversatilebenchmarkandfoundationmodel
+forcomplicatedchartreasoning. arXivpreprintarXiv:2402.12185,2024.
+RenqiuXia,MingshengLi,HanchengYe,WenjieWu,HongbinZhou,JiakangYuan,TianshuoPeng,
+XinyuCai,XiangchaoYan,BinWang,etal. GeoX:Geometricproblemsolvingthroughunified
+formalizedvision-languagepre-training. InICLR,2025.
+Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. LLaVA-CoT: Let vision
+languagemodelsreasonstep-by-step. arXivpreprintarXiv:2411.10440,2024.
+Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang.
+Learningtoreasonunderoff-policyguidance. arXivpreprintarXiv:2504.14945,2025.
+LeleYang,MuxiDiao,KongmingLiang,andZhanyuMa. GRPOforLLaVA. https://github.
+com/PRIS-CV/GRPO-for-Llava,2025a.
+YiYang,XiaoxuanHe,HongkunPan,XiyanJiang,YanDeng,XingtaoYang,HaoyuLu,Dacheng
+Yin, Fengyun Rao, Minfeng Zhu, et al. R1-OneVision: Advancing generalized multimodal
+reasoningthroughcross-modalformalization. arXivpreprintarXiv:2503.10615,2025b.
+YuexiangZhai,ShengbangTong,XiaoLi,MuCai,QingQu,YongJaeLee,andYiMa. Investigating
+thecatastrophicforgettinginmultimodallargelanguagemodelfine-tuning. InCPAL,2023.
+JingyiZhang,JiaxingHuang,HuanjinYao,ShunyuLiu,XikunZhang,ShijianLu,andDachengTao.
+R1-VL:Learningtoreasonwithmultimodallargelanguagemodelsviastep-wisegrouprelative
+policyoptimization. arXivpreprintarXiv:2503.12937,2025a.
+KaiZhang,RongZhou,EashanAdhikarla,ZhilingYan,YixinLiu,JunYu,ZhengliangLiu,Xun
+Chen,BrianDDavison,HuiRen,etal. Ageneralistvision–languagefoundationmodelfordiverse
+biomedicaltasks. NatureMedicine,pp.1–13,2024a.
+RenruiZhang,DongzhiJiang,YichiZhang,HaokunLin,ZiyuGuo,PengshuoQiu,AojunZhou,
+PanLu,Kai-WeiChang,PengGao,etal. MathVerse: Doesyourmulti-modalllmtrulyseethe
+diagramsinvisualmathproblems? 2024b.
+WenhaoZhang,YuexiangXie,YuchangSun,YanxiChen,GuoyinWang,YaliangLi,BolinDing,
+andJingrenZhou. On-policyRLmeetsoff-policyexperts: Harmonizingsupervisedfine-tuning
+andreinforcementlearningviadynamicweighting. arXivpreprintarXiv:2508.11408,2025b.
+ZhuoshengZhang,AstonZhang,MuLi,HaiZhao,GeorgeKarypis,andAlexSmola. Multimodal
+chain-of-thoughtreasoninginlanguagemodels. arXivpreprintarXiv:2302.00923,2023.
+Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei
+Huang. TinyLLaVA: A framework of small-scale large multimodal models. arXiv preprint
+arXiv:2402.14289,2024.
+ZhuofanZong,BingqiMa,DazhongShen,GuangluSong,HaoShao,DongzhiJiang,HongshengLi,
+andYuLiu. MoVA:Adaptingmixtureofvisionexpertstomultimodalcontext. InNeurIPS,2024.
+13
+--- PAGE 14 ---
+PublishedasaconferencepaperatICLR2026
+Empowering Small VLMs to Think with Dynamic Memorization
+and Exploration
+SupplementaryMaterial
+Inthesupplementarymaterials,wereport:
+• LLMinstructionsusedforconstructingvisionsupervision(§S1);
+• Detailedexperimentalsetupandadditionalexperimentalresults(§S2);
+• ShowcasesofSVLMstrainedviaDyMEperformingonmedicalVQA,chartunderstanding,
+andgeometryproblemsolving(§S3);
+S1 LLM INSTRUCTIONS FOR VISION SUPERVISION
+TheinstructionsforconstructingI ,thevisualrefiner,andthevisualcheckerarelistedasfollows.
+c
+S1.1 INSTRUCTIONSFOREXTRACTINGVISUALELEMENTS
+I isprimarilyderivedfromtwosources: groundtruthcaptions,andtheoutputsfromspecialized
+c
+toolssuchasthechart-parsingmodelDeplot. PromptS1isemployedtoextractvisualelementsfrom
+captions.
+1 You are a helpful assistant that analyzes images and provides visual
+facts.
+2 Your response MUST be a single, valid JSON object.
+3 The JSON object should contain:
+4 1. "description": A detailed and accurate description of the image.
+5 2. "objects": A list of key objects, including their name, attributes,
+and approximate position in the image.
+6
+7 Example format:
+8 {
+9 "description": "A person riding a bicycle on a city street.... (
+detailed description here)",
+10 "objects": [
+11 {"name": "person", "attributes": ["wearing helmet", "blue shirt"], "
+position": "center"},
+12 {"name": "bicycle", "attributes": ["red", "mountain bike"], "position
+": "center"},
+13 {"name": "street", "attributes": ["asphalt", "wet"], "position": "
+bottom"}
+14 ]
+15
+16 Analyze the attached image and provide the visual facts in the required
+JSON format.
+17 For context, the user will be asked this question about the image (do not
+answer the question, just use it for context):
+18 "{question}"
+19 }
+PromptS1: AutomatedVisualFactExtraction
+S1.2 INSTRUCTIONSFORVISUALREFINER
+ThevisualrefinerprimarilytakesthinkingtemplatesfromatemplatepoolandI asinputtoconstruct
+c
+thegroundtruthforSFT.
+1 Given:
+2 <IC>: the data of an image
+3 <Q>: a question
+1
+--- PAGE 15 ---
+PublishedasaconferencepaperatICLR2026
+4 <A>: a reference answer
+5 <T>: a writing template
+6
+7 Goal:
+8 Transform the visual information in <IC> into a textualized data
+description and incorporate it into a smooth, natural explanation
+that reasons why the correct answer to <Q> is <A>, using the format
+and tone defined by <T>.
+9
+10 Example:
+11 <IC>: [
+12 {"object": "bar", "attributes": ["˜120k", "Q4"], "label": "Product A"},
+13 {"object": "bar", "attributes": ["˜150k", "Q4"], "label": "Product B"},
+14 {"object": "bar", "attributes": ["˜90k", "Q4"], "label": "Product C"},
+15 {"title": "Quarterly Revenue"}
+16 ]
+17 <Q>: Which product has the highest revenue in Q4?
+18 <A>: product b
+19 <T>:
+20 [Extraction] Extract the relevant data points from <IC> that are
+needed to answer <Q>.
+21 [Calculation] Analyze and compare the extracted data to determine the
+correct result.
+22 [Conclusion] Based on the analysis, state why <A> is the correct
+answer to <Q>.
+23
+24 <Output>:
+25 [Extraction] The image shows a chart comparing three items-Product A,
+Product B, and Product C-in Q4. The bar heights indicate their
+revenues: Product A at about 120k, Product B at about 150k, and
+Product C at about 90k.
+26 [Calculation] Comparing these values, Product B’s revenue (150k) is
+greater than both Product A’s (120k) and Product C’s (90k).
+27 [Conclusion] Since Product B has the highest revenue among the three,
+the correct answer to the question "Which product has the
+highest revenue in Q4?" is product b.
+28
+29 Now, according to the requirements and the examples above, convert my
+input into the target reasoning text. Please give me the result
+directly without any explanation or description.
+30
+31 <IC>: %s
+32 <Q>: %s
+33 <A>: %s
+34 <T>: %s
+35 <Output>:
+PromptS2: Ground-truthconstructionforChartUnderstandingSFT
+Promptsfortheotherdomainsfollowasimilardesign.
+S1.3 INSTRUCTIONSFORVISUALCHECKER
+Thevisualcheckerisprimarilyresponsibleforscoringthethinkingtraceofresponsesgeneratedin
+theGRPOprocess. Itevaluatesthesetraceswithreferencetoexemplars,basedontheirfluencyand
+thedegreetowhichthementionedvisualelementsalignwithI . Promptsfortheotherdomains
+c
+followasimilardesign.
+1 Given
+2 <IC>: the data of an image
+3 <Q>: a question
+4 <A>: a reference answer
+5 <R>: a reasoning text
+6
+2
+--- PAGE 16 ---
+PublishedasaconferencepaperatICLR2026
+7 Goal:
+8 Assess whether <R> correctly and reasonably uses visible data in <IC> to
+justify that the correct answer to <Q> is <A>. Rate the quality as
+low / medium / high according to:
+9 (a) low: Does not use data from <IC> at all, or the language is not
+fluent/natural, or it fails to indicate the answer to <Q> is <A>.
+10 (b) medium: Uses data from <IC> and is written fluently, but the
+reasoning is overly brief or insufficiently clear.
+11 (c) high: Uses data from <IC> and is written fluently; the reasoning
+progresses step by step with depth, each step is correct and
+reasonable; the data from <IC> appears exactly where it should;
+overall, the reasoning text provides very strong support that the
+answer to <Q> is <A>.
+12
+13 Example:
+14 <IC>: [
+15 {"object": "bar", "attributes": ["˜120k", "Q4"], "label": "Product A"},
+16 {"object": "bar", "attributes": ["˜150k", "Q4"], "label": "Product B"},
+17 {"object": "bar", "attributes": ["˜90k", "Q4"], "label": "Product C"},
+18 {"title": "Quarterly Revenue"}
+19 ]
+20 <Q>: Which product has the highest revenue in Q4?
+21 <A>: product b
+22 <R>:
+23 [Extraction] Reads Q4 bar heights: A ˜120k, B ˜150k, C ˜90k.
+24 [Calculation] Compares values: B > A and B > C.
+25 [Conclusion] Therefore, Product B is highest, matching the answer "
+product b".
+26
+27 <Output>: medium
+28
+29 According to the requirements and examples above, score the input into
+three categories. Please give me the result directly without any
+explanation or description.
+30
+31 <IC>: %s
+32 <Q>: %s
+33 <A>: %s
+34 <R>: %s
+35 <Output>:
+PromptS3: ScoringgenerationsduringGRPOforChartUnderstanding
+S2 EXPERIMENTAL DETAILS AND EXTRA RESULTS
+S2.1 TRAININGANDTESTSETTING
+First,weprovidethestatisticalinformationforthetrainingandtestingphasesofourexperiments
+intheTab. S1. Thetrainingdatasetforeachdomainconsistsofonlyafewthousandsamples. In
+addition,Fig.S1visualizesacomparisonbetweentheground-truthresponsesproducedbytherefiner
+andtheoriginalground-truth,showingthattherefinedversionsarenoticeablymorestructuredand
+placegreateremphasisonintermediatevalues.
+TableS1: Trainingandtestingsetup. DyMEempowersthinkingcapabilitiesbasedonsmalltraining
+sets.
+Domain Trainingset #Trainingsamples SourceofI Testset
+c
+MedicalVQA SLAKE-Train 4,919 BiomedGPT SLAKE-Test
+ChartUnderstanding ChartQA-Train 4,576 DePlot ChartQA-Test
+GeometrySolving Geo170K 6,417 Collected MathVerse
+3
+--- PAGE 17 ---
+PublishedasaconferencepaperatICLR2026
+Medical(SLAKE). Chart(ChartQA). Math(Geo170k).
+Q:Ifangle1is35◦,whatisangle
+2? Choices: A:55◦, B:135◦, C:
+Q: What is the sum of making 145◦,D:155◦.
+progress,losingground,andsame w/orefiner:
+Q: What is the MR weighting in in2011? Sinceangle1is35degree,angle
+thisimage? w/orefiner: 2 is 180 - 35 = 145 degree us-
+w/orefiner: Toanswerthequestion,wefirstlo- ing a supplementary relationship.
+Basedonsignalvariationstypical catetherelevantpartofthechart ChooseC.
+of this image, the most plausible for 2011, then extract the values w/refiner:
+answerisT2. forthethreeseriesandsumthem Extraction:TwoparallellinesCD
+w/refiner: uptoobtain99. andABwithatransversal;angle1
+Analysis: On T2, fluid is bright; w/refiner: is35degree,andangle2isonthe
+bone/airandfastflowaredark. Extraction: Read 2011 values: sameside.
+Extraction: In this axial slice, Making progress = 45, Losing Calculation:Same-sideinterioran-
+intraluminalfluidisbright;vessels ground=29,Same=25. glesaresupplementary,soangle2
+lowsignal;ribs/walldark. Calculation:45+29+25=99. =180-35=145degree.
+Conclusion:PatternmatchesT2. Conclusion:Thesumis99. Conclusion:Answer:C.
+FigureS1: Comparisonofground-truthresponsesbeforeandafterrefinement. Comparedto
+theoriginalground-truth,therefinerinjectsrichervisualelementsandenforcesamorestructured
+organization,therebyreducingthelearningburdenforSVLMs.
+S2.2 EXTRARESULTS
+Wealsoreportadditionalexperimentalcontent,includingthediscussionontrainingstrategiesand
+dataorganizationformats,aswellasacomparativeanalysiswithothersimilarmethodsthatintegrate
+SFTandRL.
+Specifically, (1) we first demonstrate the importance of
+constructing vision supervision, which proves essential Table S2: Two-stage training on
+fortrainingSVLMstoproducegroundedthinkingtraces. ChartQA. Rel-corr denotes the
+(2)Wethenexaminetheimpactofstructuredversusopen- relaxed-correctnessmetric. I indicates
+c
+endedoutputformatsonthinkingperformance. (3)Fur- whetheranexplicitimage-contentfield
+thermore,tovalidateourearlierobservationthatSVLMs issupervised(✓yes;✗no).
+arepronetoconvergingtolocaloptima,wepresentperfor-
+manceacrossdifferenttrainingepochs,showingthatSFT
+Model I Rel-corr
+c
+training saturates after only one epoch. (4) We provide
+SmolVLM ✓ 64.32
+adetailedcomparisonwithalternativemethodsthatinte- SmolVLM ✗ 60.09
+grateSFTandRL.(5)Finally,weextendourevaluation LLaVA-OV-S ✓ 63.62
+tostrongerbasemodelsandpuretextualdomains,and(6) LLaVA-OV-S ✗ 52.90
+validatethequalityofgeneratedthinkingtracesthrough
+humanevaluation.
+(1)Intermediatevaluesmatter. AsshowninTableS2,wereporttheeffectofapplyingtwo-stage
+trainingwithvisualsupervisiononSmolVLMandLLaVA-OV-S.Incorporatingvisualsupervision
+significantlyimprovesthebestperformanceachievedduringtraining,despitecertaininstabilities,
+thereby validating itscritical role forSVLMs. Thiseffect isfurther illustrated inFig. S1, where
+visualsupervisioncompelsSVLMstogenerateintermediatereasoningenrichedwithvisualelements,
+whichmakeaclearcontributiontothefinalanswer.
+(2)StructuredthinkingalleviatesthelearningburdenofSVLMs. TableS3reportstheperfor-
+mancegapbetweentrainingwithstructuredthinkingground-truthandwithunconstrainedground-
+4
+--- PAGE 18 ---
+PublishedasaconferencepaperatICLR2026
+1.0
+0.8
+0.6
+0.4
+0.2
+0.0
+0 500 1000 1500 2000 2500
+Steps
+ycilop-ffO
+fo
+noitroporP
+DyME
+CHORD-
+LUFFY
+FigureS2: Relativeoff-policyinfluenceduringtraining. Eachcurveisnormalizedtoitsinitial
+valueforcomparability. DyMEmeasuresSFT/(SFT+RL)(rawinlightertone,Gaussian-smoothed
+indarkertone),CHORD-µtrackstheglobalweightµ(t),andLUFFYadoptsapolicy-shapingproxy
+E[f(π (a))]withf(x)= x . Allmethodsrevealtheshiftfromoff-policyguidancetoon-policy
+θ x+γ
+optimization,albeitwithdistinctdecaydynamics.
+TableS3: Effectoftemplatedoutputacrossmodelsandtasks. ✓denotesfixed-templateoutput,
+whereas✗denotesfree-formgeneration.
+Model Template Chart Medical
+SmolVLM ✓ 60.10 59.38
+SmolVLM ✗ 59.24 56.13
+LLaVA-OV-S ✓ 52.87 74.52
+LLaVA-OV-S ✗ 50.86 72.64
+truth. Whileopen-endedexplorationisoftenbeneficialforLVLMs,thelimitedcapacityofSVLMs
+makesunconstrainedexplorationlesseffective,asittendstobeaimlessandincreasesthelearning
+burden. GiventhatSVLMsaredesignedfortask-specificratherthangeneral-purposescenarios,em-
+ployingtailoredthinkingtemplatesforeachtaskprovesmoresuitableandyieldsbetterperformance.
+Forinstance,SmolVLMachieves60.10vs.59.24onChartQAand59.38vs.56.13onMedicalVQA,
+withLLaVA-OV-Sexhibitingsimilargains.
+(3)ComparisonbetweenannealedSFTlossandDyME.
+AsshowninFig.S2, wecomparetherelativeSFT(off- Table S4: SVLM performance satu-
+policy)influenceacrosstrainingstepsforthreeapproaches: rates after a single training epoch.
+DyME,CHORD(Zhangetal.,2025b),andLUFFY(Yan
+Scoreisdomain-specific: chartdomain
+etal.,2025). ForDyMEandCHORD,thecurvesrepresent usesRel-corr,whilethemedicaldo-
+thenormalizedweightoftheSFTlossateachstep,while mainusestheaverageofaccuracyand
+forLUFFYthecurvereflectsthetrajectoryofSFTgradi- recallvalues.
+entshapingasafunctionofpredictionprobability(which
+generally correlates with training steps). These curves
+Model Domain Epoch Score
+highlightthedynamicnatureofDyME.Becauseoftheex-
+1 60.70
+tremelylimitedcapacityofSVLMs,theirlearningpatterns
+LLaVA-OV-S Chart 5 60.44
+canshiftsignificantlyevenbetweenadjacentsteps,leading 10 60.12
+torapidforgettingofpreviouslyacquiredmodes. Unlike
+1 60.22
+CHORD, which relies on a smooth annealing schedule Chart 5 63.21
+thatdecaysquicklyandisill-suitedtosuchsmallmodels, SmolVLM 10 62.22
+DyMEassignsweightsdirectlybasedonmodeloutputs. 1 71.73
+Medical 5 71.80
+Thisproducesahighlydynamicandirregulardecay,bet- 10 72.05
+ter accommodating the instability of SVLMs. LUFFY
+adopts a shaping function f(x) = x (γ=0.1), which
+x+γ
+alsoinducesadynamicdecaywithprobabilitybutremainsheuristicandmaynotalignwellwith
+thelocal-optimumtendencyofSVLMs. Overall,DyMEisexplicitlytailoredforSVLMs,whereas
+5
+--- PAGE 19 ---
+PublishedasaconferencepaperatICLR2026
+TableS5: Detailedlearningtrajectoriesdemonstratingrigoroustuning. Wereporttheperfor-
+manceacrossmultiplesettingstoshowtheirfulllearningtrajectories. Two-stagebaselinesinclude
+variantswithandwithoutKLpenaltiestoensureoptimalperformanceiscaptured.
+DataQuality Method Performanceacrossepochs(1,3,5,10) Bestperf.
+DyME(ours,pure) Reportfinalscoredirectly 61.9
+SFT 43.1 → 47.9 → 50.0 → 50.5 50.5
+Low
+Two-stage 57.6 → 52.7 → 50.8 → 50.7 57.6
+Two-stage(w/KL) 54.2 → 55.4 → 52.6 → 54.2 55.4
+DyME(ours,pure) Reportfinalscoredirectly 64.9
+SFT 53.6 → 56.5 → 57.8 → 56.4 57.8
+Medium
+Two-stage 59.9 → 52.8 → 53.0 → 53.1 59.9
+Two-stage(w/KL) 59.0 → 60.6 → 60.6 → 60.8 60.8
+DyME(ours,pure) Reportfinalscoredirectly 68.5
+SFT 58.2 → 59.1 → 61.0 → 61.6 61.6
+High
+Two-stage 51.6 → 54.0 → 54.5 → 54.4 54.5
+Two-stage(w/KL) 61.7 → 60.9 → 62.7 → 61.8 62.7
+CHORDandLUFFYmaybemoreappropriateforstrongerbasemodels,reflectingcomplementary
+strengths.
+(4)SVLMsconvergerapidly.TableS4showsthatSVLMsconvergeextremelyquickly:performance
+afteronlyoneepochiscomparableto,orevenexceeds,thataftertenepochs(e.g.,LLaVA-OV-S
+achieves 60.70 vs. 60.12 on the Chart domain). This indicates that the very limited capacity of
+SVLMsmakesthempronetooverfittingtolocaloptima. Italsosubstantiatesourearlierclaimthat
+suchrapidconvergenceleavesonlyanarrowwindowforbalancingSFTandRL,makingitdifficultto
+achievethetrade-offthroughempiricalhyperparametertuning. Consequently,staticfusionmethods
+areunsuitableforSVLMs.
+To ensure a rigorous comparison, we further report the full learning trajectories of baselines in
+Table S5. We evaluated the Two-stage baseline (with and without KL penalty) and SFT across
+multipleepochs(1,3,5,10)tocapturetheirpeakperformance. Theresultsconfirmthatevenwith
+optimalstopping,thebaselinesconsistentlyunderperformDyME,whichachievessuperiorresultsina
+singletrainingrunwithouttheneedforepochselection.
+(5)Generalityacrosscomplexreasoningandpuretext. TodemonstratethescalabilityofDyME,
+we applied it to two new domains without modifying the core algorithm: Physical Reasoning
+(A-OKVQA)andPureTextReasoning(GSM8K).
+• PhysicalReasoning(A-OKVQA):Weaddressedthechallengeofopen-endedvisualreasoning
+bytestingonA-OKVQA.WeusedQwen2.5-VL-7BtoautomaticallygenerateVisualFactsusing
+the prompt defined in §S1 (e.g., “man, wearing a light blue and white shirt...”). As shown in
+TableS6,DyMEachievedamassivegainof+18.8%(54.2%→73.0%),provingthatthemethod
+scaleseffortlesslytotasksrequiringworldknowledgeandcommonsense.
+• PureTextReasoning(GSM8K):Inpuretextdomains,the“VisualFact”extractionstepisnaturally
+skipped. OntheGSM8Kmathbenchmark,DyMEimprovedQwen2.5-0.5Bfrom49.5%to55.3%,
+demonstratingthattheparadigmgeneralizesevenwhen“vision”isabsent.
+Theseresults,combinedwiththeChartQAimprovementsonthestrongerQwen2.5-VL-7Bmodel,
+confirm that DyME is not limited by the extraction step. By leveraging off-the-shelf LVLMs to
+automatevisualfactgeneration,theframeworkisimmediatelyapplicabletodiversevisualandtextual
+domains.
+Limitations on Abstract Visuals. We acknowledge that the VS module may face challenges
+in scenarios where “Visual Facts” are intrinsically difficult to define or extract, such as memes
+(relying on irony or cultural context) or highly abstract non-commonsense reasoning. However,
+6
+--- PAGE 20 ---
+PublishedasaconferencepaperatICLR2026
+Table S6: Generality of DyME across New Domains. We demonstrate performance gains on
+ComplexScenes(A-OKVQA),PureText(GSM8K),andwithstrongerbasemodels(Qwen2.5-VL-
+7B).BaselinesfortextusestandardGRPO.
+Domain Task BaseModel Method Baseline(%) DyME(%)
+WorldKnowledge A-OKVQA LLaVA-OV-S Two-stage 54.2 73.0(+18.8)
+PureText GSM8K Qwen2.5-0.5B GRPO 49.5 55.3(+5.8)
+NewLVLM ChartQA Qwen2.5-VL-7B SFT 87.3 89.6(+2.3)
+ourprimaryobjectiveistoempowerSVLMsforpractical,real-worldproductiontasks(e.g.,chart
+processing,medicaldiagnostics,geometricsolving). Inthesestructuredandsemi-structureddomains
+whereSVLMsaremostcommonlydeployed,VisualFactsarewell-definedandDyMEproveshighly
+effective.
+(6) Human evaluation of CoT quality. Automatic metrics like relaxed accuracy do not fully
+reflect the quality of the reasoning process. To verify whether DyME generates genuinely better
+thinkingtraces,weconductedahumanevaluationon100randomlysampledinstancesfromChartQA.
+AnnotatorsjudgedthevalidityofthegeneratedCoTbasedonitslogicalcoherenceandgrounding.
+As shown in Table S7, DyME produces traces that are slightly more concise (shorter length) but
+significantlymorevalid(validityrate∼70%)comparedtotheTwo-stagebaseline(∼30-40%). This
+confirms that DyME effectively mitigates the generation of “pseudo thinking traces” that plague
+standardSFT/Two-stagetraining.
+TableS7: HumanevaluationofCoTqualityonChartQA.
+BaseModel Method Avg. CoTLength HumanEval(Valid%)
+Two-stage ∼76.3Words 31%
+LLaVA-OV-S
+DyME ∼69.7Words 68%
+Two-stage ∼84.5Words 40%
+SmolVLM
+DyME ∼75.4Words 72%
+S3 SHOWCASES
+Before presenting the model outputs, we first illustrate the data quality definitions used in our
+Algorithmic Validation (Section 4.1 of the main paper). Table S8 showcases examples of Low
+(Undesigned),Medium(Standard),andHigh(Premium)qualityChain-of-Thoughtsupervisionfor
+thesamequestion. ThisvisualizesthesignificantgapinstructureanddetailthatDyMEmustbridge
+whentrainedonnon-premiumdata. Furthermore,TableS9illustratesthecomprehensiveformatof
+oursupervisiondata,encompassingtheinputimage,theassociatedquestion,theextractedvisual
+facts,andtheground-truthresponse.
+WepresentdialogueinstancesofSmolVLM,LLaVA-OV-S,andInternVL2-S,whichweretrained
+withDyMEinourexperiments,ontasksinthedomainsofmedicalVQA,chartunderstanding,and
+geometry. AsshowninFigS3toFigS5,allmodelstrainedwithDyMEdemonstratetheabilityto
+generateeffectivethinkingtraceswithaccurateintermediatevalues(ingreen),whichplayacrucial
+roleinreachingthefinalcorrectanswer.
+Thethinkingprocessesofthesemodelsarerelativelyfixedandtemplate-based. Thisisaresultof
+our vision supervision module taking effect, as SVLMs should not be overly broad and general;
+otherwise,theirexplorationcaneasilydivergeinsteadofconverging.
+7
+--- PAGE 21 ---
+PublishedasaconferencepaperatICLR2026
+TableS8: Showcaseofchain-of-thought(CoT)supervisionwithdifferentqualityforChartQA.
+Type Content
+Question Whendoestheunfavorableviewreachthepeak?
+High-quality <SUMMARY>IinspectthelinegraphshowingveryunfavorableviewsofPakistaninIndiaover
+CoT time,andidentifytheyearwherethepercentagepeaksbylocatingthemaximumpointonthe
+curve.</SUMMARY>
+<CAPTION>Linegraphtitled“VeryunfavorableviewsofPakistanincreasinginIndia”.The
+y-axisshowstheshareofrespondentswithaveryunfavorableview(0–100%). Thex-axis
+showsyears2013–2017.Markedvalues:54%(2013),49%(2014),51%(2015),55%(2016),
+and64%(2017).Theseriesexhibitsanoverallupwardtrendandpeaksin2017.</CAPTION>
+<REASONING>Thepeakcorrespondstothehighestpercentageonthey-axis. Amongthe
+annotatedyears,2017hasthelargestvalue(64%),higherthan2013–2016.Therefore,thepeak
+occursin2017.</REASONING>
+<CONCLUSION>2017</CONCLUSION>
+Medium-quality Goal:Determinetheyearwhentheunfavorableviewreachesitsmaximum.
+CoT Observation:Thevaluesforeachyearare:2013:0,2014:0,2015:0,2016:55,and2017:64.
+Reasoning:Comparingthesevalues,thelargestvalueis64,whichappearsin2017.
+Conclusion:Theunfavorableviewreachesitspeakin2017.
+Low-quality I’mtryingtofigureouttheyearwhentheunfavorableviewreachesitshighestpoint.Thevalues
+CoT staylowuntil2016,wheretheyriseto55,andthenclimbfurtherto64in2017.Thismakes
+2017theyearwhentheunfavorableviewreachesitsmaximum.
+TableS9: ShowcaseofacompleteCoTtrainingexample.
+Type Content
+Question Whatisthemanbythebagsawaiting?
+VisualFact {”description”: ”Amanstandsinthemiddleofaresidentialstreet,facingawayfromthe
+camera. Heholdsaredbagandpullsawheeledblacksuitcase,withanotherblacksuitcase
+placednearby.Houses,parkedcars,andtreeslinethebackground,suggestingheiswaiting
+tocrossorbepickedup.”,”objects”:[ ”name”:”man”,”attributes”:[”lightblueandwhite
+shirt”,”bluejeans”,”carryingaredbag”,”pullingawheeledsuitcase”],”position”:”center”,
+”name”:”redbag”,”attributes”:[”heldbytheman”],”position”:”leftoftheman”, ”name”:
+”blacksuitcase”,”attributes”:[”wheeled”,”beingpulled”],”position”:”neartheman’sfeet”
+, ”name”:”blacksuitcase”,”attributes”:[”lyingontheground”],”position”:”neartheman”
+, ”name”:”street”,”attributes”:[”asphalt”,”urbanresidential”],”position”:”foreground”,
+”name”:”houses”,”attributes”:[”residentialhomes”],”position”:”backgroundleft”, ”name”:
+”parkedcars”,”attributes”:[”redSUV”,”othervehicles”],”position”:”backgroundcenterand
+left”, ”name”:”trees”,”attributes”:[”greenfoliage”],”position”:”backgroundright” ]}
+GTResponse Atrainwouldnotappearonaresidentialstreet,andtheman’sluggagesuggestsheiswaiting
+fortransportationratherthanadelivery.Thenearbyskateboarderisnotinteractingwithhim,
+leavingacabasthemostplausibleoption.
+8
+--- PAGE 22 ---
+PublishedasaconferencepaperatICLR2026
+(a)ShowcaseofMedicalVQA
+(b)ShowcaseofChartUnderstanding
+(c)ShowcaseofGeometrySolving
+FigureS3: ShowcasesofSmolVLM.TheSVLMoriginallyproduceshallucinatedanswers(red),
+whiletheDyME-trainedmodelgeneratesstructuredthinkingtraces(green)thatincorporategrounded
+values,effectivelyimprovingtheperformance.
+9
+--- PAGE 23 ---
+PublishedasaconferencepaperatICLR2026
+(a)ShowcaseofMedicalVQA
+(b)ShowcaseofChartUnderstanding
+(c)ShowcaseofGeometrySolving
+FigureS4: ShowcasesofInternVL2-S.TheSVLMoriginallyproduceshallucinatedanswers(red),
+whiletheDyME-trainedmodelgeneratesstructuredthinkingtraces(green)thatincorporategrounded
+values,effectivelyimprovingtheperformance.
+10
+--- PAGE 24 ---
+PublishedasaconferencepaperatICLR2026
+(a)ShowcaseofChartUnderstanding
+(b)ShowcaseofMedicalVQA
+(c)ShowcaseofGeometrySolving
+FigureS5: ShowcasesofLLaVA-OV-S.TheSVLMoriginallyproduceshallucinatedanswers(red),
+whiletheDyME-trainedmodelgeneratesstructuredthinkingtraces(green)thatincorporategrounded
+values,effectivelyimprovingtheperformance.
+11

requirements.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+accelerate==1.11.0
+pycocotools
+matplotlib
+datasets
+peft==0.17.0
+qwen_vl_utils
+wandb
+transformers==4.57.1
+deepspeed==0.18.1
+pycocotools
+trl==0.23.1
+flash-attn==2.7.4.post1
+scikit-image
+openai
+spacy
+autoawq==0.2.9

reward_utils/__pycache__/format_checks.cpython-310.pyc ADDED Viewed

Binary file (2.82 kB). View file

reward_utils/compute_rewards.py ADDED Viewed

	@@ -0,0 +1,126 @@

+import concurrent.futures
+from typing import List, Dict, Any
+from .checker import RewardCalculator
+def split_initial_context(text: str):
+    text = text.lower()
+    flag = 'answer:'
+    if flag in text:
+        ans = text.split(flag)[-1].strip()
+        context = text.split(flag)[0].strip()
+        ans = ans.strip('.')
+    else:
+        context = text
+        ans = text
+    return context, ans
+def calculate_rewards_in_parallel(
+    checker: RewardCalculator,
+    batch_data: Dict[str, Any],
+    gpu_id: int,
+    num_threads: int = 8,
+    task='chart'):
+    """
+    Calculates accuracy rewards for a batch of data in parallel using a thread pool.
+    Args:
+        batch_data: A dictionary containing lists of data, including 'response',
+                    'prompt', 'image', 'answer', and an optional 'tp' (answer_type).
+        gpu_id: The ID of the GPU to be used for processing.
+        num_threads: The number of parallel threads to use.
+    Returns:
+        A list of calculated reward scores for each item in the batch.
+    """
+    # Extract lists of data from the input dictionary
+    responses = batch_data['response']
+    predictions = []
+    for r in responses:
+        c, p = split_initial_context(r)
+        predictions.append(p)
+    prompts = batch_data['prompt']
+    # questions = batch_data['question']
+    answers = batch_data['answer']
+    hints = batch_data['hints'] if 'hints' in batch_data else [""] * len(responses)
+    num_samples = len(responses)
+    # Safely get 'answer_types', providing a list of Nones as a default
+    # This fixes a bug in the original code.
+    answer_types = batch_data.get('tp', [None] * num_samples)
+    # Prepare the arguments for each task by zipping the data together.
+    # This creates an iterator of tuples, where each tuple contains all args for one call.
+    in_answers = answers
+    if 'world' in task:
+        in_answers = batch_data['direct_answers']
+    task_answer_args = zip(
+        predictions,
+        in_answers,
+        [task] * num_samples,
+        # [gpu_id] * num_samples,
+        # answer_types,
+        # hints
+    )
+    task_thinking_args = zip(
+        responses,
+        prompts,
+        answers,
+        hints,
+        [task] * num_samples
+    )
+    # Use a ThreadPoolExecutor to process the data in parallel.
+    with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
+        # Instead of a separate function, use a lambda to unpack the arguments.
+        # The '*' operator unpacks each tuple from task_args into positional arguments
+        # for the get_acc_reward function.
+        format_rewards = list(executor.map(lambda r: checker.get_format_reward(r, task=task), responses))
+        answer_rewards = list(executor.map(lambda args: checker.get_answer_reward(*args), task_answer_args))
+        thinking_rewards = list(executor.map(
+            lambda args: checker.get_thinking_reward_prompt(*args), task_thinking_args))
+        rewards = [0 if f == 0 else f + a + t for f, a, t in zip(format_rewards, answer_rewards, thinking_rewards)]
+    return rewards, format_rewards, answer_rewards, thinking_rewards
+def refine_context_in_parallel(
+    refiner,
+    questions: List[str],
+    hints: List[str],
+    reference_answers: List[str],
+    task,
+    gpu_id: int,
+    num_threads: int = 8):
+    """
+    Refines contexts for a batch of data in parallel using a thread pool.
+    Args:
+        questions: A list of questions.
+        hints: A list of hints corresponding to each question.
+        reference_answers: A list of reference answers.
+        tasks: A list of task types corresponding to each question.
+        gpu_id: The ID of the GPU to be used for processing.
+        num_threads: The number of parallel threads to use.
+    Returns:
+        A list of refined contexts for each question.
+    """
+    num_samples = len(questions)
+    tasks = [task] * num_samples
+    # Prepare the arguments for each task by zipping the data together.
+    task_args = zip(
+        questions,
+        hints,
+        reference_answers,
+        tasks,
+        [gpu_id] * num_samples
+    )
+    # Use a ThreadPoolExecutor to process the data in parallel.
+    with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
+        refined_contexts = list(executor.map(
+            lambda args: refiner.refine_hint(*args), task_args
+        ))
+    return refined_contexts

reward_utils/refiner.py ADDED Viewed

	@@ -0,0 +1,162 @@

+import re
+from typing import Optional
+from client_utils.openai_api import OpenAIClient
+from data_utils.chart.evaluator import eval_one_chart
+from data_utils.commom_util import prompt_ic
+import os
+import time
+from filelock import FileLock
+TEMPLATE_FILE = "best_template.txt"
+LOCK_FILE = "best_template.txt.lock"
+TEMPLATE_REFRESH_INTERVAL = 60  # Interval (in seconds) to refresh template from file
+class ContextRefiner:
+    """
+    A class to refine hints / reasoning with an external LLM.
+    Encapsulates logic for template management and refinement calls.
+    """
+    def __init__(self, RL_CONFIG, CLIENT_CONFIG, gpu_id=0):
+        """
+        Initializes the ContextRefiner.
+        Args:
+            RL_CONFIG: RL-related configuration dict.
+            CLIENT_CONFIG: LLM client configuration dict.
+            gpu_id: process / GPU id used to select API server.
+        """
+        self.refine_templetes = ["""Goal: [State the user's objective, e.g., Find the year with the highest sales]
+Observation: [List key data points from the chart, e.g., 2020: 150, 2021: 200, 2022: 180]
+Reasoning: [State the logical step, e.g., Compare the values. 200 is the maximum.]
+Conclusion: [Draw the conclusion, e.g., The year with the highest sales was 2021.]
+"""]
+        self.template_lock = FileLock(LOCK_FILE)
+        # Set to 0 so that the first call will force a refresh from file
+        self.last_template_check_time = 0
+        if CLIENT_CONFIG['client_type'] == 'openai':
+            if CLIENT_CONFIG['init_port'] is not None:
+                num_server = int(CLIENT_CONFIG['num_server'])
+                server_id = gpu_id % num_server
+                CLIENT_CONFIG['api_base'] = CLIENT_CONFIG['api_base'] % str(CLIENT_CONFIG['init_port'] + server_id)
+            self.client = OpenAIClient(config=CLIENT_CONFIG)
+        else:
+            raise ValueError(f"Client type '{CLIENT_CONFIG['client_type']}' not supported.")
+    def _check_and_update_template(self):
+        """
+        (Private method) Check whether we need to refresh the template from file.
+        This operation is process-safe.
+        """
+        current_time = time.time()
+        # 1. Check whether the refresh interval has passed
+        if (current_time - self.last_template_check_time) < TEMPLATE_REFRESH_INTERVAL:
+            return  # Not yet time, keep using cached template
+        # 2. Try to acquire the lock and read (short timeout since reading should be fast)
+        try:
+            # print(f"[Process {os.getpid()}] Checking for template update...")  # Uncomment for debugging
+            with self.template_lock.acquire(timeout=5):
+                # --- Lock acquired, safe to read ---
+                if not os.path.exists(TEMPLATE_FILE):
+                    # File does not exist, keep default template
+                    self.last_template_check_time = current_time
+                    return
+                with open(TEMPLATE_FILE, "r", encoding="utf-8") as f:
+                    new_template = f.read().strip()
+                # If file content is valid and different, update in-memory template
+                if new_template and new_template != self.refine_templetes[0]:
+                    self.refine_templetes = [new_template]
+                    print(f"[Process {os.getpid()}] Refiner template updated from file.")
+            # Regardless of success, update last check time to avoid frequent retries
+            self.last_template_check_time = current_time
+        except TimeoutError:
+            # Failed to acquire lock (another process is likely writing)
+            # Do not block; skip and try again next time
+            print(f"[Process {os.getpid()}] Failed to acquire lock for template read, using cached version.")
+            # Update time to avoid immediate retry
+            self.last_template_check_time = current_time
+        except Exception as e:
+            print(f"[Process {os.getpid()}] Error reading template file: {e}")
+            self.last_template_check_time = current_time
+    def refine_hint(self, question, hint: str, reference_answer: str, task: str, gpu_id=None):
+        if hint == "":
+            return hint
+        self._check_and_update_template()
+        system_prompt = None
+        if 'medical' in task:
+            system_prompt = 'You are a seasoned professional in the field of medical image analysis, demonstrating exceptional expertise and insight into complex medical imaging data. Your output should be only judgement, without any additional text or explanation.'
+        elif 'math' in task:
+            system_prompt = 'You are a seasoned professional in the field of mathematics, demonstrating exceptional expertise and insight into complex mathematical problems. Your output should be only judgement, without any additional text or explanation.'
+        elif 'chart' in task:
+            system_prompt = 'You are a seasoned professional in the field of chart analysis, demonstrating exceptional expertise and insight into complex chart data. Your output should be only judgement, without any additional text or explanation.'
+        elif 'world' in task:
+            system_prompt = 'You are a seasoned professional in the field of world knowledge and image analysis, demonstrating exceptional expertise and insight into complex real-world scenarios. Your output should be only judgement, without any additional text or explanation.'
+        else:
+            Exception('Unknown expert task')
+        try:
+            in_context_example = self.client.get_completion(
+                prompt_ic % hint,
+                system_prompt=system_prompt,
+                max_tokens=5000
+            )
+            if 'chart' in task or 'world' in task:
+                if 'chart' in task:
+                    from data_utils.chart.prompts import prompt_thinking_reward, prompt_refine
+                else:
+                    from data_utils.aokvqa.prompts import prompt_thinking_reward, prompt_refine
+                # Construct the final prompt for the evaluator model.
+                evaluation_prompt = prompt_refine % (
+                    in_context_example,
+                    question,
+                    reference_answer,
+                    self.refine_templetes[0]
+                )
+                output = self.client.get_completion(
+                    evaluation_prompt,
+                    system_prompt=system_prompt,
+                    max_tokens=1000
+                )
+                return output
+            else:
+                raise ValueError(f"Task '{task}' not supported for thinking reward.")
+        except Exception as e:
+            print(f"An error occurred during thinking reward prompt generation: {e}")
+            return None
+class ContextRefinerLocal:
+    """
+    A local (non-LLM) refiner that simply returns the original hint.
+    Used when remote refinement is disabled or not desired.
+    """
+    def __init__(self, RL_CONFIG, CLIENT_CONFIG, gpu_id=0):
+        """
+        Initializes the local ContextRefiner.
+        Args:
+            RL_CONFIG: RL-related configuration dict.
+            CLIENT_CONFIG: client configuration dict (unused here).
+            gpu_id: process / GPU id (unused here).
+        """
+        # Do nothing; local refiner is a no-op.
+        pass
+    def refine_hint(self, question, hint: str, reference_answer: str, task: str, gpu_id=None):
+        return hint

tests/test_data_health_probe.py ADDED Viewed

	@@ -0,0 +1,28 @@

+"""Tests for batch data health diagnostics."""
+import torch
+from opsd_utils.diagnostics import (
+    _detect_char_repeat,
+    summarize_batch_data_health,
+)
+def test_detect_char_repeat_cjk():
+    assert _detect_char_repeat("Goal: " + "其" * 10)
+def test_summarize_batch_data_health_empty_vf():
+    samples = [
+        {"prompt": "q1", "visual_fact_hint": ""},
+        {"prompt": "q2", "visual_fact_hint": "bar value 3"},
+    ]
+    stats = summarize_batch_data_health(samples)
+    assert stats["visual_fact_empty_rate"] == 0.5
+    assert stats["batch_size"] == 2
+def test_summarize_batch_data_health_pixel_nan():
+    samples = [{"prompt": "q", "visual_fact_hint": "x"}]
+    pixel = torch.tensor([float("nan"), 1.0, 2.0])
+    stats = summarize_batch_data_health(samples, pixel_values=pixel)
+    assert stats["pixel_has_nan"] is True

tests/test_degeneration_probe.py ADDED Viewed

	@@ -0,0 +1,98 @@

+"""Regression tests for completion degeneration heuristics."""
+from unittest.mock import patch
+import torch
+from opsd_utils import debug_log as opsd_debug
+from opsd_utils.diagnostics import (
+    _detect_char_repeat,
+    _detect_degeneration,
+    _detect_repeat_loop,
+    _detect_single_token_repeat,
+    _max_same_token_run,
+    is_degenerate_completion,
+    log_generate_probe,
+)
+class _FakeTokenizer:
+    eos_token_id = 151645
+    pad_token_id = 151643
+    bos_token_id = None
+    def decode(self, ids, skip_special_tokens=False):
+        if isinstance(ids, torch.Tensor):
+            ids = ids.tolist()
+        return " ".join(str(i) for i in ids)
+def test_single_token_repeat_detects_cjk_loop():
+    ids = [39992, 25, 7379] + [41146] * 40
+    assert _detect_single_token_repeat(ids)
+    assert _max_same_token_run(ids) == (40, 41146)
+def test_ngram_repeat_not_limited_to_first_eight_tokens():
+    prefix = list(range(20))
+    gram = [9, 8, 7]
+    ids = prefix + gram * 5
+    assert _detect_repeat_loop(ids)
+def test_char_repeat_detects_qiqiqi():
+    assert _detect_char_repeat("其其其其其其")
+def test_is_degenerate_completion_detects_repeat():
+    ids = [39992, 25] + [41146] * 20
+    assert is_degenerate_completion(ids, "Goal: x\n" + "其" * 40)
+def test_short_numeric_answer_not_degenerate_without_answer_flag():
+    ids = [198, 17, 15, 18, 15]  # \n2030
+    assert not is_degenerate_completion(ids, "\n2030", require_answer_flag=False)
+    assert is_degenerate_completion(ids, "\n2030", require_answer_flag=True)
+def test_degeneration_flags_missing_answer():
+    ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+    text = "Goal: test\nObservation: x\nReasoning: y\nConclusion: z"
+    is_deg, reasons = _detect_degeneration(ids, text, answer_flag="Answer:")
+    assert is_deg
+    assert any(r.startswith("ANSWER_FLAG_COUNT") for r in reasons)
+def test_log_generate_probe_does_not_shadow_tokenizer_across_samples():
+    """sample[0] single-token repeat must not break decode for sample[1]."""
+    repeat_tail = [24] * 12
+    row0 = [39992, 25, 7379] + repeat_tail + [0] * (200 - 3 - len(repeat_tail))
+    row1 = [39992, 25, 7379, 100, 101, 102] + [0] * 194
+    completion_ids = torch.tensor([row0, row1], dtype=torch.long)
+    completion_mask = torch.tensor(
+        [[1] * 15 + [0] * 185, [1] * 6 + [0] * 194],
+        dtype=torch.long,
+    )
+    is_eos = torch.zeros_like(completion_mask, dtype=torch.bool)
+    is_eos[:, 14] = True
+    is_eos[:, 5] = True
+    eos_idx = torch.tensor([14, 5], dtype=torch.long)
+    completions = ["Goal: repeat\n" + "x" * 20, "Goal: ok\nAnswer: 1"]
+    with patch.object(opsd_debug, "should_log_probe", return_value=True):
+        stats = log_generate_probe(
+            global_step=1,
+            trainer_step=1,
+            prompt_length=100,
+            prompt_completion_ids=torch.zeros(2, 300, dtype=torch.long),
+            completion_ids=completion_ids,
+            completion_mask=completion_mask,
+            is_eos=is_eos,
+            eos_idx=eos_idx,
+            completions=completions,
+            tokenizer=_FakeTokenizer(),
+            generation_config=None,
+            max_completion_length=200,
+            num_generations=1,
+            sample_count=2,
+        )
+    assert stats["degenerate_count"] >= 1

tests/test_health_monitor.py ADDED Viewed

	@@ -0,0 +1,72 @@

+"""Tests for TrainingHealthMonitor alerts and correlation."""
+from opsd_utils.health_monitor import (
+    ALERT_GEN_CLIP_COLLAPSE,
+    ALERT_GEN_REPEAT_DEGEN,
+    ALERT_RL_ZERO_SIGNAL,
+    TrainingHealthMonitor,
+)
+def test_clip_collapse_alert():
+    hm = TrainingHealthMonitor({"enabled": True, "log_alerts_immediately": False})
+    hm.reset_step(1)
+    alerts = hm.record_generate(
+        1,
+        {"clipped_rate": 0.85, "eos_terminated_rate": 0.1, "degenerate_rate": 0.2, "repeat_loop_count": 0},
+        {"p_greedy_first": 0.99, "p_eos_first": 1e-6},
+    )
+    assert ALERT_GEN_CLIP_COLLAPSE in alerts
+def test_repeat_degen_alert():
+    hm = TrainingHealthMonitor({"enabled": True, "log_alerts_immediately": False})
+    hm.reset_step(2)
+    alerts = hm.record_generate(
+        2,
+        {"clipped_rate": 0.3, "eos_terminated_rate": 0.5, "degenerate_rate": 0.6, "repeat_loop_count": 1},
+        {},
+    )
+    assert ALERT_GEN_REPEAT_DEGEN in alerts
+def test_rl_zero_signal_alert():
+    hm = TrainingHealthMonitor({"enabled": True, "log_alerts_immediately": False})
+    hm.reset_step(3)
+    hm.record_loss(3, {"advantages_abs_mean": 0.0, "grpo_zero_loss_rate": 0.95})
+    assert ALERT_RL_ZERO_SIGNAL in hm._step_alerts
+def test_correlate_hints_after_history():
+    hm = TrainingHealthMonitor({"enabled": True, "window": 5, "log_every_step": False})
+    hm.reset_step(0)
+    hm.record_generate(
+        0,
+        {"clipped_rate": 0.1, "eos_terminated_rate": 0.9, "degenerate_rate": 0.1, "repeat_loop_count": 0},
+        {"p_greedy_first": 0.8, "p_eos_first": 0.01},
+    )
+    hm.record_optimizer(0, 0.5, 8e-5)
+    hm.finish_step(0)
+    hm.reset_step(1)
+    hm.record_generate(
+        1,
+        {"clipped_rate": 0.9, "eos_terminated_rate": 0.05, "degenerate_rate": 0.5, "repeat_loop_count": 1},
+        {"p_greedy_first": 0.995, "p_eos_first": 1e-6},
+    )
+    hm.record_optimizer(1, 2.5, 8e-5)
+    corr = hm.correlate()
+    assert "delta_clipped_rate" in corr or "root_cause_hints" in corr
+def test_finish_step_returns_metrics_keys():
+    hm = TrainingHealthMonitor({"enabled": True, "metrics_every_step": True, "log_every_step": False})
+    hm.reset_step(1)
+    hm.record_generate(
+        1,
+        {"clipped_rate": 0.2, "eos_terminated_rate": 0.8, "degenerate_rate": 0.1, "repeat_loop_count": 0},
+        {"p_greedy_first": 0.9, "p_eos_first": 0.001},
+    )
+    hm.record_optimizer(1, 1.0, 8e-5)
+    metrics = hm.finish_step(1)
+    assert "completions/degenerate_rate" in metrics
+    assert "health/alert_count" in metrics

tests/test_mode_router_rlsd.py ADDED Viewed

	@@ -0,0 +1,97 @@

+"""RLSD / COPSD anti-leakage routing tests."""
+import os
+import sys
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+import torch
+from opsd_utils.constants import MODE_GRPO, MODE_OPSD, MODE_SFT
+from opsd_utils.mode_router import route_completion_modes, route_prompt_modes
+def _rlsd_cfg(**gate):
+    base = {
+        "enabled": True,
+        "mode": "rlsd",
+        "gate": {
+            "correct_threshold": 0.5,
+            "per_completion_opsd": True,
+            "require_format_for_opsd": False,
+            **gate,
+        },
+    }
+    return base
+def test_rlsd_prompt_correct_grpo():
+    acc = torch.tensor([[1.0, 0.0]])
+    modes = route_prompt_modes(acc, 2, _rlsd_cfg(), recoverable_flags=[True])
+    assert modes == [MODE_GRPO]
+def test_rlsd_prompt_wrong_opsd_when_recoverable():
+    acc = torch.tensor([[0.0, 0.0]])
+    modes = route_prompt_modes(acc, 2, _rlsd_cfg(), recoverable_flags=[True])
+    assert modes == [MODE_OPSD]
+def test_rlsd_prompt_wrong_sft_when_not_recoverable():
+    acc = torch.tensor([[0.0, 0.0]])
+    modes = route_prompt_modes(acc, 2, _rlsd_cfg(), recoverable_flags=[False])
+    assert modes == [MODE_SFT]
+def test_rlsd_per_completion_routing():
+    acc = torch.tensor([[1.0, 0.0]])
+    fmt = torch.tensor([[1.0, 0.5]])
+    modes = route_completion_modes(acc, 2, 2, _rlsd_cfg(), [True], format_rewards=fmt)
+    assert modes == [MODE_GRPO, MODE_OPSD]
+def test_copsd_opd_alias_matches_rlsd():
+    acc = torch.tensor([[0.0, 1.0]])
+    cfg = _rlsd_cfg()
+    cfg["mode"] = "copsd_opd"
+    modes = route_completion_modes(acc, 2, 2, cfg, [True])
+    assert modes == [MODE_OPSD, MODE_GRPO]
+def test_rlsd_all_wrong_group_first_completion_sft_cold_start():
+    """All-wrong group: gen 0 → SFT replace, other wrong gens → OPSD."""
+    acc = torch.tensor([[0.0, 0.0, 0.0, 0.0]])
+    modes = route_completion_modes(acc, 4, 4, _rlsd_cfg(), [True])
+    assert modes == [MODE_SFT, MODE_OPSD, MODE_OPSD, MODE_OPSD]
+def test_rlsd_all_wrong_two_prompts():
+    acc = torch.tensor([[0.0, 0.0], [0.0, 0.0]])
+    modes = route_completion_modes(acc, 2, 4, _rlsd_cfg(), [True, True])
+    assert modes == [MODE_SFT, MODE_OPSD, MODE_SFT, MODE_OPSD]
+def test_rlsd_partial_correct_no_cold_start_on_wrong():
+    acc = torch.tensor([[1.0, 0.0, 0.0]])
+    modes = route_completion_modes(acc, 3, 3, _rlsd_cfg(), [True])
+    assert modes == [MODE_GRPO, MODE_OPSD, MODE_OPSD]
+def test_online_sft_on_all_wrong_can_be_disabled():
+    acc = torch.tensor([[0.0, 0.0]])
+    cfg = _rlsd_cfg(online_sft_on_all_wrong=False)
+    modes = route_completion_modes(acc, 2, 2, cfg, [True])
+    assert modes == [MODE_OPSD, MODE_OPSD]
+if __name__ == "__main__":
+    test_rlsd_prompt_correct_grpo()
+    test_rlsd_prompt_wrong_opsd_when_recoverable()
+    test_rlsd_prompt_wrong_sft_when_not_recoverable()
+    test_rlsd_per_completion_routing()
+    test_copsd_opd_alias_matches_rlsd()
+    test_rlsd_all_wrong_group_first_completion_sft_cold_start()
+    test_rlsd_all_wrong_two_prompts()
+    test_rlsd_partial_correct_no_cold_start_on_wrong()
+    test_online_sft_on_all_wrong_can_be_disabled()
+    print("RLSD routing tests passed.")

tests/test_privileged.py ADDED Viewed

	@@ -0,0 +1,172 @@

+import json
+import os
+import sys
+import tempfile
+import pytest
+from PIL import Image
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from data_utils.privileged_schema import (
+    heuristic_bbox_from_visual_fact,
+    normalize_evidence_bbox,
+    parse_visual_fact,
+    resolve_crop_bbox,
+)
+from opsd_utils import debug_log as opsd_debug
+from opsd_utils.privileged import build_privileged_context, maybe_save_privileged_images
+from opsd_utils.privileged.image_utils import crop_image, load_rgb, resolve_teacher_images
+from opsd_utils.privileged.profiles import effective_profile
+def _make_image(path: str, size=(100, 100), color=(255, 0, 0)):
+    img = Image.new("RGB", size, color)
+    img.save(path)
+    return path
+def test_text_provider():
+    sample = {"hint": "Rep=67", "answer": "Answer: 131"}
+    suffix, images = build_privileged_context(sample, ["text"], privileged_profile="text")
+    assert "Rep=67" in suffix
+    assert "131" in suffix
+    assert images == []
+def test_hybrid_provider_suffix():
+    img = Image.new("RGB", (32, 32))
+    sample = {"hint": "step", "visual_fact": "bar=3", "answer": "Answer: 3", "image": img}
+    suffix, images = build_privileged_context(
+        sample,
+        privileged_profile="hybrid",
+        opsd_config={"privileged_image": {"mode": "dual"}},
+    )
+    assert "Visual Facts" in suffix
+    assert "Reference" in suffix
+    assert len(images) == 2
+def test_hybrid_default_single_image_for_chartqa():
+    img = Image.new("RGB", (32, 32))
+    sample = {"hint": "step", "visual_fact": "bar=3", "answer": "Answer: 3", "image": img}
+    suffix, images = build_privileged_context(sample, privileged_profile="hybrid")
+    assert "Visual Facts" in suffix
+    assert "Reference" in suffix
+    assert len(images) == 1
+def test_visual_profile_excludes_answer():
+    img = Image.new("RGB", (32, 32))
+    sample = {"hint": "secret", "visual_fact": '{"objects":[]}', "answer": "Answer: 3", "image": img}
+    suffix, _ = build_privileged_context(sample, privileged_profile="visual")
+    assert "Visual Facts" in suffix
+    assert "Reference Answer" not in suffix
+def test_math_lm_downgrade():
+    sample = {"hint": "step", "answer": "Answer: 1"}
+    profile = effective_profile(sample, "hybrid")
+    assert profile == "text"
+def test_normalize_evidence_bbox_c2():
+    assert normalize_evidence_bbox([0.1, 0.2, 0.8, 0.9]) == [0.1, 0.2, 0.8, 0.9]
+    assert normalize_evidence_bbox([0.1, 0.2, 1.5, 0.9]) is None
+def test_heuristic_bbox_d2():
+    vf = json.dumps({"objects": [{"name": "cat", "position": "center"}]})
+    bbox = heuristic_bbox_from_visual_fact(vf)
+    assert bbox == [0.25, 0.25, 0.75, 0.75]
+def test_crop_image_normalized_bbox():
+    img = Image.new("RGB", (100, 100), (0, 255, 0))
+    crop, strategy = crop_image(img, bbox_norm=[0.2, 0.2, 0.8, 0.8], strategy="bbox")
+    assert strategy == "bbox"
+    assert crop.size[0] > 0
+def test_resolve_teacher_images_dual():
+    img = Image.new("RGB", (80, 80), (0, 0, 255))
+    sample = {
+        "image": img,
+        "visual_fact": json.dumps({"objects": [{"position": "top"}]}),
+    }
+    images, meta = resolve_teacher_images(sample, "hybrid", crop_cfg={"mode": "dual"})
+    assert len(images) == 2
+    assert meta["num_teacher_images"] == 2
+    assert meta["crop_strategy"] in ("heuristic", "center", "center_fallback", "bbox")
+def test_chartqa_enriched_visual_fact_hint():
+    """Enriched ChartQA records (F1+F2) should activate VisualFactsProvider."""
+    from data_utils.chart.deplot_pipeline import build_deplot_visual_fact
+    sample = {
+        "hint": "Goal: Find the lowest value.\nObservation: values are 70, 72, 77.",
+        "answer": "Answer: 70",
+        "visual_fact_hint": "Goal: Find the lowest value.\nObservation: values are 70, 72, 77.",
+        "visual_fact": "Goal: Find the lowest value.\nObservation: values are 70, 72, 77.",
+        "visual_fact_deplot": build_deplot_visual_fact(
+            {"question": "q"}, "Year | Value\n2019 | 70\n2020 | 72"
+        ),
+        "image": Image.new("RGB", (64, 64)),
+    }
+    suffix, images = build_privileged_context(
+        sample,
+        ["text", "visual_facts"],
+        privileged_profile="hybrid",
+    )
+    assert "Visual Facts - Hint" in suffix
+    assert "Visual Facts - DePlot" in suffix
+    assert "2019 | 70" in suffix
+    assert "Reference Reasoning" in suffix
+    assert len(images) == 1
+    vf_raw = sample.get("visual_fact") or sample.get("visual_facts")
+    assert vf_raw and len(vf_raw.strip()) > 0
+def test_visual_facts_f1_f2_merge():
+    from data_utils.chart.deplot_pipeline import build_deplot_visual_fact
+    sample = {
+        "visual_fact_hint": "hint table",
+        "visual_fact_deplot": build_deplot_visual_fact(
+            {"question": "q"}, "Col | Val\nA | 1"
+        ),
+        "image": Image.new("RGB", (32, 32)),
+    }
+    suffix, _ = build_privileged_context(sample, privileged_profile="hybrid")
+    assert "Visual Facts - Hint" in suffix
+    assert "Visual Facts - DePlot" in suffix
+    assert "Col | Val" in suffix
+def test_parse_visual_fact_b1():
+    raw = {"objects": [{"name": "a"}]}
+    text = parse_visual_fact(raw)
+    assert "objects" in text
+def test_debug_artifacts_respect_detail_every():
+    opsd_debug.configure(enabled=True, detail_every=10, rank=0, world_size=1)
+    with tempfile.TemporaryDirectory() as tmp:
+        img = Image.new("RGB", (32, 32))
+        path = maybe_save_privileged_images(5, 0, img, img, meta={"crop_strategy": "center"}, output_dir=tmp)
+        assert path is None
+        assert not os.path.exists(os.path.join(tmp, "logs", "images"))
+        path = maybe_save_privileged_images(10, 0, img, img, meta={"crop_strategy": "center"}, output_dir=tmp)
+        assert path is not None
+        assert os.path.exists(f"{path}_full.png")
+        assert os.path.exists(f"{path}_meta.json")
+if __name__ == "__main__":
+    test_text_provider()
+    test_hybrid_provider_suffix()
+    test_math_lm_downgrade()
+    test_debug_artifacts_respect_detail_every()
+    print("Privileged provider tests passed.")

tests/test_privileged_debug_artifacts.py ADDED Viewed

	@@ -0,0 +1,40 @@

+import json
+import os
+import sys
+import tempfile
+from PIL import Image
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from opsd_utils import debug_log as opsd_debug
+from opsd_utils.privileged.debug_artifacts import maybe_save_privileged_images
+def test_max_samples_per_detail():
+    opsd_debug.configure(enabled=True, detail_every=1, rank=0, world_size=1)
+    cfg = {"save_images": True, "image_subdir": "logs/images", "max_samples_per_detail": 1}
+    img = Image.new("RGB", (16, 16))
+    with tempfile.TemporaryDirectory() as tmp:
+        p0 = maybe_save_privileged_images(1, 0, img, None, meta={}, output_dir=tmp, privileged_debug_cfg=cfg)
+        p1 = maybe_save_privileged_images(1, 1, img, None, meta={}, output_dir=tmp, privileged_debug_cfg=cfg)
+        assert p0 is not None
+        assert p1 is None
+def test_meta_sidecar():
+    opsd_debug.configure(enabled=True, detail_every=1, rank=0, world_size=1)
+    img = Image.new("RGB", (16, 16))
+    with tempfile.TemporaryDirectory() as tmp:
+        prefix = maybe_save_privileged_images(
+            1,
+            0,
+            img,
+            img,
+            meta={"privileged_profile": "hybrid", "crop_strategy": "bbox"},
+            output_dir=tmp,
+        )
+        with open(f"{prefix}_meta.json", encoding="utf-8") as f:
+            meta = json.load(f)
+        assert meta["privileged_profile"] == "hybrid"
+        assert meta["crop_strategy"] == "bbox"

tests/test_slice_completion_logits.py ADDED Viewed

	@@ -0,0 +1,18 @@

+"""Tests for completion logit slicing shared by GRPO and OPSD."""
+import os
+import sys
+import torch
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from opsd_utils.opsd_loss import slice_student_completion_logits
+def test_slice_matches_legacy_grpo_path():
+    logits_to_keep = 4
+    full = torch.randn(2, 20, 8)
+    legacy = full[:, -logits_to_keep - 1 :, :]
+    legacy = legacy[:, :-1, :]
+    legacy = legacy[:, -logits_to_keep:, :]
+    assert torch.allclose(legacy, slice_student_completion_logits(full, logits_to_keep))

tests/test_teacher_dual_image.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import os
+import sys
+import pytest
+from PIL import Image
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from opsd_utils.privileged.image_utils import resolve_teacher_images
+from opsd_utils.privileged.profiles import effective_profile
+def test_text_profile_single_image():
+    img = Image.new("RGB", (64, 64))
+    sample = {"image": img, "hint": "h", "answer": "Answer: 1"}
+    images, meta = resolve_teacher_images(sample, "text")
+    assert len(images) == 1
+    assert meta["num_teacher_images"] == 1
+def test_hybrid_profile_dual_image():
+    img = Image.new("RGB", (64, 64))
+    sample = {"image": img, "evidence_bbox": [0.1, 0.1, 0.9, 0.9]}
+    images, meta = resolve_teacher_images(sample, "hybrid", crop_cfg={"mode": "dual"})
+    assert len(images) == 2
+    assert meta["has_bbox"] is True
+def test_hybrid_profile_single_image_by_default():
+    img = Image.new("RGB", (64, 64))
+    sample = {"image": img, "evidence_bbox": [0.1, 0.1, 0.9, 0.9]}
+    images, meta = resolve_teacher_images(sample, "hybrid")
+    assert len(images) == 1
+    assert meta["num_teacher_images"] == 1
+    assert meta["crop_strategy"] == "single_full"
+def test_no_image_empty():
+    sample = {"hint": "only text"}
+    assert effective_profile(sample, "hybrid") == "text"
+    images, meta = resolve_teacher_images(sample, "text")
+    assert images == []
+    assert meta["num_teacher_images"] == 0

tests/test_vocab_align.py ADDED Viewed

	@@ -0,0 +1,71 @@

+"""Tests for cross-model vocab alignment diagnostics."""
+import os
+import sys
+import torch
+import torch.nn.functional as F
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from opsd_utils.opsd_loss import generalized_jsd_loss
+from opsd_utils.vocab_align import (
+    align_cross_model_logits,
+    reset_vocab_align_debug,
+    verify_shared_tokenizer_alignment,
+)
+class _Tok:
+    def __init__(self, vocab_size: int, offset: int = 0):
+        self._size = vocab_size
+        self._offset = offset
+    def __len__(self):
+        return self._size
+    def decode(self, ids, skip_special_tokens=False):
+        i = ids[0]
+        return f"tok_{i + self._offset}"
+    def convert_ids_to_tokens(self, i):
+        return f"tok_{i + self._offset}"
+def test_align_slice_renorm_via_log_softmax():
+    reset_vocab_align_debug()
+    student = torch.randn(1, 3, 100, requires_grad=True)
+    teacher = torch.randn(1, 3, 128)
+    s, t = align_cross_model_logits(student, teacher, log_renorm_check=False)
+    assert s.shape[-1] == t.shape[-1] == 100
+    t_probs = F.softmax(t[0, 0], dim=-1)
+    assert abs(float(t_probs.sum()) - 1.0) < 1e-4
+def test_generalized_jsd_renormalizes_after_slice():
+    reset_vocab_align_debug()
+    student = torch.randn(1, 5, 152000, requires_grad=True)
+    teacher = torch.randn(1, 5, 152128)
+    mask = torch.ones(1, 5)
+    loss = generalized_jsd_loss(student, teacher, mask)
+    assert loss.ndim == 0
+    assert loss.requires_grad
+def test_tokenizer_alignment_detects_mismatch():
+    st = _Tok(200, offset=0)
+    tt = _Tok(200, offset=1)
+    report = verify_shared_tokenizer_alignment(
+        st, tt, shared_vocab=200, full_scan=True, sample_stride=1
+    )
+    assert not report["aligned"]
+    assert report["mismatch_count"] > 0
+def test_tokenizer_alignment_passes_identical():
+    st = _Tok(1000, offset=0)
+    tt = _Tok(1200, offset=0)
+    report = verify_shared_tokenizer_alignment(
+        st, tt, shared_vocab=1000, full_scan=False, sample_stride=100
+    )
+    assert report["aligned"]

trainer/DyMETrainer_7B.py ADDED Viewed

	@@ -0,0 +1,983 @@

+import itertools
+import os
+import textwrap
+import warnings
+from collections import defaultdict, deque
+from collections.abc import Sized
+from contextlib import nullcontext
+from typing import Any, Callable, Optional, Union
+from torch.nn.utils.rnn import pad_sequence
+import datasets
+import torch
+import torch.utils.data
+import transformers
+from accelerate.utils import broadcast_object_list, gather, gather_object, is_peft_model, set_seed
+from datasets import Dataset, IterableDataset
+from packaging import version
+from torch import nn
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.utils.data import DataLoader, Sampler, DistributedSampler
+from transformers import (
+    AutoModelForCausalLM,
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    GenerationConfig,
+    PreTrainedModel,
+    PreTrainedTokenizerBase,
+    Trainer,
+    TrainerCallback,
+    is_wandb_available,
+)
+from transformers.integrations.deepspeed import is_deepspeed_zero3_enabled
+from transformers.trainer_utils import seed_worker
+from transformers.utils import is_datasets_available, is_peft_available
+from trl.data_utils import apply_chat_template, is_conversational, maybe_apply_chat_template
+from trl.extras.profiling import profiling_context, profiling_decorator
+from trl.import_utils import is_liger_kernel_available, is_vllm_available
+from trl.models import create_reference_model, prepare_deepspeed, prepare_fsdp, unwrap_model_for_generation
+# from trl.models.utils import _ForwardRedirection
+from trl.trainer.callbacks import SyncRefModelCallback
+from trl.trainer.grpo_config import GRPOConfig
+from trl.trainer.utils import (
+    disable_dropout_in_model,
+    generate_model_card,
+    get_comet_experiment_url,
+    pad,
+    print_prompt_completions_sample,
+    selective_log_softmax,
+)
+from trl.models import prepare_deepspeed, unwrap_model_for_generation
+from trl.trainer.grpo_config import GRPOConfig
+from trl.trainer.utils import generate_model_card, get_comet_experiment_url, selective_log_softmax
+import concurrent.futures
+from datasets import Dataset, IterableDataset
+from reward_utils import checker
+from reward_utils.checker import RewardCalculator
+from reward_utils.compute_rewards import calculate_rewards_in_parallel, refine_context_in_parallel
+if is_wandb_available():
+    import wandb
+# What we call a reward function is a callable that takes a list of prompts and completions and returns a list of
+# rewards. When it's a string, it's a model ID, so it's loaded as a pretrained model.
+RewardFunc = Union[str, PreTrainedModel, Callable[[list, list], list[float]]]
+class RepeatSampler(Sampler):
+    """
+    Sampler that repeats the indices of a dataset in a structured manner.
+    Args:
+        data_source (`Sized`):
+            Dataset to sample from.
+        mini_repeat_count (`int`):
+            Number of times to repeat each index per batch.
+        batch_size (`int`, *optional*, defaults to `1`):
+            Number of unique indices per batch.
+        repeat_count (`int`, *optional*, defaults to `1`):
+            Number of times to repeat the full sampling process.
+        shuffle (`bool`, *optional*, defaults to `True`):
+            Whether to shuffle the dataset.
+        seed (`int` or `None`, *optional*, defaults to `None`):
+            Random seed for reproducibility (only affects this sampler).
+    Example:
+    ```python
+    >>> sampler = RepeatRandomSampler(["a", "b", "c", "d", "e", "f", "g"], mini_repeat_count=2, batch_size=3, repeat_count=4)
+    >>> list(sampler)
+    [4, 4, 3, 3, 0, 0,
+     4, 4, 3, 3, 0, 0,
+     4, 4, 3, 3, 0, 0,
+     4, 4, 3, 3, 0, 0,
+     1, 1, 2, 2, 6, 6,
+     1, 1, 2, 2, 6, 6,
+     1, 1, 2, 2, 6, 6,
+     1, 1, 2, 2, 6, 6]
+    ```
+    ```txt
+    mini_repeat_count = 3
+          -   -   -
+         [0,  0,  0,  1,  1,  1,  2,  2,  2,  3,  3,  3,      |
+          4,  4,  4,  5,  5,  5,  6,  6,  6,  7,  7,  7,      |
+          8,  8,  8,  9,  9,  9, 10, 10, 10, 11, 11, 11,      |
+                                                                repeat_count = 2
+          0,  0,  0,  1,  1,  1,  2,  2,  2,  3,  3,  3,      |
+          4,  4,  4,  5,  5,  5,  6,  6,  6,  7,  7,  7,      |
+          8,  8,  8,  9,  9,  9, 10, 10, 10, 11, 11, 11, ...] |
+          ---------   ---------   ---------   ---------
+           ---------   ---------   ---------   ---------
+            ---------   ---------   ---------   ---------
+                         batch_size = 12
+    ```
+    """
+    def __init__(
+        self,
+        data_source: Sized,
+        mini_repeat_count: int,
+        batch_size: int = 1,
+        repeat_count: int = 1,
+        shuffle: bool = True,
+        seed: Optional[int] = None,
+    ):
+        self.data_source = data_source
+        self.mini_repeat_count = mini_repeat_count
+        self.batch_size = batch_size
+        self.repeat_count = repeat_count
+        self.num_samples = len(data_source)
+        self.shuffle = shuffle
+        self.seed = seed
+        if shuffle:
+            self.generator = torch.Generator()  # Create a local random generator
+            if seed is not None:
+                self.generator.manual_seed(seed)
+    def __iter__(self):
+        if self.shuffle:
+            # E.g., [2, 4, 3, 1, 0, 6, 5] (num_samples = 7)
+            indexes = torch.randperm(self.num_samples, generator=self.generator).tolist()
+        else:
+            indexes = list(range(self.num_samples))
+        #    [2, 4, 3, 1, 0, 6, 5]
+        # -> [[2, 4, 3], [1, 0, 6], [5]]  (batch_size = 3)
+        indexes = [indexes[i : i + self.batch_size] for i in range(0, len(indexes), self.batch_size)]
+        #    [[2, 4, 3], [1, 0, 6], [5]]
+        # -> [[2, 4, 3], [1, 0, 6]]
+        indexes = [chunk for chunk in indexes if len(chunk) == self.batch_size]
+        for chunk in indexes:
+            for _ in range(self.repeat_count):
+                for index in chunk:
+                    for _ in range(self.mini_repeat_count):
+                        yield index
+    def __len__(self) -> int:
+        return self.num_samples * self.mini_repeat_count * self.repeat_count
+# torch.nanstd doesn't exist, so we define it here
+def nanstd(tensor: torch.Tensor) -> torch.Tensor:
+    """
+    Compute the standard deviation of a tensor, ignoring NaNs. This function only supports 1D tensors.
+    Args:
+        tensor (`torch.Tensor`):
+            Input tensor of shape `(N,)`.
+    Returns:
+        `torch.Tensor`:
+            Standard deviation of the tensor, ignoring NaNs.
+    """
+    variance = torch.nanmean((tensor - torch.nanmean(tensor, keepdim=True)) ** 2)  # Compute variance ignoring NaNs
+    count = torch.sum(~torch.isnan(tensor))  # Count of non-NaN values
+    variance *= count / (count - 1)  # Bessel's correction
+    return torch.sqrt(variance)
+def split_tensor_dict(
+    tensor_dict: dict[str, Optional[torch.Tensor]], num_chunks: int, image_patch_id=151655, patch_id_times=4
+) -> list[dict[str, Optional[torch.Tensor]]]:
+    """
+    Splits a dictionary of tensors along the first dimension into `num_chunks` equal parts.
+    Example:
+        >>> x = torch.arange(12).reshape(6, 2)
+        >>> y = torch.arange(6).reshape(6, 1)
+        >>> tensor_dict = {"x": x, "y": y}
+        >>> split_tensor_dict(tensor_dict, 3)
+        [
+            {"x": tensor([[0, 1], [2, 3]]), "y": tensor([[0], [1]])},
+            {"x": tensor([[4, 5], [6, 7]]), "y": tensor([[2], [3]])},
+            {"x": tensor([[ 8,  9], [10, 11]]), "y": tensor([[4], [5]])}
+        ]
+    """
+    if image_patch_id is None:
+        first_tensor = next(tensor for tensor in tensor_dict.values() if tensor is not None)
+        chunk_size = first_tensor.shape[0] // num_chunks
+        # has = []
+        # if 'has_correct' in tensor_dict:
+        #     has = tensor_dict['has_correct']
+        #     del tensor_dict['has_correct']
+        l1 = []
+        for i in range(num_chunks):
+            dt = {
+                key: tensor[i * chunk_size : (i + 1) * chunk_size] if tensor is not None else None
+                for key, tensor in tensor_dict.items()
+            }
+            # if len(has) > 0:
+            #     dt['has_correct'] = has[i]
+            l1.append(dt)
+        return l1
+    else:
+        first_tensor = next(tensor for tensor in tensor_dict.values() if tensor is not None)
+        chunk_size = first_tensor.shape[0] // num_chunks
+        l1 = []
+        for i in range(num_chunks):
+            dt = {}
+            for key, tensor in tensor_dict.items():
+                if key != 'pixel_values':
+                    dt[key] = tensor[i * chunk_size : (i + 1) * chunk_size] if tensor is not None else None
+            l1.append(dt)
+        if 'pixel_values' in tensor_dict:
+            raw_pixel_values = tensor_dict['pixel_values']
+            start_image_patch = 0
+            for dt in l1:
+                batch_input_ids = dt['prompt_ids']
+                num_image_patches = (batch_input_ids == image_patch_id).sum().item() * patch_id_times
+                batch_pixel_values = raw_pixel_values[start_image_patch : start_image_patch + num_image_patches]
+                start_image_patch += num_image_patches
+                dt['pixel_values'] = batch_pixel_values
+        return l1
+def nanmin(tensor: torch.Tensor) -> torch.Tensor:
+    """
+    Compute the minimum value of a tensor, ignoring NaNs. This function only supports 1D tensors.
+    Args:
+        tensor (`torch.Tensor`): Input tensor of shape `(N,)`.
+    Returns:
+        `torch.Tensor`: Minimum value of the tensor, ignoring NaNs. Returns NaN if all values are NaN.
+    """
+    if torch.isnan(tensor).all():
+        return torch.tensor(float("nan"), dtype=tensor.dtype, device=tensor.device)
+    return torch.min(tensor[~torch.isnan(tensor)])
+def nanmax(tensor: torch.Tensor) -> torch.Tensor:
+    """
+    Compute the maximum value of a tensor, ignoring NaNs. This function only supports 1D tensors.
+    Args:
+        tensor (`torch.Tensor`): Input tensor of shape `(N,)`.
+    Returns:
+        `torch.Tensor`: Maximum value of the tensor, ignoring NaNs. Returns NaN if all values are NaN.
+    """
+    if torch.isnan(tensor).all():
+        return torch.tensor(float("nan"), dtype=tensor.dtype, device=tensor.device)
+    return torch.max(tensor[~torch.isnan(tensor)])
+class DyMETrainer(Trainer):
+    def __init__(
+        self,
+        model: PreTrainedModel,
+        checker = None,
+        refiner=None,
+        args: Optional[GRPOConfig] = None,
+        train_dataset: Optional[Union[Dataset, IterableDataset]] = None,
+        eval_dataset: Optional[Union[Dataset, IterableDataset, dict[str, Union[Dataset, IterableDataset]]]] = None,
+        processing_class: Optional[PreTrainedTokenizerBase] = None,
+        callbacks: Optional[list[TrainerCallback]] = None,
+        optimizers: tuple[Optional[torch.optim.Optimizer], Optional[torch.optim.lr_scheduler.LambdaLR]] = (None, None),
+        processing_func = None,
+        task_name: str = None,
+        end_flag: str = '<|im_end|>',
+    ):
+        self.task_name = task_name
+        self.reward_weights = torch.nn.Parameter(torch.ones(3), requires_grad=False)
+        self.reward_func_names = ['format', 'thinking', 'accuracy']
+        # Models
+        # Trained model
+        model_init_kwargs = args.model_init_kwargs or {}
+        # Enable gradient checkpointing if requested
+        if args.gradient_checkpointing:
+            model = self._enable_gradient_checkpointing(model, args)
+        # Processing class
+        if processing_class is None:
+            processing_class = AutoTokenizer.from_pretrained(model.config._name_or_path, padding_side="left")
+        # Training arguments
+        self.max_prompt_length = args.max_prompt_length
+        self.max_completion_length = args.max_completion_length  # = |o_i| in the GRPO paper
+        self.num_generations = args.num_generations  # = G in the GRPO paper
+        self.temperature = args.temperature
+        self.top_p = args.top_p
+        self.top_k = args.top_k
+        self.min_p = args.min_p
+        self.repetition_penalty = args.repetition_penalty
+        self.use_liger_loss = args.use_liger_loss
+        self.loss_type = args.loss_type
+        self.scale_rewards = args.scale_rewards
+        self.mask_truncated_completions = args.mask_truncated_completions
+        self.end_flag = end_flag
+        self.checker = checker
+        self.refiner = refiner
+        # Datasets
+        self.shuffle_dataset = args.shuffle_dataset
+        if (
+            isinstance(train_dataset, IterableDataset)
+            or isinstance(eval_dataset, IterableDataset)
+            or (
+                isinstance(eval_dataset, dict) and any(isinstance(ds, IterableDataset) for ds in eval_dataset.values())
+            )
+        ):
+            # See https://github.com/huggingface/trl/issues/3213
+            raise NotImplementedError(
+                "Iterable datasets are not yet supported in GRPOTrainer. Please use a standard dataset instead."
+            )
+        # Multi-step
+        self.num_iterations = args.num_iterations  # = 𝜇 in the GRPO paper
+        self.epsilon_low = args.epsilon
+        self.epsilon_high = args.epsilon_high if args.epsilon_high is not None else args.epsilon
+        # Tracks the number of iterations (forward + backward passes), including those within a grad accum cycle
+        self._step = 0
+        # Buffer the batch to reuse generated outputs across multiple updates. For more details, see
+        # `_get_train_sampler` and `_prepare_inputs`.
+        self._buffered_inputs = None
+        # The trainer estimates the number of FLOPs (floating-point operations) using the number of elements in the
+        # input tensor associated with the key "input_ids". However, in GRPO, the sampled data does not include the
+        # "input_ids" key. Instead, the available keys is "prompt". As a result, the trainer issues the warning:
+        # "Could not estimate the number of tokens of the input, floating-point operations will not be computed." To
+        # suppress this warning, we set the "estimate_tokens" key in the model's "warnings_issued" dictionary to True.
+        # This acts as a flag to indicate that the warning has already been issued.
+        model.warnings_issued["estimate_tokens"] = True
+        def data_collator(features):  # No data collation is needed in GRPO
+            return features
+        super().__init__(
+            model=model,
+            args=args,
+            data_collator=data_collator,
+            train_dataset=train_dataset,
+            eval_dataset=eval_dataset,
+            processing_class=processing_class,
+            callbacks=callbacks,
+            optimizers=optimizers,
+        )
+        # Reference model
+        self.beta = args.beta
+        assert self.beta == 0
+        # Disable dropout in the models
+        if args.disable_dropout:
+            disable_dropout_in_model(model)
+        # Initialize the metrics
+        self._metrics = {"train": defaultdict(list), "eval": defaultdict(list)}
+        self._total_train_tokens = 0
+        self.log_completions = args.log_completions
+        self.wandb_log_unique_prompts = args.wandb_log_unique_prompts
+        self.num_completions_to_print = args.num_completions_to_print
+        # maxlen is set to the total number of forward passes per step. This value of `maxlen` ensures we log only the
+        # final optimization step.
+        maxlen = self.accelerator.num_processes * args.per_device_train_batch_size * args.gradient_accumulation_steps
+        self._textual_logs = {
+            "prompt": deque(maxlen=maxlen),
+            "completion": deque(maxlen=maxlen),
+            "rewards": defaultdict(lambda: deque(maxlen=maxlen)),
+        }
+        # Check if the effective batch size can be divided by the number of generations
+        if self.num_generations < 2:
+            raise ValueError(
+                "GRPO requires at least 2 generations per prompt to calculate the advantages. You provided "
+                f"{self.num_generations}, which is less than the minimum required."
+            )
+        num_processes = self.accelerator.num_processes
+        effective_batch_size = args.per_device_train_batch_size * num_processes * args.gradient_accumulation_steps
+        possible_values = [
+            n_gen for n_gen in range(2, effective_batch_size + 1) if (effective_batch_size) % n_gen == 0
+        ]
+        if self.num_generations not in possible_values:
+            raise ValueError(
+                f"The effective train batch size ({num_processes} x {args.per_device_train_batch_size} x "
+                f"{args.gradient_accumulation_steps}) must be evenly divisible by the number of generations per "
+                f"prompt ({self.num_generations}). Given the current effective train batch size, the valid values for "
+                f"the number of generations are: {possible_values}."
+            )
+        # Ensure each process receives a unique seed to prevent duplicate completions when generating with
+        # transformers if num_generations exceeds per_device_train_batch_size. We could skip it if we use vLLM, but
+        # it's safer to set it in all cases.
+        set_seed(args.seed, device_specific=True)
+        self.generation_config = GenerationConfig(
+            max_new_tokens=self.max_completion_length,
+            do_sample=True,
+            pad_token_id=processing_class.tokenizer.pad_token_id,
+            bos_token_id=processing_class.tokenizer.bos_token_id,
+            eos_token_id=processing_class.tokenizer.eos_token_id,
+            temperature=self.temperature,
+            top_p=self.top_p,
+            top_k=self.top_k,
+            min_p=self.min_p,
+            repetition_penalty=self.repetition_penalty,
+            cache_implementation=args.cache_implementation,
+            use_cache=False if self.args.gradient_checkpointing else True
+        )
+        # Gradient accumulation requires scaled loss. Normally, loss scaling in the parent class depends on whether the
+        # model accepts loss-related kwargs. Since we compute our own loss, this check is irrelevant. We set
+        # self.model_accepts_loss_kwargs to False to enable scaling.
+        self.model_accepts_loss_kwargs = False
+        self.processing_func = processing_func
+    def _set_signature_columns_if_needed(self):
+        # If `self.args.remove_unused_columns` is True, non-signature columns are removed.
+        # By default, this method sets `self._signature_columns` to the model's expected inputs.
+        # In GRPOTrainer, we preprocess data, so using the model's signature columns doesn't work.
+        # Instead, we set them to the columns expected by the `training_step` method, hence the override.
+        if self._signature_columns is None:
+            self._signature_columns = ["prompt"]
+    def get_train_dataloader(self):
+        if self.train_dataset is None:
+            raise ValueError("Trainer: training requires a train_dataset.")
+        train_dataset = self.train_dataset
+        data_collator = self.data_collator
+        if is_datasets_available() and isinstance(train_dataset, datasets.Dataset):
+            train_dataset = self._remove_unused_columns(train_dataset, description="training")
+        else:
+            data_collator = self._get_collator_with_removed_columns(data_collator, description="training")
+        dataloader_params = {
+            "batch_size": self._train_batch_size * self.args.gradient_accumulation_steps,  # < this is the change
+            "collate_fn": data_collator,
+            "num_workers": self.args.dataloader_num_workers,
+            "pin_memory": self.args.dataloader_pin_memory,
+            "persistent_workers": self.args.dataloader_persistent_workers,
+        }
+        if not isinstance(train_dataset, torch.utils.data.IterableDataset):
+            dataloader_params["sampler"] = self._get_train_sampler()
+            dataloader_params["drop_last"] = self.args.dataloader_drop_last
+            dataloader_params["worker_init_fn"] = seed_worker
+            dataloader_params["prefetch_factor"] = self.args.dataloader_prefetch_factor
+        dl = self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))
+        return dl
+    def _get_train_sampler(self) -> Sampler:
+        effective_batch_size = (
+            self.args.per_device_train_batch_size
+            * self.accelerator.num_processes
+            * self.args.gradient_accumulation_steps
+        )
+        return RepeatSampler(
+            data_source=self.train_dataset,
+            mini_repeat_count=self.num_generations,
+            batch_size=effective_batch_size // self.num_generations,
+            repeat_count=self.num_iterations * self.args.gradient_accumulation_steps,
+            shuffle=self.shuffle_dataset,
+            seed=self.args.seed,
+        )
+    def _get_eval_sampler(self, eval_dataset):
+        # eval_dataset 是一个 map-style Dataset（非 IterableDataset）
+        return DistributedSampler(
+            dataset=eval_dataset,
+            num_replicas=self.accelerator.num_processes,
+            rank=self.accelerator.process_index,
+            shuffle=False,
+            seed=self.args.seed,
+        )
+    def _enable_gradient_checkpointing(self, model: PreTrainedModel, args: GRPOConfig) -> PreTrainedModel:
+        """Enables gradient checkpointing for the model."""
+        # Ensure use_cache is disabled
+        model.config.use_cache = False
+        # Enable gradient checkpointing on the base model for PEFT
+        if is_peft_model(model):
+            model.base_model.gradient_checkpointing_enable()
+        # Enable gradient checkpointing for non-PEFT models
+        else:
+            model.gradient_checkpointing_enable()
+        gradient_checkpointing_kwargs = args.gradient_checkpointing_kwargs or {}
+        use_reentrant = (
+            "use_reentrant" not in gradient_checkpointing_kwargs or gradient_checkpointing_kwargs["use_reentrant"]
+        )
+        if use_reentrant:
+            model.enable_input_require_grads()
+        return model
+    @profiling_decorator
+    def _get_last_hidden_state(self, unwrapped_model, input_ids, attention_mask, logits_to_keep=None):
+        if is_peft_model(unwrapped_model):
+            unwrapped_model = unwrapped_model.base_model.model
+        last_hidden_state = unwrapped_model.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
+        last_hidden_state = last_hidden_state[:, :-1, :]  # (B, L-1, H)
+        if logits_to_keep is not None:
+            last_hidden_state = last_hidden_state[:, -logits_to_keep:, :]  # (B, logits_to_keep, H)
+        return last_hidden_state
+    # Get the per-token log probabilities for the completions for the model and the reference model
+    @profiling_decorator
+    def _get_per_token_logps(self, model, input_ids, attention_mask, pixel_values, image_grid_thws, logits_to_keep, batch_size=None) -> torch.Tensor:
+        batch_size = batch_size or input_ids.size(0)  # Chunk inputs into smaller batches to reduce memory peak
+        all_logps = []
+        patch_s = 0
+        for i in range(0, input_ids.size(0), batch_size):
+            input_ids_batch = input_ids[i : i + batch_size]
+            attention_mask_batch = attention_mask[i : i + batch_size]
+            img_id = self.processing_class.tokenizer.convert_tokens_to_ids('<|image_pad|>')
+            image_patch_nums = (input_ids_batch == img_id).sum().item() * 4 # 每个 image_pad 对应 4 个图像 patch
+            # print("image_patch_nums", image_patch_nums, pixel_values.shape)
+            pixel_values_batch = pixel_values[patch_s : patch_s + image_patch_nums]
+            patch_s += image_patch_nums
+            image_grid_thw_batch = image_grid_thws[i : i + batch_size]
+            # We add 1 to `logits_to_keep` because the last logits of the sequence is later excluded
+            logits = model(
+                input_ids=input_ids_batch, pixel_values=pixel_values_batch, image_grid_thw=image_grid_thw_batch,
+                attention_mask=attention_mask_batch
+            ).logits
+            # logits = logits[:, :-1, :]  # (B, L-1, H)
+            if logits_to_keep is not None:
+                logits = logits[:, -logits_to_keep-1:, :]  # (B, logits_to_keep, H)
+            logits = logits[:, :-1, :]  # (B, L-1, V), exclude the last logit: it corresponds to the next token pred
+            input_ids_batch = input_ids_batch[:, -logits_to_keep:]
+            # For transformers<=4.48, logits_to_keep argument isn't supported, so here we drop logits ourselves.
+            # See https://github.com/huggingface/trl/issues/2770
+            logits = logits[:, -logits_to_keep:]
+            # Divide logits by sampling temperature.
+            # See https://huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo#policy-training-implementation-details
+            logits = logits / self.temperature
+            logps = selective_log_softmax(logits, input_ids_batch)  # compute logprobs for the input tokens
+            all_logps.append(logps)
+        return torch.cat(all_logps, dim=0)
+    @profiling_decorator
+    def _prepare_inputs(
+        self, accumulated_local_batch: dict[str, Union[torch.Tensor, Any]]
+    ) -> dict[str, Union[torch.Tensor, Any]]:
+        mode = "train" if self.model.training else "eval"
+        if mode == "train":
+            generate_every = self.args.gradient_accumulation_steps * self.num_iterations
+            if self._step % generate_every == 0 or self._buffered_inputs is None:
+                # self._buffered_inputs=None can occur when resuming from a checkpoint
+                accumulated_local_batch = self._generate_and_score_completions(accumulated_local_batch)
+                self._buffered_inputs = split_tensor_dict(
+                    accumulated_local_batch, self.args.gradient_accumulation_steps
+                )
+            inputs = self._buffered_inputs[self._step % self.args.gradient_accumulation_steps]
+            self._step += 1
+        else:
+            # In evaluation, there is neither gradient accumulation, nor multiple iterations
+            inputs = self._generate_and_score_completions(accumulated_local_batch)
+        return inputs
+    def _generate_and_score_completions(
+        self, inputs: list[dict[str, Union[torch.Tensor, Any]]]
+    ) -> dict[str, Union[torch.Tensor, Any]]:
+        # TODO
+        device = self.accelerator.device
+        mode = "train" if self.model.training else "eval"
+        inputs_for_generate = inputs.copy()
+        # 去除answer key
+        inputs_for_generate = [{k: v for k, v in x.items() if k != 'answer'} for x in inputs_for_generate]
+        dt_generate_dt = self.processing_func(inputs_for_generate)
+        prompt_inputs_generate = super(DyMETrainer, self)._prepare_inputs(dt_generate_dt)
+        if 'labels' in prompt_inputs_generate:
+            del prompt_inputs_generate["labels"]
+        prompt_ids = prompt_inputs_generate["input_ids"]
+        prompt_mask = prompt_inputs_generate["attention_mask"]
+        pixel_values = prompt_inputs_generate["pixel_values"]
+        image_grid_thws = prompt_inputs_generate["image_grid_thw"]
+        # Regular generation path
+        with unwrap_model_for_generation(
+            self.model_wrapped, self.accelerator, gather_deepspeed3_params=self.args.ds3_gather_for_generation
+        ) as unwrapped_model:
+            with (
+                FSDP.summon_full_params(self.model_wrapped, recurse=False)
+                if self.is_fsdp_enabled
+                else nullcontext()
+            ):
+                prompt_completion_ids = unwrapped_model.generate(**prompt_inputs_generate, generation_config=self.generation_config)
+            # Compute prompt length and extract completion ids
+            prompt_length = prompt_ids.size(1)
+            prompt_ids = prompt_completion_ids[:, :prompt_length]
+            completion_ids = prompt_completion_ids[:, prompt_length:]
+        # Mask everything after the first EOS token
+        is_eos = completion_ids == self.processing_class.tokenizer.eos_token_id
+        eos_idx = torch.full((is_eos.size(0),), is_eos.size(1), dtype=torch.long, device=device)
+        eos_idx[is_eos.any(dim=1)] = is_eos.int().argmax(dim=1)[is_eos.any(dim=1)]
+        sequence_indices = torch.arange(is_eos.size(1), device=device).expand(is_eos.size(0), -1)
+        completion_mask = (sequence_indices <= eos_idx.unsqueeze(1)).int()
+        # If mask_truncated_completions is enabled, zero out truncated completions in completion_mask
+        if self.mask_truncated_completions:
+            truncated_completions = ~is_eos.any(dim=1)
+            completion_mask = completion_mask * (~truncated_completions).unsqueeze(1).int()
+        completions = self.processing_class.batch_decode(completion_ids, skip_special_tokens=True)
+        batch_size = len(completion_ids)
+        images = [x['image'] for x in inputs]
+        prompts = [x['prompt'] for x in inputs]
+        question_wo_prompts = [x['question_wo_prompt'] for x in inputs]
+        hints = [x.get('hint', '') for x in inputs]
+        answers = [x['answer'] for x in inputs]
+        images_path = [image if isinstance(image, str) else image.filename for image in images]
+        batch_data = {'prompt': prompts, 'hints': hints,
+                   'image': images_path, 'response': completions, 'answer': answers}
+        gpu_id = self.accelerator.device.index
+        all_rewards, format_rewards, acc_rewards, context_rewards = calculate_rewards_in_parallel(self.checker, batch_data,
+                                                                                   gpu_id=gpu_id,
+                                                                                   task=self.task_name, num_threads=1)
+        all_rewards = torch.tensor(all_rewards, dtype=torch.float32).to(self.accelerator.device)
+        format_rewards = torch.tensor(format_rewards, dtype=torch.float32).to(self.accelerator.device)
+        context_rewards = torch.tensor(context_rewards, dtype=torch.float32).to(self.accelerator.device)
+        acc_rewards = torch.tensor(acc_rewards, dtype=torch.float32).to(self.accelerator.device)
+        rewards_per_func = torch.zeros([len(all_rewards), 3], device=device)
+        rewards_per_func[:, 0] = format_rewards.clone()
+        rewards_per_func[:, 1] = context_rewards.clone()
+        rewards_per_func[:, -1] = acc_rewards.clone()
+        rewards_per_func = gather(rewards_per_func)
+        # Apply weights to each reward function's output and sum
+        rewards = (rewards_per_func * self.reward_weights.to(device).unsqueeze(0)).nansum(dim=1)
+        # Compute grouped-wise rewards
+        mean_grouped_rewards = rewards.view(-1, self.num_generations).mean(dim=1)
+        std_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1)
+        # Normalize the rewards to compute the advantages
+        mean_grouped_rewards = mean_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
+        std_grouped_rewards = std_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
+        advantages = rewards - mean_grouped_rewards
+        if self.scale_rewards:
+            advantages = advantages / (std_grouped_rewards + 1e-4)
+        # Slice to keep only the local part of the data
+        process_slice = slice(
+            self.accelerator.process_index * len(prompts),
+            (self.accelerator.process_index + 1) * len(prompts),
+        )
+        advantages = advantages[process_slice]
+        advantages = advantages.reshape(-1, 1)
+        acc_rewards = acc_rewards.view(-1, self.num_generations)
+        format_rewards = format_rewards.view(-1, self.num_generations)
+        has_correct = (acc_rewards > 0.5).sum(1)
+        format_rewards = format_rewards.view(-1)
+        sft_check = []
+        for i in range(batch_size):
+            batch_id = i // self.num_generations
+            sft_check.append((has_correct[batch_id] == 0) & (i % self.num_generations == 0))
+        hints = refine_context_in_parallel(self.refiner, question_wo_prompts, hints, answers, task=self.task_name, gpu_id=gpu_id, num_threads=1)
+        sft_gt = [hint + '\n' + answer + self.end_flag for hint, answer in zip(hints, answers)]
+        sft_dt = self.processing_class.tokenizer(sft_gt, return_tensors="pt", padding=True,
+                                                        padding_side="right")
+        sft_padded_ids = sft_dt['input_ids'].to(device)
+        sft_attn_masks = sft_dt['attention_mask'].to(device)
+        sft_advantages = torch.ones_like(sft_attn_masks, device=device)
+        final_completion_id_list = []
+        final_completion_mask_list = []
+        final_advantange_list = []
+        for i in range(len(sft_padded_ids)):
+            batch_id = i // self.num_generations
+            if has_correct[batch_id] == 0:
+                if sft_check[i]:  # 第一个修改为正确答案，其他的保留为错误的。
+                    completion_id_ = torch.cat([sft_padded_ids[i], completion_ids[i][0:0]])
+                    completion_mask_ = torch.cat([sft_attn_masks[i], completion_mask[i][0:0]])
+                    advantange_ = torch.cat([sft_advantages[i], advantages[i][0:0]])
+                    advantange_[:] = 1
+                else:
+                    completion_id_ = torch.cat([completion_ids[i], completion_ids[i][0:0]])
+                    completion_mask_ = torch.cat([completion_mask[i], sft_attn_masks[i][0:0]])
+                    advantange_ = torch.cat([advantages[i], sft_advantages[i][0:0]])
+                    advantange_ = advantange_.repeat_interleave(len(completion_id_))
+                    advantange_[:] = 0
+            else:
+                completion_id_ = torch.cat([completion_ids[i], sft_padded_ids[i][0:0]])
+                completion_mask_ = torch.cat([completion_mask[i], sft_attn_masks[i][0:0]])
+                advantange_ = torch.cat([advantages[i], sft_advantages[i][0:0]])
+                # 如果advantange_是一个数字的话需要扩展维度
+                advantange_ = advantange_.repeat_interleave(len(completion_id_))
+            if has_correct[batch_id] == self.num_generations:  # 全部正确时停止优化
+                advantange_[:] = 0
+            final_completion_id_list.append(completion_id_)
+            final_completion_mask_list.append(completion_mask_)
+            final_advantange_list.append(advantange_)
+        completion_ids = pad_sequence(final_completion_id_list, batch_first=True,
+                                      padding_value=self.processing_class.tokenizer.pad_token_id).long()
+        completion_mask = pad_sequence(final_completion_mask_list, batch_first=True, padding_value=0)
+        completion_advantange = pad_sequence(final_advantange_list, batch_first=True, padding_value=0)
+        completion_ids = completion_ids.to(device)
+        completion_mask = completion_mask.to(device)
+        completion_advantange = completion_advantange.to(device)
+        input_completion_ids = torch.cat([prompt_ids, completion_ids], dim=1).long()
+        attention_completion_mask = torch.cat([prompt_mask, completion_mask], dim=1)
+        for s, a in enumerate(completion_advantange):
+            if acc_rewards.view(-1)[s] > 0 and format_rewards.view(-1)[s] > 0 and a[0] < 0:
+                print('no')
+        if self.accelerator.device.index == 0:
+            completion_id = completion_ids[0]
+            completion_id_pos = completion_id[(completion_advantange[0] > 0) & (completion_mask[0] > 0)]
+            completion_id_neg = completion_id[(completion_advantange[0] < 0) & (completion_mask[0] > 0)]
+            show = self.processing_class.decode(completion_id_pos, skip_special_tokens=False)
+            show_neg = self.processing_class.decode(completion_id_neg, skip_special_tokens=False)
+            print("\n=====has_correct====================\n", has_correct,)
+            print("\n=====prediction====================\n", completions[0],)
+            if show != "":
+                print("\n=====POS GT====================\n", show)
+            if show_neg != "":
+                print("\n======NEG GT===================\n", show_neg)
+        # Concatenate prompt_mask with completion_mask for logit computation
+        attention_mask = torch.cat([prompt_mask, completion_mask], dim=1)  # (B, P+C)
+        logits_to_keep = completion_ids.size(1)  # we only need to compute the logits for the completion tokens
+        batch_size = self.args.per_device_train_batch_size if mode == "train" else self.args.per_device_eval_batch_size
+        with torch.no_grad():
+            # When using num_iterations == 1, old_per_token_logps == per_token_logps, so we can skip its
+            # computation here, and use per_token_logps.detach() instead.
+            if self.num_iterations > 1:
+                old_per_token_logps = self._get_per_token_logps(self.model, input_completion_ids, attention_completion_mask, pixel_values, image_grid_thws,
+                                          logits_to_keep, batch_size)
+            else:
+                old_per_token_logps = None
+        # Log the metrics
+        if mode == "train":
+            self.state.num_input_tokens_seen += self.accelerator.gather_for_metrics(attention_mask.sum()).sum().item()
+        # log completion lengths, mean, min, max
+        agg_completion_mask = self.accelerator.gather_for_metrics(completion_mask.sum(1))
+        self._metrics[mode]["completions/mean_length"].append(agg_completion_mask.float().mean().item())
+        # identify sequences that terminated with EOS and log their lengths
+        agg_terminated_with_eos = self.accelerator.gather_for_metrics(is_eos.any(dim=1))
+        term_completion_mask = agg_completion_mask[agg_terminated_with_eos]
+        clipped_completions_ratio = 1 - len(term_completion_mask) / len(agg_completion_mask)
+        self._metrics[mode]["completions/clipped_ratio"].append(clipped_completions_ratio)
+        # Calculate mean reward per function, but only for samples where the function was applied (non-NaN values)
+        for i, reward_func_name in enumerate(self.reward_func_names):
+            mean_rewards = torch.nanmean(rewards_per_func[:, i]).item()
+            self._metrics[mode][f"rewards/{reward_func_name}/mean"].append(mean_rewards)
+        self._metrics[mode]["reward"].append(mean_grouped_rewards.mean().item())
+        self._metrics[mode]["reward_std"].append(std_grouped_rewards.mean().item())
+        for i, name in enumerate(self.reward_func_names):
+            self._textual_logs["rewards"][name].extend(rewards_per_func[:, i].tolist())
+        # completion_advantange: (batch_size, seq_len) 或 (batch_size, n)
+        mask_pos = completion_advantange > 0  # 正优势位置
+        row_min = completion_advantange.min(dim=1, keepdim=True).values.abs()  # (batch, 1)
+        # 只对正优势加 abs(row_min)，其余位置设 0
+        # completion_advantange = torch.where(
+        #     mask_pos,
+        #     completion_advantange + row_min,  # broadcasting 自动对齐到每一行
+        #     torch.zeros_like(completion_advantange)
+        # )
+        return {
+            "prompt_ids": prompt_ids,
+            "prompt_mask": prompt_mask,
+            "pixel_values": pixel_values,
+            "completion_ids": completion_ids,
+            "completion_mask": completion_mask,
+            "advantages": completion_advantange,
+            "old_per_token_logps": old_per_token_logps,
+            # "has_correct": has_correct,
+            "image_grid_thws": image_grid_thws
+        }
+    def compute_liger_loss(self, unwrapped_model, inputs):
+        # Compute the per-token log probabilities for the model
+        prompt_ids, prompt_mask = inputs["prompt_ids"], inputs["prompt_mask"]
+        completion_ids, completion_mask = inputs["completion_ids"], inputs["completion_mask"]
+        input_ids = torch.cat([prompt_ids, completion_ids], dim=1)
+        attention_mask = torch.cat([prompt_mask, completion_mask], dim=1)
+        logits_to_keep = completion_ids.size(1)  # we only need to compute the logits for the completion tokens
+        # Compute the KL divergence between the model and the reference model
+        ref_per_token_logps = None
+        # get the last hidden state of the model
+        last_hidden_state = self._get_last_hidden_state(unwrapped_model, input_ids, attention_mask, logits_to_keep)
+        # compute loss and metrics using liger grpo loss
+        loss, metrics = self.liger_grpo_loss(
+            _input=last_hidden_state,
+            lin_weight=unwrapped_model.lm_head.weight,
+            selected_token_ids=completion_ids,
+            attention_mask=completion_mask,
+            advantages=inputs["advantages"][:, 0],
+            bias=unwrapped_model.lm_head.bias,
+            old_per_token_logps=inputs["old_per_token_logps"],
+            ref_per_token_logps=ref_per_token_logps,
+        )
+        # Extract metrics from the liger_grpo_loss output
+        # KL divergence is the first metric when beta is non-zero
+        mean_kl = metrics[0] if self.beta != 0.0 else None
+        clip_ratio = metrics[-1]
+        mode = "train" if self.model.training else "eval"
+        if self.beta != 0.0:
+            self._metrics[mode]["kl"].append(self.accelerator.gather_for_metrics(mean_kl).mean().item())
+        self._metrics[mode]["clip_ratio"].append(self.accelerator.gather_for_metrics(clip_ratio).mean().item())
+        return loss
+    @profiling_decorator
+    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
+        if return_outputs:
+            raise ValueError("The GRPOTrainer does not support returning outputs")
+        if self.use_liger_loss:
+            # Compute the loss using the liger grpo loss
+            unwrapped_model = self.accelerator.unwrap_model(model)
+            return self._forward_redirection(model, unwrapped_model, self.compute_liger_loss, unwrapped_model, inputs)
+        else:
+            return self._compute_loss(model, inputs)
+    def _compute_loss(self, model, inputs):
+        # return torch.nn.Parameter(torch.tensor(0.0, device=self.accelerator.device))  # Dummy loss for compatibility
+        # Compute the per-token log probabilities for the model
+        prompt_ids, prompt_mask = inputs["prompt_ids"], inputs["prompt_mask"]
+        completion_ids, completion_mask = inputs["completion_ids"], inputs["completion_mask"]
+        pixel_values = inputs["pixel_values"]
+        # has_correct = inputs["has_correct"]
+        image_grid_thws = inputs["image_grid_thws"]
+        input_ids = torch.cat([prompt_ids, completion_ids], dim=1)
+        attention_mask = torch.cat([prompt_mask, completion_mask], dim=1)
+        logits_to_keep = completion_ids.size(1)  # we only need to compute the logits for the completion tokens
+        try:
+            per_token_logps = self._get_per_token_logps(model, input_ids, attention_mask, pixel_values, image_grid_thws,
+                                               logits_to_keep)
+        except Exception as e:
+            print(f"Error in _get_per_token_logps: {e}")
+            raise e
+        # sft_loss = -(per_token_logps * completion_mask).sum(-1) / completion_mask.sum(-1)
+        advantages = inputs["advantages"][:, 0]
+        # sft_loss = (sft_loss * (advantages > 0)).sum() * (has_correct == 0)
+        # return sft_loss
+        # When using num_iterations == 1, old_per_token_logps == per_token_logps, so we can skip it's computation (see
+        # _generate_and_score_completions) and use per_token_logps.detach() instead.
+        old_per_token_logps = inputs["old_per_token_logps"] if self.num_iterations > 1 else per_token_logps.detach()
+        coef_1 = torch.exp(per_token_logps - old_per_token_logps)
+        coef_2 = torch.clamp(coef_1, 1 - self.epsilon_low, 1 + self.epsilon_high)
+        per_token_loss1 = coef_1 * advantages.unsqueeze(1)
+        per_token_loss2 = coef_2 * advantages.unsqueeze(1)
+        per_token_loss = -torch.min(per_token_loss1, per_token_loss2)
+        if self.loss_type == "grpo":
+            loss = ((per_token_loss * completion_mask).sum(-1) / completion_mask.sum(-1).clamp(min=1.0)).mean()
+        elif self.loss_type == "bnpo":
+            loss = (per_token_loss * completion_mask).sum() / completion_mask.sum().clamp(min=1.0)
+        elif self.loss_type == "dr_grpo":
+            loss = (per_token_loss * completion_mask).sum() / (per_token_loss.size(0) * self.max_completion_length)
+        else:
+            raise ValueError(f"Unknown loss type: {self.loss_type}")
+        # loss = (has_correct > 0) * loss + sft_loss
+        # loss = (has_correct > 0) * loss
+        # Log the metrics
+        mode = "train" if self.model.training else "eval"
+        # Compute the clipped probability ratios
+        is_low_clipped = (coef_1 < 1 - self.epsilon_low) & (advantages.unsqueeze(1) < 0)
+        is_high_clipped = (coef_1 > 1 + self.epsilon_high) & (advantages.unsqueeze(1) > 0)
+        is_region_clipped = is_low_clipped | is_high_clipped
+        low_clip = (is_low_clipped * completion_mask).sum() / completion_mask.sum()
+        high_clip = (is_high_clipped * completion_mask).sum() / completion_mask.sum()
+        clip_ratio = (is_region_clipped * completion_mask).sum() / completion_mask.sum()
+        gathered_low_clip = self.accelerator.gather_for_metrics(low_clip)
+        self._metrics[mode]["clip_ratio/low_mean"].append(gathered_low_clip.nanmean().item())
+        self._metrics[mode]["clip_ratio/low_min"].append(nanmin(gathered_low_clip).item())
+        gathered_high_clip = self.accelerator.gather_for_metrics(high_clip)
+        self._metrics[mode]["clip_ratio/high_mean"].append(gathered_high_clip.nanmean().item())
+        self._metrics[mode]["clip_ratio/high_max"].append(nanmax(gathered_high_clip).item())
+        gathered_clip_ratio = self.accelerator.gather_for_metrics(clip_ratio)
+        self._metrics[mode]["clip_ratio/region_mean"].append(gathered_clip_ratio.nanmean().item())
+        return loss
+    def prediction_step(self, model, inputs, prediction_loss_only, ignore_keys: Optional[list[str]] = None):
+        inputs = self._prepare_inputs(inputs)
+        with torch.no_grad():
+            with self.compute_loss_context_manager():
+                loss = self.compute_loss(model, inputs)
+            loss = loss.mean().detach()
+        return loss, None, None
+    def log(self, logs: dict[str, float], start_time: Optional[float] = None) -> None:
+        mode = "train" if self.model.training else "eval"
+        metrics = {key: sum(val) / len(val) for key, val in self._metrics[mode].items()}  # average the metrics
+        # This method can be called both in training and evaluation. When called in evaluation, the keys in `logs`
+        # start with "eval_". We need to add the prefix "eval_" to the keys in `metrics` to match the format.
+        if mode == "eval":
+            metrics = {f"eval_{key}": val for key, val in metrics.items()}
+        logs = {**logs, **metrics}
+        if version.parse(transformers.__version__) >= version.parse("4.47.0.dev0"):
+            super().log(logs, start_time)
+        else:  # transformers<=4.46
+            super().log(logs)
+        self._metrics[mode].clear()
+        if self.accelerator.is_main_process and self.log_completions:
+            if self.args.report_to and "wandb" in self.args.report_to and wandb.run is not None:
+                import pandas as pd
+                table = {
+                    "step": [str(self.state.global_step)] * len(self._textual_logs["prompt"]),
+                    "prompt": self._textual_logs["prompt"],
+                    "completion": self._textual_logs["completion"],
+                    **self._textual_logs["rewards"],
+                }
+                df = pd.DataFrame(table)
+                if self.wandb_log_unique_prompts:
+                    df = df.drop_duplicates(subset=["prompt"])
+                wandb.log({"completions": wandb.Table(dataframe=df)})

trainer/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from .DyMETrainer import DyMETrainer
2	+
3	+ __all__ = ["DyMETrainer"]