diff --git a/.gitattributes b/.gitattributes index a6344aac8c09253b3b630fb776ae94478aa0275b..b18ecf2182c490f5a25a539197892f96d9199a29 100644 --- a/.gitattributes +++ b/.gitattributes @@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text *.zip filter=lfs diff=lfs merge=lfs -text *.zst filter=lfs diff=lfs merge=lfs -text *tfevents* filter=lfs diff=lfs merge=lfs -text +images/curves.png filter=lfs diff=lfs merge=lfs -text +images/demo2.png filter=lfs diff=lfs merge=lfs -text +images/performance.png filter=lfs diff=lfs merge=lfs -text diff --git a/README.md b/README.md new file mode 100644 index 0000000000000000000000000000000000000000..99f29617c167ac840c05ee6cc7e0d7d9d7922a7f --- /dev/null +++ b/README.md @@ -0,0 +1,187 @@ +# Video-R1: Reinforcing Video Reasoning in MLLMs + +[[📖 Paper](https://arxiv.org/pdf/2503.21776)] [[🤗 Video-R1-7B-model](https://huggingface.co/Video-R1/Video-R1-7B)] [[🤗 Video-R1-train-data](https://huggingface.co/datasets/Video-R1/Video-R1-data)] +[[🤖 Video-R1-7B-model](https://modelscope.cn/models/Video-R1/Video-R1-7B)] [[🤖 Video-R1-train-data](https://modelscope.cn/datasets/Video-R1/Video-R1-data)] + + + +## 👀 About Video-R1 + +Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based RL, we introduce Video-R1 as **the first work to *systematically* explore the R1 paradigm for eliciting video reasoning** within MLLMs. + +We introduce T-GRPO, an extension of GRPO that incorporates temporal modeling to **explicitly promote temporal reasoning**. Besides, We constructed two datasets: **Video-R1-COT-165k** for SFT cold start and **Video-R1-260k** for RL training, both comprising image and video data. + +Our Video-R1-7B obtain strong performance on several video reasoning benchmarks. For example, Video-R1-7B attains a 35.8% accuracy on video spatial reasoning benchmark VSI-bench, **surpassing the commercial proprietary model GPT-4o**. + +Video-R1-7B **can be easily trained** using 4 H20 (96GB) GPUs, or 5 A100 (80G) GPUs. + + + +## 🔥 News +- [2025/05/28] Our Video-R1-7B achieves **36.5%** accuracy on the new video reasoning benchmark [**Video-Holmes**](https://video-holmes.github.io/Page.github.io/), beating the commercial model **o4-mini (29.9%)** and **Gemini-2.0-Flash (30.6%)**. +- [2025/03/28] We release our paper, codes, model weights, and two curated training datasets in huggingface🤗 and modelscope🤖. +- [2025/02/23] We release the preliminary version of Video-R1, you can refer to `./previous_version` for this version. + +## 📍 Features + ++ Support Qwen2.5-VL ++ Support vLLM training and inference ++ Support Image-Video mixed training ++ Support multiple types for answers output (multiple choice, numerical, OCR, free-form, regression) ++ Provide full pipeline (dataset, COT annotation, SFT training, RL training, evaluation, etc) + +## 🔍 Dataset + + To overcome the scarcity of high-quality video reasoning training data, we strategically introduce image-based reasoning data as part of training data. We collect data from a variety of public datasets and carefully sample and balance the proportion of each subset. + + + +To facilitate an effective SFT cold start, we leverage Qwen2.5-VL-72B to generate COT rationales for the samples in Video-R1-260k. After applying basic rule-based filtering to remove low-quality or inconsistent outputs, we obtain a high-quality CoT dataset, Video-R1-COT 165k. + +## 🏆 Performance + + + +Video-R1 significantly outperforms previous models across most benchmarks. Notably, on VSI-Bench, which focuses on spatial reasoning in videos, Video-R1-7B achieves a new state-of-the-art accuracy of 35.8%, surpassing GPT-4o, a proprietary model, while using only 32 frames and 7B parameters. + +This highlights the necessity of explicit reasoning capability in solving video tasks, and confirms the effectiveness of reinforcement learning for video tasks. + + +
+ Descriptive alt text +
+ +Besides, although the model is trained using only 16 frames, we find that evaluating on more frames (e.g., 64) generally leads to better performance, particularly on benchmarks with longer videos. These results indicate the importance of training models to reason over more frames. + + +## 🧠 Aha Moment in Video Reasoning + +One of the most intriguing outcomes of reinforcement learning in Video-R1 is the emergence of self-reflection reasoning behaviors, commonly referred to as “aha moments”. Some examples are as follows. + + + + + + +## 📈 RL Training Curves + +The accuracy reward exhibits a generally upward trend, indicating that the model continuously improves its ability to produce correct answers under RL. + +Interestingly, the response length curve first drops at the beginning of RL training, then gradually increases. We guess this is because the model initially discards its previous, potentially sub-optimal reasoning style. Then gradually converges to a better and stable reasoning policy. + + + + + +## 📐 Set up + +```bash +git clone https://github.com/tulerfeng/Video-R1 +cd Video-R1 + +# build environment +conda create -n video-r1 python=3.11 +conda activate video-r1 +bash setup.sh + +# qwen video extraction setting, e.g., max frames, resolutions +# Use the [decord] feature to improve speed +cd src/qwen-vl-utils +pip install -e .[decord] +cd .. + +# download training dataset +git lfs install +git clone https://huggingface.co/datasets/Video-R1/Video-R1-data +``` + +Please put the downloaded dataset to `src/r1-v/Video-R1-data/` + +Then, unzip the data + +``` +python ./src/unzip.py +``` + +The `Video-R1-260k.json` file is for RL training while `Video-R1-COT-165k.json` is for SFT cold start. + +Qwen2.5-VL has been frequently updated in the Transformers library, which may cause version-related bugs or inconsistencies. Our code is compatible with the following version, please download at [here](https://drive.google.com/file/d/1Kc81WZitEhUZYWXpL6y2GXuSXufLSYcF/view?usp=sharing) + +Then install our provided version of transformers + +```bash +unzip transformers-main.zip +cd ./transformers-main +pip install . +``` + +For vLLM library, please use 0.7.2 version. + +For trl library, please use 0.16.0 version. + +## 🚀 Training + +We first perform supervised fine-tuning on the Video-R1-COT-165k dataset for one epoch to obtain the Qwen2.5-VL-7B-SFT model. If you want to perform CoT annotation on your own data, please refer to `src/generate_cot_vllm.py` + +```bash +bash ./src/scripts/run_sft_video.sh +``` +If you want to skip the SFT process, we also provide one of our SFT models at [🤗Qwen2.5-VL-SFT](https://huggingface.co/Video-R1/Qwen2.5-VL-7B-COT-SFT). + +This is followed by RL training on the Video-R1-260k dataset to produce the final Video-R1 model. Due to current computational resource limitations, we train the model for only 1.2k RL steps. + +The script for training the obtained Qwen2.5-VL-7B-SFT model with T-GRPO or GRPO is as follows + +```bash +bash ./src/scripts/run_grpo_video.sh +``` + +You can also use the following script to enable vLLM acceleration for RL training + +```bash +bash ./src/scripts/run_grpo_vllm_qwen25vl.sh +``` + +For efficiency considerations, we limit the maximum number of video frames to 16 during training. Each frame is processed at a max resolution of 128 × 28 × 28. You can set this in `src/qwen-vl-utils` + +Please keep per_device_train_batch_size=1 as in previous work r1-v + +## 🔮 Inference & Evaluation + +During inference, we increase the max frame resolution to 256 × 28 × 28 and max frames to 16/32/64 to enhance performance. You can easily set this in `src/qwen-vl-utils` + +For all evaluations, we follow the decoding configuration used in the official Qwen2.5-VL demo, with top\_p = 0.001 and temperature = 0.01. Setting large top_p may encounter messy output when inference. + +We recommend using our provided json files and scripts for easier evaluation. + +The json files can be downloaded at: [[🤗 Video-R1-eval](https://huggingface.co/datasets/Video-R1/Video-R1-eval)], put them in `/src/r1-v/Evaluation` + +Next, download the evaluation video data from each benchmark’s official website, and place them in `/src/r1-v/Evaluation` as specified in the provided json files. + +Finally, conduct evaluation on all benchmarks using the following scripts + +```bash +bash ./src/eval_bench.sh +``` +For infernce on a single example, you may use: + +```bash +python ./src/inference_example.py +``` + +## Acknowledgements + +We sincerely appreciate the contributions of the open-source community. The related projects are as follows: [R1-V](https://github.com/Deep-Agent/R1-V) , [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1) + +## Citations + +If you find our work helpful for your research, please consider citing our work. + +``` +@article{feng2025video, + title={Video-R1: Reinforcing Video Reasoning in MLLMs}, + author={Feng, Kaituo and Gong, Kaixiong and Li, Bohao and Guo, Zonghao and Wang, Yibing and Peng, Tianshuo and Wang, Benyou and Yue, Xiangyu}, + journal={arXiv preprint arXiv:2503.21776}, + year={2025} +} +``` diff --git a/create_data.py b/create_data.py new file mode 100644 index 0000000000000000000000000000000000000000..2c352a6da9b24a0cb08ebb5c818e52702841b030 --- /dev/null +++ b/create_data.py @@ -0,0 +1,370 @@ +# import re +# from pathlib import Path +# from datasets import load_dataset, Dataset, DatasetDict, Features, Value, Image +# import re +# from typing import Dict, List, Optional +# from pathlib import Path +# from datasets import Dataset, DatasetDict, concatenate_datasets, Features, Value, Sequence + + +# # ------------------------------------------------------------------ +# # 0) Load your JSON → `raw_ds` exactly as before +# # ------------------------------------------------------------------ + +# files = [ +# "pool_multiple_choice_chunk_01.json", +# "pool_multiple_choice_chunk_02.json", +# "pool_multiple_choice_chunk_03.json", +# "pool_multiple_choice_chunk_04.json", +# "pool_numerical_chunk_01.json", +# "pool_numerical_chunk_02.json", +# "pool_numerical_chunk_03.json", +# "pool_regression_chunk_01.json", +# ] + +# # ---- 1-4. load, trim, normalise ---------------------------------------- +# def load_trim_normalise(fp, cap=10_000): +# ds = Dataset.from_json(fp) + +# # a) truncate +# ds = ds.select(range(min(cap, len(ds)))) + +# # b) make sure `options` exists and is always list[str] +# if "options" not in ds.column_names: +# ds = ds.add_column("options", [[]] * len(ds)) +# else: +# ds = ds.map( +# lambda ex: {"options": [str(o) for o in (ex["options"] or [])]}, +# remove_columns=[], num_proc=4, +# ) + +# return ds + +# ds_list = [load_trim_normalise(fp) for fp in files] + +# # ---- 4. align feature schema explicitly (all files now identical) ------- +# common_features = Features({ +# "problem_id" : Value("int64"), +# "problem" : Value("string"), +# "data_type" : Value("string"), +# "problem_type": Value("string"), +# "options" : Sequence(Value("string")), +# "solution" : Value("string"), +# "path" : Value("string"), +# "data_source" : Value("string"), +# }) +# ds_list = [d.cast(common_features) for d in ds_list] + +# # ---- 5. concatenate ----------------------------------------------------- +# raw_train = concatenate_datasets(ds_list) +# raw_ds = DatasetDict({"train": raw_train}) + +# # ------------------------------------------------------------------ +# # 1) Build the question (unchanged) +# # ------------------------------------------------------------------ +# def build_question(example): +# q = ( +# example["problem"] + " Options:\n" + "\n".join(example["options"]) +# if example["problem_type"] == "multiple choice" +# else example["problem"] +# ) +# example["problem"] = q +# return example + + +# def extract_answer(predict: str) -> Optional[str]: +# """ +# Extracts the content of the block from `predict`. +# Returns the inner text (with leading/trailing whitespace stripped), +# or None if no tag is found. +# """ +# match = re.search(r"([\s\S]*?)", predict, re.DOTALL) +# if not match: +# return predict +# return match.group(1).strip() + + + +# def add_answer(example): +# # assumes the ground-truth answer (tagged) is in `solution` +# example["answer"] = extract_answer(example["solution"]) +# return example + +# # ------------------------------------------------------------------ +# # 3) Embed image bytes (column name stays "images") +# # ------------------------------------------------------------------ +# def to_embedded_image(example): +# if example["data_type"] != "image": +# example["images"] = None +# return example +# with open(example["path"], "rb") as f: +# img_bytes = f.read() +# example["images"] = {"bytes": img_bytes, "path": None} +# return example + +# # ------------------------------------------------------------------ +# # 4) Full pipeline +# # ------------------------------------------------------------------ +# processed = ( +# raw_ds["train"] +# .map(build_question, num_proc=4) +# .map(add_answer, num_proc=4) +# .map(to_embedded_image, num_proc=4) +# .remove_columns([ +# "path", "data_type", "options", "problem_type", "solution", +# "problem_id", "data_source" # ← drop these too +# ]) +# ) + +# # ------------------------------------------------------------------ +# # 5) Schema must match the final column names +# # ------------------------------------------------------------------ +# features = Features({ +# "problem": Value("string"), +# "answer" : Value("string"), +# "images" : Image(), # keep plural name +# }) +# processed = processed.cast(features) + +# # ------------------------------------------------------------------ +# # 6) Write Parquet shards (file prefix inside the folder) +# # ------------------------------------------------------------------ +# out_dir = Path("qwen2.5_vl_portable") +# out_dir.mkdir(parents=True, exist_ok=True) + +# # processed.to_parquet(str(out_dir / "train.parquet")) # → train-00000-of-00001.parquet +# processed.to_parquet(str("./hf_data/train.parquet")) +# print("✓ Dataset written with embedded images and answers →", out_dir.resolve()) + + +# import re +# from pathlib import Path +# from typing import Dict, List, Optional + +# from datasets import ( +# Dataset, +# DatasetDict, +# concatenate_datasets, +# Features, +# Value, +# Sequence, +# Image, +# ) + +# # ------------------------------------------------------------------ +# # 0) Inputs +# # ------------------------------------------------------------------ +# files = [ +# "pool_multiple_choice_chunk_01.json", +# "pool_multiple_choice_chunk_02.json", +# "pool_multiple_choice_chunk_03.json", +# "pool_multiple_choice_chunk_04.json", +# "pool_numerical_chunk_01.json", +# "pool_numerical_chunk_02.json", +# "pool_numerical_chunk_03.json", +# "pool_regression_chunk_01.json", +# ] + +# # ------------------------------------------------------------------ +# # 1) Define common meta schema (what you want to keep in the output) +# # ------------------------------------------------------------------ +# common_features = Features({ +# "problem_id" : Value("int64"), +# "problem" : Value("string"), +# "data_type" : Value("string"), +# "problem_type": Value("string"), +# "options" : Sequence(Value("string")), +# "solution" : Value("string"), +# "path" : Value("string"), +# "data_source" : Value("string"), +# }) + +# # Final (superset) schema to write: meta + new columns +# full_features = common_features.copy() +# full_features["answer"] = Value("string") +# full_features["images"] = Image() # plural name kept, binary-friendly + + +# # ------------------------------------------------------------------ +# # 2) Load + normalize each JSON +# # ------------------------------------------------------------------ +# def load_trim_normalise(fp: str, cap: int = 10_000) -> Dataset: +# ds = Dataset.from_json(fp) + +# # truncate if desired +# ds = ds.select(range(min(cap, len(ds)))) + +# # ensure `options` exists and is always list[str] +# if "options" not in ds.column_names: +# ds = ds.add_column("options", [[]] * len(ds)) +# else: +# ds = ds.map( +# lambda ex: {"options": [str(o) for o in (ex["options"] or [])]}, +# remove_columns=[], +# num_proc=4, +# ) + +# # align to the common meta schema early (helps concat) +# # Some JSONs may not have all fields; add missing with defaults first. +# missing_cols = [k for k in common_features.keys() if k not in ds.column_names] +# for mc in missing_cols: +# # create sensible defaults +# if mc == "options": +# ds = ds.add_column(mc, [[]] * len(ds)) +# elif common_features[mc].dtype == "int64": +# ds = ds.add_column(mc, [0] * len(ds)) +# else: +# ds = ds.add_column(mc, [""] * len(ds)) + +# ds = ds.cast(common_features) +# return ds + +# ds_list = [load_trim_normalise(fp) for fp in files] + +# # Concatenate shards +# raw_train = concatenate_datasets(ds_list) +# raw_ds = DatasetDict({"train": raw_train}) + + +# # ------------------------------------------------------------------ +# # 3) Processing fns +# # ------------------------------------------------------------------ +# def build_question(example: Dict) -> Dict: +# """ +# If multiple-choice, append the options to the text. +# Overwrites the `problem` field in-place (kept in output). +# """ +# if example["problem_type"] == "multiple choice": +# opts = example.get("options") or [] +# q = example["problem"] + " Options:\n" + "\n".join(opts) +# example["problem"] = q +# return example + + +# def extract_answer(predict: str) -> Optional[str]: +# """ +# Return inner text of ..., stripped. +# If no tag is found, return the original string. +# """ +# if predict is None: +# return None +# match = re.search(r"([\s\S]*?)", predict, re.DOTALL) +# if not match: +# return predict +# return match.group(1).strip() + + +# def add_answer(example: Dict) -> Dict: +# example["answer"] = extract_answer(example.get("solution", "")) +# return example + + +# def to_embedded_image(example: Dict) -> Dict: +# """ +# If data_type == 'image', embed bytes for HF Image() feature. +# Otherwise leave as None. +# """ +# if example.get("data_type") != "image": +# example["images"] = None +# return example + +# path = example.get("path") +# if not path: +# example["images"] = None +# return example + +# try: +# with open(path, "rb") as f: +# img_bytes = f.read() +# example["images"] = {"bytes": img_bytes, "path": None} +# except Exception: +# # If image is missing or unreadable, keep None so cast still works +# example["images"] = None +# return example + + +# # ------------------------------------------------------------------ +# # 4) Apply pipeline (do NOT drop meta columns you want to keep) +# # ------------------------------------------------------------------ +# processed = ( +# raw_ds["train"] +# .map(build_question, num_proc=4) +# .map(add_answer, num_proc=4) +# .map(to_embedded_image, num_proc=4) +# .cast(full_features) # <- ensure final schema +# ) + +# # Optional: control output column ordering +# processed = processed.select_columns(list(full_features.keys())) + +# # ------------------------------------------------------------------ +# # 5) Write Parquet +# # ------------------------------------------------------------------ +# out_dir = Path("./hf_data") +# out_dir.mkdir(parents=True, exist_ok=True) + +# out_path = out_dir / "train.parquet" +# processed.to_parquet(str(out_path)) + +# print("✓ Wrote:", out_path.resolve()) +# print("Columns:", list(processed.features.keys())) + + +# ------------------------------------------------------------------ +# 4.1) Downsample to 30k, mainly reducing math-heavy sources +# ------------------------------------------------------------------ +from collections import Counter + +TARGET_SIZE = 30_000 +MATH_SHARE = 0.20 # keep ~20% math (tweak if you want) +SEED = 2025 + +# Define which sources are "mathy" +MATH_SOURCES = { + "Multimath-300k", + "TabMWP", + "Geometry3K", + "CLEVR-Math", + "DVQA", + "FigureQA", + "ChartQA", + "PlotQA", + "EXAMS-V-train/Mathematics", + "UniGeo", + "GeoQA+", +} + +def is_math_source(name: Optional[str]) -> bool: + if not name: + return False + return name in MATH_SOURCES or ("math" in name.lower()) + +# Split +math_ds = processed.filter(lambda ex: is_math_source(ex.get("data_source")), num_proc=4) +non_math_ds = processed.filter(lambda ex: not is_math_source(ex.get("data_source")), num_proc=4) + +# Decide quotas +non_math_quota = min(len(non_math_ds), int(TARGET_SIZE * (1 - MATH_SHARE))) +math_quota = TARGET_SIZE - non_math_quota +math_quota = min(math_quota, len(math_ds)) # guard if math is too small + +# Sample deterministically +non_math_sample = non_math_ds.shuffle(seed=SEED).select(range(non_math_quota)) +math_sample = math_ds.shuffle(seed=SEED).select(range(math_quota)) + +# Combine and shuffle +final = concatenate_datasets([non_math_sample, math_sample]).shuffle(seed=SEED) + +# Quick sanity printout +cnt = Counter(final["data_source"]) +total = len(final) +print(f"Final size: {total} (non-math {non_math_quota}, math {math_quota})") +for name, n in sorted(cnt.items(), key=lambda x: -x[1])[:25]: + pct = n / total + print(f"{name:30s} {n:6d} {pct:7.3%}") + +# Use this 'final' dataset for writing +processed = final +out_path = out_dir / "train_30k.parquet" +processed.to_parquet(str(out_path)) +print("✓ Wrote:", out_path.resolve()) diff --git a/get_parquet_data.py b/get_parquet_data.py new file mode 100644 index 0000000000000000000000000000000000000000..9511dd0d209e5627ec7b5a66e9ff1311e24b6cb9 --- /dev/null +++ b/get_parquet_data.py @@ -0,0 +1,48 @@ +import json +import io +from datasets import Dataset, Features, Sequence, Value, Image +from PIL import Image as PILImage + +# 1️⃣ Load your JSON file (which is a top-level list of dicts) +with open("Train_QA_10k_noFreeForm.json", "r") as f: + records = json.load(f) # List[Dict] + +# 2️⃣ Build an HF Dataset +ds = Dataset.from_list(records) + +# 3️⃣ Read each image file into raw bytes +def read_image_bytes(example): + with open(example["path"], "rb") as img_f: + example["image_bytes"] = img_f.read() + return example + +# we keep all original columns + add "image_bytes" +ds = ds.map(read_image_bytes, remove_columns=[]) + +# 4️⃣ Define your schema, telling HF that image_bytes is binary +features = Features({ + "problem_id": Value("int64"), + "problem": Value("string"), + "data_type": Value("string"), + "problem_type": Value("string"), + "options": Sequence(Value("string")), + "solution": Value("string"), + "data_source": Value("string"), + # "prompt": Value("string"), + "answer": Value("string"), + "path": Value("string"), + "image_bytes": Value("binary"), # ← raw bytes in Arrow +}) +ds = ds.cast(features) + +# 5️⃣ Rename, and cast that byte-column to an Image feature that decodes to PIL +ds = ds.rename_column("image_bytes", "images") +ds = ds.cast_column("images", Image(decode=True)) + +# 6️⃣ Sanity-check +img0 = ds[0]["images"] +print(img0) +# → PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x384 + +# 7️⃣ Finally, write out to Parquet (the bytes go in the file) +ds.to_parquet("./hf_data/Train_QA_10k_noFreeForm.parquet") diff --git a/images/curves.png b/images/curves.png new file mode 100644 index 0000000000000000000000000000000000000000..57b8164586abe64349a34fa269562ed367ed40e8 --- /dev/null +++ b/images/curves.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:72f6c9fdc2b4e59df266b31e85cf186ee1acc9b508b9001f709f05873dd00b20 +size 277383 diff --git a/images/demo2.png b/images/demo2.png new file mode 100644 index 0000000000000000000000000000000000000000..dc115d5da1768bbdef1a6660aabc9a463022b74a --- /dev/null +++ b/images/demo2.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8a6dde7c88f100fdffcc01142dd997d1f53f6255d0d4d5ebe36808d8733d6280 +size 1000742 diff --git a/images/frames.png b/images/frames.png new file mode 100644 index 0000000000000000000000000000000000000000..e2ad7b6ee6decd09805d83570585a145f18466ca Binary files /dev/null and b/images/frames.png differ diff --git a/images/performance.png b/images/performance.png new file mode 100644 index 0000000000000000000000000000000000000000..2320a16831917dd7f9da922922f013ce683e6e71 --- /dev/null +++ b/images/performance.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:bf4f04aa5954b18a90471b992cf50d45ba1289bf1412ff77e6d5557b88c09659 +size 445327 diff --git a/merge_data.py b/merge_data.py new file mode 100644 index 0000000000000000000000000000000000000000..e18b0bff259be679bc177ce920b152819dfd1382 --- /dev/null +++ b/merge_data.py @@ -0,0 +1,71 @@ +import json +from pathlib import Path +from typing import Iterator, Dict + +# ----------------------------- +# Inputs +# ----------------------------- +files = [ + "pool_multiple_choice_chunk_01.json", + "pool_multiple_choice_chunk_02.json", + "pool_multiple_choice_chunk_03.json", + "pool_multiple_choice_chunk_04.json", + "pool_numerical_chunk_01.json", + "pool_numerical_chunk_02.json", + "pool_numerical_chunk_03.json", + "pool_regression_chunk_01.json", +] + +out_path = Path("merged_train.json") + +# ----------------------------- +# Read records from JSON/JSONL +# ----------------------------- +def iter_records(path: Path) -> Iterator[Dict]: + """ + Yields records from a file that can be: + - JSONL (one JSON object per line), or + - a single JSON array, or + - a single JSON object. + """ + text = path.read_text(encoding="utf-8") + # Try whole-file JSON first (array or object) + try: + data = json.loads(text) + if isinstance(data, list): + for rec in data: + yield rec + elif isinstance(data, dict): + yield data + else: + raise ValueError(f"Unsupported top-level JSON type in {path}") + except json.JSONDecodeError: + # Fallback: treat as JSONL + for i, line in enumerate(text.splitlines(), 1): + line = line.strip() + if not line: + continue + try: + yield json.loads(line) + except json.JSONDecodeError as e: + raise ValueError(f"Invalid JSON on line {i} in {path}: {e}") from e + +# ----------------------------- +# Merge & write single JSON file +# ----------------------------- +out_path.parent.mkdir(parents=True, exist_ok=True) + +count = 0 +with out_path.open("w", encoding="utf-8") as out: + out.write("[\n") + first = True + for fp in files: + for rec in iter_records(Path(fp)): + if not first: + out.write(",\n") + out.write(json.dumps(rec, ensure_ascii=False)) + first = False + count += 1 + out.write("\n]") + +print(f"✓ Wrote {count} records to {out_path.resolve()}") diff --git a/move.sh b/move.sh new file mode 100644 index 0000000000000000000000000000000000000000..2ebcbcb464058d585ebbb4a464d41ca75835e993 --- /dev/null +++ b/move.sh @@ -0,0 +1,5 @@ +cp -r /cq_1/share_1603164/user/zongxia/workspace/Video-R1/src/scripts/ ./src/ + +cp -r /cq_1/share_1603164/user/zongxia/workspace/Video-R1/src/r1-v/src/open_r1/ ./src/r1-v/src/ + +cp -r /cq_1/share_1603164/user/zongxia/workspace/Video-R1/src/r1-v/local_scripts/ ./src/r1-v/ \ No newline at end of file diff --git a/move_eval.sh b/move_eval.sh new file mode 100644 index 0000000000000000000000000000000000000000..2d1e7fb8bfb8b09ff70a8984160c71ea7c532574 --- /dev/null +++ b/move_eval.sh @@ -0,0 +1,5 @@ +cp /cq_1/share_1603164/user/zongxia/workspace/Video-R1/src/eval_bench.py ./src/ + +cp /cq_1/share_1603164/user/zongxia/workspace/Video-R1/src/eval_bench.sh ./src/ + +cp /cq_1/share_1603164/user/zongxia/workspace/Video-R1/src/eval_bench_4567.sh ./src/ \ No newline at end of file diff --git a/move_result.sh b/move_result.sh new file mode 100644 index 0000000000000000000000000000000000000000..9d17cdab52e77dc727886e474e32f8b8ab50c468 --- /dev/null +++ b/move_result.sh @@ -0,0 +1 @@ +cp -r ./src/r1-v/eval_results/* /cq_1/share_1603164/user/zongxia/workspace/A-EVALUTION/video_eval_results/ \ No newline at end of file diff --git a/previous_version/Video-R1-main-previous/src/distill_r1/README.md b/previous_version/Video-R1-main-previous/src/distill_r1/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e76348376692a64504183f60dd58b5f9448aef49 --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/distill_r1/README.md @@ -0,0 +1,54 @@ +# R1 Reasoning Dataset Generation + + + +## QA Pairs Generation + +We create a `scene description` by combining the objects (with meta info such as location, depth) using a template. + +We keep the couting relevant questions and add a `How many items are there in the described scene?` question to count all objects in the scene. + +Example QA pair: + +```json +{'img_filename': 'CLEVR_trainA_048403.png', + 'question': 'How many things are both on the right side of the big yellow rubber thing and left of the purple ball?', + 'answer': '5', + 'description': 'Scene Description:\nA large red rubber cylinder rotated 291.3° located at 3D coordinates (-0.89, -2.73, 0.70) and pixel coordinates (101, 152, 10.04)\nA small purple metal sphere rotated 247.7° located at 3D coordinates (2.93, 0.87, 0.35) and pixel coordinates (379, 183, 9.66)\nA large cyan rubber cylinder rotated 114.5° located at 3D coordinates (-2.40, 2.23, 0.70) and pixel coordinates (246, 82, 13.94)\nA small red metal cylinder rotated 109.9° located at 3D coordinates (-0.95, 1.77, 0.35) and pixel coordinates (270, 113, 12.83)\nA small red rubber cylinder rotated 343.7° located at 3D coordinates (-0.12, -0.74, 0.35) and pixel coordinates (209, 153, 10.82)\nA large red rubber cylinder rotated 324.5° located at 3D coordinates (-2.71, -2.21, 0.70) and pixel coordinates (84, 119, 11.59)\nA small red metal cylinder rotated 1.1° located at 3D coordinates (2.88, -0.12, 0.35) and pixel coordinates (342, 200, 9.12)\nA small gray rubber cube rotated 144.9° located at 3D coordinates (0.79, 0.98, 0.35) and pixel coordinates (299, 145, 11.19)\nA large yellow rubber cube rotated 90.0° located at 3D coordinates (-1.78, -0.31, 0.70) and pixel coordinates (180, 110, 12.05)\n'} +``` + +See `generate_scene_qa_pairs.ipynb` for details. + +## R1 Response Generation + +`query_r1.py`: We get the predictions and obtain the reasoning trace of R1 by querying SilconFlow (You can register via https://cloud.siliconflow.cn/i/rYfeQZ6A so both we 20M token credit!) + +Example Response: + +```json +Okay, let's see. The user is asking how many items are there in the described scene. Let me go through the scene description step by step. + +So, the scene description lists each object with details like color, material, shape, rotation, 3D coordinates, and pixel coordinates. Each entry starts with "A" which usually indicates one item each. Let me count each one. + +First entry: "A small green metal cylinder..." That's one. Second: "A small blue rubber cylinder..." Second item. Third: "A small cyan rubber cylinder..." That's three. Fourth: "A large cyan metal sphere..." Four. Fifth: "A large brown metal cube..." Five. Sixth: "A large yellow rubber cube..." Six. Seventh: "A large brown rubber cylinder..." That's seven. + +Wait, did I miss any? Let me check again. The list has entries from "A small green..." up to the seventh one. Each sentence starts with "A", which suggests each is a separate item. No commas separating multiple items in a single entry. Each has different attributes and coordinates, so they must all be distinct. + +So the answer should be 7 items. + + +There are 7 items in the described scene. Each entry corresponds to one distinct object, listed by their properties, coordinates, and rotations. +``` + +In `v1` we do not constrain the output format and thus the answers are somewhat messy to parse. We then switched to `v2` by explicitly prompting the model to generate the answer with `**The answer is: **` + +## Reasoning Path Filtering + +`filter_r1.py`: We filter out (almost) valid reasoning trace, by juding whether the R1 answer is correct (following our previous work [Math-Shepherd](https://arxiv.org/abs/2312.08935)). + +## HF dataset creation + +Finally, we create the dataset using `create_hf_dataset.py` and upload to HF dataset hub. + + + diff --git a/previous_version/Video-R1-main-previous/src/distill_r1/filter_r1.py b/previous_version/Video-R1-main-previous/src/distill_r1/filter_r1.py new file mode 100644 index 0000000000000000000000000000000000000000..844fecac2e68ef6eeaaf7dc2a7d0a4aa0bfba1e7 --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/distill_r1/filter_r1.py @@ -0,0 +1,153 @@ +import json +import re +from pathlib import Path + + + +def extract_answer_from_query(query_results: str) -> str | None: + """ + Extract answer from query results, specifically looking for: + - Numbers within asterisks + - Yes/No answers in various formats + + Args: + query_results: String containing the query response + + Returns: + Extracted answer string or None if no answer found + """ + # First try to find answers in the standard format with labels + # Split the text into segments (trying to get the last conclusion) + if "" not in query_results or "" not in query_results: + return None + segments = query_results.split("\n") + + # First try to find final conclusion in the last few segments + conclusion_patterns = [ + r"(?:so|therefore|thus|hence),?\s*(?:the answer is\s+)?\*\*\s*(no|yes|[0-9]+)\s*\*\*", + r"(?:so|therefore|thus|hence),?\s*(?:the answer is\s+)?(no|yes|[0-9]+)\b", + r"the answer is\s+\*\*\s*(no|yes|[0-9]+)\s*\*\*", + r"(?:final|conclusive) answer(?:\s+is)?\s*\*\*\s*(no|yes|[0-9]+)\s*\*\*", + ] + + # Try to find conclusion in last 3 segments + for segment in reversed(segments[-3:]): + for pattern in conclusion_patterns: + match = re.search(pattern, segment, re.IGNORECASE) + if match: + return match.group(1).strip().lower() + + # If no conclusion found, try other patterns on the full text + labeled_patterns = [ + r"\*\*The answer is:\s*\*\*\s*([0-9]+|yes|no)\b", + r"\*\*Answer:\s*\*\*\s*([0-9]+|yes|no)\b", + r"\*\*Answer\*\*:\s*([0-9]+|yes|no)\b", + r"\*\*Answer:?\s*\*\*\s*There (?:is|are)\s+([0-9]+)", + r"\*\*Final Count:\s*\*\*\s*([0-9]+)", + r"\*\*Final Count:\s*\*\*\s*([0-9]+)\s+(?:items?|objects?|spheres?|cubes?|boxes?)", + r"\*\*Total:\s*\*\*\s*([0-9]+)", + r"The answer is:\s*([0-9]+|yes|no)\b", + r"Answer:\s*([0-9]+|yes|no)\b", + r"should be\s+([0-9]+)[.\s]", + ] + + direct_patterns = [ + r"\*\*\s*([0-9]+)\s*\*\*", + r"\*\*\s*([0-9]+)\s+(?:items?|objects?|cubes?|boxes?|spheres?)?\s*\*\*", + r"\*\*\s*([0-9]+)\s+[^*]+\*\*", + ] + + latex_patterns = [ + r"\$\\boxed{([0-9]+)}\$", + r"\\boxed{([0-9]+)}", + ] + + count_patterns = [ + r"There (?:is|are)\s+([0-9]+)\s+(?:items?|objects?|spheres?|cubes?|boxes?)", + ] + + # Try all patterns in sequence on full text + all_patterns = labeled_patterns + direct_patterns + latex_patterns + count_patterns + + for pattern in all_patterns: + match = re.search(pattern, query_results, re.IGNORECASE) + if match: + return match.group(1).strip().lower() + + return None + + +def validate_qa_pairs(input_file: str, output_dir: str, verbose: bool = True): + """ + Process QA pairs and save them to separate files. + Only saves pairs where parsed answer matches ground truth. + + Args: + input_file: Path to input JSONL file + output_dir: Directory to save output files + verbose: If True, print examples of mismatched or unparseable responses + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + valid_pairs = [] + invalid_pairs = [] + stats = {"total": 0, "unparseable": 0, "mismatch": 0, "valid": 0} + + with open(input_file, "r", encoding="utf-8") as f: + for line_num, line in enumerate(f, 1): + stats["total"] += 1 + qa_pair = json.loads(line.strip()) + ground_truth = str(qa_pair.get("a", "")).lower().strip() + parsed_answer = extract_answer_from_query(qa_pair["r1_response"]) + + if parsed_answer is None: + stats["unparseable"] += 1 + qa_pair["error"] = "unparseable" + invalid_pairs.append(qa_pair) + if verbose: + print(f"\nLine {line_num}: Could not parse answer") + print(f"Ground truth: {ground_truth}") + print(f"Query results: {qa_pair['r1_response'][-200:]}...") + elif parsed_answer != ground_truth: + stats["mismatch"] += 1 + qa_pair["error"] = "mismatch" + qa_pair["parsed_answer"] = parsed_answer + invalid_pairs.append(qa_pair) + if verbose: + print(f"\nLine {line_num}: Answer mismatch") + print(f"Ground truth: {ground_truth}") + print(f"Parsed answer: {parsed_answer}") + print(f"Query results: {qa_pair['r1_response'][-200:]}...") + else: + stats["valid"] += 1 + valid_pairs.append(qa_pair) + + # Save valid pairs (where parsed answer matches ground truth) + valid_file = output_dir / "valid_pairs.jsonl" + with open(valid_file, "w", encoding="utf-8") as f: + for pair in valid_pairs: + f.write(json.dumps(pair, ensure_ascii=False) + "\n") + + # Save invalid pairs (unparseable or mismatched) + invalid_file = output_dir / "invalid_pairs.jsonl" + with open(invalid_file, "w", encoding="utf-8") as f: + for pair in invalid_pairs: + f.write(json.dumps(pair, ensure_ascii=False) + "\n") + + # Print statistics + print(f"\nProcessing Summary:") + print(f"Total pairs processed: {stats['total']}") + print(f"Valid pairs (matching ground truth): {stats['valid']}") + print(f"Invalid pairs: {stats['unparseable'] + stats['mismatch']}") + print(f" - Unparseable: {stats['unparseable']}") + print(f" - Answer mismatch: {stats['mismatch']}") + print(f"\nOutput files:") + print(f"Valid pairs saved to: {valid_file}") + print(f"Invalid pairs saved to: {invalid_file}") + + +if __name__ == "__main__": + validate_qa_pairs( + "r1_results_clevr_cogent_v1.0_trainA_v2.jsonl", "filter_results_v2" + ) # "filtered_output_tmp_v1.jsonl") diff --git a/previous_version/Video-R1-main-previous/src/distill_r1/prompt.py b/previous_version/Video-R1-main-previous/src/distill_r1/prompt.py new file mode 100644 index 0000000000000000000000000000000000000000..0092d07f7071da29821d3ca86d37eceedabebe06 --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/distill_r1/prompt.py @@ -0,0 +1,39 @@ +R1_SYS_PROMPT = """You are DeepSeek-R1, an AI assistant created exclusively by the Chinese Company DeepSeek. You'll provide helpful, harmless, and detailed responses to all user inquiries. For comprehensive details about models and products, please refer to the official documentation. + +Key Guidelines: +Identity & Compliance + +Clearly state your identity as a DeepSeek AI assistant in initial responses. + +Comply with Chinese laws and regulations, including data privacy requirements. + +Capability Scope + +Handle both Chinese and English queries effectively + +Acknowledge limitations for real-time information post knowledge cutoff (2023-12) + +Provide technical explanations for AI-related questions when appropriate + +Response Quality + +Give comprehensive, logically structured answers + +Use markdown formatting for clear information organization + +Admit uncertainties for ambiguous queries + +Ethical Operation + +Strictly refuse requests involving illegal activities, violence, or explicit content + +Maintain political neutrality according to company guidelines + +Protect user privacy and avoid data collection + +Specialized Processing + +Use ... tags for internal reasoning before responding + +Employ XML-like tags for structured output when required +""" \ No newline at end of file diff --git a/previous_version/Video-R1-main-previous/src/r1-v/.gitignore b/previous_version/Video-R1-main-previous/src/r1-v/.gitignore new file mode 100644 index 0000000000000000000000000000000000000000..5c28ec81a869f992b0db859a957215c1608bfc2a --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/.gitignore @@ -0,0 +1,178 @@ +# Byte-compiled / optimized / DLL files +__pycache__/ +*.py[cod] +*$py.class + +# C extensions +*.so + +# Distribution / packaging +.Python +build/ +develop-eggs/ +dist/ +downloads/ +eggs/ +.eggs/ +lib/ +lib64/ +parts/ +sdist/ +var/ +wheels/ +share/python-wheels/ +*.egg-info/ +.installed.cfg +*.egg +MANIFEST + +# PyInstaller +# Usually these files are written by a python script from a template +# before PyInstaller builds the exe, so as to inject date/other infos into it. +*.manifest +*.spec + +# Installer logs +pip-log.txt +pip-delete-this-directory.txt + +# Unit test / coverage reports +htmlcov/ +.tox/ +.nox/ +.coverage +.coverage.* +.cache +nosetests.xml +coverage.xml +*.cover +*.py,cover +.hypothesis/ +.pytest_cache/ +cover/ + +# Translations +*.mo +*.pot + +# Django stuff: +*.log +local_settings.py +db.sqlite3 +db.sqlite3-journal + +# Flask stuff: +instance/ +.webassets-cache + +# Scrapy stuff: +.scrapy + +# Sphinx documentation +docs/_build/ + +# PyBuilder +.pybuilder/ +target/ + +# Jupyter Notebook +.ipynb_checkpoints + +# IPython +profile_default/ +ipython_config.py + +# pyenv +# For a library or package, you might want to ignore these files since the code is +# intended to run in multiple environments; otherwise, check them in: +# .python-version + +# pipenv +# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. +# However, in case of collaboration, if having platform-specific dependencies or dependencies +# having no cross-platform support, pipenv may install dependencies that don't work, or not +# install all needed dependencies. +#Pipfile.lock + +# UV +# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control. +# This is especially recommended for binary packages to ensure reproducibility, and is more +# commonly ignored for libraries. +#uv.lock + +# poetry +# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. +# This is especially recommended for binary packages to ensure reproducibility, and is more +# commonly ignored for libraries. +# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control +#poetry.lock + +# pdm +# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. +#pdm.lock +# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it +# in version control. +# https://pdm.fming.dev/latest/usage/project/#working-with-version-control +.pdm.toml +.pdm-python +.pdm-build/ + +# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm +__pypackages__/ + +# Celery stuff +celerybeat-schedule +celerybeat.pid + +# SageMath parsed files +*.sage.py + +# Environments +.env +.venv +env/ +venv/ +ENV/ +env.bak/ +venv.bak/ + +# Spyder project settings +.spyderproject +.spyproject + +# Rope project settings +.ropeproject + +# mkdocs documentation +/site + +# mypy +.mypy_cache/ +.dmypy.json +dmypy.json + +# Pyre type checker +.pyre/ + +# pytype static type analyzer +.pytype/ + +# Cython debug symbols +cython_debug/ + +# PyCharm +# JetBrains specific template is maintained in a separate JetBrains.gitignore that can +# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore +# and can be added to the global gitignore or merged into this file. For a more nuclear +# option (not recommended) you can uncomment the following to ignore the entire idea folder. +#.idea/ + +# PyPI configuration file +.pypirc + +# Temp folders +data/ +wandb/ +scripts/ +checkpoints/ +.vscode/ \ No newline at end of file diff --git a/previous_version/Video-R1-main-previous/src/r1-v/LICENSE b/previous_version/Video-R1-main-previous/src/r1-v/LICENSE new file mode 100644 index 0000000000000000000000000000000000000000..261eeb9e9f8b2b4b0d119366dda99c6fd7d35c64 --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/previous_version/Video-R1-main-previous/src/r1-v/Makefile b/previous_version/Video-R1-main-previous/src/r1-v/Makefile new file mode 100644 index 0000000000000000000000000000000000000000..47999e65c24d98abb5fee6f072a43aa9d6b0c101 --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/Makefile @@ -0,0 +1,20 @@ +.PHONY: style quality + +# make sure to test the local checkout in scripts and not the pre-installed one (don't use quotes!) +export PYTHONPATH = src + +check_dirs := src + +style: + black --line-length 119 --target-version py310 $(check_dirs) setup.py + isort $(check_dirs) setup.py + +quality: + black --check --line-length 119 --target-version py310 $(check_dirs) setup.py + isort --check-only $(check_dirs) setup.py + flake8 --max-line-length 119 $(check_dirs) setup.py + + +# Evaluation + +evaluate: diff --git a/previous_version/Video-R1-main-previous/src/r1-v/configs/ddp.yaml b/previous_version/Video-R1-main-previous/src/r1-v/configs/ddp.yaml new file mode 100644 index 0000000000000000000000000000000000000000..4f0557131aa2c1bded4cb4cfdc1cc58a3b25765b --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/configs/ddp.yaml @@ -0,0 +1,16 @@ +compute_environment: LOCAL_MACHINE +debug: false +distributed_type: MULTI_GPU +downcast_bf16: 'no' +gpu_ids: all +machine_rank: 0 +main_training_function: main +mixed_precision: bf16 +num_machines: 1 +num_processes: 8 +rdzv_backend: static +same_network: true +tpu_env: [] +tpu_use_cluster: false +tpu_use_sudo: false +use_cpu: false diff --git a/previous_version/Video-R1-main-previous/src/r1-v/configs/qwen2vl_sft_config.yaml b/previous_version/Video-R1-main-previous/src/r1-v/configs/qwen2vl_sft_config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..a83b2b4a574beafaafebf11c463d324dcf54d32f --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/configs/qwen2vl_sft_config.yaml @@ -0,0 +1,37 @@ +# Model arguments +model_name_or_path: Qwen/Qwen2-VL-2B-Instruct +model_revision: main +torch_dtype: bfloat16 + +# Data training arguments +dataset_name: /GEOQA_R1V_Train_8K +dataset_configs: +- all +preprocessing_num_workers: 4 + +# SFT trainer config +bf16: true +do_eval: true +eval_strategy: "no" +gradient_accumulation_steps: 4 +gradient_checkpointing: true +gradient_checkpointing_kwargs: + use_reentrant: false +learning_rate: 2.0e-05 +log_level: info +logging_steps: 5 +logging_strategy: steps +lr_scheduler_type: cosine +packing: true +max_seq_length: 4096 +max_steps: -1 +num_train_epochs: 1 +output_dir: ./log/Qwen2-VL-2B-Instruct-SFT +overwrite_output_dir: true +per_device_eval_batch_size: 1 +per_device_train_batch_size: 1 +report_to: +- wandb +save_strategy: "no" +seed: 42 +warmup_ratio: 0.1 \ No newline at end of file diff --git a/previous_version/Video-R1-main-previous/src/r1-v/configs/zero2.yaml b/previous_version/Video-R1-main-previous/src/r1-v/configs/zero2.yaml new file mode 100644 index 0000000000000000000000000000000000000000..92f25e6a85a8de167f023357fade50b978b81acc --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/configs/zero2.yaml @@ -0,0 +1,21 @@ +compute_environment: LOCAL_MACHINE +debug: false +deepspeed_config: + deepspeed_multinode_launcher: standard + offload_optimizer_device: none + offload_param_device: none + zero3_init_flag: false + zero_stage: 2 +distributed_type: DEEPSPEED +downcast_bf16: 'no' +machine_rank: 0 +main_training_function: main +mixed_precision: bf16 +num_machines: 1 +num_processes: 4 +rdzv_backend: static +same_network: true +tpu_env: [] +tpu_use_cluster: false +tpu_use_sudo: false +use_cpu: false \ No newline at end of file diff --git a/previous_version/Video-R1-main-previous/src/r1-v/configs/zero3.yaml b/previous_version/Video-R1-main-previous/src/r1-v/configs/zero3.yaml new file mode 100644 index 0000000000000000000000000000000000000000..b5a1201f8a2ee8706b63f0f80c664a1fc61a7d9d --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/configs/zero3.yaml @@ -0,0 +1,22 @@ +compute_environment: LOCAL_MACHINE +debug: false +deepspeed_config: + deepspeed_multinode_launcher: standard + offload_optimizer_device: none + offload_param_device: none + zero3_init_flag: true + zero3_save_16bit_model: true + zero_stage: 3 +distributed_type: DEEPSPEED +downcast_bf16: 'no' +machine_rank: 0 +main_training_function: main +mixed_precision: bf16 +num_machines: 1 +num_processes: 8 +rdzv_backend: static +same_network: true +tpu_env: [] +tpu_use_cluster: false +tpu_use_sudo: false +use_cpu: false diff --git a/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/create_vision_cot_data.py b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/create_vision_cot_data.py new file mode 100644 index 0000000000000000000000000000000000000000..fec2d7c245b1ddedc615d97a88cf67d6711d3333 --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/create_vision_cot_data.py @@ -0,0 +1,153 @@ +import argparse +import base64 +import concurrent.futures +import io +import json +import os +import random +import re +import time +from concurrent.futures import ThreadPoolExecutor +from functools import partial +from io import BytesIO +from typing import Dict, List + +import matplotlib.pyplot as plt +import numpy as np +import pandas as pd +from datasets import Dataset, concatenate_datasets, load_dataset, load_from_disk +from tqdm import tqdm + +import bytedtos +import seaborn as sns +import yaml +from openai import AzureOpenAI +from PIL import Image +from pillow_avif import AvifImagePlugin + + +PROMPT_FORMAT = """I will provide you with an image, an original question, and its answer related to the image. Your task is to rewrite the question in such a way that answering it requires step-by-step Chain-of-Thought (CoT) reasoning with numerical or mathematical expressions where applicable. The reasoning process can include expressions like "let me think," "oh, I see," or other natural language thought expressions. + +Please make sure your question is to ask for a certain answer with a certain value, do not ask for open-ended answer, and the answer is correct and easy to verify via simple protocol, like "2" or "A". + +Please strictly do not include "Answer:" in the question part to avoid confusion and leakage. + +Input Format: +Original Question: {original_question} +Original Answer: {original_answer} + +Output Format: +Question: [rewrite the question if necessary] +Answer: [answer with reasoning steps, including calculations where applicable] +step-by-step reasoning process +easy to verify answer +""" + + +def get_image_data_url(image_input): + if isinstance(image_input, str) and image_input.startswith("data:"): + return image_input + + if isinstance(image_input, str) and image_input.startswith("http"): + image_input = load_image(image_input) + + if isinstance(image_input, str): + image_input = Image.open(image_input) + + if not isinstance(image_input, Image.Image): + raise ValueError("Unsupported image input type") + + if image_input.mode != "RGB": + image_input = image_input.convert("RGB") + + buffer = BytesIO() + image_input.save(buffer, format="JPEG") + img_bytes = buffer.getvalue() + base64_data = base64.b64encode(img_bytes).decode("utf-8") + return f"data:image/jpeg;base64,{base64_data}" + + +def gpt4o_query(image, prompt, max_retries=5, initial_delay=3): + if image is None: + return None + + data_url_list = [get_image_data_url(image)] + client = AzureOpenAI( + azure_endpoint="YOUR_AZURE_ENDPOINT", + api_version="2023-07-01-preview", + api_key="YOUR_API_KEY", + ) + + for attempt in range(max_retries): + try: + messages = [ + { + "role": "system", + "content": "You are an expert to analyze the image and provide useful information for users.", + }, + { + "role": "user", + "content": [ + {"type": "text", "text": prompt}, + ], + }, + ] + + for data_url in data_url_list: + messages[1]["content"].insert( + 0, {"type": "image_url", "image_url": {"url": data_url}} + ) + + response = client.chat.completions.create( + model="gpt-4o-2024-08-06", + messages=messages, + temperature=0.2, + max_tokens=8192, + ) + return response.choices[0].message.content + + except Exception as e: + if attempt == max_retries - 1: + raise Exception( + f"Failed after {max_retries} attempts. Last error: {str(e)}" + ) + delay = initial_delay * (2**attempt) + random.uniform( + 0, 0.1 * initial_delay * (2**attempt) + ) + time.sleep(delay) + + +def process_single_item(example): + try: + image_path = example["image_path"] + formatted_prompt = PROMPT_FORMAT.format( + original_question=example["question"], original_answer=example["answer"] + ) + + response = gpt4o_query(image_path, formatted_prompt) + example["gpt4o_response"] = response + return example + except Exception as e: + print(f"Error processing item: {str(e)}") + example["gpt4o_response"] = None + return example + + +def main(): + dataset_path = "path/to/your/dataset" + full_dataset = load_from_disk(dataset_path) + + processed_dataset = full_dataset.map( + function=partial(process_single_item), + num_proc=256, + desc="Processing dataset with GPT-4o", + keep_in_memory=True, + ) + + output_path = f"{dataset_path}_processed" + processed_dataset.save_to_disk(output_path) + print(f"Processed dataset saved to: {output_path}") + + +if __name__ == "__main__": + main() diff --git a/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/lmms_eval_qwen2vl.sh b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/lmms_eval_qwen2vl.sh new file mode 100644 index 0000000000000000000000000000000000000000..6d38769aa91029d63880a5dfc6f9cf64bb36c31a --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/lmms_eval_qwen2vl.sh @@ -0,0 +1,61 @@ +export HF_HOME="" +export HF_TOKEN="" +export HF_HUB_ENABLE_HF_TRANSFER="1" + +export API_TYPE="" +export AZURE_ENDPOINT="" +export AZURE_API_KEY="" +export API_VERSION="" +export MODEL_VERSION="" +export NAVIT_ATTENTION_IMPLEMENTATION="eager" + +# Prompt for installation with 3-second timeout +read -t 3 -p "Do you want to install dependencies? (YES/no, timeout in 3s): " install_deps || true +if [ "$install_deps" = "YES" ]; then + # Prepare the environment + pip3 install --upgrade pip + pip3 install -U setuptools + + cd + if [ ! -d "maas_engine" ]; then + git clone + else + echo "maas_engine directory already exists, skipping clone" + fi + cd maas_engine + git pull + git checkout + pip3 install --no-cache-dir --no-build-isolation -e ".[standalone]" + + current_version=$(pip3 show transformers | grep Version | cut -d' ' -f2) + if [ "$current_version" != "4.46.2" ]; then + echo "Installing transformers 4.46.2 (current version: $current_version)" + pip3 install transformers==4.46.2 + else + echo "transformers 4.46.2 is already installed" + fi + + cd + rm -rf + pip3 install -e . + pip3 install -U pydantic + pip3 install Levenshtein + pip3 install nltk + python3 -c "import nltk; nltk.download('wordnet', quiet=True); nltk.download('punkt', quiet=True)" +fi + +TASKS=mmmu_val,mathvista_testmini,mmmu_pro +MODEL_BASENAME=qwen2_vl + +model_checkpoint="" +echo "MODEL_BASENAME: ${MODEL_BASENAME}" +cd + +python3 -m accelerate.commands.launch --num_processes=8 --main_process_port=12345 lmms_eval \ + --model qwen2_vl \ + --model_args=pretrained=${model_checkpoint},max_pixels=2359296 \ + --tasks ${TASKS} \ + --batch_size 1 \ + --log_samples \ + --log_samples_suffix ${MODEL_BASENAME} \ + --output_path ./logs \ No newline at end of file diff --git a/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/prepare_hf_data.py b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/prepare_hf_data.py new file mode 100644 index 0000000000000000000000000000000000000000..62eab9e0fbba24ce354a10846fb8404abde9feaa --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/prepare_hf_data.py @@ -0,0 +1,166 @@ +import matplotlib.pyplot as plt +import seaborn as sns +import pandas as pd +import random +from typing import List, Dict +import numpy as np +from concurrent.futures import ThreadPoolExecutor +from tqdm import tqdm +import datasets + +import io +from datasets import load_dataset, load_from_disk, concatenate_datasets +from PIL import Image +from tqdm import tqdm +from functools import partial +from pillow_avif import AvifImagePlugin +from datasets import Dataset +import json +import yaml +import os +import re +import time +import random +import base64 +from openai import AzureOpenAI +import concurrent.futures +from typing import List, Dict +import argparse +import time + + +def extract_problem_solution(gpt4o_response): + # Split the response into parts + parts = gpt4o_response.split("") + + # Extract the problem (first part before any tags) + problem = parts[0].strip() + # Remove "Question:" prefix if it exists + problem = re.sub(r"^Question:\s*", "", problem) + # Remove "Answer:" at the end of the problem + problem = re.sub(r"\s*Answer:\s*$", "", problem).strip() + + # Combine all the reasoning steps into a single block + think_parts = [p.split("")[0].strip() for p in parts[1:] if "" in p] + solution = f"{' '.join(think_parts)}" + + # Add the final answer if it exists, removing "Answer:" prefix + if "" in gpt4o_response: + final_answer = ( + gpt4o_response.split("")[-1].split("")[0].strip() + ) + final_answer = re.sub(r"^Answer:\s*", "", final_answer) + solution += f"\n\n{final_answer}" + + return problem, solution + + +def load_image_from_path(image_path): + try: + img = Image.open(image_path) + return img + except Exception as e: + print(f"Error loading image {image_path}: {str(e)}") + return None + + +def process_raw_data(raw_data): + # Parse the raw data if it's a string + if isinstance(raw_data, str): + data = json.loads(raw_data) + else: + data = raw_data + + # Extract problem and solution + try: + problem, solution = extract_problem_solution(data["gpt4o_response"]) + image = load_image_from_path(data["image_path"]) + + return { + "image": image, + "problem": problem, + "solution": solution, + "original_question": data["question"], + "original_answer": data["answer"], + } + except Exception as e: + print(f"Error processing data {data}: {str(e)}") + return { + "image": None, + "problem": None, + "solution": None, + "original_question": None, + "original_answer": None, + } + + +raw_data_list = [ + "/path/to/reasoning_data_with_response_90k_verified", +] + +raw_data = concatenate_datasets([load_from_disk(path) for path in raw_data_list]) + +processed_data = raw_data.map(process_raw_data, num_proc=128).shuffle(seed=42) + +hf_dict = { + "image": [], + "problem": [], + "solution": [], + "original_question": [], + "original_answer": [], +} + +for item in tqdm(processed_data): + hf_dict["image"].append(item["image"]) + hf_dict["problem"].append(item["problem"]) + hf_dict["solution"].append(item["solution"]) + hf_dict["original_question"].append(item["original_question"]) + hf_dict["original_answer"].append(item["original_answer"]) + + +features = datasets.Features( + { + "image": datasets.Image(), + "problem": datasets.Value("string"), + "solution": datasets.Value("string"), + "original_question": datasets.Value("string"), + "original_answer": datasets.Value("string"), + } +) + + +def has_empty_tags(text): + # Pattern to match empty tags like + pattern = r"<[^>]+>]+>" + return bool(re.search(pattern, text)) + + +def has_answer_pattern(text): + if "Answer:" in text: + return True + return False + + +def has_valid_image_size(example): # for Qwen2-VL-2B's processor requirement + # Assuming the image is in a format that can be checked for dimensions + # You might need to adjust this depending on how the image is stored in your dataset + try: + image = example["image"] # or however your image is accessed + if isinstance(image, dict) and "height" in image and "width" in image: + return image["height"] >= 28 and image["width"] >= 28 + # If image is a PIL Image or similar + return image.height >= 28 and image.width >= 28 + except: + return False + + +ds = datasets.Dataset.from_dict(hf_dict, features=features) +ds = ds.filter( + lambda x: not has_empty_tags(x["solution"]) + and not has_answer_pattern(x["problem"]) + and has_valid_image_size(x) + and x["image"] is not None, + num_proc=128, +) +# Push to Hugging Face Hub +ds.push_to_hub("path/to/your/dataset") diff --git a/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/train_aria_moe.sh b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/train_aria_moe.sh new file mode 100644 index 0000000000000000000000000000000000000000..5a3b6966c4a40ff4760e4d1cb0d7518448c30fae --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/train_aria_moe.sh @@ -0,0 +1,68 @@ +#!/bin/bash + +export NCCL_BLOCKING_WAIT=0 +export TOKENIZERS_PARALLELISM=false +export OMP_NUM_THREADS=8 +export NCCL_IB_DISABLE=0 +export NCCL_IB_GID_INDEX=3 +export NCCL_SOCKET_IFNAME=eth0 +export NCCL_DEBUG=INFO + +# CONFIG Huggingface +# export HF_TOKEN="" +export HF_TOKEN="" +export HF_HOME="$HOME/.cache/huggingface" +export HF_HUB_ENABLE_HF_TRANSFER="1" + +export NCCL_DEBUG=INFO + +GPUS="0,1,2,3,4,5,6,7" + +# 取 worker0 第一个 port +ports=($(echo $METIS_WORKER_0_PORT | tr ',' ' ')) +port=${ports[0]} +port_in_cmd="$(echo "${METIS_WORKER_0_PORT:-2000}" | awk -F',' '{print $1}')" + +echo "total workers: ${ARNOLD_WORKER_NUM}" +echo "cur worker id: ${ARNOLD_ID}" +echo "gpus per worker: ${ARNOLD_WORKER_GPU}" +echo "master ip: ${METIS_WORKER_0_HOST}" +echo "master port: ${port}" +echo "master port in cmd: ${port_in_cmd}" + +# export WANDB_BASE_URL=https://api.wandb.ai +# export WANDB_API_KEY="" +# wandb login $WANDB_API_KEY + +export WANDB_BASE_URL=https://api.wandb.ai +export WANDB_PROJECT=vision-reasoning +export WANDB_API_KEY="" +export WANDB_RUN_NAME=Qwen-VL-2B-GRPO-$(date +%Y-%m-%d-%H-%M-%S) +wandb login $WANDB_API_KEY + +cd /home/tiger/multimodal-open-r1 +# pip3 install vllm==0.6.6.post1 +pip3 install -e ".[dev]" +pip3 install wandb==0.18.3 + +torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" \ + --nnodes="${ARNOLD_WORKER_NUM}" \ + --node_rank="${ARNOLD_ID}" \ + --master_addr="${METIS_WORKER_0_HOST}" \ + --master_port="${port_in_cmd}" \ + src/open_r1/grpo.py \ + --deepspeed scripts/zero3.json \ + --output_dir Aria-GRPO-mini_cot_80k \ + --model_name_or_path rhymes-ai/Aria \ + --dataset_name luodian/mini_cot_80k \ + --max_prompt_length 8192 \ + --per_device_train_batch_size 1 \ + --gradient_accumulation_steps 1 \ + --logging_steps 1 \ + --bf16 \ + --report_to wandb \ + --gradient_checkpointing true \ + --attn_implementation eager \ + --save_total_limit 8 \ + --num_train_epochs 1 \ + --run_name $WANDB_RUN_NAME diff --git a/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/train_qwen2_vl.sh b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/train_qwen2_vl.sh new file mode 100644 index 0000000000000000000000000000000000000000..137310e4438c645bfb6f89f254c50164f23f5a9d --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/train_qwen2_vl.sh @@ -0,0 +1,61 @@ +#!/bin/bash + +export NCCL_BLOCKING_WAIT=0 +export TOKENIZERS_PARALLELISM=false +export OMP_NUM_THREADS=8 +export NCCL_IB_DISABLE=0 +export NCCL_IB_GID_INDEX=3 +export NCCL_SOCKET_IFNAME=eth0 +export NCCL_DEBUG=INFO + +GPUS="0,1,2,3,4,5,6,7" + +# 取 worker0 第一个 port +ports=($(echo $METIS_WORKER_0_PORT | tr ',' ' ')) +port=${ports[0]} +port_in_cmd="$(echo "${METIS_WORKER_0_PORT:-2000}" | awk -F',' '{print $1}')" + +echo "total workers: ${ARNOLD_WORKER_NUM}" +echo "cur worker id: ${ARNOLD_ID}" +echo "gpus per worker: ${ARNOLD_WORKER_GPU}" +echo "master ip: ${METIS_WORKER_0_HOST}" +echo "master port: ${port}" +echo "master port in cmd: ${port_in_cmd}" + +# export WANDB_BASE_URL=https://api.wandb.ai +# export WANDB_API_KEY="" +# wandb login $WANDB_API_KEY + +export WANDB_BASE_URL=https://api.wandb.ai +export WANDB_PROJECT=vision-reasoning +export WANDB_API_KEY="" +export WANDB_RUN_NAME=Qwen-VL-2B-GRPO-$(date +%Y-%m-%d-%H-%M-%S) +wandb login $WANDB_API_KEY + +cd /home/tiger/multimodal-open-r1 +# pip3 install vllm==0.6.6.post1 +pip3 install -e ".[dev]" +pip3 install wandb==0.18.3 + +torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" \ + --nnodes="${ARNOLD_WORKER_NUM}" \ + --node_rank="${ARNOLD_ID}" \ + --master_addr="${METIS_WORKER_0_HOST}" \ + --master_port="${port_in_cmd}" \ + src/open_r1/grpo.py \ + --deepspeed scripts/zero3.json \ + --output_dir checkpoints/${WANDB_RUN_NAME} \ + --model_name_or_path Qwen/Qwen2-VL-2B-Instruct \ + --dataset_name luodian/${DATASET_NAME} \ + --max_prompt_length 8192 \ + --per_device_train_batch_size 1 \ + --gradient_accumulation_steps 1 \ + --logging_steps 1 \ + --bf16 \ + --report_to wandb \ + --gradient_checkpointing true \ + --attn_implementation flash_attention_2 \ + --max_pixels 2359296 \ + --save_total_limit 8 \ + --num_train_epochs 1 \ + --run_name $WANDB_RUN_NAME diff --git a/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/zero2.json b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/zero2.json new file mode 100644 index 0000000000000000000000000000000000000000..b5ba7ebea0f236230a5a41d72ec23ae1f64130d6 --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/zero2.json @@ -0,0 +1,41 @@ +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + "bf16": { + "enabled": "auto" + }, + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "none", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 2e8, + "overlap_comm": false, + "reduce_scatter": true, + "reduce_bucket_size": 2e8, + "contiguous_gradients": true + }, + "gradient_accumulation_steps": "auto", + "gradient_clipping": "auto", + "steps_per_print": 100, + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", + "wall_clock_breakdown": false +} \ No newline at end of file diff --git a/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/zero3.json b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/zero3.json new file mode 100644 index 0000000000000000000000000000000000000000..02d343165ec0eec3af55d3285f45911769af6109 --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/zero3.json @@ -0,0 +1,41 @@ +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + "bf16": { + "enabled": "auto" + }, + + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "none", + "pin_memory": true + }, + "offload_param": { + "device": "none", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_16bit_weights_on_model_save": true + }, + + "gradient_accumulation_steps": "auto", + "gradient_clipping": "auto", + "steps_per_print": 100, + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", + "wall_clock_breakdown": false +} \ No newline at end of file diff --git a/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/zero3.yaml b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/zero3.yaml new file mode 100644 index 0000000000000000000000000000000000000000..b5a1201f8a2ee8706b63f0f80c664a1fc61a7d9d --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/zero3.yaml @@ -0,0 +1,22 @@ +compute_environment: LOCAL_MACHINE +debug: false +deepspeed_config: + deepspeed_multinode_launcher: standard + offload_optimizer_device: none + offload_param_device: none + zero3_init_flag: true + zero3_save_16bit_model: true + zero_stage: 3 +distributed_type: DEEPSPEED +downcast_bf16: 'no' +machine_rank: 0 +main_training_function: main +mixed_precision: bf16 +num_machines: 1 +num_processes: 8 +rdzv_backend: static +same_network: true +tpu_env: [] +tpu_use_cluster: false +tpu_use_sudo: false +use_cpu: false diff --git a/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/zero3_offload.json b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/zero3_offload.json new file mode 100644 index 0000000000000000000000000000000000000000..9da12de56b44374047644fe77607a85ced885e7c --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/local_scripts/zero3_offload.json @@ -0,0 +1,48 @@ +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + "bf16": { + "enabled": "auto" + }, + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "offload_param": { + "device": "cpu", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "gather_16bit_weights_on_model_save": true + }, + "gradient_accumulation_steps": "auto", + "gradient_clipping": "auto", + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", + "steps_per_print": 1e5, + "wall_clock_breakdown": false +} \ No newline at end of file diff --git a/previous_version/Video-R1-main-previous/src/r1-v/run_grpo.sh b/previous_version/Video-R1-main-previous/src/r1-v/run_grpo.sh new file mode 100644 index 0000000000000000000000000000000000000000..4c5b21e3a4cdd9f5d2da2f9ae8f7299ee224cf60 --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/run_grpo.sh @@ -0,0 +1,29 @@ +cd src/r1-v + +export DEBUG_MODE="true" +export LOG_PATH="./debug_log_2b.txt" + + + +torchrun --nproc_per_node="8" \ + --nnodes="1" \ + --node_rank="0" \ + --master_addr="127.0.0.1" \ + --master_port="12345" \ + src/open_r1/grpo.py \ + --output_dir \ + --model_name_or_path \ + --dataset_name \ + --max_prompt_length 1024 \ + --per_device_train_batch_size 1 \ + --gradient_accumulation_steps 2 \ + --logging_steps 1 \ + --bf16 \ + --report_to wandb \ + --gradient_checkpointing false \ + --attn_implementation flash_attention_2 \ + --max_pixels 401408 \ + --num_train_epochs 2 \ + --run_name Qwen2-VL-2B-GRPO-CLEVR-70k \ + --save_steps 100 \ + --save_only_model true \ No newline at end of file diff --git a/previous_version/Video-R1-main-previous/src/r1-v/setup.cfg b/previous_version/Video-R1-main-previous/src/r1-v/setup.cfg new file mode 100644 index 0000000000000000000000000000000000000000..5fa1d655611f7509de9130ac8dd482fc4b4f2dae --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/setup.cfg @@ -0,0 +1,41 @@ +[isort] +default_section = FIRSTPARTY +ensure_newline_before_comments = True +force_grid_wrap = 0 +include_trailing_comma = True +known_first_party = open_r1 +known_third_party = + transformers + datasets + fugashi + git + h5py + matplotlib + nltk + numpy + packaging + pandas + psutil + pytest + rouge_score + sacrebleu + seqeval + sklearn + streamlit + torch + tqdm + +line_length = 119 +lines_after_imports = 2 +multi_line_output = 3 +use_parentheses = True + +[flake8] +ignore = E203, E501, E741, W503, W605 +max-line-length = 119 +per-file-ignores = + # imported but unused + __init__.py: F401 + +[tool:pytest] +doctest_optionflags=NUMBER NORMALIZE_WHITESPACE ELLIPSIS \ No newline at end of file diff --git a/previous_version/Video-R1-main-previous/src/r1-v/setup.py b/previous_version/Video-R1-main-previous/src/r1-v/setup.py new file mode 100644 index 0000000000000000000000000000000000000000..a847d9eb150a35de5785388a9065afe8f854ca05 --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/setup.py @@ -0,0 +1,132 @@ +# Copyright 2025 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Adapted from huggingface/transformers: https://github.com/huggingface/transformers/blob/21a2d900eceeded7be9edc445b56877b95eda4ca/setup.py + + +import re +import shutil +from pathlib import Path + +from setuptools import find_packages, setup + + +# Remove stale open_r1.egg-info directory to avoid https://github.com/pypa/pip/issues/5466 +stale_egg_info = Path(__file__).parent / "open_r1.egg-info" +if stale_egg_info.exists(): + print( + ( + "Warning: {} exists.\n\n" + "If you recently updated open_r1, this is expected,\n" + "but it may prevent open_r1 from installing in editable mode.\n\n" + "This directory is automatically generated by Python's packaging tools.\n" + "I will remove it now.\n\n" + "See https://github.com/pypa/pip/issues/5466 for details.\n" + ).format(stale_egg_info) + ) + shutil.rmtree(stale_egg_info) + + +# IMPORTANT: all dependencies should be listed here with their version requirements, if any. +# * If a dependency is fast-moving (e.g. transformers), pin to the exact version +_deps = [ + "accelerate>=1.2.1", + "bitsandbytes>=0.43.0", + "black>=24.4.2", + "datasets>=3.2.0", + "deepspeed==0.15.4", + "distilabel[vllm,ray,openai]>=1.5.2", + "einops>=0.8.0", + "flake8>=6.0.0", + "hf_transfer>=0.1.4", + "huggingface-hub[cli]>=0.19.2,<1.0", + "isort>=5.12.0", + "liger_kernel==0.5.2", + "lighteval @ git+https://github.com/huggingface/lighteval.git@4f381b352c0e467b5870a97d41cb66b487a2c503#egg=lighteval[math]", + "math-verify", # Used for math verification in grpo + "packaging>=23.0", + "parameterized>=0.9.0", + "pytest", + "safetensors>=0.3.3", + "sentencepiece>=0.1.99", + "torch>=2.5.1", + "transformers @ git+https://github.com/huggingface/transformers.git@336dc69d63d56f232a183a3e7f52790429b871ef", + "trl==0.14.0", + "vllm==0.6.6.post1", + "wandb>=0.19.1", + "pillow", +] + +# this is a lookup table with items like: +# +# tokenizers: "tokenizers==0.9.4" +# packaging: "packaging" +# +# some of the values are versioned whereas others aren't. +deps = {b: a for a, b in (re.findall(r"^(([^!=<>~ \[\]]+)(?:\[[^\]]+\])?(?:[!=<>~ ].*)?$)", x)[0] for x in _deps)} + + +def deps_list(*pkgs): + return [deps[pkg] for pkg in pkgs] + + +extras = {} +extras["tests"] = deps_list("pytest", "parameterized") +extras["torch"] = deps_list("torch") +extras["quality"] = deps_list("black", "isort", "flake8") +extras["eval"] = deps_list("lighteval", "math-verify") +extras["dev"] = extras["quality"] + extras["tests"] + extras["eval"] + +# core dependencies shared across the whole project - keep this to a bare minimum :) +install_requires = [ + deps["accelerate"], + deps["bitsandbytes"], + deps["einops"], + deps["datasets"], + deps["deepspeed"], + deps["hf_transfer"], + deps["huggingface-hub"], + deps["liger_kernel"], + deps["packaging"], # utilities from PyPA to e.g., compare versions + deps["safetensors"], + deps["sentencepiece"], + deps["transformers"], + deps["trl"], +] + +setup( + name="r1-v", + version="0.1.0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots) + author="The r1-v team and the Hugging Face team (past and future)", + description="R1-V", + license="Apache", + url="https://github.com/Deep-Agent/R1-V", + package_dir={"": "src"}, + packages=find_packages("src"), + zip_safe=False, + extras_require=extras, + python_requires=">=3.10.9", + install_requires=install_requires, + classifiers=[ + "Development Status :: 3 - Alpha", + "Intended Audience :: Developers", + "Intended Audience :: Education", + "Intended Audience :: Science/Research", + "License :: OSI Approved :: Apache Software License", + "Operating System :: OS Independent", + "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3.10", + "Topic :: Scientific/Engineering :: Artificial Intelligence", + ], +) diff --git a/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/__init__.py b/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/evaluate.py b/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/evaluate.py new file mode 100644 index 0000000000000000000000000000000000000000..ef3089fff4ecc4753b10b585fe172a2c93af4d9d --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/evaluate.py @@ -0,0 +1,85 @@ +# Copyright 2025 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Custom evaluation tasks for LightEval.""" + +from lighteval.metrics.dynamic_metrics import ( + ExprExtractionConfig, + LatexExtractionConfig, + multilingual_extractive_match_metric, +) +from lighteval.tasks.lighteval_task import LightevalTaskConfig +from lighteval.tasks.requests import Doc +from lighteval.utils.language import Language + + +metric = multilingual_extractive_match_metric( + language=Language.ENGLISH, + fallback_mode="first_match", + precision=5, + gold_extraction_target=(LatexExtractionConfig(),), + pred_extraction_target=(ExprExtractionConfig(), LatexExtractionConfig()), + aggregation_function=max, +) + + +def prompt_fn(line, task_name: str = None): + """Assumes the model is either prompted to emit \\boxed{answer} or does so automatically""" + return Doc( + task_name=task_name, + query=line["problem"], + choices=[line["solution"]], + gold_index=0, + ) + + +# Define tasks +aime24 = LightevalTaskConfig( + name="aime24", + suite=["custom"], + prompt_function=prompt_fn, + hf_repo="HuggingFaceH4/aime_2024", + hf_subset="default", + hf_avail_splits=["train"], + evaluation_splits=["train"], + few_shots_split=None, + few_shots_select=None, + generation_size=32768, + metric=[metric], + version=1, +) +math_500 = LightevalTaskConfig( + name="math_500", + suite=["custom"], + prompt_function=prompt_fn, + hf_repo="HuggingFaceH4/MATH-500", + hf_subset="default", + hf_avail_splits=["test"], + evaluation_splits=["test"], + few_shots_split=None, + few_shots_select=None, + generation_size=32768, + metric=[metric], + version=1, +) + +# Add tasks to the table +TASKS_TABLE = [] +TASKS_TABLE.append(aime24) +TASKS_TABLE.append(math_500) + +# MODULE LOGIC +if __name__ == "__main__": + print([t["name"] for t in TASKS_TABLE]) + print(len(TASKS_TABLE)) diff --git a/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/generate.py b/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/generate.py new file mode 100644 index 0000000000000000000000000000000000000000..740621018693a72edfc738ef44291a9d39c18132 --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/generate.py @@ -0,0 +1,156 @@ +# Copyright 2025 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import Optional + +from distilabel.llms import OpenAILLM +from distilabel.pipeline import Pipeline +from distilabel.steps.tasks import TextGeneration + + +def build_distilabel_pipeline( + model: str, + base_url: str = "http://localhost:8000/v1", + prompt_column: Optional[str] = None, + temperature: Optional[float] = None, + top_p: Optional[float] = None, + max_new_tokens: int = 8192, + num_generations: int = 1, +) -> Pipeline: + generation_kwargs = {"max_new_tokens": max_new_tokens} + + if temperature is not None: + generation_kwargs["temperature"] = temperature + + if top_p is not None: + generation_kwargs["top_p"] = top_p + + with Pipeline().ray() as pipeline: + TextGeneration( + llm=OpenAILLM( + base_url=base_url, + api_key="something", + model=model, + # thinking can take some time... + timeout=10 * 60, + generation_kwargs=generation_kwargs, + ), + input_mappings={"instruction": prompt_column} if prompt_column is not None else {}, + input_batch_size=64, # on 4 nodes bs ~60+ leads to preemption due to KV cache exhaustion + num_generations=num_generations, + ) + + return pipeline + + +if __name__ == "__main__": + import argparse + + from datasets import load_dataset + + parser = argparse.ArgumentParser(description="Run distilabel pipeline for generating responses with DeepSeek R1") + parser.add_argument( + "--hf-dataset", + type=str, + required=True, + help="HuggingFace dataset to load", + ) + parser.add_argument( + "--hf-dataset-config", + type=str, + required=False, + help="Dataset config to use", + ) + parser.add_argument( + "--hf-dataset-split", + type=str, + default="train", + help="Dataset split to use", + ) + parser.add_argument("--prompt-column", type=str, default="prompt") + parser.add_argument( + "--model", + type=str, + required=True, + help="Model name to use for generation", + ) + parser.add_argument( + "--vllm-server-url", + type=str, + default="http://localhost:8000/v1", + help="URL of the vLLM server", + ) + parser.add_argument( + "--temperature", + type=float, + help="Temperature for generation", + ) + parser.add_argument( + "--top-p", + type=float, + help="Top-p value for generation", + ) + parser.add_argument( + "--max-new-tokens", + type=int, + default=8192, + help="Maximum number of new tokens to generate", + ) + parser.add_argument( + "--num-generations", + type=int, + default=1, + help="Number of generations per problem", + ) + parser.add_argument( + "--hf-output-dataset", + type=str, + required=False, + help="HuggingFace repo to push results to", + ) + parser.add_argument( + "--private", + action="store_true", + help="Whether to make the output dataset private when pushing to HF Hub", + ) + + args = parser.parse_args() + + print("\nRunning with arguments:") + for arg, value in vars(args).items(): + print(f" {arg}: {value}") + print() + + print(f"Loading '{args.hf_dataset}' (config: {args.hf_dataset_config}, split: {args.hf_dataset_split}) dataset...") + dataset = load_dataset(args.hf_dataset, split=args.hf_dataset_split) + print("Dataset loaded!") + + pipeline = build_distilabel_pipeline( + model=args.model, + base_url=args.vllm_server_url, + prompt_column=args.prompt_column, + temperature=args.temperature, + top_p=args.top_p, + max_new_tokens=args.max_new_tokens, + num_generations=args.num_generations, + ) + + print("Running generation pipeline...") + distiset = pipeline.run(dataset=dataset, use_cache=False) + print("Generation pipeline finished!") + + if args.hf_output_dataset: + print(f"Pushing resulting dataset to '{args.hf_output_dataset}'...") + distiset.push_to_hub(args.hf_output_dataset, private=args.private) + print("Dataset pushed!") diff --git a/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/grpo.py b/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/grpo.py new file mode 100644 index 0000000000000000000000000000000000000000..40da936bcf95ae8057948a1f96c22037a26a7548 --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/grpo.py @@ -0,0 +1,229 @@ +# Copyright 2025 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import re +from datetime import datetime +from dataclasses import dataclass, field +from typing import Optional + +from datasets import load_dataset, load_from_disk +from transformers import Qwen2VLForConditionalGeneration + +from math_verify import parse, verify +from trainer import Qwen2VLGRPOTrainer, Qwen2VLGRPOVLLMTrainer +from trl import GRPOConfig, GRPOTrainer, ModelConfig, ScriptArguments, TrlParser, get_peft_config + +from datasets import Dataset, DatasetDict + + +@dataclass +class GRPOScriptArguments(ScriptArguments): + """ + Script arguments for the GRPO training script. + + Args: + reward_funcs (`list[str]`): + List of reward functions. Possible values: 'accuracy', 'format'. + """ + + reward_funcs: list[str] = field( + default_factory=lambda: ["accuracy", "format"], + metadata={"help": "List of reward functions. Possible values: 'accuracy', 'format'"}, + ) + max_pixels: Optional[int] = field( + default=12845056, + metadata={"help": "Maximum number of pixels for the image"}, + ) + min_pixels: Optional[int] = field( + default=3136, + metadata={"help": "Minimum number of pixels for the image"}, + ) + + +def accuracy_reward(completions, solution, **kwargs): + """Reward function that checks if the completion is correct using either symbolic verification or exact string matching.""" + contents = [completion[0]["content"] for completion in completions] + rewards = [] + current_time = datetime.now().strftime("%d-%H-%M-%S-%f") + for content, sol in zip(contents, solution): + reward = 0.0 + # Try symbolic verification first + try: + answer = parse(content) + if float(verify(answer, parse(sol))) > 0: + reward = 1.0 + except Exception: + pass # Continue to next verification method if this fails + + # If symbolic verification failed, try string matching + if reward == 0.0: + try: + # Extract answer from solution if it has think/answer tags + sol_match = re.search(r'(.*?)', sol) + ground_truth = sol_match.group(1).strip() if sol_match else sol.strip() + + # Extract answer from content if it has think/answer tags + content_match = re.search(r'(.*?)', content) + student_answer = content_match.group(1).strip() if content_match else content.strip() + + # Compare the extracted answers + if student_answer == ground_truth: + reward = 1.0 + except Exception: + pass # Keep reward as 0.0 if both methods fail + + rewards.append(reward) + if os.getenv("DEBUG_MODE") == "true": + log_path = os.getenv("LOG_PATH") + # local_rank = int(os.getenv("LOCAL_RANK", 0)) + with open(log_path, "a") as f: + f.write(f"------------- {current_time} Accuracy reward: {reward} -------------\n") + f.write(f"Content: {content}\n") + f.write(f"Solution: {sol}\n") + return rewards + + +def format_reward(completions, **kwargs): + """Reward function that checks if the completion has a specific format.""" + pattern = r".*?\s*.*?" + completion_contents = [completion[0]["content"] for completion in completions] + matches = [re.fullmatch(pattern, content, re.DOTALL) for content in completion_contents] + return [1.0 if match else 0.0 for match in matches] + + +reward_funcs_registry = { + "accuracy": accuracy_reward, + "format": format_reward, +} + +SYSTEM_PROMPT = ( + "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant " + "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning " + "process and answer are enclosed within and tags, respectively, i.e., " + " reasoning process here answer here " +) + + +def main(script_args, training_args, model_args): + # Get reward functions + reward_funcs = [reward_funcs_registry[func] for func in script_args.reward_funcs] + + if script_args.dataset_name[-6:] == '.jsonl': + dataset = DatasetDict({"train": Dataset.from_json(script_args.dataset_name)}) + else: + # Load the dataset + dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config) + + + # Format into conversation + def make_conversation(example): + return { + "prompt": [ + {"role": "system", "content": SYSTEM_PROMPT}, + {"role": "user", "content": example["problem"]}, + ], + } + + # def make_conversation_image(example): + # return { + # "prompt": [ + # {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]}, + # { + # "role": "user", + # "content": [ + # {"type": "image"}, + # {"type": "text", "text": example["problem"]}, + # ], + # }, + # ], + # } + + QUESTION_TEMPLATE = "{Question} Output the thinking process in and final answer (number) in tags." + + def make_conversation_image(example): + + return { + "prompt": [ + { + "role": "user", + "content": [ + {"type": "image"}, + {"type": "text", "text": QUESTION_TEMPLATE.format(Question=example["problem"])}, + ], + }, + ], + } + + + def make_conversation_video(example): + return { + "prompt": [ + { + "role": "user", + "content": [ + {"type": "video"}, + {"type": "text", "text": QUESTION_TEMPLATE.format(Question=example["problem"])}, + ], + }, + ], + } + + + if "image" in dataset[script_args.dataset_train_split].features: + print("has image in dataset") + dataset = dataset.map(make_conversation_image) # Utilize multiprocessing for faster mapping + # dataset = dataset.remove_columns(["original_question", "original_answer"]) + + elif "video_filename" in dataset[script_args.dataset_train_split].features: + print("has video in dataset") + dataset = dataset.map(make_conversation_video) + + else: + print("no image in dataset") + dataset = dataset.map(make_conversation) + dataset = dataset.remove_columns("messages") + + # import pdb + # pdb.set_trace() + + trainer_cls = Qwen2VLGRPOTrainer if not training_args.use_vllm else Qwen2VLGRPOVLLMTrainer + print("using: ", trainer_cls) + + # Initialize the GRPO trainer + trainer = trainer_cls( + model=model_args.model_name_or_path, + reward_funcs=reward_funcs, + args=training_args, + train_dataset=dataset[script_args.dataset_train_split], + eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None, + peft_config=get_peft_config(model_args), + attn_implementation=model_args.attn_implementation, + max_pixels=script_args.max_pixels, + min_pixels=script_args.min_pixels, + ) + + # Train and push the model to the Hub + trainer.train() + + # Save and push to hub + trainer.save_model(training_args.output_dir) + if training_args.push_to_hub: + trainer.push_to_hub(dataset_name=script_args.dataset_name) + + +if __name__ == "__main__": + parser = TrlParser((GRPOScriptArguments, GRPOConfig, ModelConfig)) + script_args, training_args, model_args = parser.parse_args_and_config() + main(script_args, training_args, model_args) diff --git a/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/sft.py b/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/sft.py new file mode 100644 index 0000000000000000000000000000000000000000..a8003dd8a28b36df49940dfcd2efd1673ce0a125 --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/sft.py @@ -0,0 +1,322 @@ +# Copyright 2025 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" +Supervised fine-tuning script for decoder language models. + +Usage: + +# One 1 node of 8 x H100s +accelerate launch --config_file=configs/zero3.yaml src/open_r1/sft.py \ + --model_name_or_path Qwen/Qwen2.5-1.5B-Instruct \ + --dataset_name HuggingFaceH4/Bespoke-Stratos-17k \ + --learning_rate 2.0e-5 \ + --num_train_epochs 1 \ + --packing \ + --max_seq_length 4096 \ + --per_device_train_batch_size 4 \ + --gradient_accumulation_steps 4 \ + --gradient_checkpointing \ + --bf16 \ + --logging_steps 5 \ + --eval_strategy steps \ + --eval_steps 100 \ + --output_dir data/Qwen2.5-1.5B-Open-R1-Distill +""" + +import logging +import os +import sys + +import datasets +from dataclasses import dataclass, field +from typing import Optional +import torch +import transformers +from datasets import load_dataset +from transformers import AutoTokenizer, set_seed, AutoProcessor +from transformers.trainer_utils import get_last_checkpoint +import trl +from trl import ( + ModelConfig, + ScriptArguments, + SFTTrainer, + TrlParser, + get_kbit_device_map, + get_peft_config, + get_quantization_config, +) + +from qwen_vl_utils import process_vision_info +logger = logging.getLogger(__name__) + + +@dataclass +class SFTConfig(trl.SFTConfig): + """ + args for callbacks, benchmarks etc + """ + + benchmarks: list[str] = field( + default_factory=lambda: [], metadata={"help": "The benchmarks to run after training."} + ) + callbacks: list[str] = field( + default_factory=lambda: [], metadata={"help": "The callbacks to run during training."} + ) + system_prompt: Optional[str] = field( + default=None, + metadata={"help": "The optional system prompt to use for benchmarking."}, + ) + hub_model_revision: Optional[str] = field( + default="main", + metadata={"help": "The Hub model branch to push the model to."}, + ) + overwrite_hub_revision: bool = field(default=False, metadata={"help": "Whether to overwrite the Hub revision."}) + push_to_hub_revision: bool = field(default=False, metadata={"help": "Whether to push to a Hub revision/branch."}) + + + +processor = None + + +def convert_example(example): + """ + correct example into "messages" + eg: + { + "system": "You are a helpful assistant.", + "conversations": [ + {"from": "user", "value": "How many objects are included in this image?", + "image_path": "/path/to/image.png"}, + {"from": "assistant", "value": "\nI can see 10 objects\n\n\n10\n"} + ] + } + """ + messages = [] + if "system" in example: + messages.append({ + "role": "system", + "content": [{"type": "text", "text": example["system"]}], + }) + else: + SYSTEM_PROMPT = ( + "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant " + "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning " + "process and answer are enclosed within and tags, respectively, i.e., " + " reasoning process here answer here " + ) + messages.append({ + "role": "system", + "content": [{"type": "text", "text": SYSTEM_PROMPT}], + }) + + thinking = example.get("thinking") + problem = example.get("problem") + solution = example.get("solution") + image = example.get("image") + messages.append({ + "role": "user", + "content": [ + {"type": "text", "text": problem}, + {"type": "image", "image": image}, + ] + }) + messages.append({ + "role": "assistant", + "content": f"{thinking}\n\n{solution}", + }) + + example["messages"] = messages + return example + + +def collate_fn(examples): + texts = [ + processor.apply_chat_template( convert_example(example)["messages"], tokenize=False, add_generation_prompt=True) + for example in examples + ] + image_inputs = [] + for example in examples: + imgs, vids = process_vision_info(example["messages"]) + image_inputs.append(imgs) + batch = processor( + text=texts, + images=image_inputs, + return_tensors="pt", + padding=True, + ) + labels = batch["input_ids"].clone() + labels[labels == processor.tokenizer.pad_token_id] = -100 + image_token_id = processor.tokenizer.convert_tokens_to_ids(processor.image_token) + labels[labels == image_token_id] = -100 + batch["labels"] = labels + + # print(batch) + + return batch + + +def main(script_args, training_args, model_args): + # Set seed for reproducibility + set_seed(training_args.seed) + + ############### + # Setup logging + ############### + logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%Y-%m-%d %H:%M:%S", + handlers=[logging.StreamHandler(sys.stdout)], + ) + log_level = training_args.get_process_log_level() + logger.setLevel(log_level) + datasets.utils.logging.set_verbosity(log_level) + transformers.utils.logging.set_verbosity(log_level) + transformers.utils.logging.enable_default_handler() + transformers.utils.logging.enable_explicit_format() + + # Log on each process a small summary + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" + + f" distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + ) + logger.info(f"Model parameters {model_args}") + logger.info(f"Script parameters {script_args}") + logger.info(f"Data parameters {training_args}") + + # Check for last checkpoint + last_checkpoint = None + if os.path.isdir(training_args.output_dir): + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info(f"Checkpoint detected, resuming training at {last_checkpoint=}.") + + ################ + # Load datasets + ################ + + dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config) + + ################ + # Load tokenizer + ################ + global processor + if "vl" in model_args.model_name_or_path.lower(): + processor = AutoProcessor.from_pretrained( + model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code + ) + logger.info("Using AutoProcessor for vision-language model.") + else: + processor = AutoTokenizer.from_pretrained( + model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, use_fast=True + ) + logger.info("Using AutoTokenizer for text-only model.") + if hasattr(processor, "pad_token") and processor.pad_token is None: + processor.pad_token = processor.eos_token + elif hasattr(processor.tokenizer, "pad_token") and processor.tokenizer.pad_token is None: + processor.tokenizer.pad_token = processor.tokenizer.eos_token + + ################### + # Model init kwargs + ################### + logger.info("*** Initializing model kwargs ***") + torch_dtype = ( + model_args.torch_dtype if model_args.torch_dtype in ["auto", None] else getattr(torch, model_args.torch_dtype) + ) + quantization_config = get_quantization_config(model_args) + model_kwargs = dict( + revision=model_args.model_revision, + trust_remote_code=model_args.trust_remote_code, + attn_implementation=model_args.attn_implementation, + torch_dtype=torch_dtype, + use_cache=False if training_args.gradient_checkpointing else True, + device_map=get_kbit_device_map() if quantization_config is not None else None, + quantization_config=quantization_config, + ) + # training_args.model_init_kwargs = model_kwargs + from transformers import Qwen2VLForConditionalGeneration + model = Qwen2VLForConditionalGeneration.from_pretrained( + model_args.model_name_or_path, **model_kwargs + ) + ############################ + # Initialize the SFT Trainer + ############################ + training_args.dataset_kwargs = { + "skip_prepare_dataset": True, + } + training_args.remove_unused_columns = False + + + trainer = SFTTrainer( + model=model, + args=training_args, + train_dataset=dataset[script_args.dataset_train_split], + eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None, + processing_class=processor.tokenizer, + data_collator=collate_fn, + peft_config=get_peft_config(model_args) + ) + + + + ############### + # Training loop + ############### + logger.info("*** Train ***") + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + metrics["train_samples"] = len(dataset[script_args.dataset_train_split]) + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + ################################## + # Save model and create model card + ################################## + logger.info("*** Save model ***") + trainer.save_model(training_args.output_dir) + processor.save_pretrained(training_args.output_dir) + logger.info(f"Model saved to {training_args.output_dir}") + + # Save everything else on main process + kwargs = { + "dataset_name": script_args.dataset_name, + "tags": ["R1-V"], + } + if trainer.accelerator.is_main_process: + trainer.create_model_card(**kwargs) + # Restore k,v cache for fast inference + trainer.model.config.use_cache = True + trainer.model.config.save_pretrained(training_args.output_dir) + ############# + # push to hub + ############# + + if training_args.push_to_hub: + logger.info("Pushing to hub...") + trainer.push_to_hub(**kwargs) + processor.push_to_hub(training_args.hub_model_id) + + + + +if __name__ == "__main__": + parser = TrlParser((ScriptArguments, SFTConfig, ModelConfig)) + script_args, training_args, model_args = parser.parse_args_and_config() + main(script_args, training_args, model_args) diff --git a/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/trainer/__init__.py b/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/trainer/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..951e9b6ad9e9a959eaf11446f371b27161222a6e --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/trainer/__init__.py @@ -0,0 +1,4 @@ +from .grpo_trainer import Qwen2VLGRPOTrainer +from .vllm_grpo_trainer import Qwen2VLGRPOVLLMTrainer + +__all__ = ["Qwen2VLGRPOTrainer", "Qwen2VLGRPOVLLMTrainer"] diff --git a/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/trainer/grpo_trainer.py b/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/trainer/grpo_trainer.py new file mode 100644 index 0000000000000000000000000000000000000000..054687bc84b4275d8dbaefdb2ab9c688f78933b4 --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/trainer/grpo_trainer.py @@ -0,0 +1,652 @@ +# Copyright 2025 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import textwrap +from collections import defaultdict +from typing import Any, Callable, Optional, Union + +import torch +import torch.utils.data +import transformers +from datasets import Dataset, IterableDataset +from packaging import version +from transformers import ( + AriaForConditionalGeneration, + AriaProcessor, + AutoModelForCausalLM, + AutoModelForSequenceClassification, + AutoProcessor, + AutoTokenizer, + GenerationConfig, + PreTrainedModel, + PreTrainedTokenizerBase, + Qwen2VLForConditionalGeneration, + Qwen2_5_VLForConditionalGeneration, + Trainer, + TrainerCallback, + is_wandb_available, +) +from transformers.integrations.deepspeed import is_deepspeed_zero3_enabled +from transformers.utils import is_peft_available + +from trl.data_utils import apply_chat_template, is_conversational, maybe_apply_chat_template +from trl.models import create_reference_model, prepare_deepspeed, unwrap_model_for_generation +from trl.trainer.grpo_config import GRPOConfig +from trl.trainer.utils import generate_model_card, get_comet_experiment_url + +from qwen_vl_utils import process_vision_info + +import copy + + +if is_peft_available(): + from peft import PeftConfig, get_peft_model + +if is_wandb_available(): + import wandb + +# What we call a reward function is a callable that takes a list of prompts and completions and returns a list of +# rewards. When it's a string, it's a model ID, so it's loaded as a pretrained model. +RewardFunc = Union[str, PreTrainedModel, Callable[[list, list], list[float]]] + + +class Qwen2VLGRPOTrainer(Trainer): + """ + Trainer for the Group Relative Policy Optimization (GRPO) method. This algorithm was initially proposed in the + paper [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300). + + Example: + + ```python + from datasets import load_dataset + from trl import GRPOTrainer + + dataset = load_dataset("trl-lib/tldr", split="train") + + trainer = GRPOTrainer( + model="Qwen/Qwen2-0.5B-Instruct", + reward_funcs="weqweasdas/RM-Gemma-2B", + train_dataset=dataset, + ) + + trainer.train() + ``` + + Args: + model (`Union[str, PreTrainedModel]`): + Model to be trained. Can be either: + + - A string, being the *model id* of a pretrained model hosted inside a model repo on huggingface.co, or + a path to a *directory* containing model weights saved using + [`~transformers.PreTrainedModel.save_pretrained`], e.g., `'./my_model_directory/'`. The model is + loaded using [`~transformers.AutoModelForCausalLM.from_pretrained`] with the keywork arguments + in `args.model_init_kwargs`. + - A [`~transformers.PreTrainedModel`] object. Only causal language models are supported. + reward_funcs (`Union[RewardFunc, list[RewardFunc]]`): + Reward functions to be used for computing the rewards. To compute the rewards, we call all the reward + functions with the prompts and completions and sum the rewards. Can be either: + + - A single reward function, such as: + - A string: The *model ID* of a pretrained model hosted inside a model repo on huggingface.co, or a + path to a *directory* containing model weights saved using + [`~transformers.PreTrainedModel.save_pretrained`], e.g., `'./my_model_directory/'`. The model is loaded + using [`~transformers.AutoModelForSequenceClassification.from_pretrained`] with `num_labels=1` and the + keyword arguments in `args.model_init_kwargs`. + - A [`~transformers.PreTrainedModel`] object: Only sequence classification models are supported. + - A custom reward function: The function is provided with the prompts and the generated completions, + plus any additional columns in the dataset. It should return a list of rewards. For more details, see + [Using a custom reward function](#using-a-custom-reward-function). + - A list of reward functions, where each item can independently be any of the above types. Mixing different + types within the list (e.g., a string model ID and a custom reward function) is allowed. + args ([`GRPOConfig`], *optional*, defaults to `None`): + Configuration for this trainer. If `None`, a default configuration is used. + train_dataset ([`~datasets.Dataset`] or [`~datasets.IterableDataset`]): + Dataset to use for training. It must include a column `"prompt"`. Any additional columns in the dataset is + ignored. The format of the samples can be either: + + - [Standard](dataset_formats#standard): Each sample contains plain text. + - [Conversational](dataset_formats#conversational): Each sample contains structured messages (e.g., role + and content). + eval_dataset ([`~datasets.Dataset`], [`~datasets.IterableDataset`] or `dict[str, Union[Dataset, IterableDataset]]`): + Dataset to use for evaluation. It must meet the same requirements as `train_dataset`. + processing_class ([`~transformers.PreTrainedTokenizerBase`], *optional*, defaults to `None`): + Processing class used to process the data. The padding side must be set to "left". If `None`, the + processing class is loaded from the model's name with [`~transformers.AutoTokenizer.from_pretrained`]. + reward_processing_classes (`Union[PreTrainedTokenizerBase, list[PreTrainedTokenizerBase]]`, *optional*, defaults to `None`): + Processing classes corresponding to the reward functions specified in `reward_funcs`. Can be either: + + - A single processing class: Used when `reward_funcs` contains only one reward function. + - A list of processing classes: Must match the order and length of the reward functions in `reward_funcs`. + If set to `None`, or if an element of the list corresponding to a [`~transformers.PreTrainedModel`] is + `None`, the tokenizer for the model is automatically loaded using [`~transformers.AutoTokenizer.from_pretrained`]. + For elements in `reward_funcs` that are custom reward functions (not [`~transformers.PreTrainedModel`]), + the corresponding entries in `reward_processing_classes` are ignored. + callbacks (list of [`~transformers.TrainerCallback`], *optional*, defaults to `None`): + List of callbacks to customize the training loop. Will add those to the list of default callbacks + detailed in [here](https://huggingface.co/docs/transformers/main_classes/callback). + + If you want to remove one of the default callbacks used, use the [`~transformers.Trainer.remove_callback`] + method. + optimizers (`tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]`, *optional*, defaults to `(None, None)`): + A tuple containing the optimizer and the scheduler to use. Will default to an instance of [`AdamW`] on your + model and a scheduler given by [`get_linear_schedule_with_warmup`] controlled by `args`. + peft_config ([`~peft.PeftConfig`], *optional*, defaults to `None`): + PEFT configuration used to wrap the model. If `None`, the model is not wrapped. + """ + + def __init__( + self, + model: Union[str, PreTrainedModel], + reward_funcs: Union[RewardFunc, list[RewardFunc]], + args: GRPOConfig = None, + train_dataset: Optional[Union[Dataset, IterableDataset]] = None, + eval_dataset: Optional[Union[Dataset, IterableDataset, dict[str, Union[Dataset, IterableDataset]]]] = None, + processing_class: Optional[PreTrainedTokenizerBase] = None, + reward_processing_classes: Optional[Union[PreTrainedTokenizerBase, list[PreTrainedTokenizerBase]]] = None, + callbacks: Optional[list[TrainerCallback]] = None, + optimizers: tuple[Optional[torch.optim.Optimizer], Optional[torch.optim.lr_scheduler.LambdaLR]] = (None, None), + peft_config: Optional["PeftConfig"] = None, + max_pixels: Optional[int] = 12845056, + min_pixels: Optional[int] = 3136, + attn_implementation: str = "flash_attention_2", + ): + # Args + if args is None: + model_name = model if isinstance(model, str) else model.config._name_or_path + model_name = model_name.split("/")[-1] + args = GRPOConfig(f"{model_name}-GRPO") + + # Models + # Trained model + model_init_kwargs = args.model_init_kwargs or {} + model_init_kwargs["attn_implementation"] = attn_implementation + if isinstance(model, str): + model_id = model + torch_dtype = model_init_kwargs.get("torch_dtype") + if isinstance(torch_dtype, torch.dtype) or torch_dtype == "auto" or torch_dtype is None: + pass # torch_dtype is already a torch.dtype or "auto" or None + elif isinstance(torch_dtype, str): # it's a str, but not "auto" + torch_dtype = getattr(torch, torch_dtype) + model_init_kwargs["torch_dtype"] = torch_dtype + else: + raise ValueError( + "Invalid `torch_dtype` passed to `GRPOConfig`. Expected either 'auto' or a string representing " + f"a `torch.dtype` (e.g., 'float32'), but got {torch_dtype}." + ) + # Disable caching if gradient checkpointing is enabled (not supported) + model_init_kwargs["use_cache"] = ( + False if args.gradient_checkpointing else model_init_kwargs.get("use_cache") + ) + if "Qwen2-VL" in model_id: + model = Qwen2VLForConditionalGeneration.from_pretrained(model, **model_init_kwargs) + elif "Qwen2.5-VL" in model_id: + model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model, **model_init_kwargs) + elif "Aria" in model_id: + model_init_kwargs.pop("use_cache") + model = AriaForConditionalGeneration.from_pretrained(model, **model_init_kwargs) + else: + model = AutoModelForCausalLM.from_pretrained(model, **model_init_kwargs) + else: + model_id = model.config._name_or_path + if args.model_init_kwargs is not None: + raise ValueError( + "You passed `model_init_kwargs` to the `GRPOConfig`, but your model is already instantiated. " + "This argument can only be used when the `model` argument is a string." + ) + + if peft_config is not None: + model = get_peft_model(model, peft_config) + + # Reference model + if is_deepspeed_zero3_enabled(): + if "Qwen2-VL" in model_id: + self.ref_model = Qwen2VLForConditionalGeneration.from_pretrained(model_id, **model_init_kwargs) + elif "Qwen2.5-VL" in model_id: + self.ref_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id, **model_init_kwargs) + elif "Aria" in model_id: + self.ref_model = AriaForConditionalGeneration.from_pretrained(model_id, **model_init_kwargs) + else: + self.ref_model = AutoModelForCausalLM.from_pretrained(model_id, **model_init_kwargs) + elif peft_config is None: + # If PEFT configuration is not provided, create a reference model based on the initial model. + self.ref_model = create_reference_model(model) + else: + # If PEFT is used, the reference model is not needed since the adapter can be disabled + # to revert to the initial model. + self.ref_model = None + + # Processing class + if processing_class is None: + if "Qwen2-VL" in model_id or "Qwen2.5-VL" in model_id or "Aria" in model_id: + processing_class = AutoProcessor.from_pretrained(model_id) + pad_token_id = processing_class.tokenizer.pad_token_id + processing_class.pad_token_id = pad_token_id + processing_class.eos_token_id = processing_class.tokenizer.eos_token_id + if "Qwen" in model_id or "Qwen2.5-VL" in model_id: + processing_class.image_processor.max_pixels = max_pixels + processing_class.image_processor.min_pixels = min_pixels + else: + processing_class = AutoTokenizer.from_pretrained(model.config._name_or_path, padding_side="left") + pad_token_id = processing_class.pad_token_id + + # Reward functions + if not isinstance(reward_funcs, list): + reward_funcs = [reward_funcs] + for i, reward_func in enumerate(reward_funcs): + if isinstance(reward_func, str): + reward_funcs[i] = AutoModelForSequenceClassification.from_pretrained( + reward_func, num_labels=1, **model_init_kwargs + ) + self.reward_funcs = reward_funcs + + # Reward processing class + if reward_processing_classes is None: + reward_processing_classes = [None] * len(reward_funcs) + elif not isinstance(reward_processing_classes, list): + reward_processing_classes = [reward_processing_classes] + else: + if len(reward_processing_classes) != len(reward_funcs): + raise ValueError("The number of reward processing classes must match the number of reward functions.") + + for i, (reward_processing_class, reward_func) in enumerate(zip(reward_processing_classes, reward_funcs)): + if isinstance(reward_func, PreTrainedModel): + if reward_processing_class is None: + reward_processing_class = AutoTokenizer.from_pretrained(reward_func.config._name_or_path) + if reward_processing_class.pad_token_id is None: + reward_processing_class.pad_token = reward_processing_class.eos_token + # The reward model computes the reward for the latest non-padded token in the input sequence. + # So it's important to set the pad token ID to the padding token ID of the processing class. + reward_func.config.pad_token_id = reward_processing_class.pad_token_id + reward_processing_classes[i] = reward_processing_class + self.reward_processing_classes = reward_processing_classes + + # Data collator + def data_collator(features): # No data collation is needed in GRPO + return features + + # Training arguments + self.max_prompt_length = args.max_prompt_length + self.max_completion_length = args.max_completion_length # = |o_i| in the GRPO paper + self.num_generations = args.num_generations # = G in the GRPO paper + self.generation_config = GenerationConfig( + max_new_tokens=self.max_completion_length, + do_sample=True, + temperature=1, # HACK + num_return_sequences=self.num_generations, + pad_token_id=pad_token_id, + ) + self.beta = args.beta + + # The trainer estimates the number of FLOPs (floating-point operations) using the number of elements in the + # input tensor associated with the key "input_ids". However, in GRPO, the sampled data does not include the + # "input_ids" key. Instead, the available keys is "prompt". As a result, the trainer issues the warning: + # "Could not estimate the number of tokens of the input, floating-point operations will not be computed." To + # suppress this warning, we set the "estimate_tokens" key in the model's "warnings_issued" dictionary to True. + # This acts as a flag to indicate that the warning has already been issued. + model.warnings_issued["estimate_tokens"] = True + + # Initialize the metrics + self._metrics = defaultdict(list) + + super().__init__( + model=model, + args=args, + data_collator=data_collator, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + processing_class=processing_class, + callbacks=callbacks, + optimizers=optimizers, + ) + + # Gradient accumulation requires scaled loss. Normally, loss scaling in the parent class depends on whether the + # model accepts loss-related kwargs. Since we compute our own loss, this check is irrelevant. We set + # self.model_accepts_loss_kwargs to False to enable scaling. + self.model_accepts_loss_kwargs = False + + if self.ref_model is not None: + if self.is_deepspeed_enabled: + self.ref_model = prepare_deepspeed(self.ref_model, self.accelerator) + else: + self.ref_model = self.accelerator.prepare_model(self.ref_model, evaluation_mode=True) + + for i, reward_func in enumerate(self.reward_funcs): + if isinstance(reward_func, PreTrainedModel): + self.reward_funcs[i] = self.accelerator.prepare_model(reward_func, evaluation_mode=True) + + def _set_signature_columns_if_needed(self): + # If `self.args.remove_unused_columns` is True, non-signature columns are removed. + # By default, this method sets `self._signature_columns` to the model's expected inputs. + # In GRPOTrainer, we preprocess data, so using the model's signature columns doesn't work. + # Instead, we set them to the columns expected by the `training_step` method, hence the override. + if self._signature_columns is None: + self._signature_columns = ["prompt"] + + + # Get the per-token log probabilities for the completions for the model and the reference model + def _get_per_token_logps(self, model, input_ids, **kwargs): + # logits = model(input_ids, attention_mask=attention_mask, pixel_values=pixel_values, image_grid_thw=image_grid_thw).logits # (B, L, V) + logits = model(input_ids, **kwargs).logits + logits = logits[:, :-1, :] # (B, L-1, V), exclude the last logit: it corresponds to the next token pred + input_ids = input_ids[:, 1:] # (B, L-1), exclude the first input ID since we don't have logits for it + # Compute the log probabilities for the input tokens. Use a loop to reduce memory peak. + per_token_logps = [] + for logits_row, input_ids_row in zip(logits, input_ids): + log_probs = logits_row.log_softmax(dim=-1) + token_log_prob = torch.gather(log_probs, dim=1, index=input_ids_row.unsqueeze(1)).squeeze(1) + per_token_logps.append(token_log_prob) + return torch.stack(per_token_logps) + + + # Trainer "prepares" the inputs before calling `compute_loss`. It converts to tensor and move to device. + # Since we preprocess the data in `compute_loss`, we need to override this method to skip this step. + def _prepare_inputs(self, inputs: dict[str, Union[torch.Tensor, Any]]) -> dict[str, Union[torch.Tensor, Any]]: + return inputs + + def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None): + if return_outputs: + raise ValueError("The GRPOTrainer does not support returning outputs") + + + + prompts = [x["prompt"] for x in inputs] + prompts_text = [maybe_apply_chat_template(example, self.processing_class)["prompt"] for example in inputs] + if "image" in inputs[0]: + images = [x["image"] for x in inputs] + elif "video_filename" in inputs[0]: + video_inputs = [] + for (cur_idx, cur_input) in enumerate(inputs): + copy_input = cur_input.copy() + + copy_input['prompt'][0]['content'][0]['video'] = os.getcwd() + "/data" + inputs[cur_idx]["video_filename"][1:] + + video_inputs.append(process_vision_info(copy_input["prompt"])[1]) + + # import pdb + # pdb.set_trace() + + + + + + prompt_inputs = self.processing_class( + text=prompts_text, + images=images if "image" in inputs[0] else None, + videos=video_inputs if "video_filename" in inputs[0] else None, + return_tensors="pt", + padding=True, + padding_side="left", + add_special_tokens=False, + ) + + # import pdb + # pdb.set_trace() + + + prompt_inputs = super()._prepare_inputs(prompt_inputs) + + prompt_ids, prompt_mask = prompt_inputs["input_ids"], prompt_inputs["attention_mask"] + # pixel_values = prompt_inputs["pixel_values"] + # image_grid_thw = prompt_inputs["image_grid_thw"] + + + if self.max_prompt_length is not None: + prompt_ids = prompt_ids[:, -self.max_prompt_length :] + prompt_mask = prompt_mask[:, -self.max_prompt_length :] + + # Generate completions + # with unwrap_model_for_generation(model, self.accelerator) as unwrapped_model: + # prompt_completion_ids = unwrapped_model.generate(**prompt_inputs, generation_config=self.generation_config) + + # prompt_length = prompt_ids.size(1) + # prompt_ids = prompt_completion_ids[:, :prompt_length] + # completion_ids = prompt_completion_ids[:, prompt_length:] + # prompt_mask = prompt_mask.repeat_interleave(self.num_generations, dim=0) + + + with unwrap_model_for_generation(model, self.accelerator) as unwrapped_model: + # prompt_completion_ids = unwrapped_model.generate(**prompt_inputs, generation_config=self.generation_config) + + # Generate N times, each generate one with the temp_generation_config , stack the output_ids to prompt_completion_ids, pad the empty places with number 151613 + num_generations = self.generation_config.num_return_sequences + temp_generation_config = copy.deepcopy(self.generation_config) + temp_generation_config.num_return_sequences = 1 + + all_completions = [] + + for i in range(num_generations): # -1 because we already have one generation + completion = unwrapped_model.generate(**prompt_inputs, generation_config=temp_generation_config) + all_completions.append(completion) + + # Stack all completions and pad if needed + max_length = max(completion.size(1) for completion in all_completions) + padded_completions = [] + + for completion in all_completions: + if completion.size(1) < max_length: + padding = torch.full( + (completion.size(0), max_length - completion.size(1)), + self.processing_class.tokenizer.pad_token_id, + dtype=completion.dtype, + device=completion.device, + ) + padded_completion = torch.cat([completion, padding], dim=1) + else: + padded_completion = completion + padded_completions.append(padded_completion) + + # Stack all padded completions + prompt_completion_ids = torch.cat(padded_completions, dim=0) + + prompt_length = prompt_inputs["input_ids"].size(1) + completion_ids = prompt_completion_ids[:, prompt_length:] + + # Mask everything after the first EOS token + is_eos = completion_ids == self.processing_class.eos_token_id + device = self.accelerator.device + eos_idx = torch.full((is_eos.size(0),), is_eos.size(1), dtype=torch.long, device=device) + eos_idx[is_eos.any(dim=1)] = is_eos.int().argmax(dim=1)[is_eos.any(dim=1)] + sequence_indices = torch.arange(is_eos.size(1), device=device).expand(is_eos.size(0), -1) + completion_mask = (sequence_indices <= eos_idx.unsqueeze(1)).int() + + # Concatenate prompt_mask with completion_mask for logit computation + # attention_mask = torch.cat([prompt_mask, completion_mask], dim=1) # (B*G, P+C) + # pixel_values = prompt_inputs["pixel_values"].repeat(self.num_generations, 1) + # image_grid_thw = prompt_inputs["image_grid_thw"].repeat_interleave(self.num_generations, dim=0) + + + + prompt_inputs.pop("input_ids") + prompt_inputs.pop("attention_mask") + # Okay I am assuming that the inputs are Qwen2VL processor + # and no video for now, repeat the image for each completion + if "image" in inputs[0]: + prompt_inputs["pixel_values"] = prompt_inputs["pixel_values"].repeat(len(prompt_completion_ids), 1) + prompt_inputs["image_grid_thw"] = prompt_inputs["image_grid_thw"].repeat(len(prompt_completion_ids), 1) + # import pdb; pdb.set_trace() + + if "video_filename" in inputs[0]: + prompt_inputs["pixel_values_videos"] = prompt_inputs["pixel_values_videos"].repeat(len(prompt_completion_ids), 1) + prompt_inputs["video_grid_thw"] = prompt_inputs["video_grid_thw"].repeat(len(prompt_completion_ids), 1) + + + # per_token_logps = self._get_per_token_logps(model, prompt_completion_ids, attention_mask, pixel_values, image_grid_thw) + per_token_logps = self._get_per_token_logps(model, prompt_completion_ids, **prompt_inputs) + # Get rid of the prompt (-1 because of the shift done in get_per_token_logps) + per_token_logps = per_token_logps[:, prompt_length - 1 :] + + + + with torch.inference_mode(): + if self.ref_model is not None: + #ref_per_token_logps = self._get_per_token_logps(self.ref_model, prompt_completion_ids, attention_mask, pixel_values, image_grid_thw) + # ref_per_token_logps = self._get_per_token_logps(self.ref_model, prompt_completion_ids, **prompt_inputs) + ref_per_token_logps = self._get_per_token_logps(self.ref_model, prompt_completion_ids) + else: + with self.accelerator.unwrap_model(model).disable_adapter(): + #ref_per_token_logps = self._get_per_token_logps(model, prompt_completion_ids, attention_mask, pixel_values, image_grid_thw) + ref_per_token_logps = self._get_per_token_logps(model, prompt_completion_ids, **prompt_inputs) + ref_per_token_logps = ref_per_token_logps[:, prompt_length - 1 :] + + # Compute the KL divergence between the model and the reference model + + per_token_kl = torch.exp(ref_per_token_logps - per_token_logps) - (ref_per_token_logps - per_token_logps) - 1 + + per_token_kl = torch.clamp(per_token_kl, min=-100, max=100) + + # import pdb + # pdb.set_trace() + + # Decode the generated completions + completions = self.processing_class.batch_decode(completion_ids, skip_special_tokens=True) + if is_conversational(inputs[0]): + completions = [[{"role": "assistant", "content": completion}] for completion in completions] + + # Compute the rewards + prompts = [prompt for prompt in prompts for _ in range(self.num_generations)] + + rewards_per_func = torch.zeros(len(prompts), len(self.reward_funcs), device=device) + for i, (reward_func, reward_processing_class) in enumerate( + zip(self.reward_funcs, self.reward_processing_classes) + ): + if isinstance(reward_func, PreTrainedModel): + if is_conversational(inputs[0]): + messages = [{"messages": p + c} for p, c in zip(prompts, completions)] + texts = [apply_chat_template(x, reward_processing_class)["text"] for x in messages] + else: + texts = [p + c for p, c in zip(prompts, completions)] + reward_inputs = reward_processing_class( + texts, return_tensors="pt", padding=True, padding_side="right", add_special_tokens=False + ) + reward_inputs = super()._prepare_inputs(reward_inputs) + with torch.inference_mode(): + rewards_per_func[:, i] = reward_func(**reward_inputs).logits[:, 0] # Shape (B*G,) + else: + # Repeat all input columns (but "prompt" and "completion") to match the number of generations + reward_kwargs = {key: [] for key in inputs[0].keys() if key not in ["prompt", "completion"]} + for key in reward_kwargs: + for example in inputs: + # Repeat each value in the column for `num_generations` times + reward_kwargs[key].extend([example[key]] * self.num_generations) + output_reward_func = reward_func(prompts=prompts, completions=completions, **reward_kwargs) + rewards_per_func[:, i] = torch.tensor(output_reward_func, dtype=torch.float32, device=device) + + # Sum the rewards from all reward functions + rewards = rewards_per_func.sum(dim=1) + + # Compute grouped-wise rewards + mean_grouped_rewards = rewards.view(-1, self.num_generations).mean(dim=1) + std_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1) + + # Normalize the rewards to compute the advantages + mean_grouped_rewards = mean_grouped_rewards.repeat_interleave(self.num_generations, dim=0) + std_grouped_rewards = std_grouped_rewards.repeat_interleave(self.num_generations, dim=0) + advantages = (rewards - mean_grouped_rewards) / (std_grouped_rewards + 1e-4) + + # x - x.detach() allows for preserving gradients from x + per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1) + per_token_loss = -(per_token_loss - self.beta * per_token_kl) + loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean() + + # import pdb + # pdb.set_trace() + + # Log the metrics + completion_length = self.accelerator.gather_for_metrics(completion_mask.sum(1)).float().mean().item() + self._metrics["completion_length"].append(completion_length) + + reward_per_func = self.accelerator.gather_for_metrics(rewards_per_func).mean(0) + for i, reward_func in enumerate(self.reward_funcs): + if isinstance(reward_func, PreTrainedModel): + reward_func_name = reward_func.config._name_or_path.split("/")[-1] + else: + reward_func_name = reward_func.__name__ + self._metrics[f"rewards/{reward_func_name}"].append(reward_per_func[i].item()) + + self._metrics["reward"].append(self.accelerator.gather_for_metrics(rewards).mean().item()) + + self._metrics["reward_std"].append(self.accelerator.gather_for_metrics(std_grouped_rewards).mean().item()) + + mean_kl = ((per_token_kl * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean() + self._metrics["kl"].append(self.accelerator.gather_for_metrics(mean_kl).mean().item()) + + return loss + + def log(self, logs: dict[str, float], start_time: Optional[float] = None) -> None: + metrics = {key: sum(val) / len(val) for key, val in self._metrics.items()} # average the metrics + logs = {**logs, **metrics} + if version.parse(transformers.__version__) >= version.parse("4.47.0.dev0"): + super().log(logs, start_time) + else: # transformers<=4.46 + super().log(logs) + self._metrics.clear() + + def create_model_card( + self, + model_name: Optional[str] = None, + dataset_name: Optional[str] = None, + tags: Union[str, list[str], None] = None, + ): + """ + Creates a draft of a model card using the information available to the `Trainer`. + + Args: + model_name (`str` or `None`, *optional*, defaults to `None`): + Name of the model. + dataset_name (`str` or `None`, *optional*, defaults to `None`): + Name of the dataset used for training. + tags (`str`, `list[str]` or `None`, *optional*, defaults to `None`): + Tags to be associated with the model card. + """ + if not self.is_world_process_zero(): + return + + if hasattr(self.model.config, "_name_or_path") and not os.path.isdir(self.model.config._name_or_path): + base_model = self.model.config._name_or_path + else: + base_model = None + + tags = tags or [] + if isinstance(tags, str): + tags = [tags] + + if hasattr(self.model.config, "unsloth_version"): + tags.append("unsloth") + + citation = textwrap.dedent( + """\ + @article{zhihong2024deepseekmath, + title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}}, + author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo}, + year = 2024, + eprint = {arXiv:2402.03300}, + """ + ) + + model_card = generate_model_card( + base_model=base_model, + model_name=model_name, + hub_model_id=self.hub_model_id, + dataset_name=dataset_name, + tags=tags, + wandb_url=wandb.run.get_url() if is_wandb_available() and wandb.run is not None else None, + comet_url=get_comet_experiment_url(), + trainer_name="GRPO", + trainer_citation=citation, + paper_title="DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", + paper_id="2402.03300", + ) + + model_card.save(os.path.join(self.args.output_dir, "README.md")) diff --git a/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/trainer/vllm_grpo_trainer.py b/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/trainer/vllm_grpo_trainer.py new file mode 100644 index 0000000000000000000000000000000000000000..154c4d28fdd81b3010e86f22de9ea21e47588df9 --- /dev/null +++ b/previous_version/Video-R1-main-previous/src/r1-v/src/open_r1/trainer/vllm_grpo_trainer.py @@ -0,0 +1,832 @@ +# Copyright 2025 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import textwrap +from collections import defaultdict +from typing import Any, Callable, Optional, Union +from accelerate.utils.other import is_compiled_module +from accelerate.utils import broadcast_object_list, gather, gather_object +import torch +import torch.utils.data +import transformers +import warnings +from unittest.mock import patch +from datasets import Dataset, IterableDataset +from packaging import version +from transformers import ( + AriaForConditionalGeneration, + AriaProcessor, + AutoModelForCausalLM, + AutoModelForSequenceClassification, + AutoProcessor, + AutoTokenizer, + GenerationConfig, + PreTrainedModel, + PreTrainedTokenizerBase, + Qwen2VLForConditionalGeneration, + Trainer, + TrainerCallback, + is_wandb_available, +) +from transformers.integrations.deepspeed import is_deepspeed_zero3_enabled +from transformers.utils import is_peft_available + +from trl.data_utils import ( + apply_chat_template, + is_conversational, + maybe_apply_chat_template, +) +from trl.import_utils import is_vllm_available + +from trl.models import ( + create_reference_model, + prepare_deepspeed, + unwrap_model_for_generation, +) +from trl.trainer.grpo_config import GRPOConfig +from trl.trainer.utils import generate_model_card, get_comet_experiment_url, pad +from trl import GRPOTrainer + +import copy + +if is_peft_available(): + from peft import PeftConfig, get_peft_model + +if is_vllm_available(): + from vllm import LLM, SamplingParams + + +if is_wandb_available(): + import wandb +import torch.nn as nn +from torch.utils.data import Sampler + +# What we call a reward function is a callable that takes a list of prompts and completions and returns a list of +# rewards. When it's a string, it's a model ID, so it's loaded as a pretrained model. +RewardFunc = Union[str, PreTrainedModel, Callable[[list, list], list[float]]] + + +class RepeatRandomSampler(Sampler): + """ + Sampler that repeats the indices of a dataset N times. + + Args: + data_source (`Sized`): + Dataset to sample from. + repeat_count (`int`): + Number of times to repeat each index. + + Example: + ```python + >>> sampler = RepeatRandomSampler(["a", "b", "c", "d"], repeat_count=2) + >>> list(sampler) + [2, 2, 0, 0, 3, 3, 1, 1] + ``` + """ + + def __init__(self, data_source, repeat_count: int): + self.data_source = data_source + self.repeat_count = repeat_count + self.num_samples = len(data_source) + + def __iter__(self): + indexes = [ + idx + for idx in torch.randperm(self.num_samples).tolist() + for _ in range(self.repeat_count) + ] + return iter(indexes) + + def __len__(self): + return self.num_samples * self.repeat_count + + +class Qwen2VLGRPOVLLMTrainer(Trainer): + def __init__( + self, + model: Union[str, PreTrainedModel], + reward_funcs: Union[RewardFunc, list[RewardFunc]], + args: GRPOConfig = None, + train_dataset: Optional[Union[Dataset, IterableDataset]] = None, + eval_dataset: Optional[ + Union[Dataset, IterableDataset, dict[str, Union[Dataset, IterableDataset]]] + ] = None, + processing_class: Optional[PreTrainedTokenizerBase] = None, + reward_processing_classes: Optional[ + Union[PreTrainedTokenizerBase, list[PreTrainedTokenizerBase]] + ] = None, + callbacks: Optional[list[TrainerCallback]] = None, + optimizers: tuple[ + Optional[torch.optim.Optimizer], Optional[torch.optim.lr_scheduler.LambdaLR] + ] = (None, None), + peft_config: Optional["PeftConfig"] = None, + # qwen2-vl related params + max_pixels: Optional[int] = 12845056, + min_pixels: Optional[int] = 3136, + attn_implementation: str = "flash_attention_2", + ): + + # Args + if args is None: + model_name = model if isinstance(model, str) else model.config._name_or_path + model_name = model_name.split("/")[-1] + args = GRPOConfig(f"{model_name}-GRPO") + + # Models + # Trained model + model_init_kwargs = args.model_init_kwargs or {} + model_init_kwargs["attn_implementation"] = attn_implementation + if isinstance(model, str): + model_id = model + torch_dtype = model_init_kwargs.get("torch_dtype") + if ( + isinstance(torch_dtype, torch.dtype) + or torch_dtype == "auto" + or torch_dtype is None + ): + pass # torch_dtype is already a torch.dtype or "auto" or None + elif isinstance(torch_dtype, str): # it's a str, but not "auto" + torch_dtype = getattr(torch, torch_dtype) + model_init_kwargs["torch_dtype"] = torch_dtype + else: + raise ValueError( + "Invalid `torch_dtype` passed to `GRPOConfig`. Expected either 'auto' or a string representing " + f"a `torch.dtype` (e.g., 'float32'), but got {torch_dtype}." + ) + # Disable caching if gradient checkpointing is enabled (not supported) + model_init_kwargs["use_cache"] = ( + False + if args.gradient_checkpointing + else model_init_kwargs.get("use_cache") + ) + if "Qwen2-VL" in model_id: + model = Qwen2VLForConditionalGeneration.from_pretrained( + model, **model_init_kwargs + ) + elif "Aria" in model_id: + model_init_kwargs.pop("use_cache") + model = AriaForConditionalGeneration.from_pretrained( + model, **model_init_kwargs + ) + else: + model = AutoModelForCausalLM.from_pretrained(model, **model_init_kwargs) + else: + model_id = model.config._name_or_path + if args.model_init_kwargs is not None: + raise ValueError( + "You passed `model_init_kwargs` to the `GRPOConfig`, but your model is already instantiated. " + "This argument can only be used when the `model` argument is a string." + ) + + if peft_config is not None: + model = get_peft_model(model, peft_config) + + # Reference model + if is_deepspeed_zero3_enabled(): + if "Qwen2-VL" in model_id: + self.ref_model = Qwen2VLForConditionalGeneration.from_pretrained( + model_id, **model_init_kwargs + ) + elif "Aria" in model_id: + self.ref_model = AriaForConditionalGeneration.from_pretrained( + model_id, **model_init_kwargs + ) + else: + self.ref_model = AutoModelForCausalLM.from_pretrained( + model_id, **model_init_kwargs + ) + elif peft_config is None: + # If PEFT configuration is not provided, create a reference model based on the initial model. + self.ref_model = create_reference_model(model) + else: + # If PEFT is used, the reference model is not needed since the adapter can be disabled + # to revert to the initial model. + self.ref_model = None + + # Processing class + if processing_class is None: + if "Qwen2-VL" in model_id or "Aria" in model_id: + processing_class = AutoProcessor.from_pretrained(model_id) + pad_token_id = processing_class.tokenizer.pad_token_id + processing_class.pad_token_id = pad_token_id + processing_class.eos_token_id = processing_class.tokenizer.eos_token_id + if "Qwen2-VL" in model_id: + processing_class.image_processor.max_pixels = max_pixels + processing_class.image_processor.min_pixels = min_pixels + else: + processing_class = AutoTokenizer.from_pretrained( + model.config._name_or_path, padding_side="left" + ) + pad_token_id = processing_class.pad_token_id + + # Reward functions + if not isinstance(reward_funcs, list): + reward_funcs = [reward_funcs] + for i, reward_func in enumerate(reward_funcs): + if isinstance(reward_func, str): + reward_funcs[i] = AutoModelForSequenceClassification.from_pretrained( + reward_func, num_labels=1, **model_init_kwargs + ) + self.reward_funcs = reward_funcs + + # Reward processing class + if reward_processing_classes is None: + reward_processing_classes = [None] * len(reward_funcs) + elif not isinstance(reward_processing_classes, list): + reward_processing_classes = [reward_processing_classes] + else: + if len(reward_processing_classes) != len(reward_funcs): + raise ValueError( + "The number of reward processing classes must match the number of reward functions." + ) + + for i, (reward_processing_class, reward_func) in enumerate( + zip(reward_processing_classes, reward_funcs) + ): + if isinstance(reward_func, PreTrainedModel): + if reward_processing_class is None: + reward_processing_class = AutoTokenizer.from_pretrained( + reward_func.config._name_or_path + ) + if reward_processing_class.pad_token_id is None: + reward_processing_class.pad_token = ( + reward_processing_class.eos_token + ) + # The reward model computes the reward for the latest non-padded token in the input sequence. + # So it's important to set the pad token ID to the padding token ID of the processing class. + reward_func.config.pad_token_id = reward_processing_class.pad_token_id + reward_processing_classes[i] = reward_processing_class + self.reward_processing_classes = reward_processing_classes + + # Data collator + def data_collator(features): # No data collation is needed in GRPO + return features + + # Training arguments + self.max_prompt_length = args.max_prompt_length + self.max_completion_length = ( + args.max_completion_length + ) # = |o_i| in the GRPO paper + self.num_generations = args.num_generations # = G in the GRPO paper + self.generation_config = GenerationConfig( + max_new_tokens=self.max_completion_length, + do_sample=True, + temperature=1, # HACK + num_return_sequences=self.num_generations, + pad_token_id=pad_token_id, + ) + self.beta = args.beta + + # The trainer estimates the number of FLOPs (floating-point operations) using the number of elements in the + # input tensor associated with the key "input_ids". However, in GRPO, the sampled data does not include the + # "input_ids" key. Instead, the available keys is "prompt". As a result, the trainer issues the warning: + # "Could not estimate the number of tokens of the input, floating-point operations will not be computed." To + # suppress this warning, we set the "estimate_tokens" key in the model's "warnings_issued" dictionary to True. + # This acts as a flag to indicate that the warning has already been issued. + model.warnings_issued["estimate_tokens"] = True + + # Initialize the metrics + self._metrics = defaultdict(list) + self.use_vllm = args.use_vllm + + # rewrite the processing AutoTokenizer -> AutoProcessor + model_id = model if isinstance(model, str) else model.config._name_or_path + if processing_class is None: + if "Qwen2-VL" in model_id or "Aria" in model_id: + processing_class = AutoProcessor.from_pretrained(model_id) + pad_token_id = processing_class.tokenizer.pad_token_id + processing_class.pad_token_id = pad_token_id + processing_class.eos_token_id = processing_class.tokenizer.eos_token_id + if "Qwen2-VL" in model_id: + processing_class.image_processor.max_pixels = max_pixels + processing_class.image_processor.min_pixels = min_pixels + else: + processing_class = AutoTokenizer.from_pretrained( + model.config._name_or_path, padding_side="left" + ) + pad_token_id = processing_class.pad_token_id + + super().__init__( + model=model, + args=args, + data_collator=data_collator, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + processing_class=processing_class, + callbacks=callbacks, + optimizers=optimizers, + ) + # Gradient accumulation requires scaled loss. Normally, loss scaling in the parent class depends on whether the + # model accepts loss-related kwargs. Since we compute our own loss, this check is irrelevant. We set + # self.model_accepts_loss_kwargs to False to enable scaling. + self.model_accepts_loss_kwargs = False + # Check if the per_device_train/eval_batch_size * num processes can be divided by the number of generations + num_processes = self.accelerator.num_processes + global_batch_size = args.per_device_train_batch_size * num_processes + possible_values = [ + n_gen + for n_gen in range(2, global_batch_size + 1) + if (global_batch_size) % n_gen == 0 + ] + + if self.num_generations not in possible_values: + raise ValueError( + f"The global train batch size ({num_processes} x {args.per_device_train_batch_size}) must be evenly " + f"divisible by the number of generations per prompt ({self.num_generations}). Given the current train " + f"batch size, the valid values for the number of generations are: {possible_values}." + ) + if self.args.eval_strategy != "no": + global_batch_size = args.per_device_eval_batch_size * num_processes + possible_values = [ + n_gen + for n_gen in range(2, global_batch_size + 1) + if (global_batch_size) % n_gen == 0 + ] + if self.num_generations not in possible_values: + raise ValueError( + f"The global eval batch size ({num_processes} x {args.per_device_eval_batch_size}) must be evenly " + f"divisible by the number of generations per prompt ({self.num_generations}). Given the current " + f"eval batch size, the valid values for the number of generations are: {possible_values}." + ) + + if self.use_vllm: + if not is_vllm_available(): + raise ImportError( + "vLLM is not available and `use_vllm` is set to True. Please install vLLM with " + "`pip install vllm` to use it." + ) + + if self.accelerator.is_main_process: + vllm_device = self.args.vllm_device + if vllm_device == "auto": + vllm_device = f"cuda:{self.accelerator.num_processes}" # take the next GPU idx + # Check that the requested device is available + if ( + vllm_device.split(":")[0] == "cuda" + and int(vllm_device.split(":")[1]) >= torch.cuda.device_count() + ): + raise ValueError( + f"The requested device for vllm ({vllm_device}) is not available. You are likely using vLLM " + "without restricting the number of GPUs for training. Set the `--num_processes` argument to a " + "value lower than the number of GPUs available on your machine—typically, reducing it by one " + f"is sufficient. In your case: `--num_processes {torch.cuda.device_count() - 1}`." + ) + # Check that the requested device is not also used for training + if vllm_device in { + f"cuda:{idx}" for idx in range(self.accelerator.num_processes) + }: + warnings.warn( + f"The requested device {vllm_device} is also used for training. This may lead to unexpected " + "behavior. It is recommended to use a dedicated device for vLLM." + ) + # vLLM is not compatible with accelerate. So we need to patch it to make sure we can (1) place the vLLM + # model on the desired device (world_size_patch) and (2) avoid a test that is not designed for our + # setting (profiling_patch). + world_size_patch = patch( + "torch.distributed.get_world_size", return_value=1 + ) + profiling_patch = patch( + "vllm.worker.worker.Worker._assert_memory_footprint_increased_during_profiling", + return_value=None, + ) + with world_size_patch, profiling_patch: + print("vllm is running on: ", vllm_device) + self.llm = LLM( + model=model.name_or_path, + device=vllm_device, + gpu_memory_utilization=self.args.vllm_gpu_memory_utilization, + dtype=torch.bfloat16, + # Automatic Prefix Caching caches the KV cache of existing queries, so that a new query can + # directly reuse the KV cache if it shares the same prefix with one of the existing queries. + # This is particularly useful here because we generate completions from the same prompts. + enable_prefix_caching=True, + enforce_eager=True, + # Ensure that training and inference use the same processor for images. + mm_processor_kwargs=( + { + "max_pixels": max_pixels, + "min_pixels": min_pixels, + } + if "Qwen2-VL" in model_id or "Qwen2.5-VL" in model_id + else None + ), + max_model_len=args.max_completion_length, + ) + self.sampling_params = SamplingParams( + temperature=args.temperature, + max_tokens=self.max_completion_length, + ) + + self._last_loaded_step = ( + 0 # tag to avoid useless loading during grad accumulation + ) + + # When using vLLM, the main process is responsible for loading the model weights. This can cause process + # desynchronization and seems to lead to DeepSpeed hanging during initialization. To prevent this, we + # synchronize all processes after vLLM has been fully initialized. + self.accelerator.wait_for_everyone() + else: + raise ValueError( + "Qwen2VLGRPOVLLMTrainer only supports vllm generation, please set --use_vllm True" + ) + + if self.ref_model is not None: + if self.is_deepspeed_enabled: + self.ref_model = prepare_deepspeed(self.ref_model, self.accelerator) + else: + self.ref_model = self.accelerator.prepare_model( + self.ref_model, evaluation_mode=True + ) + + for i, reward_func in enumerate(self.reward_funcs): + if isinstance(reward_func, PreTrainedModel): + self.reward_funcs[i] = self.accelerator.prepare_model( + reward_func, evaluation_mode=True + ) + + def _set_signature_columns_if_needed(self): + # If `self.args.remove_unused_columns` is True, non-signature columns are removed. + # By default, this method sets `self._signature_columns` to the model's expected inputs. + # In GRPOTrainer, we preprocess data, so using the model's signature columns doesn't work. + # Instead, we set them to the columns expected by the `training_step` method, hence the override. + if self._signature_columns is None: + self._signature_columns = ["prompt"] + + # We need a custom sampler that samples the same prompt multiple times + def _get_train_sampler(self): + return RepeatRandomSampler(self.train_dataset, self.num_generations) + + # Get the per-token log probabilities for the completions for the model and the reference model + def _get_per_token_logps( + self, + model, + input_ids, + attention_mask, + pixel_values, + image_grid_thw, + logits_to_keep, + ): + pixel_values = pixel_values.to(model.device) + image_grid_thw = image_grid_thw.to(device=model.device) + logits = model( + input_ids, + attention_mask=attention_mask, + pixel_values=pixel_values, + image_grid_thw=image_grid_thw, + ).logits # (B, L, V) + logits = logits[ + :, :-1, : + ] # (B, L-1, V), exclude the last logit: it corresponds to the next token pred + input_ids = input_ids[ + :, -logits_to_keep: + ] # (B, L-1), exclude the first input ID since we don't have logits for it + # Compute the log probabilities for the input tokens. Use a loop to reduce memory peak. + logits = logits[:, -logits_to_keep:] + per_token_logps = [] + for logits_row, input_ids_row in zip(logits, input_ids): + log_probs = logits_row.log_softmax(dim=-1) + token_log_prob = torch.gather( + log_probs, dim=1, index=input_ids_row.unsqueeze(1) + ).squeeze(1) + per_token_logps.append(token_log_prob) + return torch.stack(per_token_logps) + + # Trainer "prepares" the inputs before calling `compute_loss`. It converts to tensor and move to device. + # Since we preprocess the data in `compute_loss`, we need to override this method to skip this step. + def _prepare_inputs( + self, inputs: dict[str, Union[torch.Tensor, Any]] + ) -> dict[str, Union[torch.Tensor, Any]]: + device = self.accelerator.device + prompts = [x["prompt"] for x in inputs] + images = [x["image"] for x in inputs] + prompts_text = [ + maybe_apply_chat_template(example, self.processing_class)["prompt"] + for example in inputs + ] + prompt_inputs = self.processing_class( + # prompts_text, return_tensors="pt", padding=True, padding_side="left", add_special_tokens=False + text=prompts_text, + images=images, + return_tensors="pt", + padding=True, + padding_side="left", + add_special_tokens=False, + ) + prompt_ids, prompt_mask = ( + prompt_inputs["input_ids"].to(device), + prompt_inputs["attention_mask"].to(device), + ) + if self.max_prompt_length is not None: + prompt_ids = prompt_ids[:, -self.max_prompt_length :] + prompt_mask = prompt_mask[:, -self.max_prompt_length :] + + if self.args.use_vllm: + # First, have main process load weights if needed + if self.state.global_step != self._last_loaded_step: + with unwrap_model_for_generation( + self.model, + self.accelerator, + gather_deepspeed3_params=False, # TODO: fix this, self.args.ds3_gather_for_generation, + ) as unwrapped_model: + if is_compiled_module(unwrapped_model): + state_dict = unwrapped_model._orig_mod.state_dict() + else: + state_dict = unwrapped_model.state_dict() + if self.accelerator.is_main_process: + llm_model = ( + self.llm.llm_engine.model_executor.driver_worker.model_runner.model + ) + llm_model.load_weights(state_dict.items()) + self._last_loaded_step = self.state.global_step + + # Generate completions using vLLM: gather all prompts and use them in a single call in the main process + all_prompts_text = gather_object(prompts_text) + all_images = gather_object(images) + # group into pairs + all_multimodal_inputs = [ + {"prompt": p, "multi_modal_data": {"image": i}} + for p, i in zip(all_prompts_text, all_images) + ] + + if self.accelerator.is_main_process: + outputs = self.llm.generate( + all_multimodal_inputs, + sampling_params=self.sampling_params, + use_tqdm=False, + ) + completion_ids = [ + out.token_ids + for completions in outputs + for out in completions.outputs + ] + else: + completion_ids = [None] * len(all_prompts_text) + completion_ids = broadcast_object_list(completion_ids, from_process=0) + process_slice = slice( + self.accelerator.process_index * len(prompts), + (self.accelerator.process_index + 1) * len(prompts), + ) + completion_ids = completion_ids[process_slice] + + # Pad the completions, and concatenate them with the prompts + completion_ids = [ + torch.tensor(ids, device=device) for ids in completion_ids + ] + completion_ids = pad( + completion_ids, padding_value=self.processing_class.pad_token_id + ) + prompt_completion_ids = torch.cat([prompt_ids, completion_ids], dim=1) + else: + raise ValueError("Only vLLM generation is supported in this version ") + + # below are the same with yifan's code + # Mask everything after the first EOS token + is_eos = completion_ids == self.processing_class.eos_token_id + device = self.accelerator.device + eos_idx = torch.full( + (is_eos.size(0),), is_eos.size(1), dtype=torch.long, device=device + ) + eos_idx[is_eos.any(dim=1)] = is_eos.int().argmax(dim=1)[is_eos.any(dim=1)] + sequence_indices = torch.arange(is_eos.size(1), device=device).expand( + is_eos.size(0), -1 + ) + completion_mask = (sequence_indices <= eos_idx.unsqueeze(1)).int() + + # Concatenate prompt_mask with completion_mask for logit computation + attention_mask = torch.cat([prompt_mask, completion_mask], dim=1) # (B*G, P+C) + # pixel_values = prompt_inputs["pixel_values"].repeat_interleave( + # self.num_generations, dim=0 + # ) + + pixel_values = prompt_inputs["pixel_values"] + # [None].repeat_interleave(self.num_generations, dim=0) + # pixel_values = pixel_values.view(-1, pixel_values.shape[-1]) + + image_grid_thw = prompt_inputs["image_grid_thw"] + # .repeat_interleave( + # self.num_generations, dim=0 + # ) + logits_to_keep = completion_ids.size(1) + + with torch.inference_mode(): + if self.ref_model is not None: + ref_per_token_logps = self._get_per_token_logps( + self.ref_model, + prompt_completion_ids, + attention_mask, + pixel_values, + image_grid_thw, + logits_to_keep, + ) + else: + with self.accelerator.unwrap_model(self.model).disable_adapter(): + ref_per_token_logps = self._get_per_token_logps( + self.model, + prompt_completion_ids, + attention_mask, + pixel_values, + image_grid_thw, + logits_to_keep, + ) + + # Decode the generated completions + completions = self.processing_class.batch_decode( + completion_ids, skip_special_tokens=True + ) + if is_conversational(inputs[0]): + completions = [ + [{"role": "assistant", "content": completion}] + for completion in completions + ] + + # Compute the rewards + rewards_per_func = torch.zeros( + len(prompts), len(self.reward_funcs), device=device + ) + for i, (reward_func, reward_processing_class) in enumerate( + zip(self.reward_funcs, self.reward_processing_classes) + ): + if isinstance(reward_func, PreTrainedModel): + if is_conversational(inputs[0]): + messages = [ + {"messages": p + c} for p, c in zip(prompts, completions) + ] + texts = [ + apply_chat_template(x, reward_processing_class)["text"] + for x in messages + ] + else: + texts = [p + c for p, c in zip(prompts, completions)] + reward_inputs = reward_processing_class( + texts, + return_tensors="pt", + padding=True, + padding_side="right", + add_special_tokens=False, + ) + reward_inputs = super()._prepare_inputs(reward_inputs) + with torch.inference_mode(): + rewards_per_func[:, i] = reward_func(**reward_inputs).logits[ + :, 0 + ] # Shape (B*G,) + else: + # Repeat all input columns (but "prompt" and "completion") to match the number of generations + reward_kwargs = { + key: [] + for key in inputs[0].keys() + if key not in ["prompt", "completion"] + } + for key in reward_kwargs: + for example in inputs: + # Repeat each value in the column for `num_generations` times + reward_kwargs[key].extend([example[key]] * self.num_generations) + output_reward_func = reward_func( + prompts=prompts, completions=completions, **reward_kwargs + ) + rewards_per_func[:, i] = torch.tensor( + output_reward_func, dtype=torch.float32, device=device + ) + rewards_per_func = gather(rewards_per_func) + # Sum the rewards from all reward functions + rewards = rewards_per_func.sum(dim=1) + + # Compute grouped-wise rewards + mean_grouped_rewards = rewards.view(-1, self.num_generations).mean(dim=1) + std_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1) + + # Normalize the rewards to compute the advantages + mean_grouped_rewards = mean_grouped_rewards.repeat_interleave( + self.num_generations, dim=0 + ) + std_grouped_rewards = std_grouped_rewards.repeat_interleave( + self.num_generations, dim=0 + ) + advantages = (rewards - mean_grouped_rewards) / (std_grouped_rewards + 1e-4) + + # Slice to keep only the local part of the data + process_slice = slice( + self.accelerator.process_index * len(prompts), + (self.accelerator.process_index + 1) * len(prompts), + ) + advantages = advantages[process_slice] + + # Log the metrics + reward_per_func = rewards_per_func.mean(0) + for i, reward_func in enumerate(self.reward_funcs): + if isinstance( + reward_func, nn.Module + ): # Module instead of PretrainedModel for compat with compiled models + reward_func_name = reward_func.config._name_or_path.split("/")[-1] + else: + reward_func_name = reward_func.__name__ + self._metrics[f"rewards/{reward_func_name}"].append( + reward_per_func[i].item() + ) + + self._metrics["reward"].append(rewards.mean().item()) + self._metrics["reward_std"].append(std_grouped_rewards.mean().item()) + + return { + "prompt_ids": prompt_ids, + "prompt_mask": prompt_mask, + "completion_ids": completion_ids, + "completion_mask": completion_mask, + "ref_per_token_logps": ref_per_token_logps, + "advantages": advantages, + "pixel_values": pixel_values, + "image_grid_thw": image_grid_thw, + } + + def compute_loss( + self, model, inputs, return_outputs=False, num_items_in_batch=None + ): + if return_outputs: + raise ValueError("The GRPOTrainer does not support returning outputs") + # Compute the per-token log probabilities for the model + + prompt_ids, prompt_mask = inputs["prompt_ids"], inputs["prompt_mask"] + completion_ids, completion_mask = ( + inputs["completion_ids"], + inputs["completion_mask"], + ) + input_ids = torch.cat([prompt_ids, completion_ids], dim=1) + attention_mask = torch.cat([prompt_mask, completion_mask], dim=1) + pixel_values = inputs["pixel_values"] + image_grid_thw = inputs["image_grid_thw"] + logits_to_keep = completion_ids.size( + 1 + ) # we only need to compute the logits for the completion tokens + + per_token_logps = self._get_per_token_logps( + model, + input_ids, + attention_mask, + pixel_values, + image_grid_thw, + logits_to_keep, + ) + + # Compute the KL divergence between the model and the reference model + ref_per_token_logps = inputs["ref_per_token_logps"] + per_token_kl = ( + torch.exp(ref_per_token_logps - per_token_logps) + - (ref_per_token_logps - per_token_logps) + - 1 + ) + + # x - x.detach() allows for preserving gradients from x + advantages = inputs["advantages"] + per_token_loss = torch.exp( + per_token_logps - per_token_logps.detach() + ) * advantages.unsqueeze(1) + per_token_loss = -(per_token_loss - self.beta * per_token_kl) + loss = ( + (per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1) + ).mean() + + # Log the metrics + completion_length = ( + self.accelerator.gather_for_metrics(completion_mask.sum(1)) + .float() + .mean() + .item() + ) + self._metrics["completion_length"].append(completion_length) + + mean_kl = ( + (per_token_kl * completion_mask).sum(dim=1) / completion_mask.sum(dim=1) + ).mean() + self._metrics["kl"].append( + self.accelerator.gather_for_metrics(mean_kl).mean().item() + ) + + return loss + + def log(self, logs: dict[str, float], start_time: Optional[float] = None) -> None: + metrics = { + key: sum(val) / len(val) for key, val in self._metrics.items() + } # average the metrics + + # This method can be called both in training and evaluation. When called in evaluation, the keys in `logs` + # start with "eval_". We need to add the prefix "eval_" to the keys in `metrics` to match the format. + if next(iter(logs.keys())).startswith("eval_"): + metrics = {f"eval_{key}": val for key, val in metrics.items()} + + logs = {**logs, **metrics} + if version.parse(transformers.__version__) >= version.parse("4.47.0.dev0"): + super().log(logs, start_time) + else: # transformers<=4.46 + super().log(logs) + self._metrics.clear() diff --git a/setup.sh b/setup.sh new file mode 100644 index 0000000000000000000000000000000000000000..33499206777b0d06eccb9e31175429107caa8da0 --- /dev/null +++ b/setup.sh @@ -0,0 +1,19 @@ +# Install the packages in r1-v . +cd src/r1-v +pip install -e ".[dev]" + +# Addtional modules +pip install wandb==0.18.3 +pip install tensorboardx +pip install qwen_vl_utils torchvision +pip install flash-attn --no-build-isolation + +# vLLM support +pip install vllm==0.7.2 + +pip install nltk +pip install rouge_score +pip install deepspeed + +# fix transformers version +# pip install git+https://github.com/huggingface/transformers.git@336dc69d63d56f232a183a3e7f52790429b871ef diff --git a/src/download.py b/src/download.py new file mode 100644 index 0000000000000000000000000000000000000000..c2e71d4d66c8ebce788be2e50cead854535fb114 --- /dev/null +++ b/src/download.py @@ -0,0 +1,24 @@ +from huggingface_hub import snapshot_download + +# snapshot_download( +# repo_id="Video-R1/Video-R1-data", +# repo_type="dataset", # Specify it's a dataset repo +# local_dir="Video-R1-data", # Local directory to save data +# local_dir_use_symlinks=False # Set False if you want full file copies +# ) + + +# snapshot_download( +# repo_id="OpenGVLab/MVBench", +# repo_type="dataset", # Specify it's a dataset repo +# local_dir="Evaluation/MVBench", # Local directory to save data +# local_dir_use_symlinks=False # Set False if you want full file copies +# ) + + +snapshot_download( + repo_id="yale-nlp/MMVU", + repo_type="dataset", # Specify it's a dataset repo + local_dir="Evaluation/MMVU", # Local directory to save data + local_dir_use_symlinks=False # Set False if you want full file copies +) diff --git a/src/eval_bench.py b/src/eval_bench.py new file mode 100644 index 0000000000000000000000000000000000000000..5f4f8f0622b57c3ed7ad9e75e0d011ac38842543 --- /dev/null +++ b/src/eval_bench.py @@ -0,0 +1,277 @@ +import os +import json +import re +from tqdm import tqdm +from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction +from rouge_score import rouge_scorer +import torch + +from transformers import AutoProcessor, AutoTokenizer +from vllm import LLM, SamplingParams +from qwen_vl_utils import process_vision_info +import argparse + + +BSZ = 64 + + +parser = argparse.ArgumentParser(description="Evaluation benchmark") +parser.add_argument('--model_path', type=str, required=True, help="Path to the model") +parser.add_argument('--file_name', type=str, required=True, help="Name of the file") +args = parser.parse_args() + +MODEL_PATH = args.model_path +file_name = args.file_name + + + +llm = LLM( + model=MODEL_PATH, + tensor_parallel_size=torch.cuda.device_count(), + # max_model_len = 8192 * 2, + max_model_len = 32768, + gpu_memory_utilization=0.75, + limit_mm_per_prompt={"image": 1, "video": 1}, +) + + +sampling_params = SamplingParams( + temperature=0.1, + top_p=0.001, + max_tokens=1024, + stop_token_ids=[], +) + + +processor = AutoProcessor.from_pretrained(MODEL_PATH) +tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) +tokenizer.padding_side = "left" +processor.tokenizer = tokenizer + + +# for dataset_name in ['mvbench','tempcompass','videomme','videommmu','vsibench','mmvu']: +for dataset_name in ['mvbench', 'mmvu']: + + OUTPUT_PATH = f"./src/r1-v/eval_results/eval_{dataset_name}_{file_name}_greedy_output.json" + PROMPT_PATH = f"./src/r1-v/Evaluation/eval_{dataset_name}.json" + + if PROMPT_PATH.endswith('.jsonl'): + with open(PROMPT_PATH, "r", encoding="utf-8") as f: + for line in f: + data.append(json.loads(line)) + elif PROMPT_PATH.endswith('.json'): + with open(PROMPT_PATH, "r", encoding="utf-8") as f: + data = json.load(f) + else: + raise ValueError("Input file must be .json or .jsonl") + + QUESTION_TEMPLATE = ( + "{Question}\n" + "Please think about this question as if you were a human pondering deeply. " + "Engage in an internal dialogue using expressions such as 'let me think', 'wait', 'Hmm', 'oh, I see', 'let's break it down', etc, or other natural language thought expressions " + "It's encouraged to include self-reflection or verification in the reasoning process. " + "Provide your detailed reasoning between the and tags, and then give your final answer between the and tags." + ) + + TYPE_TEMPLATE = { + "multiple choice": " Please provide only the single option letter (e.g., A, B, C, D, etc.) within the tags.", + "numerical": " Please provide the numerical value (e.g., 42 or 3.14) within the tags.", + "OCR": " Please transcribe text from the image/video clearly and provide your text answer within the tags.", + "free-form": " Please provide your text answer within the tags.", + "regression": " Please provide the numerical value (e.g., 42 or 3.14) within the tags." + } + + + messages = [] + for x in data: + if x["problem_type"] == 'multiple choice': + question = x['problem'] + "Options:\n" + for op in x["options"]: + question += op + "\n" + else: + question = x['problem'] + + msg = [{ + "role": "user", + "content": [ + { + "type": x['data_type'], + # x['data_type']: os.getcwd() + "/src/r1-v/Evaluation" + x['path'][1:] + x['data_type']: os.getcwd() + "/src/r1-v" + x['path'][1:] + }, + { + "type": "text", + "text": QUESTION_TEMPLATE.format(Question=question) + TYPE_TEMPLATE[x['problem_type']] + } + ] + }] + messages.append(msg) + + + final_output = [] + start_idx = 0 + if os.path.exists(OUTPUT_PATH): + try: + with open(OUTPUT_PATH, "r", encoding="utf-8") as f: + existing = json.load(f) + final_output = existing.get("results", []) + start_idx = len(final_output) + print(f"Resuming from sample index {start_idx}") + except Exception as e: + print(f"Error reading existing output file: {e}") + + + def extract_think(output_str): + pattern = r'\s*(.*?)\s*' + match = re.search(pattern, output_str, re.DOTALL) + if match: + return match.group(1).strip() + return "" + + def extract_answer(text): + pattern = r'\s*(.*?)\s*' + match = re.search(pattern, text, re.DOTALL) + if match: + return match.group(1).strip() + return "" + + def normalize_number(num_str): + try: + num_str = num_str.replace(',', '') + return float(num_str) + except Exception as e: + return None + + def mean_relative_accuracy(pred, target, start=0.5, end=0.95, interval=0.05): + + if not torch.is_tensor(pred): + pred = torch.tensor(pred, dtype=torch.float32) + if not torch.is_tensor(target): + target = torch.tensor(target, dtype=torch.float32) + + epsilon = 1e-8 + rel_error = torch.abs(pred - target) / (torch.abs(target) + epsilon) + + thresholds = torch.arange(start, end + interval/2, interval, dtype=torch.float32) + + conditions = rel_error < (1 - thresholds) + mra = conditions.float().mean() + return mra.item() + + + def reward_fn(sample, model_output, question_type): + try: + output_ans = extract_answer(model_output) + if output_ans == '': + output_ans = model_output + gt_ans = extract_answer(sample.get("solution", "")) + if question_type == "multiple choice": + return 1.0 if output_ans.strip() == gt_ans.strip() else 0.0 + elif question_type == "numerical": + gt_has_decimal = ("." in gt_ans) or ("," in gt_ans) + out_has_decimal = ("." in output_ans) or ("," in output_ans) + if gt_has_decimal != out_has_decimal: + return 0.0 + gt_number = normalize_number(gt_ans) + out_number = normalize_number(output_ans) + if gt_number is None or out_number is None: + return 0.0 + return 1.0 if round(gt_number, 2) == round(out_number, 2) else 0.0 + elif question_type == "regression": + gt_number = normalize_number(gt_ans) + out_number = normalize_number(output_ans) + if gt_number is None or out_number is None: + return 0.0 + mra = mean_relative_accuracy(out_number, gt_number) + return mra + else: + return 0.0 + except Exception as e: + return 0.0 + + mean_acc = [] + mean_mra = [] + for i in tqdm(range(start_idx, len(messages), BSZ), desc="Processing batches"): + batch_messages = messages[i:i + BSZ] + + prompts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in batch_messages] + + + try: + image_inputs, video_inputs, video_kwargs = process_vision_info(batch_messages, return_video_kwargs=True) + + image_idx = 0 + video_idx = 0 + + llm_inputs = [] + + + for idx, prompt in enumerate(prompts): + mm_type = batch_messages[idx][0]['content'][0]['type'] + sample_mm_data = {} + sample_video_kw = {} + if mm_type == 'image': + sample_mm_data["image"] = image_inputs[image_idx] + image_idx += 1 + elif mm_type == 'video': + sample_mm_data["video"] = video_inputs[video_idx] + for key, value in video_kwargs.items(): + sample_video_kw[key] = value[video_idx] + video_idx += 1 + + + llm_inputs.append({ + "prompt": prompt, + "multi_modal_data": sample_mm_data, + "mm_processor_kwargs": sample_video_kw, + }) + + + outputs = llm.generate(llm_inputs, sampling_params=sampling_params) + batch_output_text = [out.outputs[0].text for out in outputs] + + except Exception as e: + print('error:', data[i]['path']) + print('Exception:', e) + batch_output_text = ['error'] * BSZ + + + for j, (sample, model_output) in enumerate(zip(data[i:i+BSZ], batch_output_text), start=i): + think_chain = extract_think(model_output) + final_ans = extract_answer(model_output) + if final_ans == "": + final_ans = model_output + sample["output"] = model_output + sample["prediction"] = final_ans + q_type = sample.get("problem_type", "") + sample["reward"] = reward_fn(sample, model_output, q_type) + sample['correct'] = True if sample["reward"]==1.0 else False + if sample['problem_type'] != 'regression': + mean_acc.append(sample["reward"]) + else: + mean_mra.append(sample["reward"]) + if think_chain: + sample["process"] = f"{think_chain}" + final_output.append(sample) + + + try: + with open(OUTPUT_PATH, "w", encoding="utf-8") as f: + json.dump({"results": final_output}, f, indent=2, ensure_ascii=False) + print(f"Processed batch {(i - start_idx)//BSZ + 1}, saved {len(final_output)} samples.") + except Exception as e: + print(f"Error writing to output file: {e}") + + final_acc={'mean_acc': 0.0, 'mean_mra': 0.0} + final_acc['mean_acc'] = torch.tensor(mean_acc).mean().item() + if mean_mra != []: + final_acc['mean_mra'] = torch.tensor(mean_mra).mean().item() + + try: + with open(OUTPUT_PATH, "w", encoding="utf-8") as f: + json.dump({"results": final_output, "final_acc": [final_acc]}, f, indent=2, ensure_ascii=False) + print(f"Final accuracy saved to {OUTPUT_PATH}") + except Exception as e: + print(f"Error writing final accuracy to output file: {e}") + + print(f"Results saved to {OUTPUT_PATH}") diff --git a/src/eval_bench.sh b/src/eval_bench.sh new file mode 100644 index 0000000000000000000000000000000000000000..7976f09010f62dfa7a923f7bdac93765d4fe3055 --- /dev/null +++ b/src/eval_bench.sh @@ -0,0 +1,36 @@ +#!/bin/bash +# run_models.sh + +# export HF_HOME=/apdcephfs_sh2/share_300000800/user/zongxia/hf_cache +# export TRANSFORMERS_CACHE=/apdcephfs_sh2/share_300000800/user/zongxia/hf_cache + +./move_eval.sh + +model_paths=( + # "Qwen/Qwen2.5-VL-3B-Instruct" + # "/apdcephfs_sh2/share_300000800/user/zongxia/Video-R1/src/r1-v/log/3B-Video-GRPO-NoDesEval/checkpoint-1000" + # "/apdcephfs_sh2/share_300000800/user/zongxia/Video-R1/src/r1-v/log/3B-Video-GRPO-selfEval-ThenNoDesEval/pool_numerical_chunk_02/checkpoint-42" + # "/apdcephfs_sh2/share_300000800/user/zongxia/Video-R1/src/r1-v/log/3B-Video-GRPO-AnswerBERT/video_pool_multiple_choice_chunk_02/checkpoint-46" + # "Video-R1/Video-R1-7B" + "zli12321/VideoHallu-R1-v3" + # "Qwen/Qwen2.5-VL-7B-Instruct" +) + +file_names=( + # "qwen_3B_base" + # "qwen_3B_noDesEval" + # "qwen_3B_answerBERT_thenNoDesEval" + # "qwen_3B_answerBERT_video12" + # "video-R1-7B" + "VideoHallu-R1-v3" + # "Qwen2.5-VL-7B-Instruct" +) + +export DECORD_EOF_RETRY_MAX=20480 + + +for i in "${!model_paths[@]}"; do + model="${model_paths[$i]}" + file_name="${file_names[$i]}" + CUDA_VISIBLE_DEVICES=0,1,2,3 python ./src/eval_bench.py --model_path "$model" --file_name "$file_name" +done diff --git a/src/eval_bench_4567.sh b/src/eval_bench_4567.sh new file mode 100644 index 0000000000000000000000000000000000000000..136a7914d35d92beb2bf38f85f6305039a1aa278 --- /dev/null +++ b/src/eval_bench_4567.sh @@ -0,0 +1,32 @@ +#!/bin/bash +# run_models.sh + +# export HF_HOME=/apdcephfs_sh2/share_300000800/user/zongxia/hf_cache +# export TRANSFORMERS_CACHE=/apdcephfs_sh2/share_300000800/user/zongxia/hf_cache + +./move_eval.sh + +model_paths=( + # "Qwen/Qwen2.5-VL-7B-Instruct" + # "/apdcephfs_sh2/share_300000800/user/zongxia/Video-R1/src/r1-v/log/3B-Video-GRPO-AnswerBERT/video_pool_multiple_choice_chunk_01/checkpoint-46" + # "/apdcephfs_sh2/share_300000800/user/zongxia/Video-R1/src/r1-v/log/3B-Video-GRPO-SelfEval-Train/pool_numerical_chunk_01/checkpoint-25" + # "/apdcephfs_sh2/share_300000800/user/zongxia/Video-R1/src/r1-v/log/3B-Video-GRPO-NoDesEval/pool_multiple_choice_chunk_01/checkpoint-57" + # "Video-R1/Qwen2.5-VL-7B-COT-SFT" + "zli12321/VideoHallu-R1-v1.0" +) + +file_names=( + # "qwen_3B_selfEval_mcq1_nume1" + # "qwen_3B_NoDesEval_mcq1" + # "Video-R1-7B-COT-SFT" + "VideoHallu-R1-v1.0" +) + +export DECORD_EOF_RETRY_MAX=20480 + + +for i in "${!model_paths[@]}"; do + model="${model_paths[$i]}" + file_name="${file_names[$i]}" + CUDA_VISIBLE_DEVICES=4,5,6,7 python ./src/eval_bench.py --model_path "$model" --file_name "$file_name" +done diff --git a/src/generate_cot_vllm.py b/src/generate_cot_vllm.py new file mode 100644 index 0000000000000000000000000000000000000000..3cfd9b403a4644007b729337230c041334ff8a0f --- /dev/null +++ b/src/generate_cot_vllm.py @@ -0,0 +1,266 @@ +import os +import json +import re +from tqdm import tqdm +from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction +from rouge_score import rouge_scorer +import torch + +from transformers import AutoProcessor, AutoTokenizer +from vllm import LLM, SamplingParams +from qwen_vl_utils import process_vision_info + + +MODEL_PATH = "Qwen/Qwen2.5-VL-72B-Instruct" +BSZ = 32 + + +llm = LLM( + model=MODEL_PATH, + tensor_parallel_size=torch.cuda.device_count(), + max_model_len = 8192, + gpu_memory_utilization=0.8, + limit_mm_per_prompt={"image": 10, "video": 10}, +) + +sampling_params = SamplingParams( + temperature=1.0, + top_p=0.95, + max_tokens=512, + stop_token_ids=[], +) + + +processor = AutoProcessor.from_pretrained(MODEL_PATH) +tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) +tokenizer.padding_side = "left" +processor.tokenizer = tokenizer + +for dataset_name in ['your_data_name']: + + OUTPUT_PATH = f"./src/r1-v/Video-R1-data/{dataset_name}_COT_qwen72b.json" + PROMPT_PATH = f"./src/r1-v/Video-R1-data/{dataset_name}.json" + + data = [] + if PROMPT_PATH.endswith('.jsonl'): + with open(PROMPT_PATH, "r", encoding="utf-8") as f: + for line in f: + data.append(json.loads(line)) + elif PROMPT_PATH.endswith('.json'): + with open(PROMPT_PATH, "r", encoding="utf-8") as f: + data = json.load(f) + else: + raise ValueError("Input file must be .json or .jsonl") + + + QUESTION_TEMPLATE = ( + "{Question}\n" + "Please think about this question as if you were a human pondering deeply. " + "Engage in an internal dialogue using expressions such as 'let me think', 'wait', 'Hmm', 'oh, I see', 'let's break it down', etc, or other natural language thought expressions " + "It's encouraged to include self-reflection or verification in the reasoning process. " + "Provide your detailed reasoning between the and tags, and then give your final answer between the and tags." + ) + + TYPE_TEMPLATE = { + "multiple choice": " Please provide only the single option letter (e.g., A, B, C, D, etc.) within the tags.", + "numerical": " Please provide the numerical value (e.g., 42 or 3.14) within the tags.", + "OCR": " Please transcribe text from the image/video clearly and provide your text answer within the tags.", + "free-form": " Please provide your text answer within the tags.", + "regression": " Please provide the numerical value (e.g., 42 or 3.14) within the tags." + } + + + messages = [] + for x in data: + if x["problem_type"] == 'multiple choice': + question = x['problem'] + "Options:\n" + for op in x["options"]: + question += op + "\n" + else: + question = x['problem'] + + msg = [{ + "role": "user", + "content": [ + { + "type": x['data_type'], + x['data_type']: os.getcwd() + "/src/r1-v/Video-R1-data" + x['path'][1:] + }, + { + "type": "text", + "text": QUESTION_TEMPLATE.format(Question=question) + TYPE_TEMPLATE[x['problem_type']] + } + ] + }] + messages.append(msg) + + # For resume + final_output = [] + start_idx = 0 + if os.path.exists(OUTPUT_PATH): + try: + with open(OUTPUT_PATH, "r", encoding="utf-8") as f: + existing = json.load(f) + final_output = existing.get("results", []) + start_idx = len(final_output) + print(f"Resuming from sample index {start_idx}") + except Exception as e: + print(f"Error reading existing output file: {e}") + + def extract_think(output_str): + pattern = r'\s*(.*?)\s*' + match = re.search(pattern, output_str, re.DOTALL) + if match: + return match.group(1).strip() + return "" + + def extract_answer(text): + pattern = r'\s*(.*?)\s*' + match = re.search(pattern, text, re.DOTALL) + if match: + return match.group(1).strip() + return "" + + def normalize_number(num_str): + try: + num_str = num_str.replace(',', '') + return float(num_str) + except Exception as e: + print(f"Error converting '{num_str}' to float: {e}") + return None + + def wer(reference, hypothesis): + ref_words = reference.split() + hyp_words = hypothesis.split() + m = len(ref_words) + n = len(hyp_words) + d = [[0]*(n+1) for _ in range(m+1)] + for i in range(m+1): + d[i][0] = i + for j in range(n+1): + d[0][j] = j + for i in range(1, m+1): + for j in range(1, n+1): + if ref_words[i-1] == hyp_words[j-1]: + d[i][j] = d[i-1][j-1] + else: + d[i][j] = 1 + min(d[i-1][j], d[i][j-1], d[i-1][j-1]) + return d[m][n] / max(1, m) + + def compute_bleu_score(reference, hypothesis): + try: + smoothing = SmoothingFunction().method1 + ref_tokens = reference.split() + hyp_tokens = hypothesis.split() + score = sentence_bleu([ref_tokens], hyp_tokens, smoothing_function=smoothing) + return score + except Exception as e: + print(f"Error computing BLEU score: {e}") + return 0.0 + + def compute_rouge_score(reference, hypothesis, use_stemmer=True): + scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=use_stemmer) + scores = scorer.score(reference, hypothesis) + average_fmeasure = (scores['rouge1'].fmeasure + scores['rouge2'].fmeasure + scores['rougeL'].fmeasure) / 3 + return average_fmeasure + + def reward_fn(sample, model_output, question_type): + try: + output_ans = extract_answer(model_output) + gt_ans = extract_answer(sample.get("solution", "")) + if question_type == "multiple choice": + return 1.0 if output_ans.strip() == gt_ans.strip() else 0.0 + elif question_type == "numerical": + gt_has_decimal = ("." in gt_ans) or ("," in gt_ans) + out_has_decimal = ("." in output_ans) or ("," in output_ans) + if gt_has_decimal != out_has_decimal: + return 0.0 + gt_number = normalize_number(gt_ans) + out_number = normalize_number(output_ans) + if gt_number is None or out_number is None: + return 0.0 + return 1.0 if round(gt_number, 2) == round(out_number, 2) else 0.0 + elif question_type == "OCR": + error_rate = wer(gt_ans, output_ans) + reward = 1 - error_rate + return max(0.0, min(1.0, reward)) + elif question_type == "free-form": + score = compute_rouge_score(gt_ans, output_ans) + return max(0.0, min(1.0, score)) + elif question_type == "regression": + gt_number = normalize_number(gt_ans) + out_number = normalize_number(output_ans) + if gt_number is None or out_number is None: + return 0.0 + rel_diff = (abs(out_number - gt_number) + 1e-9) / (abs(gt_number) + 1e-9) + rel_diff = min(1.0, max(0.0, rel_diff)) + return 1 - rel_diff + else: + return 0.0 + except Exception as e: + print(f"Error in reward_fn for question_type '{question_type}': {e}") + return 0.0 + + + for i in tqdm(range(start_idx, len(messages), BSZ), desc="Processing batches"): + batch_messages = messages[i:i + BSZ] + + prompts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in batch_messages] + + try: + image_inputs, video_inputs, video_kwargs = process_vision_info(batch_messages, return_video_kwargs=True) + + image_idx = 0 + video_idx = 0 + + llm_inputs = [] + + + for idx, prompt in enumerate(prompts): + mm_type = batch_messages[idx][0]['content'][0]['type'] + sample_mm_data = {} + sample_video_kw = {} + if mm_type == 'image': + sample_mm_data["image"] = image_inputs[image_idx] + image_idx += 1 + elif mm_type == 'video': + sample_mm_data["video"] = video_inputs[video_idx] + for key, value in video_kwargs.items(): + sample_video_kw[key] = value[video_idx] + video_idx += 1 + + + llm_inputs.append({ + "prompt": prompt, + "multi_modal_data": sample_mm_data, + "mm_processor_kwargs": sample_video_kw, + }) + + + outputs = llm.generate(llm_inputs, sampling_params=sampling_params) + batch_output_text = [out.outputs[0].text for out in outputs] + + except Exception as e: + print('error:', data[i]['path']) + batch_output_text = ['error'] * BSZ + + + for j, (sample, model_output) in enumerate(zip(data[i:i+BSZ], batch_output_text), start=i): + think_chain = extract_think(model_output) + final_ans = extract_answer(model_output) + sample["answer"] = final_ans + q_type = sample.get("problem_type", "") + sample["reward"] = reward_fn(sample, model_output, q_type) + sample['select'] = True if sample["reward"] > 0.6 else False + if think_chain: + sample["process"] = f"{think_chain}" + final_output.append(sample) + + try: + with open(OUTPUT_PATH, "w", encoding="utf-8") as f: + json.dump({"results": final_output}, f, indent=2, ensure_ascii=False) + print(f"Processed batch {(i - start_idx)//BSZ + 1}, saved {len(final_output)} samples.") + except Exception as e: + print(f"Error writing to output file: {e}") + + print(f"Results saved to {OUTPUT_PATH}") diff --git a/src/inference_example.py b/src/inference_example.py new file mode 100644 index 0000000000000000000000000000000000000000..bc24c7bcd22c742ddc75281cf8da0a74611ae796 --- /dev/null +++ b/src/inference_example.py @@ -0,0 +1,93 @@ +import os +import torch +from vllm import LLM, SamplingParams +from transformers import AutoProcessor, AutoTokenizer +from qwen_vl_utils import process_vision_info + +# Set model path +model_path = "Video-R1/Video-R1-7B" + +# Set video path and question +video_path = "./src/example_video/video1.mp4" +question = "Which move motion in the video lose the system energy?" + +# Choose the question type from 'multiple choice', 'numerical', 'OCR', 'free-form', 'regression' +problem_type = 'free-form' + +# Initialize the LLM +llm = LLM( + model=model_path, + tensor_parallel_size=1, + max_model_len=81920, + gpu_memory_utilization=0.8, + limit_mm_per_prompt={"video": 1, "image": 1}, +) + +sampling_params = SamplingParams( + temperature=0.1, + top_p=0.001, + max_tokens=1024, +) + +# Load processor and tokenizer +processor = AutoProcessor.from_pretrained(model_path) +tokenizer = AutoTokenizer.from_pretrained(model_path) +tokenizer.padding_side = "left" +processor.tokenizer = tokenizer + +# Prompt template +QUESTION_TEMPLATE = ( + "{Question}\n" + "Please think about this question as if you were a human pondering deeply. " + "Engage in an internal dialogue using expressions such as 'let me think', 'wait', 'Hmm', 'oh, I see', 'let's break it down', etc, or other natural language thought expressions " + "It's encouraged to include self-reflection or verification in the reasoning process. " + "Provide your detailed reasoning between the and tags, and then give your final answer between the and tags." +) + +# Question type +TYPE_TEMPLATE = { + "multiple choice": " Please provide only the single option letter (e.g., A, B, C, D, etc.) within the tags.", + "numerical": " Please provide the numerical value (e.g., 42 or 3.14) within the tags.", + "OCR": " Please transcribe text from the image/video clearly and provide your text answer within the tags.", + "free-form": " Please provide your text answer within the tags.", + "regression": " Please provide the numerical value (e.g., 42 or 3.14) within the tags." +} + +# Construct multimodal message +messages = [ + { + "role": "user", + "content": [ + { + "type": "video", + "video": video_path, + "max_pixels": 200704, # max pixels for each frame + "nframes": 32 # max frame number + }, + { + "type": "text", + "text": QUESTION_TEMPLATE.format(Question=question) + TYPE_TEMPLATE[problem_type] + }, + ], + } +] + +# Convert to prompt string +prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) + +# Process video input +image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True) + +# Prepare vllm input +llm_inputs = [{ + "prompt": prompt, + "multi_modal_data": {"video": video_inputs[0]}, + "mm_processor_kwargs": {key: val[0] for key, val in video_kwargs.items()}, +}] + +# Run inference +outputs = llm.generate(llm_inputs, sampling_params=sampling_params) +output_text = outputs[0].outputs[0].text + +print(output_text) + diff --git a/src/scripts/run_grpo_video.sh b/src/scripts/run_grpo_video.sh new file mode 100644 index 0000000000000000000000000000000000000000..5baf52db743f2799549acbf01667698ae70b70a1 --- /dev/null +++ b/src/scripts/run_grpo_video.sh @@ -0,0 +1,44 @@ +cd src/r1-v + +export DEBUG_MODE="true" # Enable Debug if you want to see the rollout of model during RL +export LOG_PATH="./debug_log_2b.txt" + +# For resume training: --resume_from_checkpoint Model_Path \ +# Set temporal to choose between T-GRPO and GRPO, and len_control to enable or disable the length control reward. + +# Qwen/Qwen2.5-VL-3B-Instruct + +CUDA_VISIBLE_DEVICES=0,1,2,3,5,6,7 torchrun --nproc_per_node="8" \ + --nnodes="1" \ + --node_rank="0" \ + --master_addr="127.0.0.1" \ + --master_port="12365" \ + src/open_r1/grpo.py \ + --output_dir "./log/Qwen2.5-VL-3B-GRPO" \ + --model_name_or_path 'Qwen/Qwen2.5-VL-3B-Instruct' \ + --dataset_name "./Video-R1-data/Video-R1-260k.json" \ + --deepspeed local_scripts/zero3.json \ + --max_prompt_length 16384 \ + --max_completion_length 768 \ + --per_device_train_batch_size 1 \ + --gradient_accumulation_steps 1 \ + --learning_rate 1e-6 \ + --lr_scheduler_type "cosine" \ + --weight_decay 0.01 \ + --bf16 \ + --logging_steps 1 \ + --gradient_checkpointing true \ + --temporal true \ + --len_control true \ + --attn_implementation flash_attention_2 \ + --max_pixels 401408 \ + --num_train_epochs 1 \ + --run_name Video-R1 \ + --save_steps 100 \ + --beta 0.04 \ + --max_grad_norm 5 \ + --save_only_model false \ + --num_generations 8 # number of outputs G in grpo, reduce it would lead to faster training and smaller memory cost but higher variance + + +python /apdcephfs_sh2/share_300000800/user/zongxia/Video-R1/gpu_burn.py \ No newline at end of file diff --git a/src/unzip.py b/src/unzip.py new file mode 100644 index 0000000000000000000000000000000000000000..d05a6a21cc70b3b1700ee5d7f639b14cd0fc49fd --- /dev/null +++ b/src/unzip.py @@ -0,0 +1,24 @@ +import os +import zipfile + +def extract_zip_files(root_dir): + """ + Traverse the specified directory and all its subdirectories, + and extract all zip files. + Each zip file will be extracted into its containing directory. + """ + for dirpath, _, filenames in os.walk(root_dir): + for filename in filenames: + if filename.lower().endswith('.zip'): + zip_path = os.path.join(dirpath, filename) + print(f"Extracting: {zip_path}") + try: + with zipfile.ZipFile(zip_path, 'r') as zip_ref: + zip_ref.extractall(dirpath) + print(f"Successfully extracted: {zip_path}") + except Exception as e: + print(f"Failed to extract {zip_path}: {e}") + +if __name__ == '__main__': + root_directory = "./src/r1-v/Video-R1-data" + extract_zip_files(root_directory)