huoyunhf commited on Mar 4

Commit

182ccb1

verified ·

1 Parent(s): f951c70

Upload folder using huggingface_hub

Browse files

Files changed (18) hide show

.gitattributes +1 -0
README.md +72 -3
added_tokens.json +28 -0
chat_template.jinja +120 -0
chat_template.json +4 -0
config.json +65 -0
eval_geo3k.py +180 -0
generation_config.json +13 -0
geo3k_test_2048_qwen3-vl-2b-geometry3k.json +0 -0
geo3k_workflow.py +60 -0
merges.txt +0 -0
model.safetensors +3 -0
preprocessor_config.json +21 -0
special_tokens_map.json +31 -0
tokenizer.json +3 -0
tokenizer_config.json +239 -0
video_preprocessor_config.json +21 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,72 @@
----
-license: apache-2.0
----

+# Qwen3-VL-2B-Instruct Geometry3K Model
+This directory contains a Qwen3-VL-2B-Instruct model trained using **SFT (Supervised Fine-Tuning) + RL (Reinforcement Learning)** methods, specifically optimized for the Geometry3K geometric reasoning task.
+## Model Information
+- **Base Model**: [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct)
+- **Training Method**: SFT + RL
+- **Dataset**: Geometry3K
+- **Baseline Accuracy**: **0.2612**
+- **SFT+RL Accuracy**: **0.4692**
+## Directory Structure
+```
+Qwen3-VL-2B-Instruct-Geometry3k/
+├── README.md                                    # This file
+├── config.json                                  # Model configuration file
+├── generation_config.json                       # Generation configuration
+├── tokenizer_config.json                        # Tokenizer configuration
+├── tokenizer.json                               # Tokenizer file
+├── vocab.json                                   # Vocabulary file
+├── merges.txt                                   # BPE merges file
+├── chat_template.jinja                          # Chat template
+├── geo3k_test_2048_qwen3-vl-2b-geometry3k.json  # Test result data
+├── eval_geo3k.py                                # Evaluation script
+└── geo3k_workflow.py                            # Workflow script
+```
+## Usage
+### 1. Start Model Service
+Model inference is deployed using [vLLM](https://github.com/vllm-project/vllm):
+```bash
+# Start vLLM service, listening on specified port (e.g., 6049)
+vllm serve Qwen3-VL-2B-Instruct-Geometry3k --port 6049
+```
+### 2. Run Evaluation
+The evaluation script uses [rLLM](https://github.com/rllm/rllm), calling the above vLLM service via OpenAI-compatible API:
+```bash
+python eval_geo3k.py --port 6049 --model_name Qwen3-VL-2B-Instruct-Geometry3k
+```
+**Dependency versions**:
+- vLLM: 0.11.0 (model serving)
+- rLLM: 0.2.1 (evaluation pipeline)
+## Performance Metrics
+| Method    | Accuracy |
+|-----------|----------|
+| Baseline  | 0.2612   |
+| SFT+RL    | 0.4692   |
+## Notes
+1. The model uses BF16 precision and is recommended to run on GPUs that support BF16
+2. The model has merged LoRA weights and can be used directly without loading additional adapters
+3. Evaluation script: `eval_geo3k.py`. Optional parameters: `--n_parallel_tasks` (default 128), `--max_length` (default 2048)
+## Citation
+If you use this model, please cite:
+- **Geometry3K**: [hiyouga/geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k) on Hugging Face (converted from [InterGPS](https://github.com/lupantech/InterGPS))
+- **GRPO**: [DeepSeekMath](https://arxiv.org/abs/2402.03300) - Group Relative Policy Optimization, arXiv:2402.03300
+- **Qwen-VL**: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966

added_tokens.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "</think>": 151668,
+  "</tool_call>": 151658,
+  "</tool_response>": 151666,
+  "<think>": 151667,
+  "<tool_call>": 151657,
+  "<tool_response>": 151665,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,120 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0].role == 'system' %}
+        {%- if messages[0].content is string %}
+            {{- messages[0].content }}
+        {%- else %}
+            {%- for content in messages[0].content %}
+                {%- if 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '\n\n' }}
+    {%- endif %}
+    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {{- '<|im_start|>system\n' }}
+        {%- if messages[0].content is string %}
+            {{- messages[0].content }}
+        {%- else %}
+            {%- for content in messages[0].content %}
+                {%- if 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set image_count = namespace(value=0) %}
+{%- set video_count = namespace(value=0) %}
+{%- for message in messages %}
+    {%- if message.role == "user" %}
+        {{- '<|im_start|>' + message.role + '\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for content in message.content %}
+                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
+                    {%- set image_count.value = image_count.value + 1 %}
+                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
+                    <|vision_start|><|image_pad|><|vision_end|>
+                {%- elif content.type == 'video' or 'video' in content %}
+                    {%- set video_count.value = video_count.value + 1 %}
+                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
+                    <|vision_start|><|video_pad|><|vision_end|>
+                {%- elif 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role + '\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for content_item in message.content %}
+                {%- if 'text' in content_item %}
+                    {{- content_item.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {%- if message.tool_calls %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if (loop.first and message.content) or (not loop.first) %}
+                    {{- '\n' }}
+                {%- endif %}
+                {%- if tool_call.function %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<tool_call>\n{"name": "' }}
+                {{- tool_call.name }}
+                {{- '", "arguments": ' }}
+                {%- if tool_call.arguments is string %}
+                    {{- tool_call.arguments }}
+                {%- else %}
+                    {{- tool_call.arguments | tojson }}
+                {%- endif %}
+                {{- '}\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {%- if message.content is string %}
+            {{- message.content }}
+        {%- else %}
+            {%- for content in message.content %}
+                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
+                    {%- set image_count.value = image_count.value + 1 %}
+                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
+                    <|vision_start|><|image_pad|><|vision_end|>
+                {%- elif content.type == 'video' or 'video' in content %}
+                    {%- set video_count.value = video_count.value + 1 %}
+                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
+                    <|vision_start|><|video_pad|><|vision_end|>
+                {%- elif 'text' in content %}
+                    {{- content.text }}
+                {%- endif %}
+            {%- endfor %}
+        {%- endif %}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}

chat_template.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "chat_template": "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0].role == 'system' %}\n        {%- if messages[0].content is string %}\n            {{- messages[0].content }}\n        {%- else %}\n            {%- for content in messages[0].content %}\n                {%- if 'text' in content %}\n                    {{- content.text }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {{- '\\n\\n' }}\n    {%- endif %}\n    {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0].role == 'system' %}\n        {{- '<|im_start|>system\\n' }}\n        {%- if messages[0].content is string %}\n            {{- messages[0].content }}\n        {%- else %}\n            {%- for content in messages[0].content %}\n                {%- if 'text' in content %}\n                    {{- content.text }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- set image_count = namespace(value=0) %}\n{%- set video_count = namespace(value=0) %}\n{%- for message in messages %}\n    {%- if message.role == \"user\" %}\n        {{- '<|im_start|>' + message.role + '\\n' }}\n        {%- if message.content is string %}\n            {{- message.content }}\n        {%- else %}\n            {%- for content in message.content %}\n                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}\n                    {%- set image_count.value = image_count.value + 1 %}\n                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}\n                    <|vision_start|><|image_pad|><|vision_end|>\n                {%- elif content.type == 'video' or 'video' in content %}\n                    {%- set video_count.value = video_count.value + 1 %}\n                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}\n                    <|vision_start|><|video_pad|><|vision_end|>\n                {%- elif 'text' in content %}\n                    {{- content.text }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role + '\\n' }}\n        {%- if message.content is string %}\n            {{- message.content }}\n        {%- else %}\n            {%- for content_item in message.content %}\n                {%- if 'text' in content_item %}\n                    {{- content_item.text }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {%- if message.tool_calls %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if (loop.first and message.content) or (not loop.first) %}\n                    {{- '\\n' }}\n                {%- endif %}\n                {%- if tool_call.function %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {{- '<tool_call>\\n{\"name\": \"' }}\n                {{- tool_call.name }}\n                {{- '\", \"arguments\": ' }}\n                {%- if tool_call.arguments is string %}\n                    {{- tool_call.arguments }}\n                {%- else %}\n                    {{- tool_call.arguments | tojson }}\n                {%- endif %}\n                {{- '}\\n</tool_call>' }}\n            {%- endfor %}\n        {%- endif %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {%- if message.content is string %}\n            {{- message.content }}\n        {%- else %}\n            {%- for content in message.content %}\n                {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}\n                    {%- set image_count.value = image_count.value + 1 %}\n                    {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}\n                    <|vision_start|><|image_pad|><|vision_end|>\n                {%- elif content.type == 'video' or 'video' in content %}\n                    {%- set video_count.value = video_count.value + 1 %}\n                    {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}\n                    <|vision_start|><|video_pad|><|vision_end|>\n                {%- elif 'text' in content %}\n                    {{- content.text }}\n                {%- endif %}\n            {%- endfor %}\n        {%- endif %}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n"
+}

config.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+  "architectures": [
+    "Qwen3VLForConditionalGeneration"
+  ],
+  "dtype": "bfloat16",
+  "image_token_id": 151655,
+  "model_type": "qwen3_vl",
+  "text_config": {
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "bos_token_id": 151643,
+    "dtype": "bfloat16",
+    "eos_token_id": 151645,
+    "head_dim": 128,
+    "hidden_act": "silu",
+    "hidden_size": 2048,
+    "initializer_range": 0.02,
+    "intermediate_size": 6144,
+    "max_position_embeddings": 262144,
+    "model_type": "qwen3_vl_text",
+    "num_attention_heads": 16,
+    "num_hidden_layers": 28,
+    "num_key_value_heads": 8,
+    "rms_norm_eps": 1e-06,
+    "rope_scaling": {
+      "mrope_interleaved": true,
+      "mrope_section": [
+        24,
+        20,
+        20
+      ],
+      "rope_type": "default"
+    },
+    "rope_theta": 5000000,
+    "tie_word_embeddings": true,
+    "use_cache": true,
+    "vocab_size": 151936
+  },
+  "tie_word_embeddings": true,
+  "transformers_version": "4.57.1",
+  "video_token_id": 151656,
+  "vision_config": {
+    "deepstack_visual_indexes": [
+      5,
+      11,
+      17
+    ],
+    "depth": 24,
+    "dtype": "bfloat16",
+    "hidden_act": "gelu_pytorch_tanh",
+    "hidden_size": 1024,
+    "in_channels": 3,
+    "initializer_range": 0.02,
+    "intermediate_size": 4096,
+    "model_type": "qwen3_vl",
+    "num_heads": 16,
+    "num_position_embeddings": 2304,
+    "out_hidden_size": 2048,
+    "patch_size": 16,
+    "spatial_merge_size": 2,
+    "temporal_patch_size": 2
+  },
+  "vision_end_token_id": 151653,
+  "vision_start_token_id": 151652
+}

eval_geo3k.py ADDED Viewed

	@@ -0,0 +1,180 @@

+import asyncio
+import json
+import os
+from copy import deepcopy
+from datetime import datetime
+from geo3k_workflow import Geo3KWorkflow
+from datasets import load_dataset
+from rllm.data.dataset import DatasetRegistry
+from rllm.engine import AgentWorkflowEngine, OpenAIEngine
+from rllm.rewards.reward_fn import math_reward_fn
+def load_data(n=1):
+    """Load geo3k data using the Dataset interface."""
+    dataset = load_dataset("hiyouga/geometry3k")
+    test_dataset = dataset["test"]
+    instruction_following = "Let's think step by step and output your final answer in \\boxed{}."
+    def process_fn(example, idx):
+        problem = example.pop("problem")
+        prompt = problem + instruction_following
+        answer = example.pop("answer")
+        image = example.pop("images")
+        data = {
+            "idx": idx,
+            "data_source": "geo3k",
+            "image": image,
+            "question": prompt,
+            "ground_truth": answer,
+        }
+        return data
+    # Preprocess datasets
+    test_dataset = test_dataset.map(function=process_fn, with_indices=True, num_proc=8)
+    data = []
+    for idx, example in enumerate(test_dataset):
+        for i in range(n):
+            data.append(deepcopy(example))
+    return data
+def _make_json_serializable(obj):
+    """Recursively replace non-JSON-serializable values (e.g. PIL.Image) with placeholders."""
+    try:
+        from PIL import Image
+        if isinstance(obj, Image.Image):
+            return "<PIL.Image>"
+    except ImportError:
+        pass
+    if getattr(obj, "__class__", None) and getattr(obj.__class__, "__name__", "") in (
+        "PngImageFile",
+        "JpegImageFile",
+        "Image",
+    ):
+        return "<PIL.Image>"
+    if isinstance(obj, dict):
+        return {k: _make_json_serializable(v) for k, v in obj.items()}
+    if isinstance(obj, (list, tuple)):
+        return [_make_json_serializable(x) for x in obj]
+    try:
+        json.dumps(obj)
+        return obj
+    except (TypeError, ValueError):
+        return f"<{type(obj).__name__}>"
+def evaluate_results(results):
+    """Evaluate the results and compute pass@k metrics."""
+    from collections import defaultdict
+    # Create a map to store correct answers per problem
+    problem_correct_map = defaultdict(int)
+    problem_total_map = defaultdict(int)
+    # Count correct answers for each problem
+    for episode in results:
+        idx = episode.task["idx"]
+        # Use the episode-level is_correct flag set by the workflow
+        is_correct = episode.is_correct
+        problem_correct_map[idx] += int(is_correct)
+        problem_total_map[idx] += 1
+    # Calculate pass@1 and pass@k
+    k = max(problem_total_map.values()) if problem_total_map else 1
+    total_problems = len(problem_correct_map)
+    if total_problems > 0:
+        pass_at_1 = sum(problem_correct_map.values()) / sum(problem_total_map.values())
+        pass_at_k = sum(1 for idx, correct in problem_correct_map.items() if correct > 0) / total_problems
+    else:
+        pass_at_1 = 0.0
+        pass_at_k = 0.0
+    print("Total unique problems:", total_problems)
+    print("Average Pass@1 Accuracy:", pass_at_1)
+    print(f"Average Pass@{k} Accuracy:", pass_at_k)
+if __name__ == "__main__":
+    import os
+    import argparse
+    parser = argparse.ArgumentParser(description="Train Qwen VL models on GUI-360 JSON data")
+    parser.add_argument(
+        "--n_parallel_tasks",
+        type=int,default=128,
+    )
+    parser.add_argument(
+        "--model_name",
+        type=str,
+        required=True,
+    )
+    parser.add_argument(
+        "--max_length",
+        type=int,default=2048,
+    )
+    parser.add_argument(
+        "--port",
+        type=int,required=True,
+    )
+    args = parser.parse_args()
+    os.environ["TOKENIZERS_PARALLELISM"] = "true"
+    n_parallel_tasks = args.n_parallel_tasks
+    model_name = args.model_name
+    base_url="http://localhost:"+str(args.port)+"/v1"
+    max_length = args.max_length
+    print(f"Using model: {model_name} with base URL: {base_url}")
+    print(f"Using n_parallel_tasks: {n_parallel_tasks}")
+    rollout_engine = OpenAIEngine(
+        model=model_name,
+        max_prompt_length=1024,
+        max_response_length=max_length,
+        base_url=base_url,
+        api_key="None",
+        sampling_params={"temperature": 0.01},
+    )
+    engine = AgentWorkflowEngine(
+        workflow_cls=Geo3KWorkflow,
+        workflow_args={
+            "reward_function": math_reward_fn,
+            "encode_as_base64": True,
+        },
+        rollout_engine=rollout_engine,
+        config=None,
+        n_parallel_tasks=n_parallel_tasks,
+        retry_limit=1,
+    )
+    tasks = load_data(n=1)
+    print(f"Loaded {len(tasks)} geo3k tasks")
+    results = asyncio.run(engine.execute_tasks(tasks))
+    # Evaluate results (rewards are already assigned in the workflow)
+    print("Evaluating results...")
+    evaluate_results(results)
+    # Save results to logs/ under current working directory, filename with model name and timestamp
+    log_dir = os.path.join(os.getcwd(), "logs_test")
+    os.makedirs(log_dir, exist_ok=True)
+    # safe_model = model_name.replace("/", "_").replace(" ", "_")
+    log_name = f"geo3k_test_{max_length}_{model_name}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
+    log_path = os.path.join(log_dir, log_name)
+    with open(log_path, "w") as f:
+        json.dump(
+            [_make_json_serializable(episode.to_dict()) for episode in results],
+            f,
+            indent=4,
+        )
+    print(f"\nResults saved to {log_path}")

generation_config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "bos_token_id": 151643,
+  "do_sample": true,
+  "eos_token_id": [
+    151645,
+    151643
+  ],
+  "pad_token_id": 151643,
+  "temperature": 0.7,
+  "top_k": 20,
+  "top_p": 0.8,
+  "transformers_version": "4.57.1"
+}

geo3k_test_2048_qwen3-vl-2b-geometry3k.json ADDED Viewed

The diff for this file is too large to render. See raw diff

geo3k_workflow.py ADDED Viewed

	@@ -0,0 +1,60 @@

+from io import BytesIO
+from PIL import Image
+from rllm.agents.agent import Action, Episode, Step, Trajectory
+from rllm.engine import ModelOutput, RolloutEngine
+from rllm.rewards.reward_fn import RewardFunction, math_reward_fn
+from rllm.workflows.simple_workflow import SimpleAgent
+from rllm.workflows.workflow import TerminationEvent, TerminationReason, Workflow
+class Geo3KWorkflow(Workflow):
+    def __init__(self, rollout_engine: RolloutEngine, reward_function: RewardFunction = None, encode_as_base64: bool = False, **kwargs):
+        """
+        Args:
+            encode_as_base64: Deprecated, kept for backward compatibility. Ignored.
+        """
+        super().__init__(rollout_engine, **kwargs)
+        self.agent = SimpleAgent()
+        self.reward_fn: RewardFunction = reward_function or math_reward_fn
+    async def run(self, task: dict, uid: str, **kwargs) -> Episode:
+        self.reset(task, uid)
+        question = task.get("question")
+        image = task.get("image", task.get("images", None))
+        if isinstance(image, list) and len(image) > 0:
+            image = image[0]
+        if isinstance(image, dict) and "bytes" in image:
+            image = Image.open(BytesIO(image["bytes"]))
+        assert isinstance(image, Image.Image) or image is None, f"Image must be a PIL.Image.Image, but got {type(image)}"
+        # Standard format: content is text, images is list[PIL.Image]
+        # Conversion to backend-specific format happens in rollout engine/renderer
+        if image is not None:
+            messages = [{"role": "user", "content": question, "images": [image]}]
+        else:
+            messages = [{"role": "user", "content": question}]
+        output: ModelOutput = await self.rollout_engine.get_model_response(messages, application_id=uid, **kwargs)
+        action = Action(output.content)
+        reward_result = self.reward_fn(task, action)
+        trajectory: Trajectory = self.agent.trajectory
+        trajectory.steps.append(
+            Step(
+                chat_completions=messages + [{"role": "assistant", "content": output.content, "reasoning": output.reasoning}],
+                thought=output.reasoning,
+                action=action,
+                reward=reward_result.reward,
+                model_output=output,
+            )
+        )
+        self.commit(agent=self.agent, reset=True)
+        if output.finish_reason == "length":
+            raise TerminationEvent(TerminationReason.MAX_RESPONSE_LENGTH_EXCEEDED)
+        raise TerminationEvent(TerminationReason.ENV_DONE)

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:559374bc7adf7c19601393456f409db02bf2f6601e9ce0c882c9978a6aa2733b
+size 4255140312

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+    "size": {
+        "longest_edge": 16777216,
+        "shortest_edge": 65536
+    },
+    "patch_size": 16,
+    "temporal_patch_size": 2,
+    "merge_size": 2,
+    "image_mean": [
+        0.5,
+        0.5,
+        0.5
+    ],
+    "image_std": [
+        0.5,
+        0.5,
+        0.5
+    ],
+    "processor_class": "Qwen3VLProcessor",
+    "image_processor_type": "Qwen2VLImageProcessorFast"
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
+size 11422654

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,239 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151666": {
+      "content": "</tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151667": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151668": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 262144,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

video_preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+    "size": {
+        "longest_edge": 25165824,
+        "shortest_edge": 4096
+    },
+    "patch_size": 16,
+    "temporal_patch_size": 2,
+    "merge_size": 2,
+    "image_mean": [
+        0.5,
+        0.5,
+        0.5
+    ],
+    "image_std": [
+        0.5,
+        0.5,
+        0.5
+    ],
+    "processor_class": "Qwen3VLProcessor",
+    "video_processor_type": "Qwen3VLVideoProcessor"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff