File size: 3,217 Bytes
7df8e18 d96c45b 7df8e18 d96c45b 7df8e18 d96c45b 7df8e18 d96c45b 7df8e18 d96c45b 7df8e18 d96c45b 7df8e18 d96c45b 7df8e18 d96c45b 7df8e18 d96c45b 7df8e18 d96c45b 7df8e18 d96c45b 7df8e18 d96c45b 7df8e18 d96c45b 7df8e18 d96c45b 7df8e18 d96c45b 7df8e18 d96c45b 7df8e18 d96c45b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | ---
base_model: zai-org/Glyph
library_name: transformers
license: other
pipeline_tag: image-text-to-text
tags:
- llama-factory
- full
- generated_from_trainer
- vision-language-model
- reasoning
model-index:
- name: vtc-r1-glyph
results: []
---
# VTC-R1-Glyph
VTC-R1 (Vision-Text Compression for Efficient Long-Context Reasoning) is an efficient reasoning paradigm that integrates vision-text compression into the reasoning process. This repository contains the fine-tuned version of [zai-org/Glyph](https://huggingface.co/zai-org/Glyph) (based on GLM-4V) using this paradigm.
- **Paper:** [VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning](https://huggingface.co/papers/2601.22069)
- **Repository:** [https://github.com/w-yibo/VTC-R1](https://github.com/w-yibo/VTC-R1)
## Model Description
VTC-R1 addresses efficiency bottlenecks in long-context reasoning for Vision-Language Models (VLMs). Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into the model as "optical memory."
Key features:
- **Efficiency:** Achieves 3.4x token compression and 2.7x speedup in end-to-end latency.
- **Performance:** Outperforms standard long-context reasoning on benchmarks like MATH500, AIME25, AMC23, and GPQA-D.
- **Scalability:** Integrates vision-text compression directly into the reasoning process without needing external compression models.
## Setup & Inference
### Installation
To use this model, install the required dependencies:
```bash
apt-get install poppler-utils # or conda install -c conda-forge poppler
pip install torch==2.6.0
pip install transformers==4.57.1
pip install reportlab
pip install pdf2image
```
### Inference
You can run the inference code provided in the [official repository](https://github.com/w-yibo/VTC-R1) to generate VTC-R1 style reasoning:
```bash
python inference.py # replace your model path in the script
```
## Training Procedure
The model was fine-tuned using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) on a dataset derived from OpenR1-Math-220K.
### Training Hyperparameters
The following hyperparameters were used during training:
- **learning_rate:** 1e-05
- **train_batch_size:** 1
- **eval_batch_size:** 8
- **seed:** 42
- **distributed_type:** multi-GPU
- **num_devices:** 8
- **gradient_accumulation_steps:** 8
- **total_train_batch_size:** 64
- **total_eval_batch_size:** 64
- **optimizer:** AdamW with betas=(0.9,0.999) and epsilon=1e-08
- **lr_scheduler_type:** cosine
- **lr_scheduler_warmup_ratio:** 0.1
- **num_epochs:** 1
## Citation
If you find this work useful, please cite:
```bibtex
@misc{wang2026vtcr1visiontextcompressionefficient,
title={VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning},
author={Yibo Wang and Yongcheng Jing and Shunyu Liu and Hao Guan and Rong-cheng Tu and Chengyu Wang and Jun Huang and Dacheng Tao},
year={2026},
eprint={2601.22069},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.22069},
}
```
## Framework Versions
- Transformers 4.57.1
- Pytorch 2.6.0+cu124
- Datasets 4.0.0
- Tokenizers 0.22.1 |