Improve model card: add paper info, repository link, and description

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +65 -30
README.md CHANGED
@@ -1,58 +1,93 @@
1
  ---
 
2
  library_name: transformers
3
  license: other
4
- base_model: zai-org/Glyph
5
  tags:
6
  - llama-factory
7
  - full
8
  - generated_from_trainer
 
 
9
  model-index:
10
  - name: vtc-r1-glyph
11
  results: []
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
-
17
 
18
- ## Model description
19
 
20
- More information needed
 
21
 
22
- ## Intended uses & limitations
23
 
24
- More information needed
25
 
26
- ## Training and evaluation data
 
 
 
27
 
28
- More information needed
29
 
30
- ## Training procedure
 
 
 
 
 
 
 
 
31
 
32
- ### Training hyperparameters
 
 
 
 
33
 
34
- The following hyperparameters were used during training:
35
- - learning_rate: 1e-05
36
- - train_batch_size: 1
37
- - eval_batch_size: 8
38
- - seed: 42
39
- - distributed_type: multi-GPU
40
- - num_devices: 8
41
- - gradient_accumulation_steps: 8
42
- - total_train_batch_size: 64
43
- - total_eval_batch_size: 64
44
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
45
- - lr_scheduler_type: cosine
46
- - lr_scheduler_warmup_ratio: 0.1
47
- - num_epochs: 1
48
-
49
- ### Training results
50
 
 
51
 
 
52
 
53
- ### Framework versions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
  - Transformers 4.57.1
56
  - Pytorch 2.6.0+cu124
57
  - Datasets 4.0.0
58
- - Tokenizers 0.22.1
 
1
  ---
2
+ base_model: zai-org/Glyph
3
  library_name: transformers
4
  license: other
5
+ pipeline_tag: image-text-to-text
6
  tags:
7
  - llama-factory
8
  - full
9
  - generated_from_trainer
10
+ - vision-language-model
11
+ - reasoning
12
  model-index:
13
  - name: vtc-r1-glyph
14
  results: []
15
  ---
16
 
17
+ # VTC-R1-Glyph
 
 
18
 
19
+ VTC-R1 (Vision-Text Compression for Efficient Long-Context Reasoning) is an efficient reasoning paradigm that integrates vision-text compression into the reasoning process. This repository contains the fine-tuned version of [zai-org/Glyph](https://huggingface.co/zai-org/Glyph) (based on GLM-4V) using this paradigm.
20
 
21
+ - **Paper:** [VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning](https://huggingface.co/papers/2601.22069)
22
+ - **Repository:** [https://github.com/w-yibo/VTC-R1](https://github.com/w-yibo/VTC-R1)
23
 
24
+ ## Model Description
25
 
26
+ VTC-R1 addresses efficiency bottlenecks in long-context reasoning for Vision-Language Models (VLMs). Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into the model as "optical memory."
27
 
28
+ Key features:
29
+ - **Efficiency:** Achieves 3.4x token compression and 2.7x speedup in end-to-end latency.
30
+ - **Performance:** Outperforms standard long-context reasoning on benchmarks like MATH500, AIME25, AMC23, and GPQA-D.
31
+ - **Scalability:** Integrates vision-text compression directly into the reasoning process without needing external compression models.
32
 
33
+ ## Setup & Inference
34
 
35
+ ### Installation
36
+ To use this model, install the required dependencies:
37
+ ```bash
38
+ apt-get install poppler-utils # or conda install -c conda-forge poppler
39
+ pip install torch==2.6.0
40
+ pip install transformers==4.57.1
41
+ pip install reportlab
42
+ pip install pdf2image
43
+ ```
44
 
45
+ ### Inference
46
+ You can run the inference code provided in the [official repository](https://github.com/w-yibo/VTC-R1) to generate VTC-R1 style reasoning:
47
+ ```bash
48
+ python inference.py # replace your model path in the script
49
+ ```
50
 
51
+ ## Training Procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
+ The model was fine-tuned using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) on a dataset derived from OpenR1-Math-220K.
54
 
55
+ ### Training Hyperparameters
56
 
57
+ The following hyperparameters were used during training:
58
+ - **learning_rate:** 1e-05
59
+ - **train_batch_size:** 1
60
+ - **eval_batch_size:** 8
61
+ - **seed:** 42
62
+ - **distributed_type:** multi-GPU
63
+ - **num_devices:** 8
64
+ - **gradient_accumulation_steps:** 8
65
+ - **total_train_batch_size:** 64
66
+ - **total_eval_batch_size:** 64
67
+ - **optimizer:** AdamW with betas=(0.9,0.999) and epsilon=1e-08
68
+ - **lr_scheduler_type:** cosine
69
+ - **lr_scheduler_warmup_ratio:** 0.1
70
+ - **num_epochs:** 1
71
+
72
+ ## Citation
73
+
74
+ If you find this work useful, please cite:
75
+
76
+ ```bibtex
77
+ @misc{wang2026vtcr1visiontextcompressionefficient,
78
+ title={VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning},
79
+ author={Yibo Wang and Yongcheng Jing and Shunyu Liu and Hao Guan and Rong-cheng Tu and Chengyu Wang and Jun Huang and Dacheng Tao},
80
+ year={2026},
81
+ eprint={2601.22069},
82
+ archivePrefix={arXiv},
83
+ primaryClass={cs.CL},
84
+ url={https://arxiv.org/abs/2601.22069},
85
+ }
86
+ ```
87
+
88
+ ## Framework Versions
89
 
90
  - Transformers 4.57.1
91
  - Pytorch 2.6.0+cu124
92
  - Datasets 4.0.0
93
+ - Tokenizers 0.22.1