| | --- |
| | base_model: Qwen/Qwen2-VL-7B-Instruct |
| | library_name: transformers |
| | license: apache-2.0 |
| | tags: |
| | - llama-factory |
| | - full |
| | - generated_from_trainer |
| | - long-context |
| | - reasoning |
| | - multi-modal |
| | model-index: |
| | - name: TVC-7B |
| | results: [] |
| | pipeline_tag: image-text-to-text |
| | --- |
| | |
| | ## Model Summary |
| |
|
| | The TVC models are 7B parameter models based on Qwen2-VL-7B-Instruct model with a context window of 8K tokens. |
| |
|
| | - **Repository:** https://github.com/sun-hailong/TVC |
| | - **Project Page:** https://sun-hailong.github.io/projects/TVC/ |
| | - **Languages:** English, Chinese |
| | - **Paper:** https://arxiv.org/abs/2503.13360 |
| |
|
| | ### Model Architecture |
| |
|
| | - **Architecture:** Qwen2-VL-7B-Instruct |
| | - **Data:** a mixture of 300k long-chain reasoning data |
| | - **Precision:** BFloat16 |
| |
|
| | #### Hardware & Software |
| |
|
| | - **Hardware:** 64 * NVIDIA Tesla H20 |
| | - **Orchestration:** HuggingFace Trainer |
| | - **Code:** Pytorch |
| |
|
| | ### Framework versions |
| |
|
| | - Transformers 4.46.1 |
| | - Pytorch 2.5.1+cu124 |
| | - Datasets 3.1.0 |
| | - Tokenizers 0.20.3 |
| |
|
| | ## Quick Start |
| |
|
| | ```python |
| | from vllm import LLM, SamplingParams |
| | from PIL import Image |
| | |
| | model_name = "Allen8/TVC-72B" |
| | llm = LLM( |
| | model=model_name, |
| | trust_remote_code=True, |
| | tensor_parallel_size=8, |
| | ) |
| | |
| | question = "Hint: Please answer the question requiring an integer answer and provide the final value, e.g., 1, 2, 3, at the end. |
| | Question: Subtract all red things. Subtract all tiny matte balls. How many objects are left? |
| | Please answer the question using a long-chain reasoning style and think step by step." |
| | placeholder = "<|image_pad|>" |
| | prompt = ("<|im_start|>system |
| | You are a helpful assistant.<|im_end|> |
| | " |
| | f"<|im_start|>user |
| | <|vision_start|>{placeholder}<|vision_end|>" |
| | f"{question}<|im_end|> |
| | " |
| | "<|im_start|>assistant |
| | ") |
| | |
| | sampling_params = SamplingParams( |
| | temperature=0.0, |
| | top_k=1, |
| | top_p=1.0, |
| | stop_token_ids=[], |
| | repetition_penalty=1.05, |
| | max_tokens=8192 |
| | ) |
| | |
| | image = Image.open("images/case1.png") |
| | inputs = { |
| | "prompt": prompt, |
| | "multi_modal_data": { |
| | "image": image |
| | }, |
| | } |
| | |
| | outputs = llm.generate([inputs], sampling_params=sampling_params) |
| | print(outputs[0].outputs[0].text) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ``` |
| | @article{sun2024mitigating, |
| | title={Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning}, |
| | author={Sun, Hai-Long and Sun, Zhun and Peng, Houwen and Ye, Han-Jia}, |
| | journal={arXiv preprint arXiv:2503.13360}, |
| | year={2025} |
| | } |
| | ``` |