TVC-7B / README.md

Update pipeline tag, add project page link, quick start and other tags (#1)

be218da verified 12 months ago

2.46 kB

	---
	base_model: Qwen/Qwen2-VL-7B-Instruct
	library_name: transformers
	license: apache-2.0
	tags:
	- llama-factory
	- full
	- generated_from_trainer
	- long-context
	- reasoning
	- multi-modal
	model-index:
	- name: TVC-7B
	results: []
	pipeline_tag: image-text-to-text
	---

	## Model Summary

	The TVC models are 7B parameter models based on Qwen2-VL-7B-Instruct model with a context window of 8K tokens.

	- Repository: https://github.com/sun-hailong/TVC
	- Project Page: https://sun-hailong.github.io/projects/TVC/
	- Languages: English, Chinese
	- Paper: https://arxiv.org/abs/2503.13360

	### Model Architecture

	- Architecture: Qwen2-VL-7B-Instruct
	- Data: a mixture of 300k long-chain reasoning data
	- Precision: BFloat16

	#### Hardware & Software

	- Hardware: 64 * NVIDIA Tesla H20
	- Orchestration: HuggingFace Trainer
	- Code: Pytorch

	### Framework versions

	- Transformers 4.46.1
	- Pytorch 2.5.1+cu124
	- Datasets 3.1.0
	- Tokenizers 0.20.3

	## Quick Start

	```python
	from vllm import LLM, SamplingParams
	from PIL import Image

	model_name = "Allen8/TVC-72B"
	llm = LLM(
	model=model_name,
	trust_remote_code=True,
	tensor_parallel_size=8,
	)

	question = "Hint: Please answer the question requiring an integer answer and provide the final value, e.g., 1, 2, 3, at the end.
	Question: Subtract all red things. Subtract all tiny matte balls. How many objects are left?
	Please answer the question using a long-chain reasoning style and think step by step."
	placeholder = "<\|image_pad\|>"
	prompt = ("<\|im_start\|>system
	You are a helpful assistant.<\|im_end\|>
	"
	f"<\|im_start\|>user
	<\|vision_start\|>{placeholder}<\|vision_end\|>"
	f"{question}<\|im_end\|>
	"
	"<\|im_start\|>assistant
	")

	sampling_params = SamplingParams(
	temperature=0.0,
	top_k=1,
	top_p=1.0,
	stop_token_ids=[],
	repetition_penalty=1.05,
	max_tokens=8192
	)

	image = Image.open("images/case1.png")
	inputs = {
	"prompt": prompt,
	"multi_modal_data": {
	"image": image
	},
	}

	outputs = llm.generate([inputs], sampling_params=sampling_params)
	print(outputs[0].outputs[0].text)
	```

	## Citation

	```
	@article{sun2024mitigating,
	title={Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning},
	author={Sun, Hai-Long and Sun, Zhun and Peng, Houwen and Ye, Han-Jia},
	journal={arXiv preprint arXiv:2503.13360},
	year={2025}
	}
	```