MUSEG-3B / README.md

Enhance model card: Add `library_name`, performance, environment setup, and sample usage

e57e33e verified 3 months ago

3.93 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	datasets:
	- PolyU-ChenLab/ET-Instruct-164K
	language:
	- en
	license: apache-2.0
	metrics:
	- f1
	pipeline_tag: video-text-to-text
	library_name: transformers
	---

	# MUSEG-3B

	[Paper](https://arxiv.org/abs/2505.20715) \| [GitHub](https://github.com/THUNLP-MT/MUSEG)

	We propose MUSEG 🌟, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding. MUSEG enables MLLMs to align queries with multiple relevant video segments, promoting more comprehensive temporal reasoning ⏳. To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning. Extensive experiments on temporal grounding and time-sensitive video QA tasks demonstrate that MUSEG significantly outperforms existing methods and generalizes well across diverse temporal understanding scenarios 🚀.

	## Performance

	<img src="https://github.com/THUNLP-MT/MUSEG/raw/main/images/performance.png" style="zoom:85%;"/>

	## Environment Setup

	```bash
	conda create -n museg python=3.11
	conda activate museg
	pip install -r requirements.txt
	pip install flash_attn==2.7.4.post1
	apt update
	apt install openjdk-11-jdk
	```

	## Sample Usage

	We provide a simple generation process for using our model. For more details, you could refer to [Github](https://github.com/THUNLP-MT/MUSEG)

	```python
	import numpy as np
	import torch
	from longvu.builder import load_pretrained_model
	from longvu.constants import (
	DEFAULT_IMAGE_TOKEN,
	IMAGE_TOKEN_INDEX,
	)
	from longvu.conversation import conv_templates, SeparatorStyle
	from longvu.mm_datautils import (
	KeywordsStoppingCriteria,
	process_images,
	tokenizer_image_token,
	)
	from decord import cpu, VideoReader

	tokenizer, model, image_processor, context_len = load_pretrained_model(
	"./checkpoints/longvu_qwen", None, "cambrian_qwen",
	)

	model.eval()
	video_path = "./examples/video1.mp4"
	qs = "Describe this video in detail"

	vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
	fps = float(vr.get_avg_fps())
	frame_indices = np.array([i for i in range(0, len(vr), round(fps),)])
	video = []
	for frame_index in frame_indices:
	img = vr[frame_index].asnumpy()
	video.append(img)
	video = np.stack(video)
	image_sizes = [video[0].shape[:2]]
	video = process_images(video, image_processor, model.config)
	video = [item.unsqueeze(0) for item in video]

	qs = DEFAULT_IMAGE_TOKEN + "
	" + qs
	conv = conv_templates["qwen"].copy()
	conv.append_message(conv.roles[0], qs)
	conv.append_message(conv.roles[1], None)
	prompt = conv.get_prompt()

	input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
	stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
	keywords = [stop_str]
	stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
	with torch.inference_mode():
	output_ids = model.generate(
	input_ids,
	images=video,
	image_sizes=image_sizes,
	do_sample=False,
	temperature=0.2,
	max_new_tokens=128,
	use_cache=True,
	stopping_criteria=[stopping_criteria],
	)
	pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
	```

	## More Details

	Please refer to our [GitHub Repository](https://github.com/THUNLP-MT/MUSEG) for more details about this model.

	## Citation

	If you find our work helpful for your research, please consider citing our work.

	```plain
	@article{luo2025museg,
	title={MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding},
	author={Fuwen Luo and Shengfeng Lou and Chi Chen and Ziyue Wang and Chenliang Li and Weizhou Shen and Jiyue Guo and Peng Li and Ming Yan and Ji Zhang and Fei Huang and Yang Liu},
	journal={arXiv preprint arXiv:2505.20715},
	year={2025}
	}
	```