eagle0504
/

llava-video-text-model

Model card Files Files and versions

llava-video-text-model / README.md

eagle0504's picture

Upload README.md with huggingface_hub

6529951 verified 4 months ago

|

history blame contribute delete

2.04 kB

	# eagle0504/llava-video-text-model

	Fine-tuned LLaVA model on video-text data using DeepSpeed.

	## Model Details

	- Base model: llava-hf/llava-interleave-qwen-7b-hf
	- Architecture: LLaVA (Large Language and Vision Assistant)
	- Training samples: 4 videos
	- Training: Multi-GPU with DeepSpeed ZeRO Stage 2
	- Task: Video-text conversation generation
	- Video frames: 5 frames per video

	## Usage

	```python
	import requests
	from PIL import Image
	import torch
	from transformers import AutoProcessor, LlavaForConditionalGeneration

	# Load model and processor
	processor = AutoProcessor.from_pretrained("eagle0504/llava-video-text-model")
	model = LlavaForConditionalGeneration.from_pretrained(
	"eagle0504/llava-video-text-model",
	torch_dtype=torch.float16,
	low_cpu_mem_usage=True,
	).to(0)

	# Define conversation with multiple images for video
	conversation = [
	{
	"role": "user",
	"content": [
	{"type": "text", "text": "What is in this video?"},
	{"type": "image"},
	{"type": "image"},
	{"type": "image"},
	{"type": "image"},
	{"type": "image"},
	],
	},
	]

	prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

	# Process video frames (you need to extract frames from your video)
	video_frames = [...] # List of PIL Images from video
	inputs = processor(images=video_frames, text=prompt, return_tensors='pt').to(0, torch.float16)

	# Generate response
	output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
	response = processor.decode(output[0], skip_special_tokens=True)
	print(response)
	```

	## Training Configuration

	- DeepSpeed ZeRO Stage 2
	- Mixed precision (BF16)
	- AdamW optimizer
	- Learning rate: 5e-5
	- Video frames per sample: 5

	## Video Processing

	This model expects 5 frames extracted from each video. For best results:
	1. Extract evenly spaced frames from your video
	2. Resize frames to model's expected input size
	3. Pass frames as a list to the processor