# eagle0504/llava-video-text-model Fine-tuned **LLaVA model** on video-text data using DeepSpeed. ## Model Details - **Base model**: llava-hf/llava-interleave-qwen-7b-hf - **Architecture**: LLaVA (Large Language and Vision Assistant) - **Training samples**: 4 videos - **Training**: Multi-GPU with DeepSpeed ZeRO Stage 2 - **Task**: Video-text conversation generation - **Video frames**: 5 frames per video ## Usage ```python import requests from PIL import Image import torch from transformers import AutoProcessor, LlavaForConditionalGeneration # Load model and processor processor = AutoProcessor.from_pretrained("eagle0504/llava-video-text-model") model = LlavaForConditionalGeneration.from_pretrained( "eagle0504/llava-video-text-model", torch_dtype=torch.float16, low_cpu_mem_usage=True, ).to(0) # Define conversation with multiple images for video conversation = [ { "role": "user", "content": [ {"type": "text", "text": "What is in this video?"}, {"type": "image"}, {"type": "image"}, {"type": "image"}, {"type": "image"}, {"type": "image"}, ], }, ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) # Process video frames (you need to extract frames from your video) video_frames = [...] # List of PIL Images from video inputs = processor(images=video_frames, text=prompt, return_tensors='pt').to(0, torch.float16) # Generate response output = model.generate(**inputs, max_new_tokens=200, do_sample=False) response = processor.decode(output[0], skip_special_tokens=True) print(response) ``` ## Training Configuration - DeepSpeed ZeRO Stage 2 - Mixed precision (BF16) - AdamW optimizer - Learning rate: 5e-5 - Video frames per sample: 5 ## Video Processing This model expects 5 frames extracted from each video. For best results: 1. Extract evenly spaced frames from your video 2. Resize frames to model's expected input size 3. Pass frames as a list to the processor