| # eagle0504/llava-video-text-model | |
| Fine-tuned **LLaVA model** on video-text data using DeepSpeed. | |
| ## Model Details | |
| - **Base model**: llava-hf/llava-interleave-qwen-7b-hf | |
| - **Architecture**: LLaVA (Large Language and Vision Assistant) | |
| - **Training samples**: 4 videos | |
| - **Training**: Multi-GPU with DeepSpeed ZeRO Stage 2 | |
| - **Task**: Video-text conversation generation | |
| - **Video frames**: 5 frames per video | |
| ## Usage | |
| ```python | |
| import requests | |
| from PIL import Image | |
| import torch | |
| from transformers import AutoProcessor, LlavaForConditionalGeneration | |
| # Load model and processor | |
| processor = AutoProcessor.from_pretrained("eagle0504/llava-video-text-model") | |
| model = LlavaForConditionalGeneration.from_pretrained( | |
| "eagle0504/llava-video-text-model", | |
| torch_dtype=torch.float16, | |
| low_cpu_mem_usage=True, | |
| ).to(0) | |
| # Define conversation with multiple images for video | |
| conversation = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "text", "text": "What is in this video?"}, | |
| {"type": "image"}, | |
| {"type": "image"}, | |
| {"type": "image"}, | |
| {"type": "image"}, | |
| {"type": "image"}, | |
| ], | |
| }, | |
| ] | |
| prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) | |
| # Process video frames (you need to extract frames from your video) | |
| video_frames = [...] # List of PIL Images from video | |
| inputs = processor(images=video_frames, text=prompt, return_tensors='pt').to(0, torch.float16) | |
| # Generate response | |
| output = model.generate(**inputs, max_new_tokens=200, do_sample=False) | |
| response = processor.decode(output[0], skip_special_tokens=True) | |
| print(response) | |
| ``` | |
| ## Training Configuration | |
| - DeepSpeed ZeRO Stage 2 | |
| - Mixed precision (BF16) | |
| - AdamW optimizer | |
| - Learning rate: 5e-5 | |
| - Video frames per sample: 5 | |
| ## Video Processing | |
| This model expects 5 frames extracted from each video. For best results: | |
| 1. Extract evenly spaced frames from your video | |
| 2. Resize frames to model's expected input size | |
| 3. Pass frames as a list to the processor | |