eagle0504's picture
Upload README.md with huggingface_hub
6529951 verified
# eagle0504/llava-video-text-model
Fine-tuned **LLaVA model** on video-text data using DeepSpeed.
## Model Details
- **Base model**: llava-hf/llava-interleave-qwen-7b-hf
- **Architecture**: LLaVA (Large Language and Vision Assistant)
- **Training samples**: 4 videos
- **Training**: Multi-GPU with DeepSpeed ZeRO Stage 2
- **Task**: Video-text conversation generation
- **Video frames**: 5 frames per video
## Usage
```python
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
# Load model and processor
processor = AutoProcessor.from_pretrained("eagle0504/llava-video-text-model")
model = LlavaForConditionalGeneration.from_pretrained(
"eagle0504/llava-video-text-model",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(0)
# Define conversation with multiple images for video
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this video?"},
{"type": "image"},
{"type": "image"},
{"type": "image"},
{"type": "image"},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Process video frames (you need to extract frames from your video)
video_frames = [...] # List of PIL Images from video
inputs = processor(images=video_frames, text=prompt, return_tensors='pt').to(0, torch.float16)
# Generate response
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
response = processor.decode(output[0], skip_special_tokens=True)
print(response)
```
## Training Configuration
- DeepSpeed ZeRO Stage 2
- Mixed precision (BF16)
- AdamW optimizer
- Learning rate: 5e-5
- Video frames per sample: 5
## Video Processing
This model expects 5 frames extracted from each video. For best results:
1. Extract evenly spaced frames from your video
2. Resize frames to model's expected input size
3. Pass frames as a list to the processor