BryanW's picture
Upload folder using huggingface_hub
3d1c0e1 verified
# Preparing and Training with Video Metadata
This guide walks you through preparing your video metadata, splitting it for efficient training, and running the training scripts.
## 1. Prepare Your Data in `.jsonl` Format
Your video metadata should be organized in JSON Lines (`.jsonl`) format, where each line is a valid JSON object representing one video.
**Example:**
```json
{
"video_path": "data/infinitystar_toy_data/videos/e06b8ca5dbc6.mp4",
"begin_frame_id": 0,
"end_frame_id": 120,
"tarsier2_caption": "The video features an animated character with long light orange hair and brown eyes.",
"width": 1280,
"height": 720,
"h_div_w": 0.5625,
"fps": 24
}
```
## 2. Split Metadata for Training
For efficient training, large `.jsonl` files can be split into smaller chunks.
```bash
python3 data/infinitystar_toy_data/split_jsonls_for_training.py --jsonl_folder_list JSONL_DIR --save_dir SAVE_DIR --chunk_size 100
```
## 3. Extract Video Features
To extract video features, modify the `extract_video_features.sh` script. Set the `video_data_path` and choose the desired resolution.
* **480p (5s):** `pn=0.40M`
* **480p (10s):** `pn=0.40M` with `video_frames=161`
* **720p (5s):** `pn=0.90M`
Then, run the script:
```bash
bash scripts/extract_video_features.sh
```
## 4. Run Training Scripts
Once your metadata is prepared and features are extracted, you can start training.
**480p Training (5s or 10s):**
```bash
bash scripts/train_480p.sh
```
**720p Training (only 5s):**
```bash
bash scripts/train_720p.sh
```
The 480p configuration supports both 5-second and 10-second video training. For 10-second training, ensure that `video_frames` is set to `161` in `extract_video_features.sh` and `train_480p.sh`.