File size: 5,104 Bytes
71d6013 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
# Data
### Data Format
We follow the data format below, which is similar to LLaVA. You can directly use the original file path or pack the multi-modal files into patches following [create_patch.py](https://github.com/Ola-Omni/Ola/blob/main/tools/create_patch.py). Patch is a binary file containing continuous image or video files in byte format, which may accelerate reading speed in some cases.
- Image Data:
```
[
{
'id': ID of the data
'image': ***.png (path to the image file or positions in patches)
'conversations': [{"from": "human", "value": "<image>\n"}, {"from": "gpt", "value": ""}]
}
]
```
The format for image patch is:
```
{
"patch": "patch_00000",
"start_num": 846113989,
"size": 27141
}
```
- Video Frame Data:
```
[
{
'id': ID of the data
'video': ***.mp4 (path to the video file or positions in patches)
'conversations': [{"from": "human", "value": "<image>\n"}, {"from": "gpt", "value": ""}]
}
]
```
The format for video patch is:
```
{
"patch": "patch_000000",
"size": [ 5605, 8902, 7917, 5562, 9249, 8785, 8379, 10389, 10505, 10337, 8481, 8164, 5562, 8844, 10565, 8035, 7768, 8969, 5643, 10478, 7632, 10980, 9986, 3602, 2848, 7591, 10766, 7813, 5605, 9840, 9664, 5605, 7726, 4828, 8006, 5562, 9711, 7903, 9542, 10626, 8827, 11268, 11115, 1832, 11354, 9222, 3965, 10426, 10427, 7311, 9726, 7655, 10025, 5350, 10098, 10470, 4877, 10273, 9730, 10150, 5604, 7203, 9881, 2246, 11114, 3790, 5567, 10490, 4072, 1701],
"start_num": 26608266
}
```
- Video + Audio Data:
```
[
{
'id': ID of the data
'video': ***.mp4 (path to the video file or positions in patches)
'audio': ***.wav (path to the audio file)
'conversations': [{"from": "human", "value": "<speech><image>\n"}, {"from": "gpt", "value": ""}]
}
]
```
- Image + Audio Data:
```
[
{
'id': ID of the data
'audio_q': ***.wav (path to the audio file)
'image': ***.png (path to the image file or positions in patches)
'conversations': [{"from": "human", "value": ""<image>\nUser's question in speech: <speech>""}, {"from": "gpt", "value": ""}]
}
]
```
- Audio Data:
```
[
{
'id': ID of the data
'audio': ***.wav (path to the audio file)
'conversations': [{"from": "human", "value": "<speech>\n"}, {"from": "gpt", "value": ""}]
}
]
```
### Instruction for Ola Data
**You can simply mix up the separated training jsons for joint training with image/video/audio data.**
#### **Ola-Video-1.9M**
1. Download [Ola-video-1.9M.json](https://huggingface.co/datasets/THUdyh/Ola-Data/blob/main/video_data/video-data.json) from huggingface.
2. Download all the [video patches](https://huggingface.co/datasets/THUdyh/Ola-Data/tree/main/video_data) from huggingface.
3. Check and modify the video patch path in the json to the true path in your machine.
#### **Ola-Audio-1.1M**
1. Download [Ola_audio_1169k.json](https://huggingface.co/datasets/THUdyh/Ola-Data/blob/main/Ola_audio_1169k.json) from huggingface.
2. Download [wav tar file](https://huggingface.co/datasets/THUdyh/Ola-Data/tree/main/ola_audio) from huggingface and unzip all the files.
3. Check the file structure:
```
βola_audio/
βββ Ola_audio_1169k.json
βββ AudioCaps/
βββ Clotho/
βββ GigaSpeech/
βββ LibriSpeech/
βββ MillionSongDatasetSpotify/
βββ MusicCaps/
βββ WavCaps/
```
4. Check and modify the audio file path in the json to the true path in your machine.
#### **Ola-Cross-Modality-298k**
1. Download [Ola_cross_modality_finevideo_175k.json](https://huggingface.co/datasets/THUdyh/Ola-Data/blob/main/Ola_cross_modality_finevideo_175k.json) and [Ola_cross_modality_llava_123k.json](https://huggingface.co/datasets/THUdyh/Ola-Data/blob/main/Ola_cross_modality_llava_123k.json) from huggingface.
2. Download [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo/tree/main) from huggingface.
3. Download [LLaVA-Video-178k](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/tree/main) from huggingface.
4. Extract pure video from FineVideo and LLaVA-Video-178k.
5. Transfer and save the wav file of the videos using [convert_mp4_wav.py](https://github.com/Ola-Omni/Ola/blob/main/tools/convert_mp4_wav.py).
6. Check the file structure:
```
βola_cross_modality_298k/
βββ Ola_cross_modality_finevideo_175k.json
βββ Ola_cross_modality_llava_123k.json
βββ finevideo_audios/
β βββ lltmlYR56dI.wav
β βββ ......
βββ finevideo_videos/
β βββ lltmlYR56dI.mp4
β βββ ......
βββ llava_audios/
β βββ academic_source
β βββ ActivityNet-QA
β βββ liwei_youtube_videos
β βββ NextQA
β βββ perception_test
βββ llava_videos/
β βββ academic_source
β βββ ActivityNet-QA
β βββ liwei_youtube_videos
β βββ NextQA
β βββ perception_test
```
7. Check and modify the video and audio path in the json to the true path in your machine.
|