File size: 5,104 Bytes
71d6013
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
# Data

### Data Format

We follow the data format below, which is similar to LLaVA. You can directly use the original file path or pack the multi-modal files into patches following [create_patch.py](https://github.com/Ola-Omni/Ola/blob/main/tools/create_patch.py). Patch is a binary file containing continuous image or video files in byte format, which may accelerate reading speed in some cases.


- Image Data:

```
[
    {
        'id': ID of the data
        'image': ***.png (path to the image file or positions in patches)
        'conversations': [{"from": "human", "value": "<image>\n"}, {"from": "gpt", "value": ""}]
    }
]
```

The format for image patch is:

```
{
    "patch": "patch_00000",
    "start_num": 846113989,
    "size": 27141
}
```

- Video Frame Data:

```
[
    {
        'id': ID of the data
        'video': ***.mp4 (path to the video file or positions in patches)
        'conversations': [{"from": "human", "value": "<image>\n"}, {"from": "gpt", "value": ""}]
    }
]
```

The format for video patch is:

```
{
    "patch": "patch_000000",
    "size": [ 5605, 8902, 7917, 5562, 9249, 8785, 8379, 10389, 10505, 10337, 8481, 8164, 5562, 8844, 10565, 8035, 7768, 8969, 5643, 10478, 7632, 10980, 9986, 3602, 2848, 7591, 10766, 7813, 5605, 9840, 9664, 5605, 7726, 4828, 8006, 5562, 9711, 7903, 9542, 10626, 8827, 11268, 11115, 1832, 11354, 9222, 3965, 10426, 10427, 7311, 9726, 7655, 10025, 5350, 10098, 10470, 4877, 10273, 9730, 10150, 5604, 7203, 9881, 2246, 11114, 3790, 5567, 10490, 4072, 1701],
    "start_num": 26608266
}
```

- Video + Audio Data:

```
[
    {
        'id': ID of the data
        'video': ***.mp4 (path to the video file or positions in patches)
        'audio': ***.wav (path to the audio file)
        'conversations': [{"from": "human", "value": "<speech><image>\n"}, {"from": "gpt", "value": ""}]
    }
]
```

- Image + Audio Data:

```
[
    {
        'id': ID of the data
        'audio_q': ***.wav (path to the audio file)
        'image': ***.png (path to the image file or positions in patches)
        'conversations': [{"from": "human", "value": ""<image>\nUser's question in speech: <speech>""}, {"from": "gpt", "value": ""}]
    }
]
```

- Audio Data:

```
[
    {
        'id': ID of the data
        'audio': ***.wav (path to the audio file)
        'conversations': [{"from": "human", "value": "<speech>\n"}, {"from": "gpt", "value": ""}]
    }
]
```

### Instruction for Ola Data

**You can simply mix up the separated training jsons for joint training with image/video/audio data.**

#### **Ola-Video-1.9M**

1. Download [Ola-video-1.9M.json](https://huggingface.co/datasets/THUdyh/Ola-Data/blob/main/video_data/video-data.json) from huggingface. 

2. Download all the [video patches](https://huggingface.co/datasets/THUdyh/Ola-Data/tree/main/video_data) from huggingface. 

3. Check and modify the video patch path in the json to the true path in your machine.

#### **Ola-Audio-1.1M**

1. Download [Ola_audio_1169k.json](https://huggingface.co/datasets/THUdyh/Ola-Data/blob/main/Ola_audio_1169k.json) from huggingface. 

2. Download [wav tar file](https://huggingface.co/datasets/THUdyh/Ola-Data/tree/main/ola_audio) from huggingface and unzip all the files. 

3. Check the file structure:

```
β”‚ola_audio/
β”œβ”€β”€ Ola_audio_1169k.json
β”œβ”€β”€ AudioCaps/
β”œβ”€β”€ Clotho/
β”œβ”€β”€ GigaSpeech/
β”œβ”€β”€ LibriSpeech/
β”œβ”€β”€ MillionSongDatasetSpotify/
β”œβ”€β”€ MusicCaps/
β”œβ”€β”€ WavCaps/
```

4. Check and modify the audio file path in the json to the true path in your machine.

#### **Ola-Cross-Modality-298k**

1. Download [Ola_cross_modality_finevideo_175k.json](https://huggingface.co/datasets/THUdyh/Ola-Data/blob/main/Ola_cross_modality_finevideo_175k.json) and [Ola_cross_modality_llava_123k.json](https://huggingface.co/datasets/THUdyh/Ola-Data/blob/main/Ola_cross_modality_llava_123k.json)  from huggingface. 

2. Download [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo/tree/main) from huggingface.

3. Download [LLaVA-Video-178k](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K/tree/main) from huggingface.

4. Extract pure video from FineVideo and LLaVA-Video-178k. 

5. Transfer and save the wav file of the videos using [convert_mp4_wav.py](https://github.com/Ola-Omni/Ola/blob/main/tools/convert_mp4_wav.py).

6. Check the file structure:

```
β”‚ola_cross_modality_298k/
β”œβ”€β”€ Ola_cross_modality_finevideo_175k.json
β”œβ”€β”€ Ola_cross_modality_llava_123k.json
β”œβ”€β”€ finevideo_audios/
β”‚  β”œβ”€β”€ lltmlYR56dI.wav
β”‚  β”œβ”€β”€ ......
β”œβ”€β”€ finevideo_videos/
β”‚  β”œβ”€β”€ lltmlYR56dI.mp4
β”‚  β”œβ”€β”€ ......
β”œβ”€β”€ llava_audios/
β”‚  β”œβ”€β”€ academic_source
β”‚  β”œβ”€β”€ ActivityNet-QA
β”‚  β”œβ”€β”€ liwei_youtube_videos
β”‚  β”œβ”€β”€ NextQA
β”‚  β”œβ”€β”€ perception_test
β”œβ”€β”€ llava_videos/
β”‚  β”œβ”€β”€ academic_source
β”‚  β”œβ”€β”€ ActivityNet-QA
β”‚  β”œβ”€β”€ liwei_youtube_videos
β”‚  β”œβ”€β”€ NextQA
β”‚  β”œβ”€β”€ perception_test
```

7. Check and modify the video and audio path in the json to the true path in your machine.