# 📊 HuggingFace Datasets Management

## 🌟 Overview

This README outlines the standards and structure for managing datasets on HuggingFace, with a focus on video data processing. For video datasets, open-source videos are extracted into frames and stored in **public datasets** (unlimited storage), while annotations (labels) are stored in **private datasets** for security.

### 🔑 Core Idea

**Store annotated data on HuggingFace with precise naming conventions. Once annotations are complete and uploaded, the data remains unchanged to ensure stability and consistency.** This allows training processes to simply select the required datasets without worrying about version mismatches or data alterations.

#### Why This Approach?

- **🔒 Data Stability**: Fixed datasets prevent changes post-upload, ensuring consistent training results and avoiding performance fluctuations.
    
- **🚀 Simplified Training**: Precise naming enables quick dataset selection, streamlining the training process.
    
- **📁 Multi-Project Support**: Clear naming conventions distinguish projects and tasks, facilitating seamless switching between multiple training tasks.
    
- **🤝 Team Collaboration**: Immutable datasets ensure all team members use the same data, enhancing collaboration and reducing discrepancies.
    
- **🔄 Reproducibility**: Fixed datasets enable other researchers or team members to replicate training results, improving transparency and reliability.
    

---

## 📚 Repository Overview

- **GitHub Datasets Repo**:  
    RM2025-DatasetUtils  
    Contains scripts for data processing, annotation, and training code.
    
    ```bash
    git clone --recursive git@github.com:hkustenterprize-algo/RM2025-DatasetUtils.git
    ```
    
- **HuggingFace (Datasets & Models)**:  
    👉  this repository hkustenterprize  
    Hosts all datasets and trained models. Refer to below for the **datasets/ structure**.
    

---

## 🚧 TODO: Training with Streaming and ugcpu

### Task Description

Plan to use **streaming** technology to dynamically load datasets during training on the **ugcpu** platform. Streaming enables on-demand data loading without downloading entire datasets, and ugcpu provides high-performance computing for machine learning tasks.

### Expected Benefits

- **💾 Storage Efficiency**: No need to download full datasets, reducing local storage requirements.
    
- **🔄 Flexibility**: Dynamic dataset loading supports seamless switching between projects.
    
- **⚡ Performance**: ugcpu accelerates training with powerful computational resources.
    

### Note

This is a **TODO** item and not yet implemented. Updates will follow as the project progresses.

---

## 📝 Dataset Naming

Datasets follow this naming convention:

```
[raw/images/labels/masks]_[project]_[task]_[*content]_[*description]
```

### Example

- images-labels_car_detection_2024-rmuc-south-engineer_fpv-live
    
    - [images-labels]: Contains both images and annotations.
        
    - [car_detection]: Specifies the project and task.
        
    - [2024-rmuc-south-engineer]: Content (year-competition-region-role).
        
    - [fpv-live]: Description (first-person view, live stream).
        

### Multi-Project Datasets

For datasets covering multiple projects (e.g., mine and station), connect projects with a hyphen in the [project] field.

- **Example**: images-labels_mine-station_detection_2024-rmuc-south-engineer_fpv-live
    
    - [mine-station]: Includes both mine and station projects.
        

### Content and Description

- **[*content]**: Use hyphens for competition videos, formatted as year-competition-region-role.
    
    - **Regions**: regional, mainland, east, north, etc.
        
    - **Competitions**: rmul, rmuc, etc.
        
- **[*description]**: Use hyphens, e.g., fpv (first-person view), live (live stream).
    

---

## 🗂️ Dataset Structure and Naming

> **Note**: If the number of videos is small, the video folder can be omitted, with frames placed directly in the root directory.

### Structure Types

1. **Standard**: Have folder images/ and labels/
    
2. **Video-Split**: Splits datasets into video folders, each stored separately.
    

### Public Datasets (Images)

- Organized by video folders with abstract names (e.g., video_001).
    
- Frames named sequentially, e.g., video_001_frame_000000001.jpg.
    

```
images_car_detection_2024-rmuc-south-engineer_fpv-live/
├── video_001/
│   ├── video_001_frame_000000001.jpg
│   ├── video_001_frame_000000005.jpg
├── video_002/
│   ├── video_002_frame_000000001.jpg
├── raw/
│   ├── video_001.mp4
│   ├── video_002.mp4
├── data.yaml
```

- **data.yaml**: Uses YOLO format for class information.
    

```yaml
# data.yaml
nc: 3
names:
  0: car
  1: truck
  2: obstacle
```

> **Future Plan**: Store raw videos on a NAS for efficient storage.

### Private Datasets (Labels/Masks)

Private datasets mirror public dataset structures, organized by task type.

#### 1. Detection

- Annotation files (.txt) in YOLO detection format: class_id x_center y_center width height.
    

```
labels_car_detection_2024-rmuc-south-engineer_fpv-live/
├── video_001/
│   ├── video_001_frame_000000001.txt
│   ├── video_001_frame_000000005.txt
├── video_002/
├── data.yaml
```

#### 2. Segmentation

- Masks can be images (.png) or YOLO polygon format (.txt).
    
- Optional labels folder.
    

```
masks_car_segmentation_2024-rmuc-south-engineer_fpv-live/
├── video_001/
│   ├── labels/  # Optional
│   │   ├── video_001_frame_000000002.txt
│   ├── masks/
│   │   ├── video_001_frame_000000002.png  # Single-channel mask
│   │   ├── video_001_frame_000000002.txt  # YOLO polygon format
├── video_002/
├── data.yaml
```

### Combined Structure (Images and Labels in Private Dataset)

For datasets with both images and annotations in a private repository:

```
rm_combined_001/
├── images/
│   ├── video_001/
│   │   ├── video_001_frame_000000001.jpg
│   │   ├── video_001_frame_000000002.jpg
│   │   ├── video_001_frame_000000005.jpg
│   ├── video_002/
├── labels/  # Detection labels
│   ├── video_001/
│   │   ├── video_001_frame_000000001.txt  # YOLO detection format
│   │   ├── video_001_frame_000000005.txt
│   ├── video_002/
├── masks/   # Segmentation masks
│   ├── video_001/
│   │   ├── video_001_frame_000000002.png  # Single-channel mask
│   │   ├── video_001_frame_000000002.txt  # YOLO polygon format
│   ├── video_002/
├── data.yaml
```

---

## 🛠️ Model Management

### Model Versioning

- Use the hf_upload_model script from the RM2025-DatasetUtils repository to upload models.
    
- Include version information (revision) during uploads for effective version control.
    

### Model Repository Naming

- Follow the [project]_[task] format.
    
    - **Example**: car_detection.