# 📊 HuggingFace Datasets Management ## 🌟 Overview This README outlines the standards and structure for managing datasets on HuggingFace, with a focus on video data processing. For video datasets, open-source videos are extracted into frames and stored in **public datasets** (unlimited storage), while annotations (labels) are stored in **private datasets** for security. ### 🔑 Core Idea **Store annotated data on HuggingFace with precise naming conventions. Once annotations are complete and uploaded, the data remains unchanged to ensure stability and consistency.** This allows training processes to simply select the required datasets without worrying about version mismatches or data alterations. #### Why This Approach? - **🔒 Data Stability**: Fixed datasets prevent changes post-upload, ensuring consistent training results and avoiding performance fluctuations. - **🚀 Simplified Training**: Precise naming enables quick dataset selection, streamlining the training process. - **📁 Multi-Project Support**: Clear naming conventions distinguish projects and tasks, facilitating seamless switching between multiple training tasks. - **🤝 Team Collaboration**: Immutable datasets ensure all team members use the same data, enhancing collaboration and reducing discrepancies. - **🔄 Reproducibility**: Fixed datasets enable other researchers or team members to replicate training results, improving transparency and reliability. --- ## 📚 Repository Overview - **GitHub Datasets Repo**: RM2025-DatasetUtils Contains scripts for data processing, annotation, and training code. ```bash git clone --recursive git@github.com:hkustenterprize-algo/RM2025-DatasetUtils.git ``` - **HuggingFace (Datasets & Models)**: 👉 this repository hkustenterprize Hosts all datasets and trained models. Refer to below for the **datasets/ structure**. --- ## 🚧 TODO: Training with Streaming and ugcpu ### Task Description Plan to use **streaming** technology to dynamically load datasets during training on the **ugcpu** platform. Streaming enables on-demand data loading without downloading entire datasets, and ugcpu provides high-performance computing for machine learning tasks. ### Expected Benefits - **💾 Storage Efficiency**: No need to download full datasets, reducing local storage requirements. - **🔄 Flexibility**: Dynamic dataset loading supports seamless switching between projects. - **⚡ Performance**: ugcpu accelerates training with powerful computational resources. ### Note This is a **TODO** item and not yet implemented. Updates will follow as the project progresses. --- ## 📝 Dataset Naming Datasets follow this naming convention: ``` [raw/images/labels/masks]_[project]_[task]_[*content]_[*description] ``` ### Example - images-labels_car_detection_2024-rmuc-south-engineer_fpv-live - [images-labels]: Contains both images and annotations. - [car_detection]: Specifies the project and task. - [2024-rmuc-south-engineer]: Content (year-competition-region-role). - [fpv-live]: Description (first-person view, live stream). ### Multi-Project Datasets For datasets covering multiple projects (e.g., mine and station), connect projects with a hyphen in the [project] field. - **Example**: images-labels_mine-station_detection_2024-rmuc-south-engineer_fpv-live - [mine-station]: Includes both mine and station projects. ### Content and Description - **[*content]**: Use hyphens for competition videos, formatted as year-competition-region-role. - **Regions**: regional, mainland, east, north, etc. - **Competitions**: rmul, rmuc, etc. - **[*description]**: Use hyphens, e.g., fpv (first-person view), live (live stream). --- ## 🗂️ Dataset Structure and Naming > **Note**: If the number of videos is small, the video folder can be omitted, with frames placed directly in the root directory. ### Structure Types 1. **Standard**: Have folder images/ and labels/ 2. **Video-Split**: Splits datasets into video folders, each stored separately. ### Public Datasets (Images) - Organized by video folders with abstract names (e.g., video_001). - Frames named sequentially, e.g., video_001_frame_000000001.jpg. ``` images_car_detection_2024-rmuc-south-engineer_fpv-live/ ├── video_001/ │ ├── video_001_frame_000000001.jpg │ ├── video_001_frame_000000005.jpg ├── video_002/ │ ├── video_002_frame_000000001.jpg ├── raw/ │ ├── video_001.mp4 │ ├── video_002.mp4 ├── data.yaml ``` - **data.yaml**: Uses YOLO format for class information. ```yaml # data.yaml nc: 3 names: 0: car 1: truck 2: obstacle ``` > **Future Plan**: Store raw videos on a NAS for efficient storage. ### Private Datasets (Labels/Masks) Private datasets mirror public dataset structures, organized by task type. #### 1. Detection - Annotation files (.txt) in YOLO detection format: class_id x_center y_center width height. ``` labels_car_detection_2024-rmuc-south-engineer_fpv-live/ ├── video_001/ │ ├── video_001_frame_000000001.txt │ ├── video_001_frame_000000005.txt ├── video_002/ ├── data.yaml ``` #### 2. Segmentation - Masks can be images (.png) or YOLO polygon format (.txt). - Optional labels folder. ``` masks_car_segmentation_2024-rmuc-south-engineer_fpv-live/ ├── video_001/ │ ├── labels/ # Optional │ │ ├── video_001_frame_000000002.txt │ ├── masks/ │ │ ├── video_001_frame_000000002.png # Single-channel mask │ │ ├── video_001_frame_000000002.txt # YOLO polygon format ├── video_002/ ├── data.yaml ``` ### Combined Structure (Images and Labels in Private Dataset) For datasets with both images and annotations in a private repository: ``` rm_combined_001/ ├── images/ │ ├── video_001/ │ │ ├── video_001_frame_000000001.jpg │ │ ├── video_001_frame_000000002.jpg │ │ ├── video_001_frame_000000005.jpg │ ├── video_002/ ├── labels/ # Detection labels │ ├── video_001/ │ │ ├── video_001_frame_000000001.txt # YOLO detection format │ │ ├── video_001_frame_000000005.txt │ ├── video_002/ ├── masks/ # Segmentation masks │ ├── video_001/ │ │ ├── video_001_frame_000000002.png # Single-channel mask │ │ ├── video_001_frame_000000002.txt # YOLO polygon format │ ├── video_002/ ├── data.yaml ``` --- ## 🛠️ Model Management ### Model Versioning - Use the hf_upload_model script from the RM2025-DatasetUtils repository to upload models. - Include version information (revision) during uploads for effective version control. ### Model Repository Naming - Follow the [project]_[task] format. - **Example**: car_detection.