Spaces:
Configuration error
Configuration error
| # π HuggingFace Datasets Management | |
| ## π Overview | |
| This README outlines the standards and structure for managing datasets on HuggingFace, with a focus on video data processing. For video datasets, open-source videos are extracted into frames and stored in **public datasets** (unlimited storage), while annotations (labels) are stored in **private datasets** for security. | |
| ### π Core Idea | |
| **Store annotated data on HuggingFace with precise naming conventions. Once annotations are complete and uploaded, the data remains unchanged to ensure stability and consistency.** This allows training processes to simply select the required datasets without worrying about version mismatches or data alterations. | |
| #### Why This Approach? | |
| - **π Data Stability**: Fixed datasets prevent changes post-upload, ensuring consistent training results and avoiding performance fluctuations. | |
| - **π Simplified Training**: Precise naming enables quick dataset selection, streamlining the training process. | |
| - **π Multi-Project Support**: Clear naming conventions distinguish projects and tasks, facilitating seamless switching between multiple training tasks. | |
| - **π€ Team Collaboration**: Immutable datasets ensure all team members use the same data, enhancing collaboration and reducing discrepancies. | |
| - **π Reproducibility**: Fixed datasets enable other researchers or team members to replicate training results, improving transparency and reliability. | |
| --- | |
| ## π Repository Overview | |
| - **GitHub Datasets Repo**: | |
| RM2025-DatasetUtils | |
| Contains scripts for data processing, annotation, and training code. | |
| ```bash | |
| git clone --recursive git@github.com:hkustenterprize-algo/RM2025-DatasetUtils.git | |
| ``` | |
| - **HuggingFace (Datasets & Models)**: | |
| π this repository hkustenterprize | |
| Hosts all datasets and trained models. Refer to below for the **datasets/ structure**. | |
| --- | |
| ## π§ TODO: Training with Streaming and ugcpu | |
| ### Task Description | |
| Plan to use **streaming** technology to dynamically load datasets during training on the **ugcpu** platform. Streaming enables on-demand data loading without downloading entire datasets, and ugcpu provides high-performance computing for machine learning tasks. | |
| ### Expected Benefits | |
| - **πΎ Storage Efficiency**: No need to download full datasets, reducing local storage requirements. | |
| - **π Flexibility**: Dynamic dataset loading supports seamless switching between projects. | |
| - **β‘ Performance**: ugcpu accelerates training with powerful computational resources. | |
| ### Note | |
| This is a **TODO** item and not yet implemented. Updates will follow as the project progresses. | |
| --- | |
| ## π Dataset Naming | |
| Datasets follow this naming convention: | |
| ``` | |
| [raw/images/labels/masks]_[project]_[task]_[*content]_[*description] | |
| ``` | |
| ### Example | |
| - images-labels_car_detection_2024-rmuc-south-engineer_fpv-live | |
| - [images-labels]: Contains both images and annotations. | |
| - [car_detection]: Specifies the project and task. | |
| - [2024-rmuc-south-engineer]: Content (year-competition-region-role). | |
| - [fpv-live]: Description (first-person view, live stream). | |
| ### Multi-Project Datasets | |
| For datasets covering multiple projects (e.g., mine and station), connect projects with a hyphen in the [project] field. | |
| - **Example**: images-labels_mine-station_detection_2024-rmuc-south-engineer_fpv-live | |
| - [mine-station]: Includes both mine and station projects. | |
| ### Content and Description | |
| - **[*content]**: Use hyphens for competition videos, formatted as year-competition-region-role. | |
| - **Regions**: regional, mainland, east, north, etc. | |
| - **Competitions**: rmul, rmuc, etc. | |
| - **[*description]**: Use hyphens, e.g., fpv (first-person view), live (live stream). | |
| --- | |
| ## ποΈ Dataset Structure and Naming | |
| > **Note**: If the number of videos is small, the video folder can be omitted, with frames placed directly in the root directory. | |
| ### Structure Types | |
| 1. **Standard**: Have folder images/ and labels/ | |
| 2. **Video-Split**: Splits datasets into video folders, each stored separately. | |
| ### Public Datasets (Images) | |
| - Organized by video folders with abstract names (e.g., video_001). | |
| - Frames named sequentially, e.g., video_001_frame_000000001.jpg. | |
| ``` | |
| images_car_detection_2024-rmuc-south-engineer_fpv-live/ | |
| βββ video_001/ | |
| β βββ video_001_frame_000000001.jpg | |
| β βββ video_001_frame_000000005.jpg | |
| βββ video_002/ | |
| β βββ video_002_frame_000000001.jpg | |
| βββ raw/ | |
| β βββ video_001.mp4 | |
| β βββ video_002.mp4 | |
| βββ data.yaml | |
| ``` | |
| - **data.yaml**: Uses YOLO format for class information. | |
| ```yaml | |
| # data.yaml | |
| nc: 3 | |
| names: | |
| 0: car | |
| 1: truck | |
| 2: obstacle | |
| ``` | |
| > **Future Plan**: Store raw videos on a NAS for efficient storage. | |
| ### Private Datasets (Labels/Masks) | |
| Private datasets mirror public dataset structures, organized by task type. | |
| #### 1. Detection | |
| - Annotation files (.txt) in YOLO detection format: class_id x_center y_center width height. | |
| ``` | |
| labels_car_detection_2024-rmuc-south-engineer_fpv-live/ | |
| βββ video_001/ | |
| β βββ video_001_frame_000000001.txt | |
| β βββ video_001_frame_000000005.txt | |
| βββ video_002/ | |
| βββ data.yaml | |
| ``` | |
| #### 2. Segmentation | |
| - Masks can be images (.png) or YOLO polygon format (.txt). | |
| - Optional labels folder. | |
| ``` | |
| masks_car_segmentation_2024-rmuc-south-engineer_fpv-live/ | |
| βββ video_001/ | |
| β βββ labels/ # Optional | |
| β β βββ video_001_frame_000000002.txt | |
| β βββ masks/ | |
| β β βββ video_001_frame_000000002.png # Single-channel mask | |
| β β βββ video_001_frame_000000002.txt # YOLO polygon format | |
| βββ video_002/ | |
| βββ data.yaml | |
| ``` | |
| ### Combined Structure (Images and Labels in Private Dataset) | |
| For datasets with both images and annotations in a private repository: | |
| ``` | |
| rm_combined_001/ | |
| βββ images/ | |
| β βββ video_001/ | |
| β β βββ video_001_frame_000000001.jpg | |
| β β βββ video_001_frame_000000002.jpg | |
| β β βββ video_001_frame_000000005.jpg | |
| β βββ video_002/ | |
| βββ labels/ # Detection labels | |
| β βββ video_001/ | |
| β β βββ video_001_frame_000000001.txt # YOLO detection format | |
| β β βββ video_001_frame_000000005.txt | |
| β βββ video_002/ | |
| βββ masks/ # Segmentation masks | |
| β βββ video_001/ | |
| β β βββ video_001_frame_000000002.png # Single-channel mask | |
| β β βββ video_001_frame_000000002.txt # YOLO polygon format | |
| β βββ video_002/ | |
| βββ data.yaml | |
| ``` | |
| --- | |
| ## π οΈ Model Management | |
| ### Model Versioning | |
| - Use the hf_upload_model script from the RM2025-DatasetUtils repository to upload models. | |
| - Include version information (revision) during uploads for effective version control. | |
| ### Model Repository Naming | |
| - Follow the [project]_[task] format. | |
| - **Example**: car_detection. | |