AI & ML interests

None defined yet.

šŸ“Š HuggingFace Datasets Management

🌟 Overview

This README outlines the standards and structure for managing datasets on HuggingFace, with a focus on video data processing. For video datasets, open-source videos are extracted into frames and stored in public datasets (unlimited storage), while annotations (labels) are stored in private datasets for security.

šŸ”‘ Core Idea

Store annotated data on HuggingFace with precise naming conventions. Once annotations are complete and uploaded, the data remains unchanged to ensure stability and consistency. This allows training processes to simply select the required datasets without worrying about version mismatches or data alterations.

Why This Approach?

  • šŸ”’ Data Stability: Fixed datasets prevent changes post-upload, ensuring consistent training results and avoiding performance fluctuations.

  • šŸš€ Simplified Training: Precise naming enables quick dataset selection, streamlining the training process.

  • šŸ“ Multi-Project Support: Clear naming conventions distinguish projects and tasks, facilitating seamless switching between multiple training tasks.

  • šŸ¤ Team Collaboration: Immutable datasets ensure all team members use the same data, enhancing collaboration and reducing discrepancies.

  • šŸ”„ Reproducibility: Fixed datasets enable other researchers or team members to replicate training results, improving transparency and reliability.


šŸ“š Repository Overview

  • GitHub Datasets Repo:
    RM2025-DatasetUtils
    Contains scripts for data processing, annotation, and training code.

    git clone --recursive git@github.com:hkustenterprize-algo/RM2025-DatasetUtils.git
    
  • HuggingFace (Datasets & Models):
    šŸ‘‰ this repository hkustenterprize
    Hosts all datasets and trained models. Refer to below for the datasets/ structure.


🚧 TODO: Training with Streaming and ugcpu

Task Description

Plan to use streaming technology to dynamically load datasets during training on the ugcpu platform. Streaming enables on-demand data loading without downloading entire datasets, and ugcpu provides high-performance computing for machine learning tasks.

Expected Benefits

  • šŸ’¾ Storage Efficiency: No need to download full datasets, reducing local storage requirements.

  • šŸ”„ Flexibility: Dynamic dataset loading supports seamless switching between projects.

  • ⚔ Performance: ugcpu accelerates training with powerful computational resources.

Note

This is a TODO item and not yet implemented. Updates will follow as the project progresses.


šŸ“ Dataset Naming

Datasets follow this naming convention:

[raw/images/labels/masks]_[project]_[task]_[*content]_[*description]

Example

  • images-labels_car_detection_2024-rmuc-south-engineer_fpv-live

    • [images-labels]: Contains both images and annotations.

    • [car_detection]: Specifies the project and task.

    • [2024-rmuc-south-engineer]: Content (year-competition-region-role).

    • [fpv-live]: Description (first-person view, live stream).

Multi-Project Datasets

For datasets covering multiple projects (e.g., mine and station), connect projects with a hyphen in the [project] field.

  • Example: images-labels_mine-station_detection_2024-rmuc-south-engineer_fpv-live

    • [mine-station]: Includes both mine and station projects.

Content and Description

  • **[*content]**: Use hyphens for competition videos, formatted as year-competition-region-role.

    • Regions: regional, mainland, east, north, etc.

    • Competitions: rmul, rmuc, etc.

  • **[*description]**: Use hyphens, e.g., fpv (first-person view), live (live stream).


šŸ—‚ļø Dataset Structure and Naming

Note: If the number of videos is small, the video folder can be omitted, with frames placed directly in the root directory.

Structure Types

  1. Standard: Have folder images/ and labels/

  2. Video-Split: Splits datasets into video folders, each stored separately.

Public Datasets (Images)

  • Organized by video folders with abstract names (e.g., video_001).

  • Frames named sequentially, e.g., video_001_frame_000000001.jpg.

images_car_detection_2024-rmuc-south-engineer_fpv-live/
ā”œā”€ā”€ video_001/
│   ā”œā”€ā”€ video_001_frame_000000001.jpg
│   ā”œā”€ā”€ video_001_frame_000000005.jpg
ā”œā”€ā”€ video_002/
│   ā”œā”€ā”€ video_002_frame_000000001.jpg
ā”œā”€ā”€ raw/
│   ā”œā”€ā”€ video_001.mp4
│   ā”œā”€ā”€ video_002.mp4
ā”œā”€ā”€ data.yaml
  • data.yaml: Uses YOLO format for class information.
# data.yaml
nc: 3
names:
  0: car
  1: truck
  2: obstacle

Future Plan: Store raw videos on a NAS for efficient storage.

Private Datasets (Labels/Masks)

Private datasets mirror public dataset structures, organized by task type.

1. Detection

  • Annotation files (.txt) in YOLO detection format: class_id x_center y_center width height.
labels_car_detection_2024-rmuc-south-engineer_fpv-live/
ā”œā”€ā”€ video_001/
│   ā”œā”€ā”€ video_001_frame_000000001.txt
│   ā”œā”€ā”€ video_001_frame_000000005.txt
ā”œā”€ā”€ video_002/
ā”œā”€ā”€ data.yaml

2. Segmentation

  • Masks can be images (.png) or YOLO polygon format (.txt).

  • Optional labels folder.

masks_car_segmentation_2024-rmuc-south-engineer_fpv-live/
ā”œā”€ā”€ video_001/
│   ā”œā”€ā”€ labels/  # Optional
│   │   ā”œā”€ā”€ video_001_frame_000000002.txt
│   ā”œā”€ā”€ masks/
│   │   ā”œā”€ā”€ video_001_frame_000000002.png  # Single-channel mask
│   │   ā”œā”€ā”€ video_001_frame_000000002.txt  # YOLO polygon format
ā”œā”€ā”€ video_002/
ā”œā”€ā”€ data.yaml

Combined Structure (Images and Labels in Private Dataset)

For datasets with both images and annotations in a private repository:

rm_combined_001/
ā”œā”€ā”€ images/
│   ā”œā”€ā”€ video_001/
│   │   ā”œā”€ā”€ video_001_frame_000000001.jpg
│   │   ā”œā”€ā”€ video_001_frame_000000002.jpg
│   │   ā”œā”€ā”€ video_001_frame_000000005.jpg
│   ā”œā”€ā”€ video_002/
ā”œā”€ā”€ labels/  # Detection labels
│   ā”œā”€ā”€ video_001/
│   │   ā”œā”€ā”€ video_001_frame_000000001.txt  # YOLO detection format
│   │   ā”œā”€ā”€ video_001_frame_000000005.txt
│   ā”œā”€ā”€ video_002/
ā”œā”€ā”€ masks/   # Segmentation masks
│   ā”œā”€ā”€ video_001/
│   │   ā”œā”€ā”€ video_001_frame_000000002.png  # Single-channel mask
│   │   ā”œā”€ā”€ video_001_frame_000000002.txt  # YOLO polygon format
│   ā”œā”€ā”€ video_002/
ā”œā”€ā”€ data.yaml

šŸ› ļø Model Management

Model Versioning

  • Use the hf_upload_model script from the RM2025-DatasetUtils repository to upload models.

  • Include version information (revision) during uploads for effective version control.

Model Repository Naming

  • Follow the [project]_[task] format.

    • Example: car_detection.

models 0

None public yet