AI & ML interests
None defined yet.
š HuggingFace Datasets Management
š Overview
This README outlines the standards and structure for managing datasets on HuggingFace, with a focus on video data processing. For video datasets, open-source videos are extracted into frames and stored in public datasets (unlimited storage), while annotations (labels) are stored in private datasets for security.
š Core Idea
Store annotated data on HuggingFace with precise naming conventions. Once annotations are complete and uploaded, the data remains unchanged to ensure stability and consistency. This allows training processes to simply select the required datasets without worrying about version mismatches or data alterations.
Why This Approach?
š Data Stability: Fixed datasets prevent changes post-upload, ensuring consistent training results and avoiding performance fluctuations.
š Simplified Training: Precise naming enables quick dataset selection, streamlining the training process.
š Multi-Project Support: Clear naming conventions distinguish projects and tasks, facilitating seamless switching between multiple training tasks.
š¤ Team Collaboration: Immutable datasets ensure all team members use the same data, enhancing collaboration and reducing discrepancies.
š Reproducibility: Fixed datasets enable other researchers or team members to replicate training results, improving transparency and reliability.
š Repository Overview
GitHub Datasets Repo:
RM2025-DatasetUtils
Contains scripts for data processing, annotation, and training code.git clone --recursive git@github.com:hkustenterprize-algo/RM2025-DatasetUtils.gitHuggingFace (Datasets & Models):
š this repository hkustenterprize
Hosts all datasets and trained models. Refer to below for the datasets/ structure.
š§ TODO: Training with Streaming and ugcpu
Task Description
Plan to use streaming technology to dynamically load datasets during training on the ugcpu platform. Streaming enables on-demand data loading without downloading entire datasets, and ugcpu provides high-performance computing for machine learning tasks.
Expected Benefits
š¾ Storage Efficiency: No need to download full datasets, reducing local storage requirements.
š Flexibility: Dynamic dataset loading supports seamless switching between projects.
ā” Performance: ugcpu accelerates training with powerful computational resources.
Note
This is a TODO item and not yet implemented. Updates will follow as the project progresses.
š Dataset Naming
Datasets follow this naming convention:
[raw/images/labels/masks]_[project]_[task]_[*content]_[*description]
Example
images-labels_car_detection_2024-rmuc-south-engineer_fpv-live
[images-labels]: Contains both images and annotations.
[car_detection]: Specifies the project and task.
[2024-rmuc-south-engineer]: Content (year-competition-region-role).
[fpv-live]: Description (first-person view, live stream).
Multi-Project Datasets
For datasets covering multiple projects (e.g., mine and station), connect projects with a hyphen in the [project] field.
Example: images-labels_mine-station_detection_2024-rmuc-south-engineer_fpv-live
- [mine-station]: Includes both mine and station projects.
Content and Description
**[*content]**: Use hyphens for competition videos, formatted as year-competition-region-role.
Regions: regional, mainland, east, north, etc.
Competitions: rmul, rmuc, etc.
**[*description]**: Use hyphens, e.g., fpv (first-person view), live (live stream).
šļø Dataset Structure and Naming
Note: If the number of videos is small, the video folder can be omitted, with frames placed directly in the root directory.
Structure Types
Standard: Have folder images/ and labels/
Video-Split: Splits datasets into video folders, each stored separately.
Public Datasets (Images)
Organized by video folders with abstract names (e.g., video_001).
Frames named sequentially, e.g., video_001_frame_000000001.jpg.
images_car_detection_2024-rmuc-south-engineer_fpv-live/
āāā video_001/
ā āāā video_001_frame_000000001.jpg
ā āāā video_001_frame_000000005.jpg
āāā video_002/
ā āāā video_002_frame_000000001.jpg
āāā raw/
ā āāā video_001.mp4
ā āāā video_002.mp4
āāā data.yaml
- data.yaml: Uses YOLO format for class information.
# data.yaml
nc: 3
names:
0: car
1: truck
2: obstacle
Future Plan: Store raw videos on a NAS for efficient storage.
Private Datasets (Labels/Masks)
Private datasets mirror public dataset structures, organized by task type.
1. Detection
- Annotation files (.txt) in YOLO detection format: class_id x_center y_center width height.
labels_car_detection_2024-rmuc-south-engineer_fpv-live/
āāā video_001/
ā āāā video_001_frame_000000001.txt
ā āāā video_001_frame_000000005.txt
āāā video_002/
āāā data.yaml
2. Segmentation
Masks can be images (.png) or YOLO polygon format (.txt).
Optional labels folder.
masks_car_segmentation_2024-rmuc-south-engineer_fpv-live/
āāā video_001/
ā āāā labels/ # Optional
ā ā āāā video_001_frame_000000002.txt
ā āāā masks/
ā ā āāā video_001_frame_000000002.png # Single-channel mask
ā ā āāā video_001_frame_000000002.txt # YOLO polygon format
āāā video_002/
āāā data.yaml
Combined Structure (Images and Labels in Private Dataset)
For datasets with both images and annotations in a private repository:
rm_combined_001/
āāā images/
ā āāā video_001/
ā ā āāā video_001_frame_000000001.jpg
ā ā āāā video_001_frame_000000002.jpg
ā ā āāā video_001_frame_000000005.jpg
ā āāā video_002/
āāā labels/ # Detection labels
ā āāā video_001/
ā ā āāā video_001_frame_000000001.txt # YOLO detection format
ā ā āāā video_001_frame_000000005.txt
ā āāā video_002/
āāā masks/ # Segmentation masks
ā āāā video_001/
ā ā āāā video_001_frame_000000002.png # Single-channel mask
ā ā āāā video_001_frame_000000002.txt # YOLO polygon format
ā āāā video_002/
āāā data.yaml
š ļø Model Management
Model Versioning
Use the hf_upload_model script from the RM2025-DatasetUtils repository to upload models.
Include version information (revision) during uploads for effective version control.
Model Repository Naming
Follow the [project]_[task] format.
- Example: car_detection.