README / README.md
Jacksit's picture
Update README.md
40a38c6 verified
# πŸ“Š HuggingFace Datasets Management
## 🌟 Overview
This README outlines the standards and structure for managing datasets on HuggingFace, with a focus on video data processing. For video datasets, open-source videos are extracted into frames and stored in **public datasets** (unlimited storage), while annotations (labels) are stored in **private datasets** for security.
### πŸ”‘ Core Idea
**Store annotated data on HuggingFace with precise naming conventions. Once annotations are complete and uploaded, the data remains unchanged to ensure stability and consistency.** This allows training processes to simply select the required datasets without worrying about version mismatches or data alterations.
#### Why This Approach?
- **πŸ”’ Data Stability**: Fixed datasets prevent changes post-upload, ensuring consistent training results and avoiding performance fluctuations.
- **πŸš€ Simplified Training**: Precise naming enables quick dataset selection, streamlining the training process.
- **πŸ“ Multi-Project Support**: Clear naming conventions distinguish projects and tasks, facilitating seamless switching between multiple training tasks.
- **🀝 Team Collaboration**: Immutable datasets ensure all team members use the same data, enhancing collaboration and reducing discrepancies.
- **πŸ”„ Reproducibility**: Fixed datasets enable other researchers or team members to replicate training results, improving transparency and reliability.
---
## πŸ“š Repository Overview
- **GitHub Datasets Repo**:
RM2025-DatasetUtils
Contains scripts for data processing, annotation, and training code.
```bash
git clone --recursive git@github.com:hkustenterprize-algo/RM2025-DatasetUtils.git
```
- **HuggingFace (Datasets & Models)**:
πŸ‘‰ this repository hkustenterprize
Hosts all datasets and trained models. Refer to below for the **datasets/ structure**.
---
## 🚧 TODO: Training with Streaming and ugcpu
### Task Description
Plan to use **streaming** technology to dynamically load datasets during training on the **ugcpu** platform. Streaming enables on-demand data loading without downloading entire datasets, and ugcpu provides high-performance computing for machine learning tasks.
### Expected Benefits
- **πŸ’Ύ Storage Efficiency**: No need to download full datasets, reducing local storage requirements.
- **πŸ”„ Flexibility**: Dynamic dataset loading supports seamless switching between projects.
- **⚑ Performance**: ugcpu accelerates training with powerful computational resources.
### Note
This is a **TODO** item and not yet implemented. Updates will follow as the project progresses.
---
## πŸ“ Dataset Naming
Datasets follow this naming convention:
```
[raw/images/labels/masks]_[project]_[task]_[*content]_[*description]
```
### Example
- images-labels_car_detection_2024-rmuc-south-engineer_fpv-live
- [images-labels]: Contains both images and annotations.
- [car_detection]: Specifies the project and task.
- [2024-rmuc-south-engineer]: Content (year-competition-region-role).
- [fpv-live]: Description (first-person view, live stream).
### Multi-Project Datasets
For datasets covering multiple projects (e.g., mine and station), connect projects with a hyphen in the [project] field.
- **Example**: images-labels_mine-station_detection_2024-rmuc-south-engineer_fpv-live
- [mine-station]: Includes both mine and station projects.
### Content and Description
- **[*content]**: Use hyphens for competition videos, formatted as year-competition-region-role.
- **Regions**: regional, mainland, east, north, etc.
- **Competitions**: rmul, rmuc, etc.
- **[*description]**: Use hyphens, e.g., fpv (first-person view), live (live stream).
---
## πŸ—‚οΈ Dataset Structure and Naming
> **Note**: If the number of videos is small, the video folder can be omitted, with frames placed directly in the root directory.
### Structure Types
1. **Standard**: Have folder images/ and labels/
2. **Video-Split**: Splits datasets into video folders, each stored separately.
### Public Datasets (Images)
- Organized by video folders with abstract names (e.g., video_001).
- Frames named sequentially, e.g., video_001_frame_000000001.jpg.
```
images_car_detection_2024-rmuc-south-engineer_fpv-live/
β”œβ”€β”€ video_001/
β”‚ β”œβ”€β”€ video_001_frame_000000001.jpg
β”‚ β”œβ”€β”€ video_001_frame_000000005.jpg
β”œβ”€β”€ video_002/
β”‚ β”œβ”€β”€ video_002_frame_000000001.jpg
β”œβ”€β”€ raw/
β”‚ β”œβ”€β”€ video_001.mp4
β”‚ β”œβ”€β”€ video_002.mp4
β”œβ”€β”€ data.yaml
```
- **data.yaml**: Uses YOLO format for class information.
```yaml
# data.yaml
nc: 3
names:
0: car
1: truck
2: obstacle
```
> **Future Plan**: Store raw videos on a NAS for efficient storage.
### Private Datasets (Labels/Masks)
Private datasets mirror public dataset structures, organized by task type.
#### 1. Detection
- Annotation files (.txt) in YOLO detection format: class_id x_center y_center width height.
```
labels_car_detection_2024-rmuc-south-engineer_fpv-live/
β”œβ”€β”€ video_001/
β”‚ β”œβ”€β”€ video_001_frame_000000001.txt
β”‚ β”œβ”€β”€ video_001_frame_000000005.txt
β”œβ”€β”€ video_002/
β”œβ”€β”€ data.yaml
```
#### 2. Segmentation
- Masks can be images (.png) or YOLO polygon format (.txt).
- Optional labels folder.
```
masks_car_segmentation_2024-rmuc-south-engineer_fpv-live/
β”œβ”€β”€ video_001/
β”‚ β”œβ”€β”€ labels/ # Optional
β”‚ β”‚ β”œβ”€β”€ video_001_frame_000000002.txt
β”‚ β”œβ”€β”€ masks/
β”‚ β”‚ β”œβ”€β”€ video_001_frame_000000002.png # Single-channel mask
β”‚ β”‚ β”œβ”€β”€ video_001_frame_000000002.txt # YOLO polygon format
β”œβ”€β”€ video_002/
β”œβ”€β”€ data.yaml
```
### Combined Structure (Images and Labels in Private Dataset)
For datasets with both images and annotations in a private repository:
```
rm_combined_001/
β”œβ”€β”€ images/
β”‚ β”œβ”€β”€ video_001/
β”‚ β”‚ β”œβ”€β”€ video_001_frame_000000001.jpg
β”‚ β”‚ β”œβ”€β”€ video_001_frame_000000002.jpg
β”‚ β”‚ β”œβ”€β”€ video_001_frame_000000005.jpg
β”‚ β”œβ”€β”€ video_002/
β”œβ”€β”€ labels/ # Detection labels
β”‚ β”œβ”€β”€ video_001/
β”‚ β”‚ β”œβ”€β”€ video_001_frame_000000001.txt # YOLO detection format
β”‚ β”‚ β”œβ”€β”€ video_001_frame_000000005.txt
β”‚ β”œβ”€β”€ video_002/
β”œβ”€β”€ masks/ # Segmentation masks
β”‚ β”œβ”€β”€ video_001/
β”‚ β”‚ β”œβ”€β”€ video_001_frame_000000002.png # Single-channel mask
β”‚ β”‚ β”œβ”€β”€ video_001_frame_000000002.txt # YOLO polygon format
β”‚ β”œβ”€β”€ video_002/
β”œβ”€β”€ data.yaml
```
---
## πŸ› οΈ Model Management
### Model Versioning
- Use the hf_upload_model script from the RM2025-DatasetUtils repository to upload models.
- Include version information (revision) during uploads for effective version control.
### Model Repository Naming
- Follow the [project]_[task] format.
- **Example**: car_detection.