Spaces:

hkustenterprize
/

README

Configuration error

App Files Files Community

README / README.md

Jacksit

Update README.md

40a38c6 verified 7 months ago

preview code

raw

history blame contribute delete

7.14 kB

	# 📊 HuggingFace Datasets Management

	## 🌟 Overview

	This README outlines the standards and structure for managing datasets on HuggingFace, with a focus on video data processing. For video datasets, open-source videos are extracted into frames and stored in public datasets (unlimited storage), while annotations (labels) are stored in private datasets for security.

	### 🔑 Core Idea

	Store annotated data on HuggingFace with precise naming conventions. Once annotations are complete and uploaded, the data remains unchanged to ensure stability and consistency. This allows training processes to simply select the required datasets without worrying about version mismatches or data alterations.

	#### Why This Approach?

	- 🔒 Data Stability: Fixed datasets prevent changes post-upload, ensuring consistent training results and avoiding performance fluctuations.

	- 🚀 Simplified Training: Precise naming enables quick dataset selection, streamlining the training process.

	- 📁 Multi-Project Support: Clear naming conventions distinguish projects and tasks, facilitating seamless switching between multiple training tasks.

	- 🤝 Team Collaboration: Immutable datasets ensure all team members use the same data, enhancing collaboration and reducing discrepancies.

	- 🔄 Reproducibility: Fixed datasets enable other researchers or team members to replicate training results, improving transparency and reliability.


	---

	## 📚 Repository Overview

	- GitHub Datasets Repo:
	RM2025-DatasetUtils
	Contains scripts for data processing, annotation, and training code.

	```bash
	git clone --recursive git@github.com:hkustenterprize-algo/RM2025-DatasetUtils.git
	```

	- HuggingFace (Datasets & Models):
	👉 this repository hkustenterprize
	Hosts all datasets and trained models. Refer to below for the datasets/ structure.


	---

	## 🚧 TODO: Training with Streaming and ugcpu

	### Task Description

	Plan to use streaming technology to dynamically load datasets during training on the ugcpu platform. Streaming enables on-demand data loading without downloading entire datasets, and ugcpu provides high-performance computing for machine learning tasks.

	### Expected Benefits

	- 💾 Storage Efficiency: No need to download full datasets, reducing local storage requirements.

	- 🔄 Flexibility: Dynamic dataset loading supports seamless switching between projects.

	- ⚡ Performance: ugcpu accelerates training with powerful computational resources.


	### Note

	This is a TODO item and not yet implemented. Updates will follow as the project progresses.

	---

	## 📝 Dataset Naming

	Datasets follow this naming convention:

	```
	[raw/images/labels/masks]_[project]_[task]_[content]_[description]
	```

	### Example

	- images-labels_car_detection_2024-rmuc-south-engineer_fpv-live

	- [images-labels]: Contains both images and annotations.

	- [car_detection]: Specifies the project and task.

	- [2024-rmuc-south-engineer]: Content (year-competition-region-role).

	- [fpv-live]: Description (first-person view, live stream).


	### Multi-Project Datasets

	For datasets covering multiple projects (e.g., mine and station), connect projects with a hyphen in the [project] field.

	- Example: images-labels_mine-station_detection_2024-rmuc-south-engineer_fpv-live

	- [mine-station]: Includes both mine and station projects.


	### Content and Description

	- *[content]**: Use hyphens for competition videos, formatted as year-competition-region-role.

	- Regions: regional, mainland, east, north, etc.

	- Competitions: rmul, rmuc, etc.

	- *[description]**: Use hyphens, e.g., fpv (first-person view), live (live stream).


	---

	## 🗂️ Dataset Structure and Naming

	> Note: If the number of videos is small, the video folder can be omitted, with frames placed directly in the root directory.

	### Structure Types

	1. Standard: Have folder images/ and labels/

	2. Video-Split: Splits datasets into video folders, each stored separately.


	### Public Datasets (Images)

	- Organized by video folders with abstract names (e.g., video_001).

	- Frames named sequentially, e.g., video_001_frame_000000001.jpg.


	```
	images_car_detection_2024-rmuc-south-engineer_fpv-live/
	├── video_001/
	│ ├── video_001_frame_000000001.jpg
	│ ├── video_001_frame_000000005.jpg
	├── video_002/
	│ ├── video_002_frame_000000001.jpg
	├── raw/
	│ ├── video_001.mp4
	│ ├── video_002.mp4
	├── data.yaml
	```

	- data.yaml: Uses YOLO format for class information.


	```yaml
	# data.yaml
	nc: 3
	names:
	0: car
	1: truck
	2: obstacle
	```

	> Future Plan: Store raw videos on a NAS for efficient storage.

	### Private Datasets (Labels/Masks)

	Private datasets mirror public dataset structures, organized by task type.

	#### 1. Detection

	- Annotation files (.txt) in YOLO detection format: class_id x_center y_center width height.


	```
	labels_car_detection_2024-rmuc-south-engineer_fpv-live/
	├── video_001/
	│ ├── video_001_frame_000000001.txt
	│ ├── video_001_frame_000000005.txt
	├── video_002/
	├── data.yaml
	```

	#### 2. Segmentation

	- Masks can be images (.png) or YOLO polygon format (.txt).

	- Optional labels folder.


	```
	masks_car_segmentation_2024-rmuc-south-engineer_fpv-live/
	├── video_001/
	│ ├── labels/ # Optional
	│ │ ├── video_001_frame_000000002.txt
	│ ├── masks/
	│ │ ├── video_001_frame_000000002.png # Single-channel mask
	│ │ ├── video_001_frame_000000002.txt # YOLO polygon format
	├── video_002/
	├── data.yaml
	```

	### Combined Structure (Images and Labels in Private Dataset)

	For datasets with both images and annotations in a private repository:

	```
	rm_combined_001/
	├── images/
	│ ├── video_001/
	│ │ ├── video_001_frame_000000001.jpg
	│ │ ├── video_001_frame_000000002.jpg
	│ │ ├── video_001_frame_000000005.jpg
	│ ├── video_002/
	├── labels/ # Detection labels
	│ ├── video_001/
	│ │ ├── video_001_frame_000000001.txt # YOLO detection format
	│ │ ├── video_001_frame_000000005.txt
	│ ├── video_002/
	├── masks/ # Segmentation masks
	│ ├── video_001/
	│ │ ├── video_001_frame_000000002.png # Single-channel mask
	│ │ ├── video_001_frame_000000002.txt # YOLO polygon format
	│ ├── video_002/
	├── data.yaml
	```

	---

	## 🛠️ Model Management

	### Model Versioning

	- Use the hf_upload_model script from the RM2025-DatasetUtils repository to upload models.

	- Include version information (revision) during uploads for effective version control.


	### Model Repository Naming

	- Follow the [project]_[task] format.

	- Example: car_detection.