errm / docs /UPLOAD_GUIDE.md

Add files using upload-large-folder tool

a741a7c verified 8 days ago

10.2 kB

	# HuggingFace 数据集上传完整指南

	## 概述

	本指南将帮助你完成以下流程：
	1. 从 DROID 数据集中采样视频
	2. 准备 HuggingFace 数据集格式
	3. 上传到 HuggingFace Hub

	---

	## 第一步：准备数据集

	### 1.1 配置参数

	编辑 `prepare_hf_dataset.py`，修改以下参数：

	```python
	# 采样配置
	TOTAL_SAMPLES = 2500 # 总共采样的视频数量（建议 2000-3000）
	PROCESS_TYPE = "failure" # 处理类型
	SAMPLING_STRATEGY = "balanced" # 采样策略

	# 采样策略说明：
	# - "balanced": 按任务类别均衡采样（推荐）
	# - "random": 完全随机采样
	# - "proportional": 按原始比例采样
	```

	### 1.2 运行采样脚本

	```bash
	cd /home/jqliu/projects/RewardModel/data_sta

	# 运行脚本
	python prepare_hf_dataset.py
	```

	输出示例：
	```
	==========================================
	准备 HuggingFace 数据集
	==========================================

	步骤 1: 扫描视频文件...
	扫描数据源: 100%\|██████████\| 13/13
	找到 15157 个视频

	步骤 2: 采样视频 (策略: balanced, 目标数量: 2500)...
	任务类别分布:
	Move object into or out of container: 2699
	Move object to a new position: 2494
	...
	采样完成: 2500 个视频

	采样后任务分布 (前10):
	Move object into or out of container: 125
	Move object to a new position: 122
	...

	步骤 3: 复制文件到 /playpen-ssd/dataset/droid_raw/hg_data...
	复制文件: 100%\|██████████\| 2500/2500

	步骤 4: 创建 README.md...

	==========================================
	数据集准备完成!
	==========================================
	位置: /playpen-ssd/dataset/droid_raw/hg_data
	总样本数: 2500
	总大小: 12.34 GB
	```

	### 1.3 验证数据集

	```bash
	# 检查文件结构
	tree -L 2 /playpen-ssd/dataset/droid_raw/hg_data

	# 应该看到:
	# hg_data/
	# ├── videos/ (2500 个 .mp4 文件)
	# ├── metadata/ (2500 个 .json 文件)
	# ├── dataset_info.json
	# └── README.md

	# 检查数据集信息
	cat /playpen-ssd/dataset/droid_raw/hg_data/dataset_info.json \| jq '.total_samples'
	```

	---

	## 第二步：配置 HuggingFace

	### 2.1 创建 HuggingFace 账户

	1. 访问 https://huggingface.co/join
	2. 注册账户（如果已有账户，跳过此步）
	3. 记住你的用户名（例如：`jqliu`）

	### 2.2 获取 Access Token

	1. 登录 HuggingFace
	2. 访问 https://huggingface.co/settings/tokens
	3. 点击 "New token"
	4. 填写：
	- Name: `dataset-upload`
	- Role: Write (重要！必须有写权限)
	5. 点击 "Generate"
	6. 复制并保存 token（只显示一次）

	### 2.3 设置 Token

	方法 1: 环境变量（推荐）

	```bash
	# 临时设置（当前终端有效）
	export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxx"

	# 永久设置（添加到 ~/.bashrc）
	echo 'export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxx"' >> ~/.bashrc
	source ~/.bashrc
	```

	方法 2: 使用 huggingface-cli

	```bash
	# 安装 huggingface_hub
	pip install huggingface_hub

	# 登录
	huggingface-cli login

	# 输入你的 token
	```

	### 2.4 验证 Token

	```bash
	python -c "from huggingface_hub import HfApi; api = HfApi(); print(api.whoami())"
	```

	应该显示你的用户信息。

	---

	## 第三步：上传数据集

	### 3.1 配置上传脚本

	编辑 `upload_to_huggingface.py`：

	```python
	# HuggingFace 配置
	HF_USERNAME = "jqliu" # 替换为你的用户名
	DATASET_NAME = "droid-failure-sampled" # 数据集名称
	PRIVATE = False # True=私有数据集, False=公开数据集
	```

	### 3.2 选择上传方法

	根据数据集大小选择：

	\| 数据集大小 \| 推荐方法 \| 说明 \|
	\|-----------\|---------\|------\|
	\| < 5 GB \| 方法 1 或 2 \| 简单快速 \|
	\| 5-20 GB \| 方法 2 \| 支持预览和流式加载 \|
	\| > 20 GB \| 方法 3 \| 分块上传，更稳定 \|

	### 3.3 运行上传脚本

	```bash
	cd /home/jqliu/projects/RewardModel/data_sta

	# 安装依赖
	pip install huggingface_hub datasets

	# 运行上传脚本
	python upload_to_huggingface.py
	```

	交互式选择：
	```
	请选择上传方法:
	1. upload_folder (简单快速，适合小数据集)
	2. Datasets 库 (推荐，支持预览和流式加载)
	3. 分块上传 (适合大数据集 > 5GB)

	请输入选择 (1/2/3): 2
	```

	上传过程：
	```
	==========================================
	方法 2: 使用 Datasets 库上传
	==========================================

	创建 Dataset 对象...
	读取数据集: 100%\|██████████\| 2500/2500
	Dataset 大小: 2500 个样本

	上传到 HuggingFace Hub: jqliu/droid-failure-sampled
	Uploading: 100%\|██████████\| 12.3G/12.3G [15:23<00:00]

	[SUCCESS] 上传完成!

	数据集链接: https://huggingface.co/datasets/jqliu/droid-failure-sampled
	```

	### 3.4 使用命令行直接上传（备选方案）

	如果脚本失败，可以使用命令行：

	```bash
	# 方法 A: 使用 huggingface-cli
	huggingface-cli upload jqliu/droid-failure-sampled \
	/playpen-ssd/dataset/droid_raw/hg_data \
	--repo-type=dataset

	# 方法 B: 使用 Git LFS（适合大文件）
	cd /playpen-ssd/dataset/droid_raw/hg_data
	git init
	git lfs install
	git lfs track "*.mp4"
	git add .
	git commit -m "Initial commit"
	git remote add origin https://huggingface.co/datasets/jqliu/droid-failure-sampled
	git push -u origin main
	```

	---

	## 第四步：验证和使用

	### 4.1 访问数据集页面

	在浏览器中访问：
	```
	https://huggingface.co/datasets/你的用户名/droid-failure-sampled
	```

	你应该看到：
	- ✅ README.md 自动渲染
	- ✅ 文件浏览器
	- ✅ 数据集统计信息

	### 4.2 测试加载数据集

	```python
	from datasets import load_dataset

	# 加载数据集
	dataset = load_dataset("jqliu/droid-failure-sampled")

	# 查看第一个样本
	example = dataset['train'][0]
	print(f"Video ID: {example['video_id']}")
	print(f"Task: {example['task_description']}")
	print(f"Source: {example['source']}")
	```

	### 4.3 更新 README（可选）

	如果需要更新 README：

	```bash
	cd /playpen-ssd/dataset/droid_raw/hg_data

	# 编辑 README.md
	vim README.md

	# 使用脚本重新上传
	python upload_to_huggingface.py
	# 或使用 CLI
	huggingface-cli upload jqliu/droid-failure-sampled README.md
	```

	---

	## 常见问题

	### Q1: 上传速度很慢怎么办？

	解决方案：

	1. 使用分块上传（方法 3）
	2. 减少样本数量：修改 `TOTAL_SAMPLES = 1000`
	3. 使用网络代理：
	```bash
	export HTTP_PROXY=http://your-proxy:port
	export HTTPS_PROXY=http://your-proxy:port
	```

	### Q2: 上传中断了怎么办？

	解决方案：

	HuggingFace 支持断点续传，重新运行上传脚本即可：
	```bash
	python upload_to_huggingface.py
	```

	### Q3: 如何删除已上传的数据集？

	```bash
	# 使用 API 删除
	from huggingface_hub import delete_repo
	delete_repo(repo_id="jqliu/droid-failure-sampled", repo_type="dataset")

	# 或在网页上删除
	# https://huggingface.co/datasets/jqliu/droid-failure-sampled/settings
	```

	### Q4: 如何设置数据集为私有？

	在 `upload_to_huggingface.py` 中：
	```python
	PRIVATE = True
	```

	或在网页上：
	```
	Settings → Make private
	```

	### Q5: 上传失败显示 "Repository not found"

	原因：仓库未创建或用户名错误

	解决方案：
	1. 检查 `HF_USERNAME` 是否正确
	2. 手动创建仓库：访问 https://huggingface.co/new-dataset

	### Q6: 如何查看上传进度？

	使用 `tqdm` 会自动显示进度条。如果看不到：
	```python
	# 在脚本中添加
	import logging
	logging.basicConfig(level=logging.INFO)
	```

	---

	## 高级功能

	### 1. 创建数据集卡片（Dataset Card）

	编辑 `README.md`，添加 YAML 元数据：

	```yaml
	---
	dataset_info:
	features:
	- name: video_id
	dtype: string
	- name: task_description
	dtype: string
	- name: source
	dtype: string
	- name: success
	dtype: bool
	splits:
	- name: train
	num_examples: 2500
	dataset_size: 12.34GB
	task_categories:
	- video-classification
	- robotics
	tags:
	- robotics
	- manipulation
	- failure-detection
	license: mit
	---

	# DROID Failure Dataset

	...
	```

	### 2. 分割数据集（train/val/test）

	```python
	from datasets import Dataset, DatasetDict

	# 分割
	dataset_dict = DatasetDict({
	'train': dataset.select(range(2000)),
	'validation': dataset.select(range(2000, 2250)),
	'test': dataset.select(range(2250, 2500))
	})

	# 上传
	dataset_dict.push_to_hub("jqliu/droid-failure-sampled")
	```

	### 3. 添加视频预处理

	```python
	from datasets import Dataset, Features, Value, Video

	# 定义特征（包含视频）
	features = Features({
	'video_id': Value('string'),
	'video': Video(), # 视频特征
	'task_description': Value('string'),
	...
	})

	# 创建数据集
	dataset = Dataset.from_dict(data, features=features)
	```

	---

	## 完整工作流示例

	```bash
	# 1. 准备数据集
	cd /home/jqliu/projects/RewardModel/data_sta
	python prepare_hf_dataset.py

	# 2. 设置 HuggingFace Token
	export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxx"

	# 3. 上传数据集
	python upload_to_huggingface.py
	# 选择方法 2

	# 4. 验证
	python -c "from datasets import load_dataset; ds = load_dataset('jqliu/droid-failure-sampled'); print(len(ds['train']))"
	```

	---

	## 参考资源

	- HuggingFace Hub 文档: https://huggingface.co/docs/hub
	- Datasets 库文档: https://huggingface.co/docs/datasets
	- Git LFS 文档: https://git-lfs.github.com/
	- DROID 数据集: https://droid-dataset.github.io/

	---

	## 联系支持

	如果遇到问题：
	1. 查看 HuggingFace 社区论坛: https://discuss.huggingface.co/
	2. 提交 Issue: https://github.com/huggingface/datasets/issues
	3. 联系数据集维护者

	---

	## 附录：脚本快速参考

	### prepare_hf_dataset.py
	采样并准备数据集

	关键参数：
	- `TOTAL_SAMPLES`: 采样数量
	- `SAMPLING_STRATEGY`: 采样策略
	- `PROCESS_TYPE`: 数据类型

	### upload_to_huggingface.py
	上传数据集到 HuggingFace

	关键参数：
	- `HF_USERNAME`: 用户名
	- `DATASET_NAME`: 数据集名称
	- `PRIVATE`: 是否私有

	上传方法：
	1. upload_folder（简单）
	2. Datasets 库（推荐）
	3. 分块上传（大文件）