errm / docs /UPLOAD_GUIDE.md

yuffish

Add files using upload-large-folder tool

a741a7c verified 8 days ago

preview code

raw

history blame contribute delete

10.2 kB

HuggingFace 数据集上传完整指南

概述

本指南将帮助你完成以下流程：

从 DROID 数据集中采样视频
准备 HuggingFace 数据集格式
上传到 HuggingFace Hub

第一步：准备数据集

1.1 配置参数

编辑 prepare_hf_dataset.py，修改以下参数：

# 采样配置
TOTAL_SAMPLES = 2500        # 总共采样的视频数量（建议 2000-3000）
PROCESS_TYPE = "failure"    # 处理类型
SAMPLING_STRATEGY = "balanced"  # 采样策略

# 采样策略说明：
# - "balanced": 按任务类别均衡采样（推荐）
# - "random": 完全随机采样
# - "proportional": 按原始比例采样

1.2 运行采样脚本

cd /home/jqliu/projects/RewardModel/data_sta

# 运行脚本
python prepare_hf_dataset.py

输出示例： ```

准备 HuggingFace 数据集

步骤 1: 扫描视频文件... 扫描数据源: 100%|██████████| 13/13 找到 15157 个视频

步骤 2: 采样视频 (策略: balanced, 目标数量: 2500)... 任务类别分布: Move object into or out of container: 2699 Move object to a new position: 2494 ... 采样完成: 2500 个视频

采样后任务分布 (前10): Move object into or out of container: 125 Move object to a new position: 122 ...

步骤 3: 复制文件到 /playpen-ssd/dataset/droid_raw/hg_data... 复制文件: 100%|██████████| 2500/2500

步骤 4: 创建 README.md...

========================================== 数据集准备完成!

位置: /playpen-ssd/dataset/droid_raw/hg_data 总样本数: 2500 总大小: 12.34 GB


### 1.3 验证数据集

```bash
# 检查文件结构
tree -L 2 /playpen-ssd/dataset/droid_raw/hg_data

# 应该看到:
# hg_data/
# ├── videos/           (2500 个 .mp4 文件)
# ├── metadata/         (2500 个 .json 文件)
# ├── dataset_info.json
# └── README.md

# 检查数据集信息
cat /playpen-ssd/dataset/droid_raw/hg_data/dataset_info.json | jq '.total_samples'

第二步：配置 HuggingFace

2.1 创建 HuggingFace 账户

访问 https://huggingface.co/join
注册账户（如果已有账户，跳过此步）
记住你的用户名（例如：jqliu）

2.2 获取 Access Token

登录 HuggingFace
访问 https://huggingface.co/settings/tokens
点击 "New token"
填写：
- Name: dataset-upload
- Role: Write (重要！必须有写权限)
点击 "Generate"
复制并保存 token（只显示一次）

2.3 设置 Token

方法 1: 环境变量（推荐）

# 临时设置（当前终端有效）
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxx"

# 永久设置（添加到 ~/.bashrc）
echo 'export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxx"' >> ~/.bashrc
source ~/.bashrc

方法 2: 使用 huggingface-cli

# 安装 huggingface_hub
pip install huggingface_hub

# 登录
huggingface-cli login

# 输入你的 token

2.4 验证 Token

python -c "from huggingface_hub import HfApi; api = HfApi(); print(api.whoami())"

应该显示你的用户信息。

第三步：上传数据集

3.1 配置上传脚本

编辑 upload_to_huggingface.py：

# HuggingFace 配置
HF_USERNAME = "jqliu"  # 替换为你的用户名
DATASET_NAME = "droid-failure-sampled"  # 数据集名称
PRIVATE = False  # True=私有数据集, False=公开数据集

3.2 选择上传方法

根据数据集大小选择：

数据集大小	推荐方法	说明
< 5 GB	方法 1 或 2	简单快速
5-20 GB	方法 2	支持预览和流式加载
> 20 GB	方法 3	分块上传，更稳定

3.3 运行上传脚本

cd /home/jqliu/projects/RewardModel/data_sta

# 安装依赖
pip install huggingface_hub datasets

# 运行上传脚本
python upload_to_huggingface.py

交互式选择：

请选择上传方法:
  1. upload_folder (简单快速，适合小数据集)
  2. Datasets 库 (推荐，支持预览和流式加载)
  3. 分块上传 (适合大数据集 > 5GB)

请输入选择 (1/2/3): 2

上传过程： ```

方法 2: 使用 Datasets 库上传

创建 Dataset 对象... 读取数据集: 100%|██████████| 2500/2500 Dataset 大小: 2500 个样本

上传到 HuggingFace Hub: jqliu/droid-failure-sampled Uploading: 100%|██████████| 12.3G/12.3G [15:23<00:00]

[SUCCESS] 上传完成!

数据集链接: https://huggingface.co/datasets/jqliu/droid-failure-sampled


### 3.4 使用命令行直接上传（备选方案）

如果脚本失败，可以使用命令行：

```bash
# 方法 A: 使用 huggingface-cli
huggingface-cli upload jqliu/droid-failure-sampled \
  /playpen-ssd/dataset/droid_raw/hg_data \
  --repo-type=dataset

# 方法 B: 使用 Git LFS（适合大文件）
cd /playpen-ssd/dataset/droid_raw/hg_data
git init
git lfs install
git lfs track "*.mp4"
git add .
git commit -m "Initial commit"
git remote add origin https://huggingface.co/datasets/jqliu/droid-failure-sampled
git push -u origin main

第四步：验证和使用

4.1 访问数据集页面

在浏览器中访问：

https://huggingface.co/datasets/你的用户名/droid-failure-sampled

你应该看到：

✅ README.md 自动渲染
✅ 文件浏览器
✅ 数据集统计信息

4.2 测试加载数据集

from datasets import load_dataset

# 加载数据集
dataset = load_dataset("jqliu/droid-failure-sampled")

# 查看第一个样本
example = dataset['train'][0]
print(f"Video ID: {example['video_id']}")
print(f"Task: {example['task_description']}")
print(f"Source: {example['source']}")

4.3 更新 README（可选）

如果需要更新 README：

cd /playpen-ssd/dataset/droid_raw/hg_data

# 编辑 README.md
vim README.md

# 使用脚本重新上传
python upload_to_huggingface.py
# 或使用 CLI
huggingface-cli upload jqliu/droid-failure-sampled README.md

常见问题

Q1: 上传速度很慢怎么办？

解决方案：

使用分块上传（方法 3）
减少样本数量：修改 TOTAL_SAMPLES = 1000

使用网络代理：

export HTTP_PROXY=http://your-proxy:port
export HTTPS_PROXY=http://your-proxy:port

Q2: 上传中断了怎么办？

解决方案：

HuggingFace 支持断点续传，重新运行上传脚本即可：

python upload_to_huggingface.py

Q3: 如何删除已上传的数据集？

# 使用 API 删除
from huggingface_hub import delete_repo
delete_repo(repo_id="jqliu/droid-failure-sampled", repo_type="dataset")

# 或在网页上删除
# https://huggingface.co/datasets/jqliu/droid-failure-sampled/settings

Q4: 如何设置数据集为私有？

在 upload_to_huggingface.py 中：

PRIVATE = True

或在网页上：

Settings → Make private

Q5: 上传失败显示 "Repository not found"

原因：仓库未创建或用户名错误

解决方案：

检查 HF_USERNAME 是否正确
手动创建仓库：访问 https://huggingface.co/new-dataset

Q6: 如何查看上传进度？

使用 tqdm 会自动显示进度条。如果看不到：

# 在脚本中添加
import logging
logging.basicConfig(level=logging.INFO)

高级功能

1. 创建数据集卡片（Dataset Card）

编辑 README.md，添加 YAML 元数据：

---
dataset_info:
  features:
  - name: video_id
    dtype: string
  - name: task_description
    dtype: string
  - name: source
    dtype: string
  - name: success
    dtype: bool
  splits:
  - name: train
    num_examples: 2500
  dataset_size: 12.34GB
task_categories:
- video-classification
- robotics
tags:
- robotics
- manipulation
- failure-detection
license: mit
---

# DROID Failure Dataset

...

2. 分割数据集（train/val/test）

from datasets import Dataset, DatasetDict

# 分割
dataset_dict = DatasetDict({
    'train': dataset.select(range(2000)),
    'validation': dataset.select(range(2000, 2250)),
    'test': dataset.select(range(2250, 2500))
})

# 上传
dataset_dict.push_to_hub("jqliu/droid-failure-sampled")

3. 添加视频预处理

from datasets import Dataset, Features, Value, Video

# 定义特征（包含视频）
features = Features({
    'video_id': Value('string'),
    'video': Video(),  # 视频特征
    'task_description': Value('string'),
    ...
})

# 创建数据集
dataset = Dataset.from_dict(data, features=features)

完整工作流示例

# 1. 准备数据集
cd /home/jqliu/projects/RewardModel/data_sta
python prepare_hf_dataset.py

# 2. 设置 HuggingFace Token
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxx"

# 3. 上传数据集
python upload_to_huggingface.py
# 选择方法 2

# 4. 验证
python -c "from datasets import load_dataset; ds = load_dataset('jqliu/droid-failure-sampled'); print(len(ds['train']))"

参考资源

HuggingFace Hub 文档: https://huggingface.co/docs/hub
Datasets 库文档: https://huggingface.co/docs/datasets
Git LFS 文档: https://git-lfs.github.com/
DROID 数据集: https://droid-dataset.github.io/

联系支持

如果遇到问题：

查看 HuggingFace 社区论坛: https://discuss.huggingface.co/
提交 Issue: https://github.com/huggingface/datasets/issues
联系数据集维护者

附录：脚本快速参考

prepare_hf_dataset.py

采样并准备数据集

关键参数：

TOTAL_SAMPLES: 采样数量
SAMPLING_STRATEGY: 采样策略
PROCESS_TYPE: 数据类型

upload_to_huggingface.py

上传数据集到 HuggingFace

关键参数：

HF_USERNAME: 用户名
DATASET_NAME: 数据集名称
PRIVATE: 是否私有

上传方法：

upload_folder（简单）
Datasets 库（推荐）
分块上传（大文件）