File size: 10,172 Bytes

a741a7c

# HuggingFace 数据集上传完整指南

## 概述

本指南将帮助你完成以下流程：
1. 从 DROID 数据集中采样视频
2. 准备 HuggingFace 数据集格式
3. 上传到 HuggingFace Hub

---

## 第一步：准备数据集

### 1.1 配置参数

编辑 `prepare_hf_dataset.py`，修改以下参数：

```python
# 采样配置
TOTAL_SAMPLES = 2500        # 总共采样的视频数量（建议 2000-3000）
PROCESS_TYPE = "failure"    # 处理类型
SAMPLING_STRATEGY = "balanced"  # 采样策略

# 采样策略说明：
# - "balanced": 按任务类别均衡采样（推荐）
# - "random": 完全随机采样
# - "proportional": 按原始比例采样
```

### 1.2 运行采样脚本

```bash
cd /home/jqliu/projects/RewardModel/data_sta

# 运行脚本
python prepare_hf_dataset.py
```

**输出示例**：
```
==========================================
准备 HuggingFace 数据集
==========================================

步骤 1: 扫描视频文件...
扫描数据源: 100%|██████████| 13/13
找到 15157 个视频

步骤 2: 采样视频 (策略: balanced, 目标数量: 2500)...
任务类别分布:
  Move object into or out of container: 2699
  Move object to a new position: 2494
  ...
采样完成: 2500 个视频

采样后任务分布 (前10):
  Move object into or out of container: 125
  Move object to a new position: 122
  ...

步骤 3: 复制文件到 /playpen-ssd/dataset/droid_raw/hg_data...
复制文件: 100%|██████████| 2500/2500

步骤 4: 创建 README.md...

==========================================
数据集准备完成!
==========================================
位置: /playpen-ssd/dataset/droid_raw/hg_data
总样本数: 2500
总大小: 12.34 GB
```

### 1.3 验证数据集

```bash
# 检查文件结构
tree -L 2 /playpen-ssd/dataset/droid_raw/hg_data

# 应该看到:
# hg_data/
# ├── videos/           (2500 个 .mp4 文件)
# ├── metadata/         (2500 个 .json 文件)
# ├── dataset_info.json
# └── README.md

# 检查数据集信息
cat /playpen-ssd/dataset/droid_raw/hg_data/dataset_info.json | jq '.total_samples'
```

---

## 第二步：配置 HuggingFace

### 2.1 创建 HuggingFace 账户

1. 访问 https://huggingface.co/join
2. 注册账户（如果已有账户，跳过此步）
3. 记住你的用户名（例如：`jqliu`）

### 2.2 获取 Access Token

1. 登录 HuggingFace
2. 访问 https://huggingface.co/settings/tokens
3. 点击 **"New token"**
4. 填写：
   - Name: `dataset-upload`
   - Role: **Write** (重要！必须有写权限)
5. 点击 **"Generate"**
6. **复制并保存** token（只显示一次）

### 2.3 设置 Token

**方法 1: 环境变量（推荐）**

```bash
# 临时设置（当前终端有效）
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxx"

# 永久设置（添加到 ~/.bashrc）
echo 'export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxx"' >> ~/.bashrc
source ~/.bashrc
```

**方法 2: 使用 huggingface-cli**

```bash
# 安装 huggingface_hub
pip install huggingface_hub

# 登录
huggingface-cli login

# 输入你的 token
```

### 2.4 验证 Token

```bash
python -c "from huggingface_hub import HfApi; api = HfApi(); print(api.whoami())"
```

应该显示你的用户信息。

---

## 第三步：上传数据集

### 3.1 配置上传脚本

编辑 `upload_to_huggingface.py`：

```python
# HuggingFace 配置
HF_USERNAME = "jqliu"  # 替换为你的用户名
DATASET_NAME = "droid-failure-sampled"  # 数据集名称
PRIVATE = False  # True=私有数据集, False=公开数据集
```

### 3.2 选择上传方法

根据数据集大小选择：

| 数据集大小 | 推荐方法 | 说明 |
|-----------|---------|------|
| < 5 GB | 方法 1 或 2 | 简单快速 |
| 5-20 GB | 方法 2 | 支持预览和流式加载 |
| > 20 GB | 方法 3 | 分块上传，更稳定 |

### 3.3 运行上传脚本

```bash
cd /home/jqliu/projects/RewardModel/data_sta

# 安装依赖
pip install huggingface_hub datasets

# 运行上传脚本
python upload_to_huggingface.py
```

**交互式选择**：
```
请选择上传方法:
  1. upload_folder (简单快速，适合小数据集)
  2. Datasets 库 (推荐，支持预览和流式加载)
  3. 分块上传 (适合大数据集 > 5GB)

请输入选择 (1/2/3): 2
```

**上传过程**：
```
==========================================
方法 2: 使用 Datasets 库上传
==========================================

创建 Dataset 对象...
读取数据集: 100%|██████████| 2500/2500
Dataset 大小: 2500 个样本

上传到 HuggingFace Hub: jqliu/droid-failure-sampled
Uploading: 100%|██████████| 12.3G/12.3G [15:23<00:00]

[SUCCESS] 上传完成!

数据集链接: https://huggingface.co/datasets/jqliu/droid-failure-sampled
```

### 3.4 使用命令行直接上传（备选方案）

如果脚本失败，可以使用命令行：

```bash
# 方法 A: 使用 huggingface-cli
huggingface-cli upload jqliu/droid-failure-sampled \
  /playpen-ssd/dataset/droid_raw/hg_data \
  --repo-type=dataset

# 方法 B: 使用 Git LFS（适合大文件）
cd /playpen-ssd/dataset/droid_raw/hg_data
git init
git lfs install
git lfs track "*.mp4"
git add .
git commit -m "Initial commit"
git remote add origin https://huggingface.co/datasets/jqliu/droid-failure-sampled
git push -u origin main
```

---

## 第四步：验证和使用

### 4.1 访问数据集页面

在浏览器中访问：
```
https://huggingface.co/datasets/你的用户名/droid-failure-sampled
```

你应该看到：
- ✅ README.md 自动渲染
- ✅ 文件浏览器
- ✅ 数据集统计信息

### 4.2 测试加载数据集

```python
from datasets import load_dataset

# 加载数据集
dataset = load_dataset("jqliu/droid-failure-sampled")

# 查看第一个样本
example = dataset['train'][0]
print(f"Video ID: {example['video_id']}")
print(f"Task: {example['task_description']}")
print(f"Source: {example['source']}")
```

### 4.3 更新 README（可选）

如果需要更新 README：

```bash
cd /playpen-ssd/dataset/droid_raw/hg_data

# 编辑 README.md
vim README.md

# 使用脚本重新上传
python upload_to_huggingface.py
# 或使用 CLI
huggingface-cli upload jqliu/droid-failure-sampled README.md
```

---

## 常见问题

### Q1: 上传速度很慢怎么办？

**解决方案**：

1. **使用分块上传**（方法 3）
2. **减少样本数量**：修改 `TOTAL_SAMPLES = 1000`
3. **使用网络代理**：
   ```bash
   export HTTP_PROXY=http://your-proxy:port
   export HTTPS_PROXY=http://your-proxy:port
   ```

### Q2: 上传中断了怎么办？

**解决方案**：

HuggingFace 支持断点续传，重新运行上传脚本即可：
```bash
python upload_to_huggingface.py
```

### Q3: 如何删除已上传的数据集？

```bash
# 使用 API 删除
from huggingface_hub import delete_repo
delete_repo(repo_id="jqliu/droid-failure-sampled", repo_type="dataset")

# 或在网页上删除
# https://huggingface.co/datasets/jqliu/droid-failure-sampled/settings
```

### Q4: 如何设置数据集为私有？

在 `upload_to_huggingface.py` 中：
```python
PRIVATE = True
```

或在网页上：
```
Settings → Make private
```

### Q5: 上传失败显示 "Repository not found"

**原因**：仓库未创建或用户名错误

**解决方案**：
1. 检查 `HF_USERNAME` 是否正确
2. 手动创建仓库：访问 https://huggingface.co/new-dataset

### Q6: 如何查看上传进度？

使用 `tqdm` 会自动显示进度条。如果看不到：
```python
# 在脚本中添加
import logging
logging.basicConfig(level=logging.INFO)
```

---

## 高级功能

### 1. 创建数据集卡片（Dataset Card）

编辑 `README.md`，添加 YAML 元数据：

```yaml
---
dataset_info:
  features:
  - name: video_id
    dtype: string
  - name: task_description
    dtype: string
  - name: source
    dtype: string
  - name: success
    dtype: bool
  splits:
  - name: train
    num_examples: 2500
  dataset_size: 12.34GB
task_categories:
- video-classification
- robotics
tags:
- robotics
- manipulation
- failure-detection
license: mit
---

# DROID Failure Dataset

...
```

### 2. 分割数据集（train/val/test）

```python
from datasets import Dataset, DatasetDict

# 分割
dataset_dict = DatasetDict({
    'train': dataset.select(range(2000)),
    'validation': dataset.select(range(2000, 2250)),
    'test': dataset.select(range(2250, 2500))
})

# 上传
dataset_dict.push_to_hub("jqliu/droid-failure-sampled")
```

### 3. 添加视频预处理

```python
from datasets import Dataset, Features, Value, Video

# 定义特征（包含视频）
features = Features({
    'video_id': Value('string'),
    'video': Video(),  # 视频特征
    'task_description': Value('string'),
    ...
})

# 创建数据集
dataset = Dataset.from_dict(data, features=features)
```

---

## 完整工作流示例

```bash
# 1. 准备数据集
cd /home/jqliu/projects/RewardModel/data_sta
python prepare_hf_dataset.py

# 2. 设置 HuggingFace Token
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxx"

# 3. 上传数据集
python upload_to_huggingface.py
# 选择方法 2

# 4. 验证
python -c "from datasets import load_dataset; ds = load_dataset('jqliu/droid-failure-sampled'); print(len(ds['train']))"
```

---

## 参考资源

- **HuggingFace Hub 文档**: https://huggingface.co/docs/hub
- **Datasets 库文档**: https://huggingface.co/docs/datasets
- **Git LFS 文档**: https://git-lfs.github.com/
- **DROID 数据集**: https://droid-dataset.github.io/

---

## 联系支持

如果遇到问题：
1. 查看 HuggingFace 社区论坛: https://discuss.huggingface.co/
2. 提交 Issue: https://github.com/huggingface/datasets/issues
3. 联系数据集维护者

---

## 附录：脚本快速参考

### prepare_hf_dataset.py
采样并准备数据集

**关键参数**：
- `TOTAL_SAMPLES`: 采样数量
- `SAMPLING_STRATEGY`: 采样策略
- `PROCESS_TYPE`: 数据类型

### upload_to_huggingface.py
上传数据集到 HuggingFace

**关键参数**：
- `HF_USERNAME`: 用户名
- `DATASET_NAME`: 数据集名称
- `PRIVATE`: 是否私有

**上传方法**：
1. upload_folder（简单）
2. Datasets 库（推荐）
3. 分块上传（大文件）