Add files using upload-large-folder tool

6b6f412 verified about 2 months ago

3.31 kB

# Dataset Overview

This repository now bundles the selected final datasets that were actually kept for handoff, along with the upstream source materials and prompt assets needed to understand where those datasets came from.

Data Layers

The bundled data is organized into two layers:

data/source-materials/
- Human text source pools
- AI-generated text source pools
- Prompt assets used to generate the AI text pools
data/dataset/
- Five experiment-ready datasets that can be used directly by the packaged scripts
- One central manifest at data/dataset/90_manifests/dataset_manifests.json

Source Materials

data/source-materials/ keeps the upstream materials in a delivery-friendly layout:

human-core-pool
- The main manually collected human text pool
- This is the core human-side source used to build the downstream human datasets
human-recovery-pool-v2
- High-confidence human texts recovered from the quarantine branch, version 2
- Used as a supplement to the main human pool
human-recovery-pool-v3
- High-confidence human texts recovered from the quarantine branch, version 3
- Used as an additional supplement to the main human pool
ai-generated-standard
- AI-generated texts under the standard generation style
- Keeps the original topic/subtopic/prompt tree for traceability
ai-generated-natural-v1
- AI-generated texts under the more natural writing-style branch
- Also keeps the original topic/subtopic/prompt tree
prompts-standard
- Prompt files corresponding to the standard AI generation branch
prompts-natural-v1
- Prompt files corresponding to the natural-style AI generation branch

Prompt files remain as .txt in the packaged data tree so the repository keeps documentation in docs/ while the data area stays asset-oriented.

Experiment-Ready Datasets

data/dataset/ includes five packaged datasets:

DS04_Human_pools_merged_v1
- Pure human text pool
- Works as the main human-side source dataset
- All records are kept in train.jsonl
DS11_Generated_AI_v1
- Standard-style AI text pool
- Pairs naturally with DS04 when building a standard human-vs-AI setting
- All records are kept in train.jsonl
DS12_Generated_AI_natural_v1
- Natural-style AI text pool
- Used for the harder, more natural writing branch
- All records are kept in train.jsonl
DS06_External_core_balanced_v1
- Balanced experiment set built from DS04 and DS11
- Includes train/dev/test and is suitable for direct evaluation and cross-domain experiments
DS07_External_long_v1
- Balanced experiment set built from DS04 and DS12
- Includes train/dev/test and emphasizes the natural-style branch

Each dataset directory keeps its own train/dev/test.jsonl, manifest.json, and check_noise.py, while the repository-level manifest provides one portable entry point for script loading.

What Is Intentionally Not Bundled

Raw public benchmark downloads such as NLPCC, HC3, CLTS, or Zhihu RLHF source packages
Processed datasets that depend on the excluded public-source branches
The deprecated DS10 branch

This repository is meant to be a focused research asset pack, not a mirror of every intermediate or publicly downloadable dataset used during exploration.