enhanced-replica-model-pack / docs /dataset_overview.md
LUCIFerace's picture
Add files using upload-large-folder tool
6b6f412 verified

# Dataset Overview

This repository now bundles the selected final datasets that were actually kept for handoff, along with the upstream source materials and prompt assets needed to understand where those datasets came from.

Data Layers

The bundled data is organized into two layers:

  • data/source-materials/
    • Human text source pools
    • AI-generated text source pools
    • Prompt assets used to generate the AI text pools
  • data/dataset/
    • Five experiment-ready datasets that can be used directly by the packaged scripts
    • One central manifest at data/dataset/90_manifests/dataset_manifests.json

Source Materials

data/source-materials/ keeps the upstream materials in a delivery-friendly layout:

  • human-core-pool
    • The main manually collected human text pool
    • This is the core human-side source used to build the downstream human datasets
  • human-recovery-pool-v2
    • High-confidence human texts recovered from the quarantine branch, version 2
    • Used as a supplement to the main human pool
  • human-recovery-pool-v3
    • High-confidence human texts recovered from the quarantine branch, version 3
    • Used as an additional supplement to the main human pool
  • ai-generated-standard
    • AI-generated texts under the standard generation style
    • Keeps the original topic/subtopic/prompt tree for traceability
  • ai-generated-natural-v1
    • AI-generated texts under the more natural writing-style branch
    • Also keeps the original topic/subtopic/prompt tree
  • prompts-standard
    • Prompt files corresponding to the standard AI generation branch
  • prompts-natural-v1
    • Prompt files corresponding to the natural-style AI generation branch

Prompt files remain as .txt in the packaged data tree so the repository keeps documentation in docs/ while the data area stays asset-oriented.

Experiment-Ready Datasets

data/dataset/ includes five packaged datasets:

  • DS04_Human_pools_merged_v1
    • Pure human text pool
    • Works as the main human-side source dataset
    • All records are kept in train.jsonl
  • DS11_Generated_AI_v1
    • Standard-style AI text pool
    • Pairs naturally with DS04 when building a standard human-vs-AI setting
    • All records are kept in train.jsonl
  • DS12_Generated_AI_natural_v1
    • Natural-style AI text pool
    • Used for the harder, more natural writing branch
    • All records are kept in train.jsonl
  • DS06_External_core_balanced_v1
    • Balanced experiment set built from DS04 and DS11
    • Includes train/dev/test and is suitable for direct evaluation and cross-domain experiments
  • DS07_External_long_v1
    • Balanced experiment set built from DS04 and DS12
    • Includes train/dev/test and emphasizes the natural-style branch

Each dataset directory keeps its own train/dev/test.jsonl, manifest.json, and check_noise.py, while the repository-level manifest provides one portable entry point for script loading.

What Is Intentionally Not Bundled

  • Raw public benchmark downloads such as NLPCC, HC3, CLTS, or Zhihu RLHF source packages
  • Processed datasets that depend on the excluded public-source branches
  • The deprecated DS10 branch

This repository is meant to be a focused research asset pack, not a mirror of every intermediate or publicly downloadable dataset used during exploration.