LUCIFerace
/

enhanced-replica-model-pack

Text Classification

ai-text-detection

Model card Files Files and versions

enhanced-replica-model-pack / docs /dataset_overview.md

LUCIFerace's picture

Add files using upload-large-folder tool

6b6f412 verified about 2 months ago

|

history blame contribute delete

3.31 kB

	# Dataset Overview

	This repository now bundles the selected final datasets that were actually kept for handoff, along with the upstream source materials and prompt assets needed to understand where those datasets came from.

	## Data Layers

	The bundled data is organized into two layers:

	- `data/source-materials/`
	- Human text source pools
	- AI-generated text source pools
	- Prompt assets used to generate the AI text pools
	- `data/dataset/`
	- Five experiment-ready datasets that can be used directly by the packaged scripts
	- One central manifest at `data/dataset/90_manifests/dataset_manifests.json`

	## Source Materials

	`data/source-materials/` keeps the upstream materials in a delivery-friendly layout:

	- `human-core-pool`
	- The main manually collected human text pool
	- This is the core human-side source used to build the downstream human datasets
	- `human-recovery-pool-v2`
	- High-confidence human texts recovered from the quarantine branch, version 2
	- Used as a supplement to the main human pool
	- `human-recovery-pool-v3`
	- High-confidence human texts recovered from the quarantine branch, version 3
	- Used as an additional supplement to the main human pool
	- `ai-generated-standard`
	- AI-generated texts under the standard generation style
	- Keeps the original topic/subtopic/prompt tree for traceability
	- `ai-generated-natural-v1`
	- AI-generated texts under the more natural writing-style branch
	- Also keeps the original topic/subtopic/prompt tree
	- `prompts-standard`
	- Prompt files corresponding to the standard AI generation branch
	- `prompts-natural-v1`
	- Prompt files corresponding to the natural-style AI generation branch

	Prompt files remain as `.txt` in the packaged data tree so the repository keeps documentation in `docs/` while the data area stays asset-oriented.

	## Experiment-Ready Datasets

	`data/dataset/` includes five packaged datasets:

	- `DS04_Human_pools_merged_v1`
	- Pure human text pool
	- Works as the main human-side source dataset
	- All records are kept in `train.jsonl`
	- `DS11_Generated_AI_v1`
	- Standard-style AI text pool
	- Pairs naturally with DS04 when building a standard human-vs-AI setting
	- All records are kept in `train.jsonl`
	- `DS12_Generated_AI_natural_v1`
	- Natural-style AI text pool
	- Used for the harder, more natural writing branch
	- All records are kept in `train.jsonl`
	- `DS06_External_core_balanced_v1`
	- Balanced experiment set built from DS04 and DS11
	- Includes `train/dev/test` and is suitable for direct evaluation and cross-domain experiments
	- `DS07_External_long_v1`
	- Balanced experiment set built from DS04 and DS12
	- Includes `train/dev/test` and emphasizes the natural-style branch

	Each dataset directory keeps its own `train/dev/test.jsonl`, `manifest.json`, and `check_noise.py`, while the repository-level manifest provides one portable entry point for script loading.

	## What Is Intentionally Not Bundled

	- Raw public benchmark downloads such as NLPCC, HC3, CLTS, or Zhihu RLHF source packages
	- Processed datasets that depend on the excluded public-source branches
	- The deprecated `DS10` branch

	This repository is meant to be a focused research asset pack, not a mirror of every intermediate or publicly downloadable dataset used during exploration.