Add files using upload-large-folder tool

Browse files

Files changed (13) hide show

.gitattributes +0 -9
.gitignore +2 -1
README.md +34 -12
docs/dataset_overview.md +74 -0
docs/project_overview.md +6 -3
scripts/inference/run_cross_domain_batch.py +1 -1
scripts/inference/run_cross_domain_ensemble.py +1 -1
scripts/inference/run_logistic_regression_ensemble.py +2 -2
scripts/inference/run_zero_shot_detectors.py +6 -2
scripts/train_eval/cross_domain/evaluate_mixed_label_zero_shot.py +2 -3
scripts/train_eval/data_checks/inspect_dataset_distribution.py +2 -8
src/enhanced_replica/data_utils.py +3 -3
src/enhanced_replica/io_utils.py +12 -4

.gitattributes CHANGED Viewed

@@ -2,12 +2,3 @@
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.gz -text
-reports/bert-baseline/dev_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
-reports/roberta-baseline/dev_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
-reports/bert-baseline/test_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
-reports/roberta-baseline/test_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
-reports/bert-baseline/train_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
-reports/roberta-baseline/train_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
-models/qwen-adapters/shared-tokenizer/tokenizer.json filter=lfs diff=lfs merge=lfs -text
-models/bert-final/classifier_full_model.bin filter=lfs diff=lfs merge=lfs -text
-models/roberta-final/classifier_full_model.bin filter=lfs diff=lfs merge=lfs -text

 *.safetensors filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.gz -text

.gitignore CHANGED Viewed

@@ -5,4 +5,5 @@
 .pytest_cache/
 .mypy_cache/
 .ipynb_checkpoints/
-outputs/

 .pytest_cache/
 .mypy_cache/
 .ipynb_checkpoints/
+outputs/
+.cache/

README.md CHANGED Viewed

@@ -9,26 +9,30 @@ tags:
   - qwen
   - lora
   - research
 library_name: transformers
 ---
 # EnhancedReplica Research Asset Pack
-这是一个已经整理成 Hugging Face 仓库风格的研究交付包。它不再试图保留本地实验目录的原始混乱层级，而是把真正需要上传、交接和长期保存的资产收口成一个清晰项目。
 ## 这份仓库里有什么
 - `models/`：最终保留的 BERT、RoBERTa 权重，以及 3 个 Qwen LoRA adapter
 - `src/`：公共 Python 模块，供训练、评估、推理脚本复用
-- `scripts/`：按任务拆分的入口脚本，包括数据构建、训练评估、推理和批量流程
 - `configs/`：训练、推理、集成配置快照
 - `reports/`：实验输出、指标、manifest、日志和压缩后的预测文件
-- `docs/`：项目说明、模型清单、脚本重命名映射、上传说明
 ## 这份仓库里没有什么
 - 不包含 Qwen 基础模型权重
-- 不包含最终数据集本体
 - 不保证在当前目录下直接一键重跑全部训练流程
 - 不再把 Markdown 文档和源码脚本混放在一起
@@ -45,6 +49,15 @@ library_name: transformers
   - `shared-tokenizer/` 中保留了一份公共 tokenizer 资产
   - 推理时必须通过 `QWEN_BASE_MODEL_PATH` 指向单独下载的基础模型
 ## 目录地图
 ```text
@@ -53,6 +66,7 @@ library_name: transformers
 ├── .gitattributes
 ├── .gitignore
 ├── requirements.txt
 ├── models/
 ├── src/
 ├── scripts/
@@ -69,6 +83,12 @@ library_name: transformers
 pip install -r requirements.txt
 ```
 运行 BERT / RoBERTa 推理：
 ```bash
@@ -82,17 +102,19 @@ export QWEN_BASE_MODEL_PATH=/path/to/Qwen2.5-7B-Instruct
 python scripts/inference/infer_qwen_adapters.py --dataset DS06_External_core_balanced_v1
 ```
 ## 为什么它适合上传到 Hugging Face
 - 大权重通过 `.gitattributes` 配置为 Git LFS 跟踪
 - 顶层结构已经收敛成 HF 常见项目形态
 - 说明文档集中在 `docs/`，源码树更干净
 - 大预测文件已经压缩为 `.csv.gz`
-- 顶层目录和核心脚本命名全部改成英文 / ASCII，更适合 Git 和 HF Hub
-## 建议先看这些文档
-- `docs/project_overview.md`
-- `docs/model_inventory.md`
-- `docs/script_name_map.md`
-- `docs/huggingface_upload.md`

   - qwen
   - lora
   - research
+  - dataset
+  - ai-text-detection
 library_name: transformers
 ---
 # EnhancedReplica Research Asset Pack
+这是一个已经整理成 Hugging Face 仓库风格的研究交付包。它不再保留本地实验目录的混乱层级，而是把真正需要交接、上传和长期保存的资产收口成一个清晰项目。
 ## 这份仓库里有什么
 - `models/`：最终保留的 BERT、RoBERTa 权重，以及 3 个 Qwen LoRA adapter
 - `src/`：公共 Python 模块，供训练、评估、推理脚本复用
+- `scripts/`：按任务拆分的入口脚本，包括训练评估、推理和批量流程
 - `configs/`：训练、推理、集成配置快照
 - `reports/`：实验输出、指标、manifest、日志和压缩后的预测文件
+- `data/`：仓库内置的数据资产，分为可直接实验的数据集和原始素材/提示词
+- `docs/`：项目说明、模型清单、脚本映射、数据说明和上传说明
 ## 这份仓库里没有什么
 - 不包含 Qwen 基础模型权重
+- 不包含所有公网基准数据的原始下载副本
+- 不包含完整的外部 benchmark 数据全集
 - 不保证在当前目录下直接一键重跑全部训练流程
 - 不再把 Markdown 文档和源码脚本混放在一起
   - `shared-tokenizer/` 中保留了一份公共 tokenizer 资产
   - 推理时必须通过 `QWEN_BASE_MODEL_PATH` 指向单独下载的基础模型
+## 数据说明
+- `data/dataset/`
+  - 收录 5 个仓库内可直接使用的数据集：`DS04`、`DS06`、`DS07`、`DS11`、`DS12`
+  - `90_manifests/dataset_manifests.json` 已改写为仓库内相对路径，可直接被脚本读取
+- `data/source-materials/`
+  - 收录人工文本原料、AI 生成文本原料，以及两套提示词资产
+  - 一级目录统一使用 ASCII 命名，方便 Git、Hugging Face 和下游脚本处理
 ## 目录地图
 ```text
 ├── .gitattributes
 ├── .gitignore
 ├── requirements.txt
+├── data/
 ├── models/
 ├── src/
 ├── scripts/
 pip install -r requirements.txt
 ```
+查看仓库内置数据清单：
+```bash
+python scripts/train_eval/data_checks/inspect_dataset_distribution.py --dataset_ids DS04,DS06,DS07,DS11,DS12 --smoke
+```
 运行 BERT / RoBERTa 推理：
 ```bash
 python scripts/inference/infer_qwen_adapters.py --dataset DS06_External_core_balanced_v1
 ```
+## 数据文档入口
+- [项目总览](docs/project_overview.md)
+- [数据总览](docs/dataset_overview.md)
+- [模型清单](docs/model_inventory.md)
+- [脚本映射](docs/script_name_map.md)
+- [上传说明](docs/huggingface_upload.md)
 ## 为什么它适合上传到 Hugging Face
 - 大权重通过 `.gitattributes` 配置为 Git LFS 跟踪
 - 顶层结构已经收敛成 HF 常见项目形态
 - 说明文档集中在 `docs/`，源码树更干净
+- 选定的数据资产已经并入仓库，不再依赖仓库外部自备数据目录
 - 大预测文件已经压缩为 `.csv.gz`
+- 顶层目录和核心脚本命名全部改成英文 / ASCII，更适合 Git 和 HF Hub

docs/dataset_overview.md ADDED Viewed

	@@ -0,0 +1,74 @@

+# Dataset Overview
+This repository now bundles the selected final datasets that were actually kept for handoff, along with the upstream source materials and prompt assets needed to understand where those datasets came from.
+## Data Layers
+The bundled data is organized into two layers:
+- `data/source-materials/`
+  - Human text source pools
+  - AI-generated text source pools
+  - Prompt assets used to generate the AI text pools
+- `data/dataset/`
+  - Five experiment-ready datasets that can be used directly by the packaged scripts
+  - One central manifest at `data/dataset/90_manifests/dataset_manifests.json`
+## Source Materials
+`data/source-materials/` keeps the upstream materials in a delivery-friendly layout:
+- `human-core-pool`
+  - The main manually collected human text pool
+  - This is the core human-side source used to build the downstream human datasets
+- `human-recovery-pool-v2`
+  - High-confidence human texts recovered from the quarantine branch, version 2
+  - Used as a supplement to the main human pool
+- `human-recovery-pool-v3`
+  - High-confidence human texts recovered from the quarantine branch, version 3
+  - Used as an additional supplement to the main human pool
+- `ai-generated-standard`
+  - AI-generated texts under the standard generation style
+  - Keeps the original topic/subtopic/prompt tree for traceability
+- `ai-generated-natural-v1`
+  - AI-generated texts under the more natural writing-style branch
+  - Also keeps the original topic/subtopic/prompt tree
+- `prompts-standard`
+  - Prompt files corresponding to the standard AI generation branch
+- `prompts-natural-v1`
+  - Prompt files corresponding to the natural-style AI generation branch
+Prompt files remain as `.txt` in the packaged data tree so the repository keeps documentation in `docs/` while the data area stays asset-oriented.
+## Experiment-Ready Datasets
+`data/dataset/` includes five packaged datasets:
+- `DS04_Human_pools_merged_v1`
+  - Pure human text pool
+  - Works as the main human-side source dataset
+  - All records are kept in `train.jsonl`
+- `DS11_Generated_AI_v1`
+  - Standard-style AI text pool
+  - Pairs naturally with DS04 when building a standard human-vs-AI setting
+  - All records are kept in `train.jsonl`
+- `DS12_Generated_AI_natural_v1`
+  - Natural-style AI text pool
+  - Used for the harder, more natural writing branch
+  - All records are kept in `train.jsonl`
+- `DS06_External_core_balanced_v1`
+  - Balanced experiment set built from DS04 and DS11
+  - Includes `train/dev/test` and is suitable for direct evaluation and cross-domain experiments
+- `DS07_External_long_v1`
+  - Balanced experiment set built from DS04 and DS12
+  - Includes `train/dev/test` and emphasizes the natural-style branch
+Each dataset directory keeps its own `train/dev/test.jsonl`, `manifest.json`, and `check_noise.py`, while the repository-level manifest provides one portable entry point for script loading.
+## What Is Intentionally Not Bundled
+- Raw public benchmark downloads such as NLPCC, HC3, CLTS, or Zhihu RLHF source packages
+- Processed datasets that depend on the excluded public-source branches
+- The deprecated `DS10` branch
+This repository is meant to be a focused research asset pack, not a mirror of every intermediate or publicly downloadable dataset used during exploration.

docs/project_overview.md CHANGED Viewed

@@ -1,9 +1,10 @@
-# Project Overview
 This repository is the Hugging Face friendly version of the local research handoff package.
-It is organized around six durable parts:
 - `models/`: final checkpoints and adapters
 - `src/enhanced_replica/`: shared Python modules
 - `scripts/`: task-oriented entry points
@@ -15,5 +16,7 @@ Design choices:
 - Top-level folders use English / ASCII names for Git and HF Hub compatibility.
 - Markdown is concentrated under `docs/` so source trees stay uncluttered.
 - Prediction CSV files in `reports/` were compressed to `.csv.gz` to reduce repository weight.
-- The repository is upload-ready, but it is still an archive-oriented project pack rather than a fully reproducible end-to-end training repo.

+# Project Overview
 This repository is the Hugging Face friendly version of the local research handoff package.
+It is organized around seven durable parts:
+- `data/`: bundled datasets, source materials, prompts, and a repo-local dataset manifest
 - `models/`: final checkpoints and adapters
 - `src/enhanced_replica/`: shared Python modules
 - `scripts/`: task-oriented entry points
 - Top-level folders use English / ASCII names for Git and HF Hub compatibility.
 - Markdown is concentrated under `docs/` so source trees stay uncluttered.
+- Only the selected final data assets are bundled. Public benchmark downloads and the full external dataset zoo are intentionally excluded.
+- `data/dataset/90_manifests/dataset_manifests.json` is rewritten with repo-relative paths so the packaged scripts can resolve datasets inside the repository.
 - Prediction CSV files in `reports/` were compressed to `.csv.gz` to reduce repository weight.
+- The repository is upload-ready, but it is still an archive-oriented project pack rather than a fully reproducible end-to-end training repo.

scripts/inference/run_cross_domain_batch.py CHANGED Viewed

@@ -40,7 +40,7 @@ def main() -> None:
             "--dataset",
             dataset,
             "--fallback-dataset",
-            "DS01_NLPCC_core_v1",
             "--force-fallback",
         ])

             "--dataset",
             dataset,
             "--fallback-dataset",
+            "DS06_External_core_balanced_v1",
             "--force-fallback",
         ])

scripts/inference/run_cross_domain_ensemble.py CHANGED Viewed

@@ -249,7 +249,7 @@ def main():
     import argparse
     parser = argparse.ArgumentParser()
     parser.add_argument("--dataset", required=True)
-    parser.add_argument("--fallback-dataset", default="DS01_NLPCC_core_v1",
                         help="Fallback dev dataset for LR training")
     parser.add_argument("--force-fallback", action="store_true",
                         help="Force LR training on fallback dataset regardless of local dev size")

     import argparse
     parser = argparse.ArgumentParser()
     parser.add_argument("--dataset", required=True)
+    parser.add_argument("--fallback-dataset", default="DS06_External_core_balanced_v1",
                         help="Fallback dev dataset for LR training")
     parser.add_argument("--force-fallback", action="store_true",
                         help="Force LR training on fallback dataset regardless of local dev size")

scripts/inference/run_logistic_regression_ensemble.py CHANGED Viewed

@@ -259,8 +259,8 @@ def fit_lr_bucket(df_dev, feature_cols, global_scaler, global_clf, global_th):
 def main():
     import argparse
     parser = argparse.ArgumentParser()
-    parser.add_argument("--dataset", required=True, help="e.g. DS01_NLPCC_core_v1 or DS13_NLPCC_full_test_v1")
-    parser.add_argument("--fallback-dataset", default="DS01_NLPCC_core_v1",
                         help="If dev samples are insufficient, use this dataset's dev to train LR")
     args = parser.parse_args()

 def main():
     import argparse
     parser = argparse.ArgumentParser()
+    parser.add_argument("--dataset", required=True, help="e.g. DS06_External_core_balanced_v1 or DS13_NLPCC_full_test_v1")
+    parser.add_argument("--fallback-dataset", default="DS06_External_core_balanced_v1",
                         help="If dev samples are insufficient, use this dataset's dev to train LR")
     args = parser.parse_args()

scripts/inference/run_zero_shot_detectors.py CHANGED Viewed

@@ -12,8 +12,12 @@ from pathlib import Path
 from transformers import AutoTokenizer, AutoModelForCausalLM
 from modelscope import snapshot_download
-DATASET_ROOT = Path("data/dataset")
-OUTPUT_ROOT = Path("outputs/zero_shot")
 MAX_LENGTH = 512
 BATCH_SIZE = 16
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

 from transformers import AutoTokenizer, AutoModelForCausalLM
 from modelscope import snapshot_download
+REPO_ROOT = Path(__file__).resolve()
+while REPO_ROOT != REPO_ROOT.parent and not (REPO_ROOT / "src").exists():
+    REPO_ROOT = REPO_ROOT.parent
+DATASET_ROOT = REPO_ROOT / "data" / "dataset"
+OUTPUT_ROOT = REPO_ROOT / "outputs" / "zero_shot"
 MAX_LENGTH = 512
 BATCH_SIZE = 16
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

scripts/train_eval/cross_domain/evaluate_mixed_label_zero_shot.py CHANGED Viewed

@@ -86,11 +86,10 @@ def run_e09(args: argparse.Namespace) -> dict:
     # 3. Load target dataset split
     manifest = load_dataset_manifest(Path(args.manifest_file))
     ds_meta = get_ds_meta(manifest, args.dataset_id)
-    dataset_dir = Path(ds_meta["dataset_dir"])
-    split_path = dataset_dir / f"{args.split}.jsonl"
     if not split_path.exists():
         raise FileNotFoundError(f"Split file not found: {split_path}")
-    df = load_split_df(dataset_dir, args.split)
     logger.info(f"Target dataset: {args.dataset_id} | split={args.split} | rows={len(df)}")
     # 4. Load model and run inference

     # 3. Load target dataset split
     manifest = load_dataset_manifest(Path(args.manifest_file))
     ds_meta = get_ds_meta(manifest, args.dataset_id)
+    split_path = Path(ds_meta[args.split])
     if not split_path.exists():
         raise FileNotFoundError(f"Split file not found: {split_path}")
+    df = load_split_df(split_path)
     logger.info(f"Target dataset: {args.dataset_id} | split={args.split} | rows={len(df)}")
     # 4. Load model and run inference

scripts/train_eval/data_checks/inspect_dataset_distribution.py CHANGED Viewed

@@ -25,7 +25,7 @@ for _candidate in (REPO_ROOT, REPO_ROOT / "src"):
         sys.path.insert(0, _candidate_str)
 from enhanced_replica.cli_args import add_base_args
-from enhanced_replica.data_utils import load_dataset_manifest, load_dataset_splits, SPLITS, validate_schema
 from enhanced_replica.io_utils import create_run_context, ensure_dir, write_csv, write_json, write_run_manifest, write_run_report, write_yaml_minimal
@@ -61,13 +61,7 @@ def run_e00(args: argparse.Namespace) -> dict:
     for ds_id in ds_ids:
         info = manifest[ds_id]
-        dataset_dir = Path(info["dataset_dir"])
-        ds_meta = {
-            "dataset_id": info["dataset_id"],
-            "train": dataset_dir / "train.jsonl",
-            "dev": dataset_dir / "dev.jsonl",
-            "test": dataset_dir / "test.jsonl",
-        }
         # 1. Load splits (this tests _common.data_utils.load_dataset_splits)
         try:

         sys.path.insert(0, _candidate_str)
 from enhanced_replica.cli_args import add_base_args
+from enhanced_replica.data_utils import get_ds_meta, load_dataset_manifest, load_dataset_splits, SPLITS, validate_schema
 from enhanced_replica.io_utils import create_run_context, ensure_dir, write_csv, write_json, write_run_manifest, write_run_report, write_yaml_minimal
     for ds_id in ds_ids:
         info = manifest[ds_id]
+        ds_meta = get_ds_meta(manifest, ds_id)
         # 1. Load splits (this tests _common.data_utils.load_dataset_splits)
         try:

src/enhanced_replica/data_utils.py CHANGED Viewed

@@ -7,7 +7,7 @@ from typing import Dict, List, Sequence
 import pandas as pd
-from .io_utils import read_json
 DEFAULT_REQUIRED_FIELDS = ["record_id", "text", "label", "source", "split", "length_char", "topic", "model_slug"]
@@ -28,7 +28,7 @@ def load_dataset_manifest(manifest_file: Path | None = None) -> dict:
     if manifest_file is None:
         from .io_utils import DEFAULT_MANIFEST_FILE
         manifest_file = DEFAULT_MANIFEST_FILE
-    return read_json(manifest_file)
 def get_ds_meta(manifest: dict, ds_id: str) -> dict:
@@ -36,7 +36,7 @@ def get_ds_meta(manifest: dict, ds_id: str) -> dict:
     if ds_id not in manifest:
         raise KeyError(f"{ds_id} not found in dataset manifest")
     info = manifest[ds_id]
-    ds_dir = Path(info["dataset_dir"])
     out = {
         "dataset_id": info["dataset_id"],
         "dataset_dir": str(ds_dir),

 import pandas as pd
+from .io_utils import read_json, resolve_repo_path
 DEFAULT_REQUIRED_FIELDS = ["record_id", "text", "label", "source", "split", "length_char", "topic", "model_slug"]
     if manifest_file is None:
         from .io_utils import DEFAULT_MANIFEST_FILE
         manifest_file = DEFAULT_MANIFEST_FILE
+    return read_json(resolve_repo_path(manifest_file))
 def get_ds_meta(manifest: dict, ds_id: str) -> dict:
     if ds_id not in manifest:
         raise KeyError(f"{ds_id} not found in dataset manifest")
     info = manifest[ds_id]
+    ds_dir = resolve_repo_path(info["dataset_dir"])
     out = {
         "dataset_id": info["dataset_id"],
         "dataset_dir": str(ds_dir),

src/enhanced_replica/io_utils.py CHANGED Viewed

@@ -10,11 +10,12 @@ from typing import Any, Dict, Iterable, List
 import pandas as pd
-SCRIPT_ROOT = Path(__file__).resolve().parents[1]
-ROUTE_ROOT = SCRIPT_ROOT.parents[1]
-DATASET_ROOT = ROUTE_ROOT / "data" / "dataset"
 DEFAULT_MANIFEST_FILE = DATASET_ROOT / "90_manifests" / "dataset_manifests.json"
-DEFAULT_OUTPUT_ROOT = ROUTE_ROOT / "outputs"
 def now_ts() -> str:
@@ -30,6 +31,13 @@ def ensure_dir(path: Path) -> Path:
     return path
 def read_json(path: Path) -> Any:
     return json.loads(path.read_text(encoding="utf-8"))

 import pandas as pd
+PACKAGE_ROOT = Path(__file__).resolve().parent
+SRC_ROOT = PACKAGE_ROOT.parent
+REPO_ROOT = SRC_ROOT.parent
+DATASET_ROOT = REPO_ROOT / "data" / "dataset"
 DEFAULT_MANIFEST_FILE = DATASET_ROOT / "90_manifests" / "dataset_manifests.json"
+DEFAULT_OUTPUT_ROOT = REPO_ROOT / "outputs"
 def now_ts() -> str:
     return path
+def resolve_repo_path(path: str | Path) -> Path:
+    resolved = Path(path)
+    if resolved.is_absolute():
+        return resolved
+    return REPO_ROOT / resolved
 def read_json(path: Path) -> Any:
     return json.loads(path.read_text(encoding="utf-8"))