LUCIFerace commited on
Commit
6b6f412
·
verified ·
1 Parent(s): d2048ce

Add files using upload-large-folder tool

Browse files
.gitattributes CHANGED
@@ -2,12 +2,3 @@
2
  *.safetensors filter=lfs diff=lfs merge=lfs -text
3
  *.pkl filter=lfs diff=lfs merge=lfs -text
4
  *.gz -text
5
- reports/bert-baseline/dev_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
6
- reports/roberta-baseline/dev_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
7
- reports/bert-baseline/test_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
8
- reports/roberta-baseline/test_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
9
- reports/bert-baseline/train_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
10
- reports/roberta-baseline/train_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
11
- models/qwen-adapters/shared-tokenizer/tokenizer.json filter=lfs diff=lfs merge=lfs -text
12
- models/bert-final/classifier_full_model.bin filter=lfs diff=lfs merge=lfs -text
13
- models/roberta-final/classifier_full_model.bin filter=lfs diff=lfs merge=lfs -text
 
2
  *.safetensors filter=lfs diff=lfs merge=lfs -text
3
  *.pkl filter=lfs diff=lfs merge=lfs -text
4
  *.gz -text
 
 
 
 
 
 
 
 
 
.gitignore CHANGED
@@ -5,4 +5,5 @@
5
  .pytest_cache/
6
  .mypy_cache/
7
  .ipynb_checkpoints/
8
- outputs/
 
 
5
  .pytest_cache/
6
  .mypy_cache/
7
  .ipynb_checkpoints/
8
+ outputs/
9
+ .cache/
README.md CHANGED
@@ -9,26 +9,30 @@ tags:
9
  - qwen
10
  - lora
11
  - research
 
 
12
  library_name: transformers
13
  ---
14
 
15
  # EnhancedReplica Research Asset Pack
16
 
17
- 这是一个已经整理成 Hugging Face 仓库风格的研究交付包。它不再试图保留本地实验目录的原始混乱层级,而是把真正需要上传、交接和长期保存的资产收口成一个清晰项目。
18
 
19
  ## 这份仓库里有什么
20
 
21
  - `models/`:最终保留的 BERT、RoBERTa 权重,以及 3 个 Qwen LoRA adapter
22
  - `src/`:公共 Python 模块,供训练、评估、推理脚本复用
23
- - `scripts/`:按任务拆分的入口脚本,包括数据构建、训练评估、推理和批量流程
24
  - `configs/`:训练、推理、集成配置快照
25
  - `reports/`:实验输出、指标、manifest、日志和压缩后的预测文件
26
- - `docs/`:项目说明、模型清单、脚本重命名映射、上传说明
 
27
 
28
  ## 这份仓库里没有什么
29
 
30
  - 不包含 Qwen 基础模型权重
31
- - 不包含最终数据
 
32
  - 不保证在当前目录下直接一键重跑全部训练流程
33
  - 不再把 Markdown 文档和源码脚本混放在一起
34
 
@@ -45,6 +49,15 @@ library_name: transformers
45
  - `shared-tokenizer/` 中保留了一份公共 tokenizer 资产
46
  - 推理时必须通过 `QWEN_BASE_MODEL_PATH` 指向单独下载的基础模型
47
 
 
 
 
 
 
 
 
 
 
48
  ## 目录地图
49
 
50
  ```text
@@ -53,6 +66,7 @@ library_name: transformers
53
  ├── .gitattributes
54
  ├── .gitignore
55
  ├── requirements.txt
 
56
  ├── models/
57
  ├── src/
58
  ├── scripts/
@@ -69,6 +83,12 @@ library_name: transformers
69
  pip install -r requirements.txt
70
  ```
71
 
 
 
 
 
 
 
72
  运行 BERT / RoBERTa 推理:
73
 
74
  ```bash
@@ -82,17 +102,19 @@ export QWEN_BASE_MODEL_PATH=/path/to/Qwen2.5-7B-Instruct
82
  python scripts/inference/infer_qwen_adapters.py --dataset DS06_External_core_balanced_v1
83
  ```
84
 
 
 
 
 
 
 
 
 
85
  ## 为什么它适合上传到 Hugging Face
86
 
87
  - 大权重通过 `.gitattributes` 配置为 Git LFS 跟踪
88
  - 顶层结构已经收敛成 HF 常见项目形态
89
  - 说明文档集中在 `docs/`,源码树更干净
 
90
  - 大预测文件已经压缩为 `.csv.gz`
91
- - 顶层目录和核心脚本命名全部改成英文 / ASCII,更适合 Git 和 HF Hub
92
-
93
- ## 建议先看这些文档
94
-
95
- - `docs/project_overview.md`
96
- - `docs/model_inventory.md`
97
- - `docs/script_name_map.md`
98
- - `docs/huggingface_upload.md`
 
9
  - qwen
10
  - lora
11
  - research
12
+ - dataset
13
+ - ai-text-detection
14
  library_name: transformers
15
  ---
16
 
17
  # EnhancedReplica Research Asset Pack
18
 
19
+ 这是一个已经整理成 Hugging Face 仓库风格的研究交付包。它不再保留本地实验目录的混乱层级,而是把真正需要交接、上传和长期保存的资产收口成一个清晰项目。
20
 
21
  ## 这份仓库里有什么
22
 
23
  - `models/`:最终保留的 BERT、RoBERTa 权重,以及 3 个 Qwen LoRA adapter
24
  - `src/`:公共 Python 模块,供训练、评估、推理脚本复用
25
+ - `scripts/`:按任务拆分的入口脚本,包括训练评估、推理和批量流程
26
  - `configs/`:训练、推理、集成配置快照
27
  - `reports/`:实验输出、指标、manifest、日志和压缩后的预测文件
28
+ - `data/`:仓库内置的数据资产,分为可直接实验的数据集和原始素材/提示词
29
+ - `docs/`:项目说明、模型清单、脚本映射、数据说明和上传说明
30
 
31
  ## 这份仓库里没有什么
32
 
33
  - 不包含 Qwen 基础模型权重
34
+ - 不包含所有公网基准数据的原始下载副
35
+ - 不包含完整的外部 benchmark 数据全集
36
  - 不保证在当前目录下直接一键重跑全部训练流程
37
  - 不再把 Markdown 文档和源码脚本混放在一起
38
 
 
49
  - `shared-tokenizer/` 中保留了一份公共 tokenizer 资产
50
  - 推理时必须通过 `QWEN_BASE_MODEL_PATH` 指向单独下载的基础模型
51
 
52
+ ## 数据说明
53
+
54
+ - `data/dataset/`
55
+ - 收录 5 个仓库内可直接使用的数据集:`DS04`、`DS06`、`DS07`、`DS11`、`DS12`
56
+ - `90_manifests/dataset_manifests.json` 已改写为仓库内相对路径,可直接被脚本读取
57
+ - `data/source-materials/`
58
+ - 收录人工文本原料、AI 生成文本原料,以及两套提示词资产
59
+ - 一级目录统一使用 ASCII 命名,方便 Git、Hugging Face 和下游脚本处理
60
+
61
  ## 目录地图
62
 
63
  ```text
 
66
  ├── .gitattributes
67
  ├── .gitignore
68
  ├── requirements.txt
69
+ ├── data/
70
  ├── models/
71
  ├── src/
72
  ├── scripts/
 
83
  pip install -r requirements.txt
84
  ```
85
 
86
+ 查看仓库内置数据清单:
87
+
88
+ ```bash
89
+ python scripts/train_eval/data_checks/inspect_dataset_distribution.py --dataset_ids DS04,DS06,DS07,DS11,DS12 --smoke
90
+ ```
91
+
92
  运行 BERT / RoBERTa 推理:
93
 
94
  ```bash
 
102
  python scripts/inference/infer_qwen_adapters.py --dataset DS06_External_core_balanced_v1
103
  ```
104
 
105
+ ## 数据文档入口
106
+
107
+ - [项目总览](docs/project_overview.md)
108
+ - [数据总览](docs/dataset_overview.md)
109
+ - [模型清单](docs/model_inventory.md)
110
+ - [脚本映射](docs/script_name_map.md)
111
+ - [上传说明](docs/huggingface_upload.md)
112
+
113
  ## 为什么它适合上传到 Hugging Face
114
 
115
  - 大权重通过 `.gitattributes` 配置为 Git LFS 跟踪
116
  - 顶层结构已经收敛成 HF 常见项目形态
117
  - 说明文档集中在 `docs/`,源码树更干净
118
+ - 选定的数据资产已经并入仓库,不再依赖仓库外部自备数据目录
119
  - 大预测文件已经压缩为 `.csv.gz`
120
+ - 顶层目录和核心脚本命名全部改成英文 / ASCII,更适合 Git 和 HF Hub
 
 
 
 
 
 
 
docs/dataset_overview.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Dataset Overview
2
+
3
+ This repository now bundles the selected final datasets that were actually kept for handoff, along with the upstream source materials and prompt assets needed to understand where those datasets came from.
4
+
5
+ ## Data Layers
6
+
7
+ The bundled data is organized into two layers:
8
+
9
+ - `data/source-materials/`
10
+ - Human text source pools
11
+ - AI-generated text source pools
12
+ - Prompt assets used to generate the AI text pools
13
+ - `data/dataset/`
14
+ - Five experiment-ready datasets that can be used directly by the packaged scripts
15
+ - One central manifest at `data/dataset/90_manifests/dataset_manifests.json`
16
+
17
+ ## Source Materials
18
+
19
+ `data/source-materials/` keeps the upstream materials in a delivery-friendly layout:
20
+
21
+ - `human-core-pool`
22
+ - The main manually collected human text pool
23
+ - This is the core human-side source used to build the downstream human datasets
24
+ - `human-recovery-pool-v2`
25
+ - High-confidence human texts recovered from the quarantine branch, version 2
26
+ - Used as a supplement to the main human pool
27
+ - `human-recovery-pool-v3`
28
+ - High-confidence human texts recovered from the quarantine branch, version 3
29
+ - Used as an additional supplement to the main human pool
30
+ - `ai-generated-standard`
31
+ - AI-generated texts under the standard generation style
32
+ - Keeps the original topic/subtopic/prompt tree for traceability
33
+ - `ai-generated-natural-v1`
34
+ - AI-generated texts under the more natural writing-style branch
35
+ - Also keeps the original topic/subtopic/prompt tree
36
+ - `prompts-standard`
37
+ - Prompt files corresponding to the standard AI generation branch
38
+ - `prompts-natural-v1`
39
+ - Prompt files corresponding to the natural-style AI generation branch
40
+
41
+ Prompt files remain as `.txt` in the packaged data tree so the repository keeps documentation in `docs/` while the data area stays asset-oriented.
42
+
43
+ ## Experiment-Ready Datasets
44
+
45
+ `data/dataset/` includes five packaged datasets:
46
+
47
+ - `DS04_Human_pools_merged_v1`
48
+ - Pure human text pool
49
+ - Works as the main human-side source dataset
50
+ - All records are kept in `train.jsonl`
51
+ - `DS11_Generated_AI_v1`
52
+ - Standard-style AI text pool
53
+ - Pairs naturally with DS04 when building a standard human-vs-AI setting
54
+ - All records are kept in `train.jsonl`
55
+ - `DS12_Generated_AI_natural_v1`
56
+ - Natural-style AI text pool
57
+ - Used for the harder, more natural writing branch
58
+ - All records are kept in `train.jsonl`
59
+ - `DS06_External_core_balanced_v1`
60
+ - Balanced experiment set built from DS04 and DS11
61
+ - Includes `train/dev/test` and is suitable for direct evaluation and cross-domain experiments
62
+ - `DS07_External_long_v1`
63
+ - Balanced experiment set built from DS04 and DS12
64
+ - Includes `train/dev/test` and emphasizes the natural-style branch
65
+
66
+ Each dataset directory keeps its own `train/dev/test.jsonl`, `manifest.json`, and `check_noise.py`, while the repository-level manifest provides one portable entry point for script loading.
67
+
68
+ ## What Is Intentionally Not Bundled
69
+
70
+ - Raw public benchmark downloads such as NLPCC, HC3, CLTS, or Zhihu RLHF source packages
71
+ - Processed datasets that depend on the excluded public-source branches
72
+ - The deprecated `DS10` branch
73
+
74
+ This repository is meant to be a focused research asset pack, not a mirror of every intermediate or publicly downloadable dataset used during exploration.
docs/project_overview.md CHANGED
@@ -1,9 +1,10 @@
1
- # Project Overview
2
 
3
  This repository is the Hugging Face friendly version of the local research handoff package.
4
 
5
- It is organized around six durable parts:
6
 
 
7
  - `models/`: final checkpoints and adapters
8
  - `src/enhanced_replica/`: shared Python modules
9
  - `scripts/`: task-oriented entry points
@@ -15,5 +16,7 @@ Design choices:
15
 
16
  - Top-level folders use English / ASCII names for Git and HF Hub compatibility.
17
  - Markdown is concentrated under `docs/` so source trees stay uncluttered.
 
 
18
  - Prediction CSV files in `reports/` were compressed to `.csv.gz` to reduce repository weight.
19
- - The repository is upload-ready, but it is still an archive-oriented project pack rather than a fully reproducible end-to-end training repo.
 
1
+ # Project Overview
2
 
3
  This repository is the Hugging Face friendly version of the local research handoff package.
4
 
5
+ It is organized around seven durable parts:
6
 
7
+ - `data/`: bundled datasets, source materials, prompts, and a repo-local dataset manifest
8
  - `models/`: final checkpoints and adapters
9
  - `src/enhanced_replica/`: shared Python modules
10
  - `scripts/`: task-oriented entry points
 
16
 
17
  - Top-level folders use English / ASCII names for Git and HF Hub compatibility.
18
  - Markdown is concentrated under `docs/` so source trees stay uncluttered.
19
+ - Only the selected final data assets are bundled. Public benchmark downloads and the full external dataset zoo are intentionally excluded.
20
+ - `data/dataset/90_manifests/dataset_manifests.json` is rewritten with repo-relative paths so the packaged scripts can resolve datasets inside the repository.
21
  - Prediction CSV files in `reports/` were compressed to `.csv.gz` to reduce repository weight.
22
+ - The repository is upload-ready, but it is still an archive-oriented project pack rather than a fully reproducible end-to-end training repo.
scripts/inference/run_cross_domain_batch.py CHANGED
@@ -40,7 +40,7 @@ def main() -> None:
40
  "--dataset",
41
  dataset,
42
  "--fallback-dataset",
43
- "DS01_NLPCC_core_v1",
44
  "--force-fallback",
45
  ])
46
 
 
40
  "--dataset",
41
  dataset,
42
  "--fallback-dataset",
43
+ "DS06_External_core_balanced_v1",
44
  "--force-fallback",
45
  ])
46
 
scripts/inference/run_cross_domain_ensemble.py CHANGED
@@ -249,7 +249,7 @@ def main():
249
  import argparse
250
  parser = argparse.ArgumentParser()
251
  parser.add_argument("--dataset", required=True)
252
- parser.add_argument("--fallback-dataset", default="DS01_NLPCC_core_v1",
253
  help="Fallback dev dataset for LR training")
254
  parser.add_argument("--force-fallback", action="store_true",
255
  help="Force LR training on fallback dataset regardless of local dev size")
 
249
  import argparse
250
  parser = argparse.ArgumentParser()
251
  parser.add_argument("--dataset", required=True)
252
+ parser.add_argument("--fallback-dataset", default="DS06_External_core_balanced_v1",
253
  help="Fallback dev dataset for LR training")
254
  parser.add_argument("--force-fallback", action="store_true",
255
  help="Force LR training on fallback dataset regardless of local dev size")
scripts/inference/run_logistic_regression_ensemble.py CHANGED
@@ -259,8 +259,8 @@ def fit_lr_bucket(df_dev, feature_cols, global_scaler, global_clf, global_th):
259
  def main():
260
  import argparse
261
  parser = argparse.ArgumentParser()
262
- parser.add_argument("--dataset", required=True, help="e.g. DS01_NLPCC_core_v1 or DS13_NLPCC_full_test_v1")
263
- parser.add_argument("--fallback-dataset", default="DS01_NLPCC_core_v1",
264
  help="If dev samples are insufficient, use this dataset's dev to train LR")
265
  args = parser.parse_args()
266
 
 
259
  def main():
260
  import argparse
261
  parser = argparse.ArgumentParser()
262
+ parser.add_argument("--dataset", required=True, help="e.g. DS06_External_core_balanced_v1 or DS13_NLPCC_full_test_v1")
263
+ parser.add_argument("--fallback-dataset", default="DS06_External_core_balanced_v1",
264
  help="If dev samples are insufficient, use this dataset's dev to train LR")
265
  args = parser.parse_args()
266
 
scripts/inference/run_zero_shot_detectors.py CHANGED
@@ -12,8 +12,12 @@ from pathlib import Path
12
  from transformers import AutoTokenizer, AutoModelForCausalLM
13
  from modelscope import snapshot_download
14
 
15
- DATASET_ROOT = Path("data/dataset")
16
- OUTPUT_ROOT = Path("outputs/zero_shot")
 
 
 
 
17
  MAX_LENGTH = 512
18
  BATCH_SIZE = 16
19
  DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
 
12
  from transformers import AutoTokenizer, AutoModelForCausalLM
13
  from modelscope import snapshot_download
14
 
15
+ REPO_ROOT = Path(__file__).resolve()
16
+ while REPO_ROOT != REPO_ROOT.parent and not (REPO_ROOT / "src").exists():
17
+ REPO_ROOT = REPO_ROOT.parent
18
+
19
+ DATASET_ROOT = REPO_ROOT / "data" / "dataset"
20
+ OUTPUT_ROOT = REPO_ROOT / "outputs" / "zero_shot"
21
  MAX_LENGTH = 512
22
  BATCH_SIZE = 16
23
  DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
scripts/train_eval/cross_domain/evaluate_mixed_label_zero_shot.py CHANGED
@@ -86,11 +86,10 @@ def run_e09(args: argparse.Namespace) -> dict:
86
  # 3. Load target dataset split
87
  manifest = load_dataset_manifest(Path(args.manifest_file))
88
  ds_meta = get_ds_meta(manifest, args.dataset_id)
89
- dataset_dir = Path(ds_meta["dataset_dir"])
90
- split_path = dataset_dir / f"{args.split}.jsonl"
91
  if not split_path.exists():
92
  raise FileNotFoundError(f"Split file not found: {split_path}")
93
- df = load_split_df(dataset_dir, args.split)
94
  logger.info(f"Target dataset: {args.dataset_id} | split={args.split} | rows={len(df)}")
95
 
96
  # 4. Load model and run inference
 
86
  # 3. Load target dataset split
87
  manifest = load_dataset_manifest(Path(args.manifest_file))
88
  ds_meta = get_ds_meta(manifest, args.dataset_id)
89
+ split_path = Path(ds_meta[args.split])
 
90
  if not split_path.exists():
91
  raise FileNotFoundError(f"Split file not found: {split_path}")
92
+ df = load_split_df(split_path)
93
  logger.info(f"Target dataset: {args.dataset_id} | split={args.split} | rows={len(df)}")
94
 
95
  # 4. Load model and run inference
scripts/train_eval/data_checks/inspect_dataset_distribution.py CHANGED
@@ -25,7 +25,7 @@ for _candidate in (REPO_ROOT, REPO_ROOT / "src"):
25
  sys.path.insert(0, _candidate_str)
26
 
27
  from enhanced_replica.cli_args import add_base_args
28
- from enhanced_replica.data_utils import load_dataset_manifest, load_dataset_splits, SPLITS, validate_schema
29
  from enhanced_replica.io_utils import create_run_context, ensure_dir, write_csv, write_json, write_run_manifest, write_run_report, write_yaml_minimal
30
 
31
 
@@ -61,13 +61,7 @@ def run_e00(args: argparse.Namespace) -> dict:
61
 
62
  for ds_id in ds_ids:
63
  info = manifest[ds_id]
64
- dataset_dir = Path(info["dataset_dir"])
65
- ds_meta = {
66
- "dataset_id": info["dataset_id"],
67
- "train": dataset_dir / "train.jsonl",
68
- "dev": dataset_dir / "dev.jsonl",
69
- "test": dataset_dir / "test.jsonl",
70
- }
71
 
72
  # 1. Load splits (this tests _common.data_utils.load_dataset_splits)
73
  try:
 
25
  sys.path.insert(0, _candidate_str)
26
 
27
  from enhanced_replica.cli_args import add_base_args
28
+ from enhanced_replica.data_utils import get_ds_meta, load_dataset_manifest, load_dataset_splits, SPLITS, validate_schema
29
  from enhanced_replica.io_utils import create_run_context, ensure_dir, write_csv, write_json, write_run_manifest, write_run_report, write_yaml_minimal
30
 
31
 
 
61
 
62
  for ds_id in ds_ids:
63
  info = manifest[ds_id]
64
+ ds_meta = get_ds_meta(manifest, ds_id)
 
 
 
 
 
 
65
 
66
  # 1. Load splits (this tests _common.data_utils.load_dataset_splits)
67
  try:
src/enhanced_replica/data_utils.py CHANGED
@@ -7,7 +7,7 @@ from typing import Dict, List, Sequence
7
 
8
  import pandas as pd
9
 
10
- from .io_utils import read_json
11
 
12
 
13
  DEFAULT_REQUIRED_FIELDS = ["record_id", "text", "label", "source", "split", "length_char", "topic", "model_slug"]
@@ -28,7 +28,7 @@ def load_dataset_manifest(manifest_file: Path | None = None) -> dict:
28
  if manifest_file is None:
29
  from .io_utils import DEFAULT_MANIFEST_FILE
30
  manifest_file = DEFAULT_MANIFEST_FILE
31
- return read_json(manifest_file)
32
 
33
 
34
  def get_ds_meta(manifest: dict, ds_id: str) -> dict:
@@ -36,7 +36,7 @@ def get_ds_meta(manifest: dict, ds_id: str) -> dict:
36
  if ds_id not in manifest:
37
  raise KeyError(f"{ds_id} not found in dataset manifest")
38
  info = manifest[ds_id]
39
- ds_dir = Path(info["dataset_dir"])
40
  out = {
41
  "dataset_id": info["dataset_id"],
42
  "dataset_dir": str(ds_dir),
 
7
 
8
  import pandas as pd
9
 
10
+ from .io_utils import read_json, resolve_repo_path
11
 
12
 
13
  DEFAULT_REQUIRED_FIELDS = ["record_id", "text", "label", "source", "split", "length_char", "topic", "model_slug"]
 
28
  if manifest_file is None:
29
  from .io_utils import DEFAULT_MANIFEST_FILE
30
  manifest_file = DEFAULT_MANIFEST_FILE
31
+ return read_json(resolve_repo_path(manifest_file))
32
 
33
 
34
  def get_ds_meta(manifest: dict, ds_id: str) -> dict:
 
36
  if ds_id not in manifest:
37
  raise KeyError(f"{ds_id} not found in dataset manifest")
38
  info = manifest[ds_id]
39
+ ds_dir = resolve_repo_path(info["dataset_dir"])
40
  out = {
41
  "dataset_id": info["dataset_id"],
42
  "dataset_dir": str(ds_dir),
src/enhanced_replica/io_utils.py CHANGED
@@ -10,11 +10,12 @@ from typing import Any, Dict, Iterable, List
10
  import pandas as pd
11
 
12
 
13
- SCRIPT_ROOT = Path(__file__).resolve().parents[1]
14
- ROUTE_ROOT = SCRIPT_ROOT.parents[1]
15
- DATASET_ROOT = ROUTE_ROOT / "data" / "dataset"
 
16
  DEFAULT_MANIFEST_FILE = DATASET_ROOT / "90_manifests" / "dataset_manifests.json"
17
- DEFAULT_OUTPUT_ROOT = ROUTE_ROOT / "outputs"
18
 
19
 
20
  def now_ts() -> str:
@@ -30,6 +31,13 @@ def ensure_dir(path: Path) -> Path:
30
  return path
31
 
32
 
 
 
 
 
 
 
 
33
  def read_json(path: Path) -> Any:
34
  return json.loads(path.read_text(encoding="utf-8"))
35
 
 
10
  import pandas as pd
11
 
12
 
13
+ PACKAGE_ROOT = Path(__file__).resolve().parent
14
+ SRC_ROOT = PACKAGE_ROOT.parent
15
+ REPO_ROOT = SRC_ROOT.parent
16
+ DATASET_ROOT = REPO_ROOT / "data" / "dataset"
17
  DEFAULT_MANIFEST_FILE = DATASET_ROOT / "90_manifests" / "dataset_manifests.json"
18
+ DEFAULT_OUTPUT_ROOT = REPO_ROOT / "outputs"
19
 
20
 
21
  def now_ts() -> str:
 
31
  return path
32
 
33
 
34
+ def resolve_repo_path(path: str | Path) -> Path:
35
+ resolved = Path(path)
36
+ if resolved.is_absolute():
37
+ return resolved
38
+ return REPO_ROOT / resolved
39
+
40
+
41
  def read_json(path: Path) -> Any:
42
  return json.loads(path.read_text(encoding="utf-8"))
43