Text Classification
Transformers
Safetensors
Chinese
chinese
ai-text-detection
ensemble
bert
roberta
qwen
lora
research
dataset
Instructions to use LUCIFerace/enhanced-replica-model-pack with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use LUCIFerace/enhanced-replica-model-pack with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="LUCIFerace/enhanced-replica-model-pack")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("LUCIFerace/enhanced-replica-model-pack", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Add files using upload-large-folder tool
Browse files- .gitattributes +0 -9
- .gitignore +2 -1
- README.md +34 -12
- docs/dataset_overview.md +74 -0
- docs/project_overview.md +6 -3
- scripts/inference/run_cross_domain_batch.py +1 -1
- scripts/inference/run_cross_domain_ensemble.py +1 -1
- scripts/inference/run_logistic_regression_ensemble.py +2 -2
- scripts/inference/run_zero_shot_detectors.py +6 -2
- scripts/train_eval/cross_domain/evaluate_mixed_label_zero_shot.py +2 -3
- scripts/train_eval/data_checks/inspect_dataset_distribution.py +2 -8
- src/enhanced_replica/data_utils.py +3 -3
- src/enhanced_replica/io_utils.py +12 -4
.gitattributes
CHANGED
|
@@ -2,12 +2,3 @@
|
|
| 2 |
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 3 |
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 4 |
*.gz -text
|
| 5 |
-
reports/bert-baseline/dev_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
|
| 6 |
-
reports/roberta-baseline/dev_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
-
reports/bert-baseline/test_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
-
reports/roberta-baseline/test_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
|
| 9 |
-
reports/bert-baseline/train_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
|
| 10 |
-
reports/roberta-baseline/train_pred.csv.gz filter=lfs diff=lfs merge=lfs -text
|
| 11 |
-
models/qwen-adapters/shared-tokenizer/tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
| 12 |
-
models/bert-final/classifier_full_model.bin filter=lfs diff=lfs merge=lfs -text
|
| 13 |
-
models/roberta-final/classifier_full_model.bin filter=lfs diff=lfs merge=lfs -text
|
|
|
|
| 2 |
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 3 |
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 4 |
*.gz -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.gitignore
CHANGED
|
@@ -5,4 +5,5 @@
|
|
| 5 |
.pytest_cache/
|
| 6 |
.mypy_cache/
|
| 7 |
.ipynb_checkpoints/
|
| 8 |
-
outputs/
|
|
|
|
|
|
| 5 |
.pytest_cache/
|
| 6 |
.mypy_cache/
|
| 7 |
.ipynb_checkpoints/
|
| 8 |
+
outputs/
|
| 9 |
+
.cache/
|
README.md
CHANGED
|
@@ -9,26 +9,30 @@ tags:
|
|
| 9 |
- qwen
|
| 10 |
- lora
|
| 11 |
- research
|
|
|
|
|
|
|
| 12 |
library_name: transformers
|
| 13 |
---
|
| 14 |
|
| 15 |
# EnhancedReplica Research Asset Pack
|
| 16 |
|
| 17 |
-
这是一个已经整理成 Hugging Face 仓库风格的研究交付包。它不再
|
| 18 |
|
| 19 |
## 这份仓库里有什么
|
| 20 |
|
| 21 |
- `models/`:最终保留的 BERT、RoBERTa 权重,以及 3 个 Qwen LoRA adapter
|
| 22 |
- `src/`:公共 Python 模块,供训练、评估、推理脚本复用
|
| 23 |
-
- `scripts/`:按任务拆分的入口脚本,包括
|
| 24 |
- `configs/`:训练、推理、集成配置快照
|
| 25 |
- `reports/`:实验输出、指标、manifest、日志和压缩后的预测文件
|
| 26 |
-
- `
|
|
|
|
| 27 |
|
| 28 |
## 这份仓库里没有什么
|
| 29 |
|
| 30 |
- 不包含 Qwen 基础模型权重
|
| 31 |
-
- 不包含
|
|
|
|
| 32 |
- 不保证在当前目录下直接一键重跑全部训练流程
|
| 33 |
- 不再把 Markdown 文档和源码脚本混放在一起
|
| 34 |
|
|
@@ -45,6 +49,15 @@ library_name: transformers
|
|
| 45 |
- `shared-tokenizer/` 中保留了一份公共 tokenizer 资产
|
| 46 |
- 推理时必须通过 `QWEN_BASE_MODEL_PATH` 指向单独下载的基础模型
|
| 47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
## 目录地图
|
| 49 |
|
| 50 |
```text
|
|
@@ -53,6 +66,7 @@ library_name: transformers
|
|
| 53 |
├── .gitattributes
|
| 54 |
├── .gitignore
|
| 55 |
├── requirements.txt
|
|
|
|
| 56 |
├── models/
|
| 57 |
├── src/
|
| 58 |
├── scripts/
|
|
@@ -69,6 +83,12 @@ library_name: transformers
|
|
| 69 |
pip install -r requirements.txt
|
| 70 |
```
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
运行 BERT / RoBERTa 推理:
|
| 73 |
|
| 74 |
```bash
|
|
@@ -82,17 +102,19 @@ export QWEN_BASE_MODEL_PATH=/path/to/Qwen2.5-7B-Instruct
|
|
| 82 |
python scripts/inference/infer_qwen_adapters.py --dataset DS06_External_core_balanced_v1
|
| 83 |
```
|
| 84 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
## 为什么它适合上传到 Hugging Face
|
| 86 |
|
| 87 |
- 大权重通过 `.gitattributes` 配置为 Git LFS 跟踪
|
| 88 |
- 顶层结构已经收敛成 HF 常见项目形态
|
| 89 |
- 说明文档集中在 `docs/`,源码树更干净
|
|
|
|
| 90 |
- 大预测文件已经压缩为 `.csv.gz`
|
| 91 |
-
- 顶层目录和核心脚本命名全部改成英文 / ASCII,更适合 Git 和 HF Hub
|
| 92 |
-
|
| 93 |
-
## 建议先看这些文档
|
| 94 |
-
|
| 95 |
-
- `docs/project_overview.md`
|
| 96 |
-
- `docs/model_inventory.md`
|
| 97 |
-
- `docs/script_name_map.md`
|
| 98 |
-
- `docs/huggingface_upload.md`
|
|
|
|
| 9 |
- qwen
|
| 10 |
- lora
|
| 11 |
- research
|
| 12 |
+
- dataset
|
| 13 |
+
- ai-text-detection
|
| 14 |
library_name: transformers
|
| 15 |
---
|
| 16 |
|
| 17 |
# EnhancedReplica Research Asset Pack
|
| 18 |
|
| 19 |
+
这是一个已经整理成 Hugging Face 仓库风格的研究交付包。它不再保留本地实验目录的混乱层级,而是把真正需要交接、上传和长期保存的资产收口成一个清晰项目。
|
| 20 |
|
| 21 |
## 这份仓库里有什么
|
| 22 |
|
| 23 |
- `models/`:最终保留的 BERT、RoBERTa 权重,以及 3 个 Qwen LoRA adapter
|
| 24 |
- `src/`:公共 Python 模块,供训练、评估、推理脚本复用
|
| 25 |
+
- `scripts/`:按任务拆分的入口脚本,包括训练评估、推理和批量流程
|
| 26 |
- `configs/`:训练、推理、集成配置快照
|
| 27 |
- `reports/`:实验输出、指标、manifest、日志和压缩后的预测文件
|
| 28 |
+
- `data/`:仓库内置的数据资产,分为可直接实验的数据集和原始素材/提示词
|
| 29 |
+
- `docs/`:项目说明、模型清单、脚本映射、数据说明和上传说明
|
| 30 |
|
| 31 |
## 这份仓库里没有什么
|
| 32 |
|
| 33 |
- 不包含 Qwen 基础模型权重
|
| 34 |
+
- 不包含所有公网基准数据的原始下载副本
|
| 35 |
+
- 不包含完整的外部 benchmark 数据全集
|
| 36 |
- 不保证在当前目录下直接一键重跑全部训练流程
|
| 37 |
- 不再把 Markdown 文档和源码脚本混放在一起
|
| 38 |
|
|
|
|
| 49 |
- `shared-tokenizer/` 中保留了一份公共 tokenizer 资产
|
| 50 |
- 推理时必须通过 `QWEN_BASE_MODEL_PATH` 指向单独下载的基础模型
|
| 51 |
|
| 52 |
+
## 数据说明
|
| 53 |
+
|
| 54 |
+
- `data/dataset/`
|
| 55 |
+
- 收录 5 个仓库内可直接使用的数据集:`DS04`、`DS06`、`DS07`、`DS11`、`DS12`
|
| 56 |
+
- `90_manifests/dataset_manifests.json` 已改写为仓库内相对路径,可直接被脚本读取
|
| 57 |
+
- `data/source-materials/`
|
| 58 |
+
- 收录人工文本原料、AI 生成文本原料,以及两套提示词资产
|
| 59 |
+
- 一级目录统一使用 ASCII 命名,方便 Git、Hugging Face 和下游脚本处理
|
| 60 |
+
|
| 61 |
## 目录地图
|
| 62 |
|
| 63 |
```text
|
|
|
|
| 66 |
├── .gitattributes
|
| 67 |
├── .gitignore
|
| 68 |
├── requirements.txt
|
| 69 |
+
├── data/
|
| 70 |
├── models/
|
| 71 |
├── src/
|
| 72 |
├── scripts/
|
|
|
|
| 83 |
pip install -r requirements.txt
|
| 84 |
```
|
| 85 |
|
| 86 |
+
查看仓库内置数据清单:
|
| 87 |
+
|
| 88 |
+
```bash
|
| 89 |
+
python scripts/train_eval/data_checks/inspect_dataset_distribution.py --dataset_ids DS04,DS06,DS07,DS11,DS12 --smoke
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
运行 BERT / RoBERTa 推理:
|
| 93 |
|
| 94 |
```bash
|
|
|
|
| 102 |
python scripts/inference/infer_qwen_adapters.py --dataset DS06_External_core_balanced_v1
|
| 103 |
```
|
| 104 |
|
| 105 |
+
## 数据文档入口
|
| 106 |
+
|
| 107 |
+
- [项目总览](docs/project_overview.md)
|
| 108 |
+
- [数据总览](docs/dataset_overview.md)
|
| 109 |
+
- [模型清单](docs/model_inventory.md)
|
| 110 |
+
- [脚本映射](docs/script_name_map.md)
|
| 111 |
+
- [上传说明](docs/huggingface_upload.md)
|
| 112 |
+
|
| 113 |
## 为什么它适合上传到 Hugging Face
|
| 114 |
|
| 115 |
- 大权重通过 `.gitattributes` 配置为 Git LFS 跟踪
|
| 116 |
- 顶层结构已经收敛成 HF 常见项目形态
|
| 117 |
- 说明文档集中在 `docs/`,源码树更干净
|
| 118 |
+
- 选定的数据资产已经并入仓库,不再依赖仓库外部自备数据目录
|
| 119 |
- 大预测文件已经压缩为 `.csv.gz`
|
| 120 |
+
- 顶层目录和核心脚本命名全部改成英文 / ASCII,更适合 Git 和 HF Hub
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/dataset_overview.md
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Dataset Overview
|
| 2 |
+
|
| 3 |
+
This repository now bundles the selected final datasets that were actually kept for handoff, along with the upstream source materials and prompt assets needed to understand where those datasets came from.
|
| 4 |
+
|
| 5 |
+
## Data Layers
|
| 6 |
+
|
| 7 |
+
The bundled data is organized into two layers:
|
| 8 |
+
|
| 9 |
+
- `data/source-materials/`
|
| 10 |
+
- Human text source pools
|
| 11 |
+
- AI-generated text source pools
|
| 12 |
+
- Prompt assets used to generate the AI text pools
|
| 13 |
+
- `data/dataset/`
|
| 14 |
+
- Five experiment-ready datasets that can be used directly by the packaged scripts
|
| 15 |
+
- One central manifest at `data/dataset/90_manifests/dataset_manifests.json`
|
| 16 |
+
|
| 17 |
+
## Source Materials
|
| 18 |
+
|
| 19 |
+
`data/source-materials/` keeps the upstream materials in a delivery-friendly layout:
|
| 20 |
+
|
| 21 |
+
- `human-core-pool`
|
| 22 |
+
- The main manually collected human text pool
|
| 23 |
+
- This is the core human-side source used to build the downstream human datasets
|
| 24 |
+
- `human-recovery-pool-v2`
|
| 25 |
+
- High-confidence human texts recovered from the quarantine branch, version 2
|
| 26 |
+
- Used as a supplement to the main human pool
|
| 27 |
+
- `human-recovery-pool-v3`
|
| 28 |
+
- High-confidence human texts recovered from the quarantine branch, version 3
|
| 29 |
+
- Used as an additional supplement to the main human pool
|
| 30 |
+
- `ai-generated-standard`
|
| 31 |
+
- AI-generated texts under the standard generation style
|
| 32 |
+
- Keeps the original topic/subtopic/prompt tree for traceability
|
| 33 |
+
- `ai-generated-natural-v1`
|
| 34 |
+
- AI-generated texts under the more natural writing-style branch
|
| 35 |
+
- Also keeps the original topic/subtopic/prompt tree
|
| 36 |
+
- `prompts-standard`
|
| 37 |
+
- Prompt files corresponding to the standard AI generation branch
|
| 38 |
+
- `prompts-natural-v1`
|
| 39 |
+
- Prompt files corresponding to the natural-style AI generation branch
|
| 40 |
+
|
| 41 |
+
Prompt files remain as `.txt` in the packaged data tree so the repository keeps documentation in `docs/` while the data area stays asset-oriented.
|
| 42 |
+
|
| 43 |
+
## Experiment-Ready Datasets
|
| 44 |
+
|
| 45 |
+
`data/dataset/` includes five packaged datasets:
|
| 46 |
+
|
| 47 |
+
- `DS04_Human_pools_merged_v1`
|
| 48 |
+
- Pure human text pool
|
| 49 |
+
- Works as the main human-side source dataset
|
| 50 |
+
- All records are kept in `train.jsonl`
|
| 51 |
+
- `DS11_Generated_AI_v1`
|
| 52 |
+
- Standard-style AI text pool
|
| 53 |
+
- Pairs naturally with DS04 when building a standard human-vs-AI setting
|
| 54 |
+
- All records are kept in `train.jsonl`
|
| 55 |
+
- `DS12_Generated_AI_natural_v1`
|
| 56 |
+
- Natural-style AI text pool
|
| 57 |
+
- Used for the harder, more natural writing branch
|
| 58 |
+
- All records are kept in `train.jsonl`
|
| 59 |
+
- `DS06_External_core_balanced_v1`
|
| 60 |
+
- Balanced experiment set built from DS04 and DS11
|
| 61 |
+
- Includes `train/dev/test` and is suitable for direct evaluation and cross-domain experiments
|
| 62 |
+
- `DS07_External_long_v1`
|
| 63 |
+
- Balanced experiment set built from DS04 and DS12
|
| 64 |
+
- Includes `train/dev/test` and emphasizes the natural-style branch
|
| 65 |
+
|
| 66 |
+
Each dataset directory keeps its own `train/dev/test.jsonl`, `manifest.json`, and `check_noise.py`, while the repository-level manifest provides one portable entry point for script loading.
|
| 67 |
+
|
| 68 |
+
## What Is Intentionally Not Bundled
|
| 69 |
+
|
| 70 |
+
- Raw public benchmark downloads such as NLPCC, HC3, CLTS, or Zhihu RLHF source packages
|
| 71 |
+
- Processed datasets that depend on the excluded public-source branches
|
| 72 |
+
- The deprecated `DS10` branch
|
| 73 |
+
|
| 74 |
+
This repository is meant to be a focused research asset pack, not a mirror of every intermediate or publicly downloadable dataset used during exploration.
|
docs/project_overview.md
CHANGED
|
@@ -1,9 +1,10 @@
|
|
| 1 |
-
# Project Overview
|
| 2 |
|
| 3 |
This repository is the Hugging Face friendly version of the local research handoff package.
|
| 4 |
|
| 5 |
-
It is organized around
|
| 6 |
|
|
|
|
| 7 |
- `models/`: final checkpoints and adapters
|
| 8 |
- `src/enhanced_replica/`: shared Python modules
|
| 9 |
- `scripts/`: task-oriented entry points
|
|
@@ -15,5 +16,7 @@ Design choices:
|
|
| 15 |
|
| 16 |
- Top-level folders use English / ASCII names for Git and HF Hub compatibility.
|
| 17 |
- Markdown is concentrated under `docs/` so source trees stay uncluttered.
|
|
|
|
|
|
|
| 18 |
- Prediction CSV files in `reports/` were compressed to `.csv.gz` to reduce repository weight.
|
| 19 |
-
- The repository is upload-ready, but it is still an archive-oriented project pack rather than a fully reproducible end-to-end training repo.
|
|
|
|
| 1 |
+
# Project Overview
|
| 2 |
|
| 3 |
This repository is the Hugging Face friendly version of the local research handoff package.
|
| 4 |
|
| 5 |
+
It is organized around seven durable parts:
|
| 6 |
|
| 7 |
+
- `data/`: bundled datasets, source materials, prompts, and a repo-local dataset manifest
|
| 8 |
- `models/`: final checkpoints and adapters
|
| 9 |
- `src/enhanced_replica/`: shared Python modules
|
| 10 |
- `scripts/`: task-oriented entry points
|
|
|
|
| 16 |
|
| 17 |
- Top-level folders use English / ASCII names for Git and HF Hub compatibility.
|
| 18 |
- Markdown is concentrated under `docs/` so source trees stay uncluttered.
|
| 19 |
+
- Only the selected final data assets are bundled. Public benchmark downloads and the full external dataset zoo are intentionally excluded.
|
| 20 |
+
- `data/dataset/90_manifests/dataset_manifests.json` is rewritten with repo-relative paths so the packaged scripts can resolve datasets inside the repository.
|
| 21 |
- Prediction CSV files in `reports/` were compressed to `.csv.gz` to reduce repository weight.
|
| 22 |
+
- The repository is upload-ready, but it is still an archive-oriented project pack rather than a fully reproducible end-to-end training repo.
|
scripts/inference/run_cross_domain_batch.py
CHANGED
|
@@ -40,7 +40,7 @@ def main() -> None:
|
|
| 40 |
"--dataset",
|
| 41 |
dataset,
|
| 42 |
"--fallback-dataset",
|
| 43 |
-
"
|
| 44 |
"--force-fallback",
|
| 45 |
])
|
| 46 |
|
|
|
|
| 40 |
"--dataset",
|
| 41 |
dataset,
|
| 42 |
"--fallback-dataset",
|
| 43 |
+
"DS06_External_core_balanced_v1",
|
| 44 |
"--force-fallback",
|
| 45 |
])
|
| 46 |
|
scripts/inference/run_cross_domain_ensemble.py
CHANGED
|
@@ -249,7 +249,7 @@ def main():
|
|
| 249 |
import argparse
|
| 250 |
parser = argparse.ArgumentParser()
|
| 251 |
parser.add_argument("--dataset", required=True)
|
| 252 |
-
parser.add_argument("--fallback-dataset", default="
|
| 253 |
help="Fallback dev dataset for LR training")
|
| 254 |
parser.add_argument("--force-fallback", action="store_true",
|
| 255 |
help="Force LR training on fallback dataset regardless of local dev size")
|
|
|
|
| 249 |
import argparse
|
| 250 |
parser = argparse.ArgumentParser()
|
| 251 |
parser.add_argument("--dataset", required=True)
|
| 252 |
+
parser.add_argument("--fallback-dataset", default="DS06_External_core_balanced_v1",
|
| 253 |
help="Fallback dev dataset for LR training")
|
| 254 |
parser.add_argument("--force-fallback", action="store_true",
|
| 255 |
help="Force LR training on fallback dataset regardless of local dev size")
|
scripts/inference/run_logistic_regression_ensemble.py
CHANGED
|
@@ -259,8 +259,8 @@ def fit_lr_bucket(df_dev, feature_cols, global_scaler, global_clf, global_th):
|
|
| 259 |
def main():
|
| 260 |
import argparse
|
| 261 |
parser = argparse.ArgumentParser()
|
| 262 |
-
parser.add_argument("--dataset", required=True, help="e.g.
|
| 263 |
-
parser.add_argument("--fallback-dataset", default="
|
| 264 |
help="If dev samples are insufficient, use this dataset's dev to train LR")
|
| 265 |
args = parser.parse_args()
|
| 266 |
|
|
|
|
| 259 |
def main():
|
| 260 |
import argparse
|
| 261 |
parser = argparse.ArgumentParser()
|
| 262 |
+
parser.add_argument("--dataset", required=True, help="e.g. DS06_External_core_balanced_v1 or DS13_NLPCC_full_test_v1")
|
| 263 |
+
parser.add_argument("--fallback-dataset", default="DS06_External_core_balanced_v1",
|
| 264 |
help="If dev samples are insufficient, use this dataset's dev to train LR")
|
| 265 |
args = parser.parse_args()
|
| 266 |
|
scripts/inference/run_zero_shot_detectors.py
CHANGED
|
@@ -12,8 +12,12 @@ from pathlib import Path
|
|
| 12 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 13 |
from modelscope import snapshot_download
|
| 14 |
|
| 15 |
-
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
MAX_LENGTH = 512
|
| 18 |
BATCH_SIZE = 16
|
| 19 |
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
|
|
|
| 12 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 13 |
from modelscope import snapshot_download
|
| 14 |
|
| 15 |
+
REPO_ROOT = Path(__file__).resolve()
|
| 16 |
+
while REPO_ROOT != REPO_ROOT.parent and not (REPO_ROOT / "src").exists():
|
| 17 |
+
REPO_ROOT = REPO_ROOT.parent
|
| 18 |
+
|
| 19 |
+
DATASET_ROOT = REPO_ROOT / "data" / "dataset"
|
| 20 |
+
OUTPUT_ROOT = REPO_ROOT / "outputs" / "zero_shot"
|
| 21 |
MAX_LENGTH = 512
|
| 22 |
BATCH_SIZE = 16
|
| 23 |
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
scripts/train_eval/cross_domain/evaluate_mixed_label_zero_shot.py
CHANGED
|
@@ -86,11 +86,10 @@ def run_e09(args: argparse.Namespace) -> dict:
|
|
| 86 |
# 3. Load target dataset split
|
| 87 |
manifest = load_dataset_manifest(Path(args.manifest_file))
|
| 88 |
ds_meta = get_ds_meta(manifest, args.dataset_id)
|
| 89 |
-
|
| 90 |
-
split_path = dataset_dir / f"{args.split}.jsonl"
|
| 91 |
if not split_path.exists():
|
| 92 |
raise FileNotFoundError(f"Split file not found: {split_path}")
|
| 93 |
-
df = load_split_df(
|
| 94 |
logger.info(f"Target dataset: {args.dataset_id} | split={args.split} | rows={len(df)}")
|
| 95 |
|
| 96 |
# 4. Load model and run inference
|
|
|
|
| 86 |
# 3. Load target dataset split
|
| 87 |
manifest = load_dataset_manifest(Path(args.manifest_file))
|
| 88 |
ds_meta = get_ds_meta(manifest, args.dataset_id)
|
| 89 |
+
split_path = Path(ds_meta[args.split])
|
|
|
|
| 90 |
if not split_path.exists():
|
| 91 |
raise FileNotFoundError(f"Split file not found: {split_path}")
|
| 92 |
+
df = load_split_df(split_path)
|
| 93 |
logger.info(f"Target dataset: {args.dataset_id} | split={args.split} | rows={len(df)}")
|
| 94 |
|
| 95 |
# 4. Load model and run inference
|
scripts/train_eval/data_checks/inspect_dataset_distribution.py
CHANGED
|
@@ -25,7 +25,7 @@ for _candidate in (REPO_ROOT, REPO_ROOT / "src"):
|
|
| 25 |
sys.path.insert(0, _candidate_str)
|
| 26 |
|
| 27 |
from enhanced_replica.cli_args import add_base_args
|
| 28 |
-
from enhanced_replica.data_utils import load_dataset_manifest, load_dataset_splits, SPLITS, validate_schema
|
| 29 |
from enhanced_replica.io_utils import create_run_context, ensure_dir, write_csv, write_json, write_run_manifest, write_run_report, write_yaml_minimal
|
| 30 |
|
| 31 |
|
|
@@ -61,13 +61,7 @@ def run_e00(args: argparse.Namespace) -> dict:
|
|
| 61 |
|
| 62 |
for ds_id in ds_ids:
|
| 63 |
info = manifest[ds_id]
|
| 64 |
-
|
| 65 |
-
ds_meta = {
|
| 66 |
-
"dataset_id": info["dataset_id"],
|
| 67 |
-
"train": dataset_dir / "train.jsonl",
|
| 68 |
-
"dev": dataset_dir / "dev.jsonl",
|
| 69 |
-
"test": dataset_dir / "test.jsonl",
|
| 70 |
-
}
|
| 71 |
|
| 72 |
# 1. Load splits (this tests _common.data_utils.load_dataset_splits)
|
| 73 |
try:
|
|
|
|
| 25 |
sys.path.insert(0, _candidate_str)
|
| 26 |
|
| 27 |
from enhanced_replica.cli_args import add_base_args
|
| 28 |
+
from enhanced_replica.data_utils import get_ds_meta, load_dataset_manifest, load_dataset_splits, SPLITS, validate_schema
|
| 29 |
from enhanced_replica.io_utils import create_run_context, ensure_dir, write_csv, write_json, write_run_manifest, write_run_report, write_yaml_minimal
|
| 30 |
|
| 31 |
|
|
|
|
| 61 |
|
| 62 |
for ds_id in ds_ids:
|
| 63 |
info = manifest[ds_id]
|
| 64 |
+
ds_meta = get_ds_meta(manifest, ds_id)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
# 1. Load splits (this tests _common.data_utils.load_dataset_splits)
|
| 67 |
try:
|
src/enhanced_replica/data_utils.py
CHANGED
|
@@ -7,7 +7,7 @@ from typing import Dict, List, Sequence
|
|
| 7 |
|
| 8 |
import pandas as pd
|
| 9 |
|
| 10 |
-
from .io_utils import read_json
|
| 11 |
|
| 12 |
|
| 13 |
DEFAULT_REQUIRED_FIELDS = ["record_id", "text", "label", "source", "split", "length_char", "topic", "model_slug"]
|
|
@@ -28,7 +28,7 @@ def load_dataset_manifest(manifest_file: Path | None = None) -> dict:
|
|
| 28 |
if manifest_file is None:
|
| 29 |
from .io_utils import DEFAULT_MANIFEST_FILE
|
| 30 |
manifest_file = DEFAULT_MANIFEST_FILE
|
| 31 |
-
return read_json(manifest_file)
|
| 32 |
|
| 33 |
|
| 34 |
def get_ds_meta(manifest: dict, ds_id: str) -> dict:
|
|
@@ -36,7 +36,7 @@ def get_ds_meta(manifest: dict, ds_id: str) -> dict:
|
|
| 36 |
if ds_id not in manifest:
|
| 37 |
raise KeyError(f"{ds_id} not found in dataset manifest")
|
| 38 |
info = manifest[ds_id]
|
| 39 |
-
ds_dir =
|
| 40 |
out = {
|
| 41 |
"dataset_id": info["dataset_id"],
|
| 42 |
"dataset_dir": str(ds_dir),
|
|
|
|
| 7 |
|
| 8 |
import pandas as pd
|
| 9 |
|
| 10 |
+
from .io_utils import read_json, resolve_repo_path
|
| 11 |
|
| 12 |
|
| 13 |
DEFAULT_REQUIRED_FIELDS = ["record_id", "text", "label", "source", "split", "length_char", "topic", "model_slug"]
|
|
|
|
| 28 |
if manifest_file is None:
|
| 29 |
from .io_utils import DEFAULT_MANIFEST_FILE
|
| 30 |
manifest_file = DEFAULT_MANIFEST_FILE
|
| 31 |
+
return read_json(resolve_repo_path(manifest_file))
|
| 32 |
|
| 33 |
|
| 34 |
def get_ds_meta(manifest: dict, ds_id: str) -> dict:
|
|
|
|
| 36 |
if ds_id not in manifest:
|
| 37 |
raise KeyError(f"{ds_id} not found in dataset manifest")
|
| 38 |
info = manifest[ds_id]
|
| 39 |
+
ds_dir = resolve_repo_path(info["dataset_dir"])
|
| 40 |
out = {
|
| 41 |
"dataset_id": info["dataset_id"],
|
| 42 |
"dataset_dir": str(ds_dir),
|
src/enhanced_replica/io_utils.py
CHANGED
|
@@ -10,11 +10,12 @@ from typing import Any, Dict, Iterable, List
|
|
| 10 |
import pandas as pd
|
| 11 |
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
|
|
|
| 16 |
DEFAULT_MANIFEST_FILE = DATASET_ROOT / "90_manifests" / "dataset_manifests.json"
|
| 17 |
-
DEFAULT_OUTPUT_ROOT =
|
| 18 |
|
| 19 |
|
| 20 |
def now_ts() -> str:
|
|
@@ -30,6 +31,13 @@ def ensure_dir(path: Path) -> Path:
|
|
| 30 |
return path
|
| 31 |
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
def read_json(path: Path) -> Any:
|
| 34 |
return json.loads(path.read_text(encoding="utf-8"))
|
| 35 |
|
|
|
|
| 10 |
import pandas as pd
|
| 11 |
|
| 12 |
|
| 13 |
+
PACKAGE_ROOT = Path(__file__).resolve().parent
|
| 14 |
+
SRC_ROOT = PACKAGE_ROOT.parent
|
| 15 |
+
REPO_ROOT = SRC_ROOT.parent
|
| 16 |
+
DATASET_ROOT = REPO_ROOT / "data" / "dataset"
|
| 17 |
DEFAULT_MANIFEST_FILE = DATASET_ROOT / "90_manifests" / "dataset_manifests.json"
|
| 18 |
+
DEFAULT_OUTPUT_ROOT = REPO_ROOT / "outputs"
|
| 19 |
|
| 20 |
|
| 21 |
def now_ts() -> str:
|
|
|
|
| 31 |
return path
|
| 32 |
|
| 33 |
|
| 34 |
+
def resolve_repo_path(path: str | Path) -> Path:
|
| 35 |
+
resolved = Path(path)
|
| 36 |
+
if resolved.is_absolute():
|
| 37 |
+
return resolved
|
| 38 |
+
return REPO_ROOT / resolved
|
| 39 |
+
|
| 40 |
+
|
| 41 |
def read_json(path: Path) -> Any:
|
| 42 |
return json.loads(path.read_text(encoding="utf-8"))
|
| 43 |
|